Oh, thanks for reporting this. This should be a bug since SPARK_HIVE was
deprecated, we shouldn’t rely on it any more.
On Wed, Aug 13, 2014 at 1:23 PM, ZHENG, Xu-dong dong...@gmail.com wrote:
Just find this is because below lines in make_distribution.sh doesn't work:
if [ $SPARK_HIVE ==
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
-Xiangrui
On Tue, Aug 12, 2014 at 8:01 PM,
In next run of this application it exited safely after ctrl-\ (SIGQUIT).
grzegorz-bialek wrote
Hi,
when I run some spark application on my local machine using spark-submit:
$SPARK_HOME/bin/spark-submit --driver-memory 1g
class
jar
When I want to interrupt computing by ctrl-c it
Hi,
I wanted to access Spark web UI after application stops. I set
spark.eventLog.enabled to true and logs are availaible
in JSON format in /tmp/spark-event but web UI isn't available under address
http://driver-node:4040
I'm running Spark in standalone mode.
What should I do to access web UI
Thanks a lot
Worked like a charm.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-SparkSQLCLIDriver-and-Beeline-drivers-in-Spark-tp11724p12024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am trying to run and test some graph apis using Java. I started with
connected components, here is my code.
JavaRDDEdgeLong vertices;
///code to populate vertices
..
..
ClassTagLong longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
ClassTagFloat floatTag =
Thank you, Ton. That helps a lot.
I want to debug spark code for tracing state transform. So I use sbt as my
build tools and compile spark code in Intellij IDEA .
Zhanfeng Huo
From: Ron's Yahoo!
Date: 2014-08-12 03:46
To: Zhanfeng Huo
CC: user
Subject: Re: Compile spark code with idea
let's say you have a model which is of class
org.apache.spark.mllib.classification.LogisticRegressionModel
you can save model to disk as following:
/import java.io.FileOutputStream
import java.io.ObjectOutputStream
val fos = new FileOutputStream(e:/model.obj)
val oos = new
I've been playing around with Spark off and on for the past month and have
developed some XML helper utilities that enable me to filter an XML dataset as
well as transform an XML dataset (we have a lot of XML content). I'm posting
this email to see if there would be any interest in this effort
Hi,
I have faced a similar issue when trying to run a map function with
predict. In my case I had some non-serializable fields in my calling class.
After making those fields transient, the error went away.
On Wed, Aug 13, 2014 at 6:39 PM, lancezhange lancezha...@gmail.com wrote:
let's say you
After a lot of grovelling through logs, I found out that the Nagios monitor
process detected that the machine was almost out of memory, and killed the
SNAP executor process.
So why is the machine running out of memory? Each node has 128GB of RAM, 4
executors, about 40GB of data. It did run out of
PS I think that solving not serializable exceptions by adding
'transient' is usually a mistake. It's a band-aid on a design problem.
transient causes the default serialization mechanism to not serialize
the field when the object is serialized. When deserialized, this field
will be null, which
my prediction codes are simple enough as follows:
*val labelsAndPredsOnGoodData = goodDataPoints.map { point =
val prediction = model.predict(point.features)
(point.label, prediction)
}*
when model is the loaded one, above code just can't work. Can you catch the
error?
Thanks.
PS. i use
+1 what Sean said. And if there are too many state/argument parameters for
your taste, you can always create a dedicated (serializable) class to
encapsulate them.
Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 6:58 AM, Sean Owen so...@cloudera.com wrote:
PS I think that solving not
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
isolate the cause to serialization of the loaded model. And also try to
serialize the deserialized (loaded) model manually to see if that throws
any visible exceptions.
Sent while mobile. Pls excuse typos etc.
On Aug 13,
Using the SchemaRDD insertInto method, is there any way to support partitions
on a field in the RDD? If not, what's the alternative, register a table and do
an insert into via SQL statement? Any plans to support partitioning via
insertInto? What other options are there for inserting into a
Dennis,
If it is PLSA with least square loss then the QuadraticMinimizer that we
open sourced should be able to solve it for modest topics (till 1000 I
believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper
is the reference) the topic size can be increased much larger than
Hi,
I'm trying to figure out how to constantly update, say, the 95th percentile
of a set of data through Spark Streaming. I'm not sure how to order the
dataset though, and while I can find percentiles in regular Spark, I can't
seem to figure out how to get that to transfer over to Spark
What is your Spark executor memory set to? (You can see it in Spark's web UI at
http://driver:4040 under the executors tab). One thing to be aware of is that
the JVM never really releases memory back to the OS, so it will keep filling up
to the maximum heap size you set. Maybe 4 executors with
This is not supported at the moment. There are no concrete plans at the
moment to support it though the programatic API, but it should work using
SQL as you suggested.
On Wed, Aug 13, 2014 at 8:22 AM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Using the SchemaRDD *insertInto*
The Spark UI isn't available through the same address; otherwise new
applications won't be able to bind to it. Once the old application
finishes, the standalone Master renders the after-the-fact application UI
and exposes it under a different URL. To see this, go to the Master UI
(master-url:8080)
I would expect this to work with Spark SQL (available in 1.0+) but there is
a JIRA open to confirm this works SPARK-2883
https://issues.apache.org/jira/browse/SPARK-2883.
On Mon, Aug 11, 2014 at 10:23 PM, vinay.kash...@socialinfra.net wrote:
Hi all,
Is it possible to use table with ORC
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've usually found that lowering the
executor memory size helps.
Shivaram
On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
What is your Spark executor
I do not believe this is true. If you are using a hive context you should
be able to register an RDD as a temporary table and then use INSERT INTO
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueriesto
add data to a hive
To add to the pile of information we're asking you to provide, what version
of Spark are you running?
2014-08-13 11:11 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu
:
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
would indeed run out of memory (the machine has 110GB). And in fact they
would get repeatedly restarted and killed until eventually Spark gave up.
I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
Hi,
Let me begin by describing my Spark setup on EC2 (launched using the
provided spark-ec2.py script):
- 100 c3.2xlarge workers (8 cores 15GB memory each)
- 1 c3.2xlarge Master (only running master daemon)
- Spark 1.0.2
- 8GB mounted at */* 80 GB mounted at */mnt*
Hi Ravi,
Setting SPARK_MEMORY doesn't do anything. I believe you confused it with
SPARK_MEM, which is now deprecated. You should set SPARK_EXECUTOR_MEMORY
instead, or spark.executor.memory as a config in
conf/spark-defaults.conf. Assuming you haven't set the executor memory
through a different
Can you link to the JIRA issue? I'm having to work around this bug and it
would be nice to monitor the JIRA so I can change my code when it's fixed.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html
Sent
I am having a similar problem:
I have a large dataset in HDFS and (for a few possible reason including a
filter operation, and some of my computation nodes simply not being hdfs
datanodes) have a large skew on my RDD blocks: the master node always has
the most, while the worker nodes have few...
Arpan,
Which version of Spark are you using? Could you try the master or 1.1
branch? which can spill the data into disk during groupByKey().
PS: it's better to use reduceByKey() or combineByKey() to reduce data
size during shuffle.
Maybe there is a huge key in the data sets, you can find it in
Hi Spark community,
We're excited about Spark at Adobe Research and have
just open sourced a project we use to automatically provision
a Spark cluster and submit applications.
The project is on GitHub, and we're happy for any feedback
from the community:
Hi Spark community,
We're excited about Spark at Adobe Research and have
just open sourced an example project writing and reading
Thrift objects to Parquet with Spark.
The project is on GitHub, and we're happy for any feedback:
https://github.com/adobe-research/spark-parquet-thrift-example
Hi All,
what is the best way to make a spark streaming driver highly available. I
would like the backup driver to pickup the processing if the primary driver
dies.
Thanks,
Ali
--
View this message in context:
I think the problem is that when you are using yarn-cluster mode, because
the Spark driver runs inside the application master, the hive-conf is not
accessible by the driver. Can you try to set those confs by using
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in
the node
Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest
release)
I'll try changing it to a reduceByKey() and check the size of the largest
key and post the results here.
UPDATE: If I run this job and DO NOT specify the number of partitions for
the input textFile() (124 GB being
The 1.1 release will come out this or next month, we will really appreciate that
if you could test it with you real case.
Davies
On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh ar...@automatic.com wrote:
Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest
release)
I'll try
Hi Silvio,
You can insert into a static partition via SQL statement. Dynamic
partitioning is not supported at the moment.
Thanks,
Yin
On Wed, Aug 13, 2014 at 2:03 PM, Michael Armbrust mich...@databricks.com
wrote:
This is not supported at the moment. There are no concrete plans at the
So you are saying that in-spite of spark.shuffle.spill being set to true by
default, version 1.0.2 does not spill data to disk during a groupByKey()?
On Wed, Aug 13, 2014 at 2:05 PM, Davies Liu dav...@databricks.com wrote:
The 1.1 release will come out this or next month, we will really
In Spark (Scala/Java), it will spill the data to disk, but in PySpark,
it will not.
On Wed, Aug 13, 2014 at 2:10 PM, Arpan Ghosh ar...@automatic.com wrote:
So you are saying that in-spite of spark.shuffle.spill being set to true by
default, version 1.0.2 does not spill data to disk during a
Hi,
I have Yarn available to test and I'm currently working on getting my
application to run correctly on the Yarn cluster. I'll get back to you with
the results once I'm able to successfully run it(successfully meaning I at
least get to the point where my application currently fails)
On Tue,
Hi,
I'd like to read in a (binary) file from Python for which I have defined a
Java InputFormat (.java) definition. However, now I am stuck in how to use
that in Python and didn't find anything in newsgroups either.
As far as I know, I have to use this newAPIHadoopRDD function. However, I am
not
Not that much familiar with Python APIs, but You should be able to
configure a job object with your custom InputFormat and pass in the
required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to
get the required RDD
On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein tjkl...@gmail.com
Tassilo, newAPIHadoopRDD has been added to PySpark in master and
yet-to-be-released 1.1 branch. It allows you specify your custom
InputFormat. Examples of using it include hbase_inputformat.py and
cassandra_inputformat.py in examples/src/main/python. Check it out.
On Wed, Aug 13, 2014 at 3:12
Yes, somehow seems logical. But where / how to pass -the InputFormat
definition (.jar/.java/.class) Spark.
I mean when using Hadoop I need to call something like 'hadoop jar
myInputFormat.jar -inFormat myFormat other stuff' to register the file
format definition file.
--
View this message in
Need help getting around these errors.
I have this program that runs fine on smaller input sizes. As it gets
larger, Spark has increasing difficulty of being efficient and functioning
without errors. We have about 46GB free on each node. The workers and
executors are configured to use this up
Here are the biggest keys:
[ (17634, 87874097),
(8407, 38395833),
(20092, 14403311),
(9295, 4142636),
(14359, 3129206),
(13051, 2608708),
(14133, 2073118),
(4571, 2053514),
(16175, 2021669),
(5268, 1908557),
(3669, 1687313),
(14051,
Hello all,
I’m beginner in Spark and Scala. I have the following code which does a
groupBy on 2 keys.
val rdd2 = rdd1.groupBy(x = (x._2._1._1, x._2._1._2))
rdd1 looks like below: rdd1 is obtained as a result of left outer join
between 2 RDDs.
Class[_ : org.apache.spark.rdd.RDD[ ((Int,
Hi All,
I am using the decision tree algorithm and I get the following error. Any help
would be great!
java.lang.UnknownError: no bin was found for continuous variable. at
org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492) at
For the hottest key, it will need about 1-2 GB memory for Python
worker to do groupByKey().
These configurations can not help with the memory of Python worker.
So, two options:
1) use reduceByKey() or combineByKey() to reduce the memory
consumption in Python worker.
2) try master or 1.1 branch
Hi,
On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu wrote:
what is the best way to make a spark streaming driver highly available.
I would also be interested in that. In particular for Streaming
applications where the Spark driver is running for a long time, this might
be
Yin Michael,
Thanks, I'll try the SQL route.
From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Wednesday, August 13, 2014 at 5:04 PM
To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Cc: Silvio Fiorito
Hi,
I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can
use more shuffle space...
Does Spark cleans all the shuffle files once the runs are done ? Seems to
me that the shuffle files are not cleaned...
Do I need to set this variable ? spark.cleaner.ttl
Right now we are
Hi all,
I have an issue where I'm able to run my code in standalone mode but not on
my cluster. I've isolated it to a few things but am at a lost at how to
debug this. Below is the code. Any suggestions would be much appreciated
Thanks!
1) RDD size is causing the problem. The code below as is
I've got ~500 tab delimited log files 25gigs each with page name and userId
who viewed the page along with timestamp.
I'm trying to build a basic spark app to get a unique visitors per page. I
was able to achieve this using SparkSQL by registering the RDD of a case
class and running a select
Hi Deb,
If you don't have long-running Spark applications (those taking more than
spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good
solution. If however you have a mix of long-running and short-running
applications, then the TTL-based solution will fail. It will clean up
Hi,
I tried running the sample sql code JavaSparkSQL but keep getting this
error:-
the error comes on line
JavaSchemaRDD teenagers = sqlCtx.sql(SELECT name FROM people WHERE age =
13 AND age = 19);
C:\spark-submit --class org.apache.spark.examples.sql.JavaSparkSQL
--master local
Before I start doing something on my own I wanted to check if someone has
created a script to deploy the latest version of Spark to Google Compute
Engine.
Thanks
-Soumya
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node
yarn-cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks Micheal for the info.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Support-for-ORC-Table-in-Shark-Spark-tp11952p12089.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi All,
Sorry reposting this again in the hope to get some clues.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Wed, Aug 13, 2014 at 3:53 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi,
I am trying to run and test some graph apis
61 matches
Mail list logo