Does RDD.saveAsObjectFile appends or create a new file ?

2014-03-21 Thread Jaonary Rabarisoa
Dear all, I need to run a series of transformations that map a RDD into another RDD. The computation changes over times and so does the resulting RDD. Each results is then saved to the disk in order to do further analysis (for example variation of the result over time). The question is, if I

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
Hey, Even i am getting the same error. I am running, sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount spark://spark_master_hostname:7077 spark_master_hostname 7781 and getting no events in the spark streaming. --- Time:

Re: Spark worker threads waiting

2014-03-21 Thread sparrow
Here is the stage overview: [image: Inline image 2] and here are the stage details for stage 0: [image: Inline image 1] Transformations from first stage to the second one are trivial, so that should not be the bottle neck (apart from keyBy().groupByKey() that causes the shuffle write/read). Kind

Re: Relation between DStream and RDDs

2014-03-21 Thread Sanjay Awatramani
Hi, I searched more articles and ran few examples and have clarified my doubts. This answer by TD in another thread (  https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped me a lot. Here is the summary of my finding: 1) A DStream can consist of 0 or 1 or more RDDs. 2)

Re: Relation between DStream and RDDs

2014-03-21 Thread Azuryy
Thanks for sharing here. Sent from my iPhone5s On 2014年3月21日, at 20:44, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi, I searched more articles and ran few examples and have clarified my doubts. This answer by TD in another thread (

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
Hi, This is my summary of the gap between expected behavior and actual behavior. FlumeEventCount spark://spark_master_hostname:7077 address port Expected: an 'agent' listening on address:port (bind to). In the context of Spark, this agent should be running on one of the slaves, which should be

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: he actual address, which in turn causes the 'Fail to bind to ...' error. This comes naturally because the slave that is running the code to bind to address:port has a different ip. So if we run the code on the slave where

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: he actual address, which in turn causes the 'Fail to bind to ...' error. This comes naturally because the slave that is running the code to bind to address:port has a different ip. I ran sudo ./run-example

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
It is my understanding that there is no way to make FlumeInputDStream work in a cluster environment with the current release. Switch to Kafka, if you can, would be my suggestion, although I have not used KafkaInputDStream. There is a big difference between Kafka and Flume InputDstream:

N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Hi I need to partition my data represented as RDD into n folds and run metrics computation in each fold and finally compute the means of my metrics overall the folds. Does spark can do the data partition out of the box or do I need to implement it myself. I know that RDD has a partitions method

Parallelizing job execution

2014-03-21 Thread Ognen Duzlevski
Hello, I have a task that runs on a week's worth of data (let's say) and produces a Set of tuples such as Set[(String,Long)] (essentially output of countByValue.toMap) I want to produce 4 sets, one each for a different week and run an intersection of the 4 sets. I have the sequential

Spark executor paths

2014-03-21 Thread deric
I'm trying to run Spark on Mesos and I'm getting this error: java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at

Re: Spark worker threads waiting

2014-03-21 Thread Mayur Rustagi
In your task details I dont see a large skew in tasks so the low cpu usage period occurs between stages or during stage execution. One issue possible is your data is 89GB Shuffle read, if the machine producing the shuffle data is not the one processing it, data shuffling across machines may be

Spark and Hadoop cluster

2014-03-21 Thread Sameer Tilak
Hi everyone,We are planning to set up Spark. The documentation mentions that it is possible to run Spark in standalone mode on a Hadoop cluster. Does anyone have any comments on stability and performance of this mode?

Re: Spark and Hadoop cluster

2014-03-21 Thread Mayur Rustagi
Both are quite stable. Yarn is in beta though so would be good to test on Standalone till Spark 1.0.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 21, 2014 at 2:19 PM, Sameer Tilak

Re: Reload RDD saved with saveAsObjectFile

2014-03-21 Thread deenar.toraskar
Jaonary val loadedData: RDD[(String,(String,Array[Byte]))] = sc.objectFile(yourObjectFileName) Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html Sent from the Apache Spark User List mailing

RE: Spark and Hadoop cluster

2014-03-21 Thread sstilak
Thanks, Mayur. Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone div Original message /divdivFrom: Mayur Rustagi mayur.rust...@gmail.com /divdivDate:03/21/2014 11:32 AM (GMT-08:00) /divdivTo: user@spark.apache.org /divdivSubject: Re: Spark and Hadoop cluster

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-21 Thread Aureliano Buendia
On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote: On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: Is there a reason for spark using the older akka? On Sun, Mar 2, 2014 at 1:53 PM, 1esha

Spark streaming kafka _output_

2014-03-21 Thread Benjamin Black
Howdy, folks! Anybody out there having a working kafka _output_ for Spark streaming? Perhaps one that doesn't involve instantiating a new producer for every batch? Thanks! b

How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
Hi, Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We found that a partition number of 1000 is a good number to speed the process up. However, it does not make sense to have 1000 pieces of csv files each less than 1 kb. We used RDD.coalesce(1) to get only 1 csv file, but

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Try passing the shuffle=true parameter to coalesce, then it will do the map in parallel but still pass all the data through one reduce node for writing it out. That’s probably the fastest it will get. No need to cache if you do that. Matei On Mar 21, 2014, at 4:04 PM, Aureliano Buendia

Shark Table for 22 columns

2014-03-21 Thread subacini Arunkumar
Hi, I am able to successfully create shark table with 3 columns and 2 rows. val recList = List(( value A1, value B1,value C1), (value A2, value B2,value c2)); val dbFields =List (Col A, Col B,Col C) val rdd = sc.parallelize(recList)

Re: How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
Good to know it's as simple as that! I wonder why shuffle=true is not the default for coalesce(). On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Try passing the shuffle=true parameter to coalesce, then it will do the map in parallel but still pass all the data

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar
Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar --

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Ah, the reason is because coalesce is often used to deal with lots of small input files on HDFS. In that case you don’t want to reshuffle them all across the network, you just want each mapper to directly read multiple files (and you want fewer than one mapper per file). Matei On Mar 21,

Re: How to save as a single file efficiently?

2014-03-21 Thread deenar.toraskar
Aureliano Apologies for hijacking this thread. Matei On the subject of processing lots (millions) of small input files on HDFS, what are the best practices to follow on spark. Currently my code looks something like this. Without coalesce there is one task and one output file per input file.

pySpark memory usage

2014-03-21 Thread Jim Blomo
Hi all, I'm wondering if there's any settings I can use to reduce the memory needed by the PythonRDD when computing simple stats. I am getting OutOfMemoryError exceptions while calculating count() on big, but not absurd, records. It seems like PythonRDD is trying to keep too many of these

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread Aureliano Buendia
I think you bumped the wrong thread. As I mentioned in the other thread: saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties. I'm not sure if this is a feature, or a bug in spark. if this is a feature,

unable to build spark - sbt/sbt: line 50: killed

2014-03-21 Thread Bharath Bhushan
I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to