Does RDD.saveAsObjectFile appends or create a new file ?

2014-03-21 Thread Jaonary Rabarisoa
Dear all, I need to run a series of transformations that map a RDD into another RDD. The computation changes over times and so does the resulting RDD. Each results is then saved to the disk in order to do further analysis (for example variation of the result over time). The question is, if I save

Re: How to distribute external executable (script) with Spark ?

2014-03-21 Thread Jaonary Rabarisoa
I finally found the answer. SparkContext has a method addFile. On Wed, Mar 19, 2014 at 5:23 PM, Mayur Rustagi wrote: > I doubt thr is something like this out of the box. Easiest thing is to > package it in to a jar & send that jar across. > Regards > > Mayur Rustagi > Ph: +1 (760) 203 3257 > htt

Persist streams to text files

2014-03-21 Thread gaganbm
Hi, I am trying to persist the DStreams to text files. When I use the inbuilt API 'saveAsTextFiles' as : stream.saveAsTextFiles(resultDirectory) this creates a number of subdirectories, for each batch, and within each sub directory, it creates bunch of text files for each RDD (I assume). I a

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
Hey, Even i am getting the same error. I am running, sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount spark://:7077 7781 and getting no events in the spark streaming. --- Time: 1395395676000 ms -

Re: Spark worker threads waiting

2014-03-21 Thread sparrow
Here is the stage overview: [image: Inline image 2] and here are the stage details for stage 0: [image: Inline image 1] Transformations from first stage to the second one are trivial, so that should not be the bottle neck (apart from keyBy().groupByKey() that causes the shuffle write/read). Kind

Re: Relation between DStream and RDDs

2014-03-21 Thread Sanjay Awatramani
Hi, I searched more articles and ran few examples and have clarified my doubts. This answer by TD in another thread (  https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped me a lot. Here is the summary of my finding: 1) A DStream can consist of 0 or 1 or more RDDs. 2) E

Re: Relation between DStream and RDDs

2014-03-21 Thread Azuryy
Thanks for sharing here. Sent from my iPhone5s > On 2014年3月21日, at 20:44, Sanjay Awatramani wrote: > > Hi, > > I searched more articles and ran few examples and have clarified my doubts. > This answer by TD in another thread ( > https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0n

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
Hi, This is my summary of the gap between expected behavior and actual behavior. FlumeEventCount spark://:7077 Expected: an 'agent' listening on : (bind to). In the context of Spark, this agent should be running on one of the slaves, which should be the slave whose ip/hostname is . Observed:

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: > he actual , which in turn causes the 'Fail to bind to ...' > error. This comes naturally because the slave that is running the code > to bind to : has a different ip. So if we run the code on the slave where we are sending

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: > he actual , which in turn causes the 'Fail to bind to ...' > error. This comes naturally because the slave that is running the code > to bind to : has a different ip. I ran sudo ./run-example org.apache.spark.streaming.exa

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
It is my understanding that there is no way to make FlumeInputDStream work in a cluster environment with the current release. Switch to Kafka, if you can, would be my suggestion, although I have not used KafkaInputDStream. There is a big difference between Kafka and Flume InputDstream: KafkaInputDS

N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Hi I need to partition my data represented as RDD into n folds and run metrics computation in each fold and finally compute the means of my metrics overall the folds. Does spark can do the data partition out of the box or do I need to implement it myself. I know that RDD has a partitions method an

Parallelizing job execution

2014-03-21 Thread Ognen Duzlevski
Hello, I have a task that runs on a week's worth of data (let's say) and produces a Set of tuples such as Set[(String,Long)] (essentially output of countByValue.toMap) I want to produce 4 sets, one each for a different week and run an intersection of the 4 sets. I have the sequential appro

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Sanjay Awatramani
Hi Jaonary, I believe the n folds should be mapped into n Keys in spark using a map function. You can reduce the returned PairRDD and you should get your metric. I don't understand partitions fully, but from whatever I understand of it, they aren't required in your scenario. Regards, Sanjay

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
I'll start with Kafka implementation. Thanks for all the help. On Mar 21, 2014 7:00 PM, "anoldbrain [via Apache Spark User List]" < ml-node+s1001560n2994...@n3.nabble.com> wrote: > It is my understanding that there is no way to make FlumeInputDStream work > in a cluster environment with the curre

Sliding Window operations do not work as documented

2014-03-21 Thread Sanjay Awatramani
Hi, I want to run a map/reduce process over last 5 seconds of data, every 4 seconds. This is quite similar to the sliding window pictorial example under Window Operations section on http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html .  The RDDs returned by window t

assumption data must fit in memory per reducer

2014-03-21 Thread Koert Kuipers
does anyone know if the assumption that data must fit in memory per reducer has been lifted (perhaps if using disk-only mode)? or can anyone point me to the JIRA for this so i can follow it? thanks! koert

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Hai-Anh Trinh
Hi Jaonary, You can find the code for k-fold CV in https://github.com/apache/incubator-spark/pull/448. I have not find the time to resubmit the pull to latest master. On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani wrote: > Hi Jaonary, > > I believe the n folds should be mapped into n Keys i

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Jaonary Rabarisoa
Thank you Hai-Anh. Are the files CrossValidation.scala and RandomSplitRDD.scala enough to use it ? I'm currently using spark 0.9.0 and I to avoid to rebuild every thing. On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh wrote: > Hi Jaonary, > > You can find the code for k-fold CV in > https:/

Spark executor paths

2014-03-21 Thread deric
I'm trying to run Spark on Mesos and I'm getting this error: java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.SparkEnv$.instantiateClass$1(S

Re: Problem with HBase external table on freshly created EMR cluster

2014-03-21 Thread Kanwaldeep
Seems like this could be a version mismatch issue between the HBase version deployed and the jars being used. Here are the details on the versions we have setup We are running CDH-4.6.0 (which includes hadoop 2.0.0), and the spark was compiled against that version. Below is environment variable

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar
Hi Aureliano If you have managed to get a custom version of saveAsObject() that handles compression working, would appreciate if you could share the code. I have come across the same issue and it would help me some time having to reinvent the wheel. Deenar -- View this message in context: h

Re: Spark worker threads waiting

2014-03-21 Thread Mayur Rustagi
In your task details I dont see a large skew in tasks so the low cpu usage period occurs between stages or during stage execution. One issue possible is your data is 89GB Shuffle read, if the machine producing the shuffle data is not the one processing it, data shuffling across machines may be caus

Spark and Hadoop cluster

2014-03-21 Thread Sameer Tilak
Hi everyone,We are planning to set up Spark. The documentation mentions that it is possible to run Spark in standalone mode on a Hadoop cluster. Does anyone have any comments on stability and performance of this mode?

Re: Spark and Hadoop cluster

2014-03-21 Thread Mayur Rustagi
Both are quite stable. Yarn is in beta though so would be good to test on Standalone till Spark 1.0.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Mar 21, 2014 at 2:19 PM, Sameer Tilak wrote: > Hi

Re: Reload RDD saved with saveAsObjectFile

2014-03-21 Thread deenar.toraskar
Jaonary val loadedData: RDD[(String,(String,Array[Byte]))] = sc.objectFile("yourObjectFileName") Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html Sent from the Apache Spark User List mailing

RE: Spark and Hadoop cluster

2014-03-21 Thread sstilak
Thanks, Mayur. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Mayur Rustagi Date:03/21/2014 11:32 AM (GMT-08:00) To: user@spark.apache.org Subject: Re: Spark and Hadoop cluster Both are quite stable. Yarn is in beta though so would be

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread Aureliano Buendia
On Fri, Mar 21, 2014 at 5:53 PM, deenar.toraskar wrote: > Hi Aureliano > > If you have managed to get a custom version of saveAsObject() that handles > compression working, would appreciate if you could share the code. I have > come across the same issue and it would help me some time having to >

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-21 Thread Aureliano Buendia
On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski < og...@plainvanillagames.com> wrote: > > On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote: > >> On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: >> >>> Is there a reason for spark using the older akka? >>> >>> >>> >>> >>> On Sun, Mar

Spark streaming kafka _output_

2014-03-21 Thread Benjamin Black
Howdy, folks! Anybody out there having a working kafka _output_ for Spark streaming? Perhaps one that doesn't involve instantiating a new producer for every batch? Thanks! b

How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
Hi, Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We found that a partition number of 1000 is a good number to speed the process up. However, it does not make sense to have 1000 pieces of csv files each less than 1 kb. We used RDD.coalesce(1) to get only 1 csv file, but it

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Try passing the shuffle=true parameter to coalesce, then it will do the map in parallel but still pass all the data through one reduce node for writing it out. That’s probably the fastest it will get. No need to cache if you do that. Matei On Mar 21, 2014, at 4:04 PM, Aureliano Buendia wrote:

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread Matei Zaharia
To use compression here, you might just have to set the correct Hadoop settings in SparkContext.hadoopConf. Matei On Mar 21, 2014, at 10:53 AM, deenar.toraskar wrote: > Hi Aureliano > > If you have managed to get a custom version of saveAsObject() that handles > compression working, would a

Shark Table for >22 columns

2014-03-21 Thread subacini Arunkumar
Hi, I am able to successfully create shark table with 3 columns and 2 rows. val recList = List((" value A1", "value B1","value C1"), ("value A2", "value B2","value c2")); val dbFields =List ("Col A", "Col B","Col C") val rdd = sc.parallelize(recList)

Re: How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
Good to know it's as simple as that! I wonder why shuffle=true is not the default for coalesce(). On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia wrote: > Try passing the shuffle=true parameter to coalesce, then it will do the > map in parallel but still pass all the data through one reduce node

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar
Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar --

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Ah, the reason is because coalesce is often used to deal with lots of small input files on HDFS. In that case you don’t want to reshuffle them all across the network, you just want each mapper to directly read multiple files (and you want fewer than one mapper per file). Matei On Mar 21, 2014,

Re: How to save as a single file efficiently?

2014-03-21 Thread deenar.toraskar
Aureliano Apologies for hijacking this thread. Matei On the subject of processing lots (millions) of small input files on HDFS, what are the best practices to follow on spark. Currently my code looks something like this. Without coalesce there is one task and one output file per input file. But

pySpark memory usage

2014-03-21 Thread Jim Blomo
Hi all, I'm wondering if there's any settings I can use to reduce the memory needed by the PythonRDD when computing simple stats. I am getting OutOfMemoryError exceptions while calculating count() on big, but not absurd, records. It seems like PythonRDD is trying to keep too many of these records

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread Aureliano Buendia
I think you bumped the wrong thread. As I mentioned in the other thread: saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties. I'm not sure if this is a feature, or a bug in spark. if this is a feature, t

unable to build spark - sbt/sbt: line 50: killed

2014-03-21 Thread Bharath Bhushan
I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to /vagrant/spark-0.9.

distinct on huge dataset

2014-03-21 Thread Kane
I have a huge 2.5TB file. When i run: val t = sc.textFile("/user/hdfs/dump.csv") t.distinct.count It fails right away with a lot of: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 1 at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:

Re: distinct on huge dataset

2014-03-21 Thread Aaron Davidson
Which version of spark are you running? On Fri, Mar 21, 2014 at 10:45 PM, Kane wrote: > I have a huge 2.5TB file. When i run: > val t = sc.textFile("/user/hdfs/dump.csv") > t.distinct.count > > It fails right away with a lot of: > > Loss was due to java.lang.ArrayIndexOutOfBoundsException > jav