Re: Setting SPARK_MEM higher than available memory in driver

2014-03-28 Thread Aaron Davidson
Assuming you're using a new enough version of Spark, you should use spark.executor.memory to set the memory for your executors, without changing the driver memory. See the docs for your version of Spark. On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote: Hi, My worker

Re: Configuring shuffle write directory

2014-03-28 Thread Tsai Li Ming
Hi, Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly.. I took a look at the process table and saw that the “org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the -Dspark.local.dir set. On 28 Mar, 2014, at 1:05 pm, Matei Zaharia

Re: Replicating RDD elements

2014-03-28 Thread Sonal Goyal
Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at Spark's broadcasting in that case. On the other hand, if you want to take one item and create an RDD of 100

Exception on simple pyspark script

2014-03-28 Thread idanzalz
Hi, I am a newbie with Spark. I tried installing 2 virtual machines, one as a client and one as standalone mode worker+master. Everything seems to run and connect fine, but when I try to run a simple script, I get weird errors. Here is the traceback, notice my program is just a one-liner:

Re: Not getting it

2014-03-28 Thread Sonal Goyal
Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote: Hi all, I've got something which I think should be straightforward but

spark.akka.frameSize setting problem

2014-03-28 Thread lihu
Hi, I just run a simple example to generate some data for the ALS algorithm. my spark version is 0.9, and in local mode, the memory of my node is 108G but when I set conf.set(spark.akka.frameSize, 4096), it then occurred the following problem, and when I do not set this, it runs well .

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More

Re: Exception on simple pyspark script

2014-03-28 Thread idanzalz
I sorted it out. Turns out that if the client uses Python 2.7 and the server is Python 2.6, you get some weird errors, like this and others. So you would probably want not to do that... -- View this message in context:

streaming: code to simulate a network socket data source

2014-03-28 Thread Diana Carroll
If you are learning about Spark Streaming, as I am, you've probably use netcat nc as mentioned in the spark streaming programming guide. I wanted something a little more useful, so I modified the ClickStreamGenerator code to make a very simple script that simply reads a file off disk and passes

Re: Not getting it

2014-03-28 Thread lannyripple
I've played around with it. The CSV file looks like it gives 130 partitions. I'm assuming that's the standard 64MB split size for HDFS files. I have increased number of partitions and number of tasks for things like groupByKey and such. Usually I start blowing up on GC Overlimit or sometimes

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Debasish Das
Classes are serialized and sent to all the workers as akka msgs singletons and case classes I am not sure if they are javaserialized or kryoserialized by default But definitely your own classes if serialized by kryo will be much efficient.there is an comparison that Matei did for all

Re: function state lost when next RDD is processed

2014-03-28 Thread Mark Hamstra
As long as the amount of state being passed is relatively small, it's probably easiest to send it back to the driver and to introduce it into RDD transformations as the zero value of a fold. On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu amoc...@verticalscope.comwrote: I'd like to resurrect

Re: Not getting it

2014-03-28 Thread lannyripple
Ok. Based on Sonal's message I dived more into memory and partitioning and got it to work. For the CSV file I used 1024 partitions [textFile(path, 1024)] which cut the partition size down to 8MB (based on standard HDFS 64MB splits). For the key file I also adjusted partitions to use about 8MB.

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
I'd like to resurrect this thread since I don't have an answer yet. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 10:04 AM To: u...@spark.incubator.apache.org Subject: function state lost when next RDD is processed Is there a way to pass a custom function to spark to

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread pradeeps8
Hi Aureliano, I followed this thread to create a custom saveAsObjectFile. The following is the code. /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter = iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread Ognen Duzlevski
There is also this quote from the Tuning guide (http://spark.incubator.apache.org/docs/latest/tuning.html): Finally, if you don't register your classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful. It implies that you don't really

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
Thanks! Ya that's what I'm doing so far, but I wanted to see if it's possible to keep the tuples inside Spark for fault tolerance purposes. -A From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: March-28-14 10:45 AM To: user@spark.apache.org Subject: Re: function state lost when next RDD

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context:

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi, Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
The cleaner ttl was introduced as a brute force method to clean all old data and metadata in the system, so that the system can run 24/7. The cleaner ttl should be set to a large value, so that RDDs older than that are not used. Though there are some cases where you may want to use an RDD again

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I think you should sort each RDD -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 4:44 PM To: u...@spark.incubator.apache.org Subject: Re: Splitting RDD and Grouping together to perform computation Hi, Thanks Nanzhu.I tried to implement your suggestion on

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd = rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190

Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Syed A. Hashmi
From the jist of it, it seems like you need to override the default partitioner to control how your data is distributed among partitions. Take a look at different Partitioners available (Default, Range, Hash) if none of these get you desired result, you might want to provide your own. On Fri,

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi Andriana, Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you. -- View this message in context:

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that. I can tell you where to put the code to remap and sort though. import org.apache.spark.rdd.OrderedRDDFunctions val res2=reduced_hccg.map(_._2) .map( x= (newkey,x)).sortByKey(true)

Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample.

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote: Hi Aureliano, I

Re: function state lost when next RDD is processed

2014-03-28 Thread Mayur Rustagi
Are you referring to Spark Streaming? Can you save the sum as a RDD keep joining the two rdd together? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 28, 2014 at 10:47 AM, Adrian Mocanu

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if you tried it, it would (mostly) work, and my uncertainty with respect to guarantees on the semantics. Definitely there would be no fault tolerance if the mutations depend on state that is not captured in the RDD lineage.

Re: Announcing Spark SQL

2014-03-28 Thread Rohit Rai
Thanks Patrick, I was thinking about that... Upon analysis I realized (on date) it would be something similar to the way Hive Context using CustomCatalog stuff. I will review it again, on the lines of implementing SchemaRDD with Cassandra. Thanks for the pointer. Upon discussion with couple of

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at