Re: How to efficiently join this two complicated rdds

2014-02-19 Thread Eugen Cepoi
Yeah this is due to the fact that the broadcasted variables are kept in memory and I am guessing that it is referenced in a way that prevents it from being garbage collected... A solution could be to enable spark.cleaner.ttl, but I don't like it much as it sounds more like a hacky solution. There

Re: How to efficiently join this two complicated rdds

2014-02-19 Thread Eugen Cepoi
the cleaning of old broadcasted vars. 2014-02-19 12:25 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com: Yeah this is due to the fact that the broadcasted variables are kept in memory and I am guessing that it is referenced in a way that prevents it from being garbage collected... A solution could

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread Eugen Cepoi
Hi, What is the size of RDD two? You want to map à line from RDD one to multiple values from RDD two and get the sum of all of them? So as result you would have an rdd of size RDD1 and containing a number per line? 2014-02-18 8:06 GMT+01:00 hanbo hanbo...@gmail.com: Sincerely thank you for

Re: How to confirm serializer type on workers?

2014-01-09 Thread Eugen Cepoi
Do you have the stacktrace? I had something similar, where the Kryo deser was throwing EOF, but in fact EOF means nothing, spark catches Kryo exceptions and then throws EOF (and loses the reason...), in my case kryo couldn't find the class to which to deser. 2014/1/8 Aureliano Buendia

Re: Spark context jar confusions

2014-01-05 Thread Eugen Cepoi
hadoop in your fat jar: includeorg.apache.hadoop:*/include This would take a big chunk of the fat jar. Isn't this jar already included in spark? On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: It depends how you deploy, I don't find it so complicated... 1) To build

Re: debug standalone Spark jobs?

2014-01-05 Thread Eugen Cepoi
You can set the log level to INFO, it looks like spark is logging applicative errors as INFO. When I have errors that I can reproduce only on live data, I am running a spark shell with my job in its classpath, then I debug tweak things to find out what happens. 2014/1/5 Nan Zhu

Re: Spark context jar confusions

2014-01-02 Thread Eugen Cepoi
Hi, This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the

Re: Spark context jar confusions

2014-01-02 Thread Eugen Cepoi
sbt assembly also create that jar? 3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use? On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Hi, This is the list of the jars

Re: Spark context jar confusions

2014-01-02 Thread Eugen Cepoi
? Using spark://localhost:7077 is a good way to simulate the production driver and it provides the web ui. When using spark://localhost:7077, is it required to create the fat jar? Wouldn't that significantly slow down the development cycle? On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi cepoi.eu

Re: Spark context jar confusions

2014-01-02 Thread Eugen Cepoi
and it provides the web ui. When using spark://localhost:7077, is it required to create the fat jar? Wouldn't that significantly slow down the development cycle? On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: It depends how you deploy, I don't find it so complicated

Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-02 Thread Eugen Cepoi
Did you try to define the spark.executor.memory property to the amount of memory you want per worker? For example spark.executor.memory=2g http://spark.incubator.apache.org/docs/latest/configuration.html 2014/1/2 Archit Thakur archit279tha...@gmail.com Need not mention Workers could be seen

Re: Spark context jar confusions

2014-01-02 Thread Eugen Cepoi
2014/1/2 Aureliano Buendia buendia...@gmail.com On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: When developing I am using local[2] that launches a local cluster with 2 workers. In most cases it is fine, I just encountered some strange behaviours for broadcasted

Re: debugging NotSerializableException while using Kryo

2013-12-24 Thread Eugen Cepoi
In scala case classes are serializable by default, your TileIdWritable should be a case class. I usually enable Kryo ser for objects and keep default ser for closures, this works pretty well. Eugen 2013/12/24 Ameet Kini ameetk...@gmail.com If Java serialization is the only one that properly

Re: HttpBroadcast strange behaviour, bug?

2013-11-19 Thread Eugen Cepoi
Ramachandrasekaran sri.ram...@gmail.com Trying local[m], where m is the number of workers. For tests, local[2] should be ideal. This is the way to accomplish writing tests for Spark code generally. On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Maybe a bug with HttpBroadcast

Re: HttpBroadcast strange behaviour, bug?

2013-11-19 Thread Eugen Cepoi
for other inputs. On Tue, Nov 19, 2013 at 10:40 PM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Yes sure for usual tests it is fine, but the broadcast is only done if we are not in local mode (at least seems so). In SparkContext we have def broadcast[T](value: T) = env.broadcastManager.newBroadcast

Re: Write to HBase from spark job

2013-10-12 Thread Eugen Cepoi
for output formats that go to a filesystem (e.g. HDFS), but HBase isn't a filesystem. Matei On Oct 11, 2013, at 8:53 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi there, I have got a few questions on how best to write to HBase from a spark job. - If we want to write using

Write to HBase from spark job

2013-10-11 Thread Eugen Cepoi
Hi there, I have got a few questions on how best to write to HBase from a spark job. - If we want to write using TableOutputFormat are we supposed to use saveAsNewAPIHadoopFile? - Or should we do it by hand (without TableOutputFormat) in a foreach loop for example? - Or should use