spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power)

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several

Re: spark 1.4.1 - LZFException

2015-09-03 Thread Yadid Ayzenberg
going on Thanks Best Regards On Sun, Aug 23, 2015 at 1:27 AM, Yadid Ayzenberg <ya...@media.mit.edu <mailto:ya...@media.mit.edu>> wrote: Hi All, We have a spark standalone cluster running 1.4.1 and we are setting spark.io.compression.codec to lzf. I have a

spark 1.4.1 - LZFException

2015-08-22 Thread Yadid Ayzenberg
Hi All, We have a spark standalone cluster running 1.4.1 and we are setting spark.io.compression.codec to lzf. I have a long running interactive application which behaves as normal, but after a few days I get the following exception in multiple jobs. Any ideas on what could be causing this

Re: Change delimiter when collecting SchemaRDD

2014-08-29 Thread yadid ayzenberg
on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql(SELECT * FROM src).map(_.mkString(|)).collect() On Thu, Aug 28, 2014 at 7:58 AM, yadid ayzenberg ya...@media.mit.edu wrote: Hi All, Is there any way to change the delimiter from being a comma ? Some

Change delimiter when collecting SchemaRDD

2014-08-28 Thread yadid ayzenberg
Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid

Losing Executors on cluster with RDDs of 100GB

2014-08-22 Thread Yadid Ayzenberg
Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Yadid Ayzenberg
Yep, I just issued a pull request. Yadid On 5/31/14, 1:25 PM, Patrick Wendell wrote: 1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah,

possible typos in spark 1.0 documentation

2014-05-30 Thread Yadid Ayzenberg
Congrats on the new 1.0 release. Amazing work ! It looks like there may some typos in the latest http://spark.apache.org/docs/latest/sql-programming-guide.html in the Running SQL on RDDs section when choosing the java example: 1. ctx is an instance of JavaSQLContext but the textFile method

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread Yadid Ayzenberg
An additional option 4) Use SparkContext.addJar() and have the application ship your jar to all the nodes. Yadid On 5/4/14, 4:07 PM, DB Tsai wrote: If you add the breeze dependency in your build.sbt project, it will not be available to all the workers. There are couple options, 1) use sbt

Re: Strange lookup behavior. Possible bug?

2014-04-30 Thread Yadid Ayzenberg
Dear Sparkers, Has anyone got any insight on this ? I am really stuck. Yadid On 4/28/14, 11:28 AM, Yadid Ayzenberg wrote: Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
. If you still see the issue, I'd check whether the task has really completed. What do you see on the web UI? Is the executor using CPU? Good luck. On Mon, Apr 28, 2014 at 2:35 AM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: Can someone please suggest how I can

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) On 4/28/14 11:28 AM, Yadid Ayzenberg wrote: Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine in question. The web UI shows a single

Re: Strange lookup behavior. Possible bug?

2014-04-27 Thread Yadid Ayzenberg
:37 PM, Yadid Ayzenberg wrote: Some additional information - maybe this rings a bell with someone: I suspect this happens when the lookup returns more than one value. For 0 and 1 values, the function behaves as you would expect. Anyone ? On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote: Hi All, Im

Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Hi All, Im running a lookup on a JavaPairRDDString, Tuple2. When running on local machine - the lookup is successfull. However, when running a standalone cluster with the exact same dataset - one of the tasks never ends (constantly in RUNNING status). When viewing the worker log, it seems that

Re: Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Some additional information - maybe this rings a bell with someone: I suspect this happens when the lookup returns more than one value. For 0 and 1 values, the function behaves as you would expect. Anyone ? On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote: Hi All, Im running a lookup