Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Anshul Singhle
...@sigmoidanalytics.com wrote: Does this speed up? val rdd = sc.parallelize(1 to 100*, 30)* rdd.count Thanks Best Regards On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd

Re: java.io.IOException: No space left on device

2015-04-29 Thread Anshul Singhle
Do you have multiple disks? Maybe your work directory is not in the right disk? On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com wrote: Hi, I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf output,the training data is a file containing 156060 (size

Re: Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
yes On 29 Apr 2015 03:31, ayan guha guha.a...@gmail.com wrote: Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want

rdd.count with 100 elements taking 1 second to run

2015-04-28 Thread Anshul Singhle
Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to 100) rdd.count This takes around 1.2s to run. Is this expected or am I configuring something wrong? I'm using about 30 cores with 512MB executor memory As expected, GC time is

Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec. I'm getting the following job times with about 15 GB of data distributed across 6 nodes. Each executor has about 20GB of

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Anshul Singhle
Hi firemonk9, What you're doing looks interesting. Can you share some more details? Are you running the same spark context for each job, or are you running a seperate spark context for each job? Does your system need sharing of rdd's across multiple jobs? If yes, how do you implement that? Also

Getting outofmemory errors on spark

2015-04-10 Thread Anshul Singhle
Hi, I'm reading data stored in S3 and aggregating and storing it in Cassandra using a spark job. When I run the job with approx 3Mil records (about 3-4 GB of data) stored in text files, I get the following error: (11529/14925)15/04/10 19:32:43 INFO TaskSetManager: Starting task 11609.0 in