Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-11 Thread hotdog
hi Daniel, 
Do you solve your problem?
I met the same problem when running massive data using reduceByKey on yarn.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread Daniel Mahler
Hi Matei,

Thanks for the suggestions.
Is the number of partitions set by calling 'myrrd.partitionBy(new
HashPartitioner(N))'?
Is there some heuristic formula for choosing a good number of partitions?

thanks
Daniel




On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu  wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu  wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Matei Zaharia
Make sure you set up enough reduce partitions so you don’t overload them. 
Another thing that may help is checking whether you’ve run out of local disk 
space on the machines, and turning on spark.shuffle.consolidateFiles to produce 
fewer files. Finally, there’s been a recent fix in both branch 0.9 and master 
that reduces the amount of memory used when there are small files (due to extra 
memory that was being taken by mmap()): 
https://issues.apache.org/jira/browse/SPARK-1145. You can find this in either 
the 1.0 release candidates on the dev list, or branch-0.9 in git.

Matei

On May 17, 2014, at 5:45 PM, Madhu  wrote:

> Daniel,
> 
> How many partitions do you have?
> Are they more or less uniformly distributed?
> We have similar data volume currently running well on Hadoop MapReduce with
> roughly 30 nodes. 
> I was planning to test it with Spark. 
> I'm very interested in your findings. 
> 
> 
> 
> -
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Madhu
Daniel,

How many partitions do you have?
Are they more or less uniformly distributed?
We have similar data volume currently running well on Hadoop MapReduce with
roughly 30 nodes. 
I was planning to test it with Spark. 
I'm very interested in your findings. 



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Daniel Mahler
I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be constant memory.
I am using 10s of AWS-EC2 macines with 60G memory and 30 processors.

After a while the whole process just hangs.
I have not been able to isolate the root problem from the logs,
but I suspect that the problem is in the shuffling.
Simple mapping and filtering transfomations work fine,
and the reductions work fine if I reduce the data down to 10^8 items
makes the reduceByKey go through.

What do I need to do to make reducByKey work for >10^9 items.

thanks
Daniel