Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
The result was no different with saveAsHadoopFile. In both cases, I can see that I've misinterpreted the API docs. I'll explore the API's a bit further for ways to save the iterable as chunks rather than one large text/binary. It might also help to clarify this aspect in the API docs. For those

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
I also realized from your description of saveAsText that the API is indeed behaving as expected i.e. it is appropriate (though not optimal) for the API to construct a single string out of the value. If the value turns out to be large, the user of the API needs to reconsider the implementation

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Sean Owen
Yes, that's the same thing really. You're still writing a huge value as part of one single (key,value) record. The value exists in memory in order to be written to storage. Although there aren't hard limits, in general, keys and values aren't intended to be huge, like, hundreds of megabytes. You

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Bharath Ravi Kumar
Thanks for responding. This is what I initially suspected, and hence asked why the library needed to construct the entire value buffer on a single host before writing it out. The stacktrace appeared to suggest that user code is not constructing the large buffer. I'm simply calling groupBy and

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Sean Owen
saveAsText means save every element of the RDD as one line of text. It works like TextOutputFormat in Hadoop MapReduce since that's what it uses. So you are causing it to create one big string out of each Iterable this way. On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar reachb...@gmail.com

OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 1061 keys with values being IterableTuple4String, Integer, Double, String. The job runs on 3 hosts in a standalone setup with each host's

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit. On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort use case in a framework that is leading in sorting bench marks? Or is there something fundamentally wrong in the usage? On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, I'm trying to run

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com
Hi, FYI as follows. Could you post your heap size settings as well your Spark app code? Regards Arthur 3.1.3 Detail Message: Requested array size exceeds VM limit The detail message Requested array size exceeds VM limit indicates that the application (or APIs used by that application)