The result was no different with saveAsHadoopFile. In both cases, I can see
that I've misinterpreted the API docs. I'll explore the API's a bit further
for ways to save the iterable as chunks rather than one large text/binary.
It might also help to clarify this aspect in the API docs. For those
I also realized from your description of saveAsText that the API is indeed
behaving as expected i.e. it is appropriate (though not optimal) for the
API to construct a single string out of the value. If the value turns out
to be large, the user of the API needs to reconsider the implementation
Yes, that's the same thing really. You're still writing a huge value
as part of one single (key,value) record. The value exists in memory
in order to be written to storage. Although there aren't hard limits,
in general, keys and values aren't intended to be huge, like, hundreds
of megabytes.
You
None of your tuning will help here because the problem is actually the way
you are saving the output. If you take a look at the stacktrace, it is
trying to build a single string that is too large for the VM to allocate
memory. The VM is actually not running out of memory, but rather, JVM
cannot
Thanks for responding. This is what I initially suspected, and hence asked
why the library needed to construct the entire value buffer on a single
host before writing it out. The stacktrace appeared to suggest that user
code is not constructing the large buffer. I'm simply calling groupBy and
saveAsText means save every element of the RDD as one line of text.
It works like TextOutputFormat in Hadoop MapReduce since that's what
it uses. So you are causing it to create one big string out of each
Iterable this way.
On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar reachb...@gmail.com
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of
count ~ 100 million. The data size is 20GB and groupBy results in an RDD of
1061 keys with values being IterableTuple4String, Integer, Double,
String. The job runs on 3 hosts in a standalone setup with each host's
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit.
On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
of count ~ 100 million. The data size is 20GB and groupBy results
Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
sort use case in a framework that is leading in sorting bench marks? Or is
there something fundamentally wrong in the usage?
On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
Hi,
I'm trying to run
Hi,
FYI as follows. Could you post your heap size settings as well your Spark app
code?
Regards
Arthur
3.1.3 Detail Message: Requested array size exceeds VM limit
The detail message Requested array size exceeds VM limit indicates that the
application (or APIs used by that application)
10 matches
Mail list logo