Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Sean Owen
Can you be more specific about numbers? I am not sure that splitting helps so much in the end, in that it has the same effect as executing a smaller number at a time of the large number of tasks that the full cartesian join would generate. The full join is probably intractable no matter what in

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin
The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov
GC limit overhead exceeded is usually sign of either inadequate heap size (not the case here) or application produces garbage (temp objects) faster than garbage collector collects them - GC consumes most CPU cycles. 17G of Java heap is quite large for many application and is above safe and

GC Issues with randomSplit on large dataset

2014-10-29 Thread Ganelin, Ilya
Hey all – not writing to necessarily get a fix but more to get an understanding of what’s going on internally here. I wish to take a cross-product of two very large RDDs (using cartesian), the product of which is well in excess of what can be stored on disk . Clearly that is intractable, thus