Can you be more specific about numbers?
I am not sure that splitting helps so much in the end, in that it has
the same effect as executing a smaller number at a time of the large
number of tasks that the full cartesian join would generate.
The full join is probably intractable no matter what in
The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
GC limit overhead exceeded is usually sign of either inadequate heap size
(not the case here) or application produces garbage (temp objects) faster
than garbage collector collects them - GC consumes most CPU cycles. 17G of
Java heap is quite large for many application and is above safe and
Hey all – not writing to necessarily get a fix but more to get an understanding
of what’s going on internally here.
I wish to take a cross-product of two very large RDDs (using cartesian), the
product of which is well in excess of what can be stored on disk . Clearly that
is intractable, thus