Hi all,

In a stand-alone cluster operation, with more than 80 gbs of ram in each node, 
am trying to:

1.      load a partitioned json dataframe which weights around 100GB as input
2.      apply transformations such as cast some column types
3.      get some percentiles which involves sort by, rdd transformations and 
lookup actions
4.      flatten many unordered rows with HiveContext df explode function
5.      compute histogram to each column of the dataframe

I have tried many strategies, from threaded per-column transformations, to a 
for loop with each column. Tried to put as many common transformations out of 
this loop as possible, and persist the original dataframe, at different points. 
Also FIFO vs FAIR, speculation, etc.

With a limited dataframe to 50k rows, the operation is succesful in local[32], 
in a somewhat short amount of time, eg 30 minutes for all columns. Where around 
14 minutes is only loading the data into memory and doing a count. But when I 
want to go further away, I need to go into cluster mode.

The first transformations and the percentile calculations, are very fast, 
faster than local mode. But computing the histogram on the final dataframe gets 
stuck, it seems stuck at garbage collection operations and never completes. 
Trying the 50k limited dataframe within cluster mode, happens exactly the same, 
where local mode succeeded. The task seems to hang at some random point out out 
of always the same stage, at the very end of the histogram computation.

I have also tried repartitioning up to 1024 pieces and coalescing down to 1 
piece, with no different result. The process always hangs up in cluster mode. 
Local mode cannot handle this big operation and ends up stuck somewhere as 
well. I have enabled the parallel GCer too.

How can I proceed and diagnose?
Any help thankful
Saif

Reply via email to