I’m pretty sure

WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 20) (10.20.167.28 executor 
2): java.lang.OutOfMemoryError
        at java.base/java.io.ByteArrayOutputStream.hugeCapacity(Unknown Source)
If you look at the «toPandas» you can see the exchange stage that doesn’t occur 
in the «collect» example. I suppose it tries to pull all the data to a single 
executor for some reason.

> 3 нояб. 2021 г., в 21:34, Sean Owen <sro...@gmail.com> написал(а):
> 
> No, you can collect() a DataFrame. You get Rows. collect() must create an 
> in-memory representation - how else could it work? Those aren't differences.
> Are you sure it's the executor that's OOM, not the driver? 
> 
> On Wed, Nov 3, 2021 at 1:32 PM Gourav Sengupta <gourav.sengu...@gmail.com 
> <mailto:gourav.sengu...@gmail.com>> wrote:
> Hi,
> 
> I might be wrong but toPandas() works with dataframes, where as collect works 
> at RDD. Also toPandas() converts to Python objects in memory I do not think 
> that collect does it.
> 
> Regards,
> Gourav
> 
> On Wed, Nov 3, 2021 at 2:24 PM Sergey Ivanychev <sergeyivanyc...@gmail.com 
> <mailto:sergeyivanyc...@gmail.com>> wrote:
> Hi, 
> 
> Spark 3.1.2 K8s.
> 
> I encountered OOM error while trying to create a Pandas DataFrame from Spark 
> DataFrame. My Spark driver has 60G of ram, but the executors are tiny 
> compared to that (8G)
> 
> If I do `spark.table(…).limit(1000000).collect()` I get the following plan
> 
> <2021-11-03_17-07-55.png>
> 
> 
> If I do `spark.table(…).limit(1000000).toPandas()` I get a more complicated 
> plan with an extra shuffle
> 
> <2021-11-03_17-08-31.png>
> 
> IIUC, in the `toPandas` case all the data gets shuffled to a single executor 
> that fails with OOM, which doesn’t happen in `collect` case. This does it 
> work like that? How do I collect a large dataset that fits into memory of the 
> driver?

Reply via email to