I’m pretty sure WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 20) (10.20.167.28 executor 2): java.lang.OutOfMemoryError at java.base/java.io.ByteArrayOutputStream.hugeCapacity(Unknown Source) If you look at the «toPandas» you can see the exchange stage that doesn’t occur in the «collect» example. I suppose it tries to pull all the data to a single executor for some reason.
> 3 нояб. 2021 г., в 21:34, Sean Owen <sro...@gmail.com> написал(а): > > No, you can collect() a DataFrame. You get Rows. collect() must create an > in-memory representation - how else could it work? Those aren't differences. > Are you sure it's the executor that's OOM, not the driver? > > On Wed, Nov 3, 2021 at 1:32 PM Gourav Sengupta <gourav.sengu...@gmail.com > <mailto:gourav.sengu...@gmail.com>> wrote: > Hi, > > I might be wrong but toPandas() works with dataframes, where as collect works > at RDD. Also toPandas() converts to Python objects in memory I do not think > that collect does it. > > Regards, > Gourav > > On Wed, Nov 3, 2021 at 2:24 PM Sergey Ivanychev <sergeyivanyc...@gmail.com > <mailto:sergeyivanyc...@gmail.com>> wrote: > Hi, > > Spark 3.1.2 K8s. > > I encountered OOM error while trying to create a Pandas DataFrame from Spark > DataFrame. My Spark driver has 60G of ram, but the executors are tiny > compared to that (8G) > > If I do `spark.table(…).limit(1000000).collect()` I get the following plan > > <2021-11-03_17-07-55.png> > > > If I do `spark.table(…).limit(1000000).toPandas()` I get a more complicated > plan with an extra shuffle > > <2021-11-03_17-08-31.png> > > IIUC, in the `toPandas` case all the data gets shuffled to a single executor > that fails with OOM, which doesn’t happen in `collect` case. This does it > work like that? How do I collect a large dataset that fits into memory of the > driver?