Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

Sean Owen Fri, 19 Jun 2020 06:39:16 -0700

Yes you'll generally get 1 partition per block, and 1 task per partition.
The amount of RAM isn't directly relevant; it's not loaded into memory. But
you may nevertheless get some improvement with larger partitions / tasks,
though typically only if your tasks are very small and very fast right now
(completing in a few seconds)
You can use minSplitSize to encourage some RDD APIs to choose larger
partitions, but not in the DF API.
Instead you can try coalescing to a smaller number of partitions, without a
shuffle (the shuffle will probably negate any benefit)


However what I see here is different still -- you have serious data skew
because you partitioned by date, and I suppose some dates have lots of
data, some have almost none.


On Fri, Jun 19, 2020 at 12:17 AM Rishi Shah <rishishah.s...@gmail.com>
wrote:

> Hi All,
>
> I have about 10TB of parquet data on S3, where data files have 128MB sized
> blocks. Spark would by default pick up one block per task, even though
> every task within executor has atleast 1.8GB memory. Isn't that wasteful?
> Is there any way to speed up this processing? Is there a way to force tasks
> to pick up more files which sum up to a certain block size? or Spark would
> always entertain block per task? Basically is there an override to make
> sure spark tasks reads larger block(s)?
>
> Also as seen in the image here - while writing 4 files (partitionby
> file_date), one file per partition.. Somehow 4 threads are active but two
> threads seem to be doing nothing. and other 2 threads have taken over the
> writing for all 4 files. Shouldn't all 4 tasks pick up one task each?
>
> For this example, assume df has 4 file_dates worth data.
>
> df.repartition('file_date').write.partitionBy('file_date').parquet(PATH)
>
> Screen Shot 2020-06-18 at 2.01.53 PM.png (126K)
> <https://mail.google.com/mail/u/1?ui=2&ik=72f679d936&attid=0.1&permmsgid=msg-a:r4449189998704909724&view=att&disp=safe&realattid=f_kblr6muu0>
>
> Any suggestions/feedback helps, appreciate it!
> --
> Regards,
>
> Rishi Shah
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

Reply via email to