Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Silvio Fiorito Fri, 31 May 2019 07:42:39 -0700

Spark does allow appending new files to bucketed tables. When the data is read 
in, Spark will combine the multiple files belonging to the same buckets into 
the same partitions.

Having said that, you need to be very careful with bucketing especially as 
you’re appending to avoid generating lots of small files. So, you may need to 
consider periodically running a compaction job.

If you’re simply appending daily snapshots, then you could just consider using 
date partitions, instead?

From: Rishi Shah <rishishah.s...@gmail.com>
Date: Thursday, May 30, 2019 at 10:43 PM
To: "user @spark" <user@spark.apache.org>
Subject: [pyspark 2.3+] Bucketing with sort - incremental data load?

Hi All,

Can we use bucketing with sorting functionality to save data incrementally (say 
daily) ? I understand bucketing is supported in Spark only with saveAsTable, 
however can this be used with mode "append" instead of "overwrite"?

My understanding around bucketing was, you need to rewrite entire table every 
time, can someone help advice?

--
Regards,

Rishi Shah

Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Reply via email to