Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Rishi Shah Fri, 31 May 2019 20:58:23 -0700

Thanks much for your input Gourav, Silvio.

I have about 10TB of data, which gets stored daily. There's no qualifying
column for partitioning, which makes querying this table super slow. So I
wanted to sort the results before storing them daily. This is why I was
thinking to use bucketing and sorting ... Do you think sorting data based
on a column or two before saving would help query performance on this
table?


My concern is, data will be sorted on daily basis and not globally. Would
that help with performance? I can compact files every month as well and
sort before saving. Just not sure if this is going to help with performance
issues on this table.

Would be great to get your advice on this.









On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Spark does allow appending new files to bucketed tables. When the data is
> read in, Spark will combine the multiple files belonging to the same
> buckets into the same partitions.
>
>
>
> Having said that, you need to be very careful with bucketing especially as
> you’re appending to avoid generating lots of small files. So, you may need
> to consider periodically running a compaction job.
>
>
>
> If you’re simply appending daily snapshots, then you could just consider
> using date partitions, instead?
>
>
>
> *From: *Rishi Shah <rishishah.s...@gmail.com>
> *Date: *Thursday, May 30, 2019 at 10:43 PM
> *To: *"user @spark" <user@spark.apache.org>
> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>
>
>
> Hi All,
>
>
>
> Can we use bucketing with sorting functionality to save data incrementally
> (say daily) ? I understand bucketing is supported in Spark only with
> saveAsTable, however can this be used with mode "append" instead of
> "overwrite"?
>
>
>
> My understanding around bucketing was, you need to rewrite entire table
> every time, can someone help advice?
>
>
>
> --
>
> Regards,
>
>
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah

Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Reply via email to