Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Georg Heiler Fri, 31 May 2019 22:29:09 -0700

Bucketing will only help you with joins. And these usually happen on a key.
You mentioned that there is no such key in your data. If just want to
search through large  quantities of data  sorting an partitioning by time
is left.


Rishi Shah <rishishah.s...@gmail.com> schrieb am Sa. 1. Juni 2019 um 05:57:

> Thanks much for your input Gourav, Silvio.
>
> I have about 10TB of data, which gets stored daily. There's no qualifying
> column for partitioning, which makes querying this table super slow. So I
> wanted to sort the results before storing them daily. This is why I was
> thinking to use bucketing and sorting ... Do you think sorting data based
> on a column or two before saving would help query performance on this
> table?
>
> My concern is, data will be sorted on daily basis and not globally. Would
> that help with performance? I can compact files every month as well and
> sort before saving. Just not sure if this is going to help with performance
> issues on this table.
>
> Would be great to get your advice on this.
>
>
>
>
>
>
>
>
>
> On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Spark does allow appending new files to bucketed tables. When the data is
>> read in, Spark will combine the multiple files belonging to the same
>> buckets into the same partitions.
>>
>>
>>
>> Having said that, you need to be very careful with bucketing especially
>> as you’re appending to avoid generating lots of small files. So, you may
>> need to consider periodically running a compaction job.
>>
>>
>>
>> If you’re simply appending daily snapshots, then you could just consider
>> using date partitions, instead?
>>
>>
>>
>> *From: *Rishi Shah <rishishah.s...@gmail.com>
>> *Date: *Thursday, May 30, 2019 at 10:43 PM
>> *To: *"user @spark" <user@spark.apache.org>
>> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>>
>>
>>
>> Hi All,
>>
>>
>>
>> Can we use bucketing with sorting functionality to save data
>> incrementally (say daily) ? I understand bucketing is supported in Spark
>> only with saveAsTable, however can this be used with mode "append" instead
>> of "overwrite"?
>>
>>
>>
>> My understanding around bucketing was, you need to rewrite entire table
>> every time, can someone help advice?
>>
>>
>>
>> --
>>
>> Regards,
>>
>>
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

Reply via email to