Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Georg Heiler
Bucketing will only help you with joins. And these usually happen on a key.
You mentioned that there is no such key in your data. If just want to
search through large  quantities of data  sorting an partitioning by time
is left.

Rishi Shah  schrieb am Sa. 1. Juni 2019 um 05:57:

> Thanks much for your input Gourav, Silvio.
>
> I have about 10TB of data, which gets stored daily. There's no qualifying
> column for partitioning, which makes querying this table super slow. So I
> wanted to sort the results before storing them daily. This is why I was
> thinking to use bucketing and sorting ... Do you think sorting data based
> on a column or two before saving would help query performance on this
> table?
>
> My concern is, data will be sorted on daily basis and not globally. Would
> that help with performance? I can compact files every month as well and
> sort before saving. Just not sure if this is going to help with performance
> issues on this table.
>
> Would be great to get your advice on this.
>
>
>
>
>
>
>
>
>
> On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Spark does allow appending new files to bucketed tables. When the data is
>> read in, Spark will combine the multiple files belonging to the same
>> buckets into the same partitions.
>>
>>
>>
>> Having said that, you need to be very careful with bucketing especially
>> as you’re appending to avoid generating lots of small files. So, you may
>> need to consider periodically running a compaction job.
>>
>>
>>
>> If you’re simply appending daily snapshots, then you could just consider
>> using date partitions, instead?
>>
>>
>>
>> *From: *Rishi Shah 
>> *Date: *Thursday, May 30, 2019 at 10:43 PM
>> *To: *"user @spark" 
>> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>>
>>
>>
>> Hi All,
>>
>>
>>
>> Can we use bucketing with sorting functionality to save data
>> incrementally (say daily) ? I understand bucketing is supported in Spark
>> only with saveAsTable, however can this be used with mode "append" instead
>> of "overwrite"?
>>
>>
>>
>> My understanding around bucketing was, you need to rewrite entire table
>> every time, can someone help advice?
>>
>>
>>
>> --
>>
>> Regards,
>>
>>
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Rishi Shah
Thanks much for your input Gourav, Silvio.

I have about 10TB of data, which gets stored daily. There's no qualifying
column for partitioning, which makes querying this table super slow. So I
wanted to sort the results before storing them daily. This is why I was
thinking to use bucketing and sorting ... Do you think sorting data based
on a column or two before saving would help query performance on this
table?

My concern is, data will be sorted on daily basis and not globally. Would
that help with performance? I can compact files every month as well and
sort before saving. Just not sure if this is going to help with performance
issues on this table.

Would be great to get your advice on this.









On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Spark does allow appending new files to bucketed tables. When the data is
> read in, Spark will combine the multiple files belonging to the same
> buckets into the same partitions.
>
>
>
> Having said that, you need to be very careful with bucketing especially as
> you’re appending to avoid generating lots of small files. So, you may need
> to consider periodically running a compaction job.
>
>
>
> If you’re simply appending daily snapshots, then you could just consider
> using date partitions, instead?
>
>
>
> *From: *Rishi Shah 
> *Date: *Thursday, May 30, 2019 at 10:43 PM
> *To: *"user @spark" 
> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load?
>
>
>
> Hi All,
>
>
>
> Can we use bucketing with sorting functionality to save data incrementally
> (say daily) ? I understand bucketing is supported in Spark only with
> saveAsTable, however can this be used with mode "append" instead of
> "overwrite"?
>
>
>
> My understanding around bucketing was, you need to rewrite entire table
> every time, can someone help advice?
>
>
>
> --
>
> Regards,
>
>
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Silvio Fiorito
Spark does allow appending new files to bucketed tables. When the data is read 
in, Spark will combine the multiple files belonging to the same buckets into 
the same partitions.

Having said that, you need to be very careful with bucketing especially as 
you’re appending to avoid generating lots of small files. So, you may need to 
consider periodically running a compaction job.

If you’re simply appending daily snapshots, then you could just consider using 
date partitions, instead?

From: Rishi Shah 
Date: Thursday, May 30, 2019 at 10:43 PM
To: "user @spark" 
Subject: [pyspark 2.3+] Bucketing with sort - incremental data load?

Hi All,

Can we use bucketing with sorting functionality to save data incrementally (say 
daily) ? I understand bucketing is supported in Spark only with saveAsTable, 
however can this be used with mode "append" instead of "overwrite"?

My understanding around bucketing was, you need to rewrite entire table every 
time, can someone help advice?

--
Regards,

Rishi Shah


Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Gourav Sengupta
Hi Rishi,

I think that if you are using sorting and then appending data locally there
will no need to bucket data and you are good with external tables that way.

Regards,
Gourav

On Fri, May 31, 2019 at 3:43 AM Rishi Shah  wrote:

> Hi All,
>
> Can we use bucketing with sorting functionality to save data incrementally
> (say daily) ? I understand bucketing is supported in Spark only with
> saveAsTable, however can this be used with mode "append" instead of
> "overwrite"?
>
> My understanding around bucketing was, you need to rewrite entire table
> every time, can someone help advice?
>
> --
> Regards,
>
> Rishi Shah
>


[pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-30 Thread Rishi Shah
Hi All,

Can we use bucketing with sorting functionality to save data incrementally
(say daily) ? I understand bucketing is supported in Spark only with
saveAsTable, however can this be used with mode "append" instead of
"overwrite"?

My understanding around bucketing was, you need to rewrite entire table
every time, can someone help advice?

-- 
Regards,

Rishi Shah