Re: [pyspark 2.3+] Bucketing with sort - incremental data load?
Bucketing will only help you with joins. And these usually happen on a key. You mentioned that there is no such key in your data. If just want to search through large quantities of data sorting an partitioning by time is left. Rishi Shah schrieb am Sa. 1. Juni 2019 um 05:57: > Thanks much for your input Gourav, Silvio. > > I have about 10TB of data, which gets stored daily. There's no qualifying > column for partitioning, which makes querying this table super slow. So I > wanted to sort the results before storing them daily. This is why I was > thinking to use bucketing and sorting ... Do you think sorting data based > on a column or two before saving would help query performance on this > table? > > My concern is, data will be sorted on daily basis and not globally. Would > that help with performance? I can compact files every month as well and > sort before saving. Just not sure if this is going to help with performance > issues on this table. > > Would be great to get your advice on this. > > > > > > > > > > On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Spark does allow appending new files to bucketed tables. When the data is >> read in, Spark will combine the multiple files belonging to the same >> buckets into the same partitions. >> >> >> >> Having said that, you need to be very careful with bucketing especially >> as you’re appending to avoid generating lots of small files. So, you may >> need to consider periodically running a compaction job. >> >> >> >> If you’re simply appending daily snapshots, then you could just consider >> using date partitions, instead? >> >> >> >> *From: *Rishi Shah >> *Date: *Thursday, May 30, 2019 at 10:43 PM >> *To: *"user @spark" >> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load? >> >> >> >> Hi All, >> >> >> >> Can we use bucketing with sorting functionality to save data >> incrementally (say daily) ? I understand bucketing is supported in Spark >> only with saveAsTable, however can this be used with mode "append" instead >> of "overwrite"? >> >> >> >> My understanding around bucketing was, you need to rewrite entire table >> every time, can someone help advice? >> >> >> >> -- >> >> Regards, >> >> >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah >
Re: [pyspark 2.3+] Bucketing with sort - incremental data load?
Thanks much for your input Gourav, Silvio. I have about 10TB of data, which gets stored daily. There's no qualifying column for partitioning, which makes querying this table super slow. So I wanted to sort the results before storing them daily. This is why I was thinking to use bucketing and sorting ... Do you think sorting data based on a column or two before saving would help query performance on this table? My concern is, data will be sorted on daily basis and not globally. Would that help with performance? I can compact files every month as well and sort before saving. Just not sure if this is going to help with performance issues on this table. Would be great to get your advice on this. On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Spark does allow appending new files to bucketed tables. When the data is > read in, Spark will combine the multiple files belonging to the same > buckets into the same partitions. > > > > Having said that, you need to be very careful with bucketing especially as > you’re appending to avoid generating lots of small files. So, you may need > to consider periodically running a compaction job. > > > > If you’re simply appending daily snapshots, then you could just consider > using date partitions, instead? > > > > *From: *Rishi Shah > *Date: *Thursday, May 30, 2019 at 10:43 PM > *To: *"user @spark" > *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load? > > > > Hi All, > > > > Can we use bucketing with sorting functionality to save data incrementally > (say daily) ? I understand bucketing is supported in Spark only with > saveAsTable, however can this be used with mode "append" instead of > "overwrite"? > > > > My understanding around bucketing was, you need to rewrite entire table > every time, can someone help advice? > > > > -- > > Regards, > > > > Rishi Shah > -- Regards, Rishi Shah
Re: [pyspark 2.3+] Bucketing with sort - incremental data load?
Spark does allow appending new files to bucketed tables. When the data is read in, Spark will combine the multiple files belonging to the same buckets into the same partitions. Having said that, you need to be very careful with bucketing especially as you’re appending to avoid generating lots of small files. So, you may need to consider periodically running a compaction job. If you’re simply appending daily snapshots, then you could just consider using date partitions, instead? From: Rishi Shah Date: Thursday, May 30, 2019 at 10:43 PM To: "user @spark" Subject: [pyspark 2.3+] Bucketing with sort - incremental data load? Hi All, Can we use bucketing with sorting functionality to save data incrementally (say daily) ? I understand bucketing is supported in Spark only with saveAsTable, however can this be used with mode "append" instead of "overwrite"? My understanding around bucketing was, you need to rewrite entire table every time, can someone help advice? -- Regards, Rishi Shah
Re: [pyspark 2.3+] Bucketing with sort - incremental data load?
Hi Rishi, I think that if you are using sorting and then appending data locally there will no need to bucket data and you are good with external tables that way. Regards, Gourav On Fri, May 31, 2019 at 3:43 AM Rishi Shah wrote: > Hi All, > > Can we use bucketing with sorting functionality to save data incrementally > (say daily) ? I understand bucketing is supported in Spark only with > saveAsTable, however can this be used with mode "append" instead of > "overwrite"? > > My understanding around bucketing was, you need to rewrite entire table > every time, can someone help advice? > > -- > Regards, > > Rishi Shah >
[pyspark 2.3+] Bucketing with sort - incremental data load?
Hi All, Can we use bucketing with sorting functionality to save data incrementally (say daily) ? I understand bucketing is supported in Spark only with saveAsTable, however can this be used with mode "append" instead of "overwrite"? My understanding around bucketing was, you need to rewrite entire table every time, can someone help advice? -- Regards, Rishi Shah