Bucketing will only help you with joins. And these usually happen on a key. You mentioned that there is no such key in your data. If just want to search through large quantities of data sorting an partitioning by time is left.
Rishi Shah <rishishah.s...@gmail.com> schrieb am Sa. 1. Juni 2019 um 05:57: > Thanks much for your input Gourav, Silvio. > > I have about 10TB of data, which gets stored daily. There's no qualifying > column for partitioning, which makes querying this table super slow. So I > wanted to sort the results before storing them daily. This is why I was > thinking to use bucketing and sorting ... Do you think sorting data based > on a column or two before saving would help query performance on this > table? > > My concern is, data will be sorted on daily basis and not globally. Would > that help with performance? I can compact files every month as well and > sort before saving. Just not sure if this is going to help with performance > issues on this table. > > Would be great to get your advice on this. > > > > > > > > > > On Fri, May 31, 2019 at 10:42 AM Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Spark does allow appending new files to bucketed tables. When the data is >> read in, Spark will combine the multiple files belonging to the same >> buckets into the same partitions. >> >> >> >> Having said that, you need to be very careful with bucketing especially >> as you’re appending to avoid generating lots of small files. So, you may >> need to consider periodically running a compaction job. >> >> >> >> If you’re simply appending daily snapshots, then you could just consider >> using date partitions, instead? >> >> >> >> *From: *Rishi Shah <rishishah.s...@gmail.com> >> *Date: *Thursday, May 30, 2019 at 10:43 PM >> *To: *"user @spark" <user@spark.apache.org> >> *Subject: *[pyspark 2.3+] Bucketing with sort - incremental data load? >> >> >> >> Hi All, >> >> >> >> Can we use bucketing with sorting functionality to save data >> incrementally (say daily) ? I understand bucketing is supported in Spark >> only with saveAsTable, however can this be used with mode "append" instead >> of "overwrite"? >> >> >> >> My understanding around bucketing was, you need to rewrite entire table >> every time, can someone help advice? >> >> >> >> -- >> >> Regards, >> >> >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah >