Thanks a lot Vinoth for your suggestion. I will look into it. On Thu, 4 Jun 2020 at 10:15 AM, Vinoth Chandar <[email protected]> wrote:
> This is a good conversation. The ask for support of bucketed tables has not > actually come up much, since if you are looking up things at that > granularity, it almost feels like you are doing OLTP/database like queries? > > Assuming you hash the primary key into a hash that denotes the partition, > then a simple workaround is to always add a where clause using a UDF in > presto, I.e where key = 123 and partition = hash_udf(123) > > But of course the down side Is that your ops team needs to remember to add > the second partition clause (which is not very different from querying > large time partitioned tables today) > > Our mid term plan is to build out column indexes (RFC-15 has the details, > if you are interested) > > On Wed, Jun 3, 2020 at 2:54 AM tanu dua <[email protected]> wrote: > > > If I need to plugin this hashing algorithm to resolve the partitions in > > Presto and hive what is the code I should look into ? > > > > On Wed, Jun 3, 2020, 12:04 PM tanu dua <[email protected]> wrote: > > > > > Yes that’s also on cards and for developers that’s ok but we need to > > > provide an interface to our ops people to execute the queries from > presto > > > so I need to find out if they fire a query on primary key how can I > > > calculate the hash. They can fire a query including primary key with > > other > > > fields. So that is the only problem I see in hash partitions and to get > > if > > > work I believe I need to go deeper into presto Hudi plugin > > > > > > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <[email protected]> > > > wrote: > > > > > >> Hi Tanu, > > >> > > >> If your primary key is integer you can add one more field as hash of > > >> integer and partition based on hash field. It will add some complexity > > to > > >> read and write because hash has to be computed prior to each read or > > >> write. > > >> Not whether overhead of doing this exceeds performance gains due to > less > > >> partitions. I wonder why HUDI don't directly support hash based > > >> partitions? > > >> > > >> Thanks > > >> Jaimin > > >> > > >> On Wed, 3 Jun 2020 at 10:07, tanu dua <[email protected]> wrote: > > >> > > >> > Thanks Vinoth for detailed explanation. Even I was thinking on the > > same > > >> > lines and I will relook. We can reduce the 2nd and 3rd partition but > > >> it’s > > >> > very difficult to reduce the 1st partition as that is the basic > > primary > > >> key > > >> > of our domain model on which analysts and developers need to query > > >> almost > > >> > 90% of time and its an integer primary key and can’t be decomposed > > >> further. > > >> > > > >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <[email protected]> > > >> wrote: > > >> > > > >> > > Hi tanu, > > >> > > > > >> > > For good query performance, its recommended to write optimally > sized > > >> > files. > > >> > > Hudi already ensures that. > > >> > > > > >> > > Generally speaking, if you have too many partitions, then it also > > >> means > > >> > too > > >> > > many files. Mostly people limit to 1000s of partitions in their > > >> datasets, > > >> > > since queries typically crunch data based on time or a > > business_domain > > >> > (e.g > > >> > > city for uber).. Partitioning too granular - say based on > user_id - > > >> is > > >> > not > > >> > > very useful unless your queries only crunch per user.. if you are > > >> using > > >> > > Hive metastore then 25M partitions mean 25M rows in your backing > > mysql > > >> > > metastore db as well - not very scalable. > > >> > > > > >> > > What I am trying to say is : even outside of Hudi, if analytics is > > >> your > > >> > use > > >> > > case, might be worth partitioning at lower granularity and > increase > > >> rows > > >> > > per parquet file. > > >> > > > > >> > > Thanks > > >> > > Vinoth > > >> > > > > >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> > wrote: > > >> > > > > >> > > > Hi, > > >> > > > We have a requirement to ingest 30M records in S3 backed up by > > >> HUDI. I > > >> > am > > >> > > > figuring out the partition strategy and ending up with lot of > > >> > partitions > > >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary > > >> partition) > > >> > > --> > > >> > > > 2.5 M (third partition) and each parquet file will have the > > records > > >> > with > > >> > > > less than 10 rows of data. > > >> > > > > > >> > > > Our dataset will be ingested at once in full and then it will be > > >> > > > incremental daily with less than 1k updates. So its more read > > heavy > > >> > > rather > > >> > > > than write heavy > > >> > > > > > >> > > > So what should be the suggestion in terms of HUDI performance - > go > > >> > ahead > > >> > > > with the above partition strategy or shall I reduce my > partitions > > >> and > > >> > > > increase no of rows in each parquet file. > > >> > > > > > >> > > > > >> > > > >> > > > > > >
