Thanks a lot Vinoth for your suggestion. I will look into it.

On Thu, 4 Jun 2020 at 10:15 AM, Vinoth Chandar <[email protected]> wrote:

> This is a good conversation. The ask for support of bucketed tables has not
> actually come up much, since if you are looking up things at that
> granularity, it almost feels like you are doing OLTP/database like queries?
>
> Assuming you hash the primary key into a hash that denotes the partition,
> then a simple workaround is to always add a where clause using a UDF in
> presto, I.e where key = 123 and partition = hash_udf(123)
>
> But of course the down side Is that your ops team needs to remember to add
> the second partition clause (which is not very different from querying
> large time partitioned tables today)
>
> Our mid term plan is to build out column indexes (RFC-15 has the details,
> if you are interested)
>
> On Wed, Jun 3, 2020 at 2:54 AM tanu dua <[email protected]> wrote:
>
> > If I need to plugin this hashing algorithm to resolve the partitions in
> > Presto and hive what is the code I should look into ?
> >
> > On Wed, Jun 3, 2020, 12:04 PM tanu dua <[email protected]> wrote:
> >
> > > Yes that’s also on cards and for developers that’s ok but we need to
> > > provide an interface to our ops people to execute the queries from
> presto
> > > so I need to find out if they fire a query on primary key how can I
> > > calculate the hash. They can fire a query including primary key with
> > other
> > > fields. So that is the only problem I see in hash partitions and to get
> > if
> > > work I believe I need to go deeper into presto Hudi plugin
> > >
> > > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <[email protected]>
> > > wrote:
> > >
> > >> Hi Tanu,
> > >>
> > >> If your primary key is integer you can add one more field as hash of
> > >> integer and partition based on hash field. It will add some complexity
> > to
> > >> read and write because hash has to be computed prior to each read or
> > >> write.
> > >> Not whether overhead of doing this exceeds performance gains due to
> less
> > >> partitions. I wonder why HUDI don't directly support hash based
> > >> partitions?
> > >>
> > >> Thanks
> > >> Jaimin
> > >>
> > >> On Wed, 3 Jun 2020 at 10:07, tanu dua <[email protected]> wrote:
> > >>
> > >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> > same
> > >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> > >> it’s
> > >> > very difficult to reduce the 1st partition as that is the basic
> > primary
> > >> key
> > >> > of our domain model on which analysts and developers need to query
> > >> almost
> > >> > 90% of time and its an integer primary key and can’t be decomposed
> > >> further.
> > >> >
> > >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <[email protected]>
> > >> wrote:
> > >> >
> > >> > > Hi tanu,
> > >> > >
> > >> > > For good query performance, its recommended to write optimally
> sized
> > >> > files.
> > >> > > Hudi already ensures that.
> > >> > >
> > >> > > Generally speaking, if you have too many partitions, then it also
> > >> means
> > >> > too
> > >> > > many files. Mostly people limit to 1000s of partitions in their
> > >> datasets,
> > >> > > since queries typically crunch data based on time or a
> > business_domain
> > >> > (e.g
> > >> > > city for uber)..  Partitioning too granular - say based on
> user_id -
> > >> is
> > >> > not
> > >> > > very useful unless your queries only crunch per user.. if you are
> > >> using
> > >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> > mysql
> > >> > > metastore db as well - not very scalable.
> > >> > >
> > >> > > What I am trying to say is : even outside of Hudi, if analytics is
> > >> your
> > >> > use
> > >> > > case, might be worth partitioning at lower granularity and
> increase
> > >> rows
> > >> > > per parquet file.
> > >> > >
> > >> > > Thanks
> > >> > > Vinoth
> > >> > >
> > >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]>
> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > We have a requirement to ingest 30M records in S3 backed up by
> > >> HUDI. I
> > >> > am
> > >> > > > figuring out the partition strategy and ending up with lot of
> > >> > partitions
> > >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> > >> partition)
> > >> > > -->
> > >> > > > 2.5 M (third partition) and each parquet file will have the
> > records
> > >> > with
> > >> > > > less than 10 rows of data.
> > >> > > >
> > >> > > > Our dataset will be ingested at once in full and then it will be
> > >> > > > incremental daily with less than 1k updates. So its more read
> > heavy
> > >> > > rather
> > >> > > > than write heavy
> > >> > > >
> > >> > > > So what should be the suggestion in terms of HUDI performance -
> go
> > >> > ahead
> > >> > > > with the above partition strategy or shall I reduce my
> partitions
> > >> and
> > >> > > > increase  no of rows in each parquet file.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to