Re: [EXT] Re: Bucketing in Hudi

Roopa Murthy Thu, 22 Oct 2020 17:23:48 -0700

Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. Refer: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html
 . This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <[email protected]>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "[email protected]" <[email protected]>
Cc: DL-AIE <[email protected]>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>

You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
<[email protected]> wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: [EXT] Re: Bucketing in Hudi

Reply via email to