Hi Roopa,
Bucketing is a more general concept. I think what you are referring to is how
to integrate with spark sql bucketing syntax. I was proposing a Hudi native
solution where we can implement Bucket indexing which gives the same end result
of writing compacted (parquet) files with keys hashed to get bucket-id. You can
then use the Hudi's Spark data source integration to write to this table and
get bucketized organization.
Let me know if this makes sense.
Thanks,Balaji.V
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy
<[email protected]> wrote:
Hi Balaji,
Thanks for your response. I went through HoodieIndex in source code but I am
not sure how indexing alone could help with bucketing.
Spark Bucketing would involve writing the compacted files in bucketed/clustered
fashion such that when a spark sql query has a certain id, only the
bucket(file) which hashes to that id would be scanned for matching records.
This means, data during compaction has to be written using Spark’s saveAsTable
API with bucketBy set to the desired number of buckets. Refer:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html
. This will create a spark bucketed table having metadata different from Hive
bucketed tables as Spark cannot understand Hive’s hashing algorithm.
Is this something that Hudi might support?
Thanks,
Roopa
From: Balaji Varadarajan <[email protected]>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "[email protected]" <[email protected]>
Cc: DL-AIE <[email protected]>
Subject: [EXT] Re: Bucketing in Hudi
Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup
is nicely abstracted out. We have a Jira for supporting Bucket Indexing :
https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>
You can get bucket indexing done by implementing that interface along with
additional changes for handling initial writes to the partition and for
bucketing information which IMO is not significant. If you are interested in
contributing, we would be happy to help you in guiding and landing the change.
Thanks,
Balaji.V
On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy
<[email protected]> wrote:
Hello Hudi team,
We have a requirement to compact data on s3 but we need bucketing on top of
compaction so that during query time, only the files relevant to the "id" in
query would be scanned. We are told that bucketing is not currently supported
in Hudi. Is it possible to extend Hudi to support it? What does it take to
extend the framework in order to do this?
We are trying to analyze from timelines perspective whether this is an option
to consider and need your help in analyzing and planning for it.
Thanks,
Roopa