Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts.
At 2023-03-31 10:17:52, "Vinoth Chandar" <[email protected]> wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.
The idea of RFC-08 Index (rowKey ->pationPath, fileID) is very similar
to HBase index, but its index is implemented internally in Hudi, so there is no
need to worry about consistency issues. Index can be written to HFiles
quickly, but when reading, it is necessary to read from multiple HFiles, so the
performance of reading an index can be a problem. Therefore, RFC proposer
naturally thought of using hash buckets to partially solve this problems.
HBase's solution to multiple HFILE files is to add a maximum and minimum index
and a Bloom filter index. In Hudi, you can directly create a maximum and
minimum index and a Bloom filter index for FileGroups, eliminating the need to
store the index in HFILE; Another solution is to do a compaction on HFILE
files, but it also adds a burden to hudi.We need to consider the performance of
reading HFile well when using RFC-08.
Therefore, I believe that hash partition + bloom filter is still the simplest
and most effective solution for predictable data growth in a small range.
At 2023-03-31 10:17:52, "Vinoth Chandar" <[email protected]> wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.
>
>On Fri, Mar 24, 2023 at 2:14 AM 吕虎 <[email protected]> wrote:
>
>> Hi Vinoth, I am very happy to receive your reply. Here are some of my
>> thoughts。
>>
>> At 2023-03-21 23:32:44, "Vinoth Chandar" <[email protected]> wrote:
>> >>but when it is used for data expansion, it still involves the need to
>> >redistribute the data records of some data files, thus affecting the
>> >performance.
>> >but expansion of the consistent hash index is an optional operation right?
>>
>> >Sorry, not still fully understanding the differences here,
>> I'm sorry I didn't make myself clearly. The expansion I mentioned last
>> time refers to data records increase in hudi table.
>> The difference between consistent hash index and hash partition with Bloom
>> filters index is how to deal with data increase:
>> For consistent hash index, the way of splitting the file is used.
>> Splitting files affects performance, but can permanently work effectively.
>> So consistent hash index is suitable for scenarios where data increase
>> cannot be estimated or data will increase large.
>> For hash partitions with Bloom filters index, the way of creating new
>> files is used. Adding new files does not affect performance, but if there
>> are too many files, the probability of false positives in the Bloom filters
>> will increase. So hash partitions with Bloom filters index is suitable for
>> scenario where data increase can be estimated over a relatively small range.
>>
>>
>> >>Because the hash partition field values under the parquet file in a
>> >columnar storage format are all equal, the added column field hardly
>> >occupies storage space after compression.
>> >Any new meta field added adds other overhead in terms evolving the schema,
>> >so forth. are you suggesting this is not possible to do without a new meta
>> >field?
>>
>> No new meta field implementation is a more elegant implementation, but
>> for me, who is not yet familiar with the Hudi source code, it is somewhat
>> difficult to implement, but it is not a problem for experts. If you want to
>> implement it without adding new meta fields, I hope I can participate in
>> some simple development, and I can also learn how experts can do it.
>>
>>
>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎 <[email protected]> wrote:
>> >
>> >> Hello,
>> >> I feel very honored that you are interested in my views.
>> >>
>> >> Here are some of my thoughts marked with blue font.
>> >>
>> >> At 2023-03-16 13:18:08, "Vinoth Chandar" <[email protected]> wrote:
>> >>
>> >> >Thanks for the proposal! Some first set of questions here.
>> >> >
>> >> >>You need to pre-select the number of buckets and use the hash
>> function to
>> >> >determine which bucket a record belongs to.
>> >> >>when building the table according to the estimated amount of data,
>> and it
>> >> >cannot be changed after building the table
>> >> >>When the amount of data in a hash partition is too large, the data in
>> >> that
>> >> >partition will be split into multiple files in the way of Bloom index.
>> >> >
>> >> >All these issues are related to bucket sizing could be alleviated by
>> the
>> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>> >> >your thoughts on this.
>> >>
>> >> Hash partitioning is applicable to data tables that cannot give the
>> exact
>> >> capacity of data, but can estimate a rough range. For example, if a
>> company
>> >> currently has 300 million customers in the United States, the company
>> will
>> >> have 7 billion customers in the world at most. In this scenario, using
>> hash
>> >> partitioning to cope with data growth within the known range by directly
>> >> adding files and establishing bloom filters can still guarantee
>> >> performance.
>> >> The consistent hash bucket index is also very valuable, but when it is
>> >> used for data expansion, it still involves the need to redistribute the
>> >> data records of some data files, thus affecting the performance. When
>> it is
>> >> completely impossible to estimate the range of data capacity, it is very
>> >> suitable to use consistent hashing.
>> >> >> you can directly search the data under the partition, which greatly
>> >> >reduces the scope of the Bloom filter to search for files and reduces
>> the
>> >> >false positive of the Bloom filter.
>> >> >the bloom index is already partition aware and unless you use the
>> global
>> >> >version can achieve this. Am I missing something?
>> >> >
>> >> >Another key thing is - if we can avoid adding a new meta field, that
>> would
>> >> >be great. Is it possible to implement this similar to bucket index,
>> based
>> >> >on jsut table properties?
>> >> Add a hash partition field in the table to implement the hash partition
>> >> function, which can well reuse the existing partition function, and
>> >> involves very few code changes. Because the hash partition field values
>> >> under the parquet file in a columnar storage format are all equal, the
>> >> added column field hardly occupies storage space after compression.
>> >> Of course, it is not necessary to add hash partition fields in the
>> table,
>> >> but to store hash partition fields in the corresponding metadata to
>> achieve
>> >> this function, but it will be difficult to reuse the existing functions.
>> >> The establishment of hash partition and partition pruning during query
>> need
>> >> more time to develop code and test again.
>> >> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎 <[email protected]> wrote:
>> >> >
>> >> >> Hi folks,
>> >> >>
>> >> >> Here is my proposal.Thank you very much for reading it.I am looking
>> >> >> forward to your agreement to create an RFC for it.
>> >> >>
>> >> >> Background
>> >> >>
>> >> >> In order to deal with the problem that the modification of a small
>> >> amount
>> >> >> of local data needs to rewrite the entire partition data, Hudi
>> divided
>> >> the
>> >> >> partition into multiple File Groups, and each File Group is
>> identified
>> >> by
>> >> >> the File ID. In this way, when a small amount of local data is
>> modified,
>> >> >> only the data of the corresponding File Group needs to be rewritten.
>> >> Hudi
>> >> >> consistently maps the given Hudi record to the File ID through the
>> index
>> >> >> mechanism. The mapping relationship between Record Key and File
>> >> Group/File
>> >> >> ID will not change once the first version of Record is determined.
>> >> >>
>> >> >> At present, Hudi's indexes mainly include Bloom filter index,
>> Hbase
>> >> >> index and bucket index. The Bloom filter index has a false positive
>> >> >> problem. When a large amount of data results in a large number of
>> File
>> >> >> Groups, the false positive problem will magnify and lead to poor
>> >> >> performance. The Hbase index depends on the external Hbase database,
>> and
>> >> >> may be inconsistent, which will ultimately increase the operation and
>> >> >> maintenance costs. Bucket index makes each bucket of the bucket index
>> >> >> correspond to a File Group. You need to pre-select the number of
>> buckets
>> >> >> and use the hash function to determine which bucket a record belongs
>> to.
>> >> >> Therefore, you can directly determine the mapping relationship
>> between
>> >> the
>> >> >> Record Key and the File Group/File ID through the hash function.
>> Using
>> >> the
>> >> >> bucket index, you need to determine the number of buckets in advance
>> >> when
>> >> >> building the table according to the estimated amount of data, and it
>> >> cannot
>> >> >> be changed after building the table. The unreasonable number of
>> buckets
>> >> >> will seriously affect performance. Unfortunately, the amount of data
>> is
>> >> >> often unpredictable and will continue to grow.
>> >> >>
>> >> >> Hash partition feasibility
>> >> >>
>> >> >> In this context, I put forward the idea of hash partitioning.
>> The
>> >> >> principle is similar to bucket index, but in addition to the
>> advantages
>> >> of
>> >> >> bucket index, hash partitioning can retain the Bloom index. When the
>> >> amount
>> >> >> of data in a hash partition is too large, the data in that partition
>> >> will
>> >> >> be split into multiple files in the way of Bloom index. Therefore,
>> the
>> >> >> problem that bucket index depends heavily on the number of buckets
>> does
>> >> not
>> >> >> exist in the hash partition. Compared with the Bloom index, when
>> >> searching
>> >> >> for a data, you can directly search the data under the partition,
>> which
>> >> >> greatly reduces the scope of the Bloom filter to search for files and
>> >> >> reduces the false positive of the Bloom filter.
>> >> >>
>> >> >> Design of a simple hash partition implementation
>> >> >>
>> >> >> The idea is to use the capabilities of the ComplexKeyGenerator to
>> >> >> implement hash partitioning. Hash partition field is one of the
>> >> partition
>> >> >> fields of the ComplexKeyGenerator.
>> >> >>
>> >> >> When hash.partition.fields is specified and partition.fields
>> contains
>> >> >> _hoodie_hash_partition, a column named _hoodie_hash_partition will be
>> >> added
>> >> >> in this table as one of the partition key.
>> >> >>
>> >> >> If predicates of hash.partition.fields appear in the query statement,
>> >> the
>> >> >> _hoodie_hash_partition = X predicate will be automatically added to
>> the
>> >> >> query statement for partition pruning.
>> >> >>
>> >> >> Advantages of this design: simple implementation, no modification of
>> >> core
>> >> >> functions, so low risk.
>> >> >>
>> >> >> The above design has been implemented in pr 7984.
>> >> >>
>> >> >> https://github.com/apache/hudi/pull/7984
>> >> >>
>> >> >> How to use hash partition in spark data source can refer to
>> >> >>
>> >> >>
>> >> >>
>> >>
>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>> >> >>
>> >> >> #testHashPartition
>> >> >>
>> >> >> Perhaps for experts, the implementation of PR is not elegant enough.
>> I
>> >> >> also look forward to a more elegant implementation, from which I can
>> >> learn
>> >> >> more.
>> >> >>
>> >> >> Problems with hash partition:
>> >> >>
>> >> >> Because SparkSQL does not know the existence of hash partitions, when
>> >> two
>> >> >> hash-partitioned tables perform associated queries, it may not get
>> the
>> >> >> optimal execution plan. However, because hudi still has the ability
>> to
>> >> >> prune a single table, the performance of two tables can still be
>> greatly
>> >> >> improved by performing associated queries, compared with no hash
>> >> >> partitioning.
>> >> >>
>> >> >> The same problem also exists in the operation of group by. SparkSQL
>> does
>> >> >> not know that the aggregation operation has been completed in the
>> hash
>> >> >> partition. It may aggregate the results of different partition
>> >> aggregation
>> >> >> again. However, due to the data volume after partition aggregation is
>> >> very
>> >> >> small, the redundant and useless aggregation operation has little
>> >> impact on
>> >> >> the overall performance.
>> >> >>
>> >> >> I'm not sure whether bucket indexes will have the same problem as
>> hash
>> >> >> partitions.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> At 2023-02-17 05:18:19, "Y Ethan Guo" <[email protected]> wrote:
>> >> >> >+1 Thanks Lvhu for bringing up the idea. As Alexey suggested, it
>> >> would be
>> >> >> >good for you to write down the proposal with design details for
>> >> discussion
>> >> >> >in the community.
>> >> >> >
>> >> >> >On Thu, Feb 16, 2023 at 11:28 AM Alexey Kudinkin <
>> [email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Thanks for your contribution, Lvhu!
>> >> >> >>
>> >> >> >> I think we should actually kick-start this effort with an small
>> RFC
>> >> >> >> outlining proposed changes first, as this is modifying the core
>> >> >> read-flow
>> >> >> >> for all Hudi tables and we want to make sure our approach there is
>> >> >> >> rock-solid.
>> >> >> >>
>> >> >> >> On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <[email protected]> wrote:
>> >> >> >>
>> >> >> >> > Hi folks,
>> >> >> >> > PR 7984【 https://github.com/apache/hudi/pull/7984 】
>> >> implements
>> >> >> >> hash
>> >> >> >> > partitioning.
>> >> >> >> > As you know, It is often difficult to find an appropriate
>> >> >> partition
>> >> >> >> > key in the existing big data. Hash partitioning can easily solve
>> >> this
>> >> >> >> > problem. it can greatly improve the performance of hudi's big
>> data
>> >> >> >> > processing.
>> >> >> >> > The idea is to use the hash partition field as one of the
>> >> >> partition
>> >> >> >> > fields of the ComplexKeyGenerator, so this PR implementation
>> does
>> >> not
>> >> >> >> > involve logic modification of core code.
>> >> >> >> > The codes are easy to review, but I think hash partition
>> is
>> >> very
>> >> >> >> > usefull. we really need it.
>> >> >> >> > How to use hash partition in spark data source can refer
>> to
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>> >> >> >> > #testHashPartition
>> >> >> >> >
>> >> >> >> > No public API or user-facing feature change or any
>> >> performance
>> >> >> >> > impact if the hash partition parameters are not specified.
>> >> >> >> >
>> >> >> >> > When hash.partition.fields is specified and
>> partition.fields
>> >> >> >> > contains _hoodie_hash_partition, a column named
>> >> _hoodie_hash_partition
>> >> >> >> will
>> >> >> >> > be added in this table as one of the partition key.
>> >> >> >> >
>> >> >> >> > If predicates of hash.partition.fields appear in the query
>> >> >> >> > statement, the _hoodie_hash_partition = X predicate will be
>> >> >> automatically
>> >> >> >> > added to the query statement for partition pruning.
>> >> >> >> >
>> >> >> >> > Hope folks help and review!
>> >> >> >> > Thanks!
>> >> >> >> > Lvhu
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>>