I think we can focus more on validating the hash index + bloom filter vs consistent hash index more first. Have you looked at RFC-08, which is a kind of hash index as well, except it stores the key => file group mapping externally.
On Fri, Mar 24, 2023 at 2:14 AM 吕虎 <lvh...@163.com> wrote: > Hi Vinoth, I am very happy to receive your reply. Here are some of my > thoughts。 > > At 2023-03-21 23:32:44, "Vinoth Chandar" <vin...@apache.org> wrote: > >>but when it is used for data expansion, it still involves the need to > >redistribute the data records of some data files, thus affecting the > >performance. > >but expansion of the consistent hash index is an optional operation right? > > >Sorry, not still fully understanding the differences here, > I'm sorry I didn't make myself clearly. The expansion I mentioned last > time refers to data records increase in hudi table. > The difference between consistent hash index and hash partition with Bloom > filters index is how to deal with data increase: > For consistent hash index, the way of splitting the file is used. > Splitting files affects performance, but can permanently work effectively. > So consistent hash index is suitable for scenarios where data increase > cannot be estimated or data will increase large. > For hash partitions with Bloom filters index, the way of creating new > files is used. Adding new files does not affect performance, but if there > are too many files, the probability of false positives in the Bloom filters > will increase. So hash partitions with Bloom filters index is suitable for > scenario where data increase can be estimated over a relatively small range. > > > >>Because the hash partition field values under the parquet file in a > >columnar storage format are all equal, the added column field hardly > >occupies storage space after compression. > >Any new meta field added adds other overhead in terms evolving the schema, > >so forth. are you suggesting this is not possible to do without a new meta > >field? > > No new meta field implementation is a more elegant implementation, but > for me, who is not yet familiar with the Hudi source code, it is somewhat > difficult to implement, but it is not a problem for experts. If you want to > implement it without adding new meta fields, I hope I can participate in > some simple development, and I can also learn how experts can do it. > > > >On Thu, Mar 16, 2023 at 2:22 AM 吕虎 <lvh...@163.com> wrote: > > > >> Hello, > >> I feel very honored that you are interested in my views. > >> > >> Here are some of my thoughts marked with blue font. > >> > >> At 2023-03-16 13:18:08, "Vinoth Chandar" <vin...@apache.org> wrote: > >> > >> >Thanks for the proposal! Some first set of questions here. > >> > > >> >>You need to pre-select the number of buckets and use the hash > function to > >> >determine which bucket a record belongs to. > >> >>when building the table according to the estimated amount of data, > and it > >> >cannot be changed after building the table > >> >>When the amount of data in a hash partition is too large, the data in > >> that > >> >partition will be split into multiple files in the way of Bloom index. > >> > > >> >All these issues are related to bucket sizing could be alleviated by > the > >> >consistent hashing index in 0.13? Have you checked it out? Love to hear > >> >your thoughts on this. > >> > >> Hash partitioning is applicable to data tables that cannot give the > exact > >> capacity of data, but can estimate a rough range. For example, if a > company > >> currently has 300 million customers in the United States, the company > will > >> have 7 billion customers in the world at most. In this scenario, using > hash > >> partitioning to cope with data growth within the known range by directly > >> adding files and establishing bloom filters can still guarantee > >> performance. > >> The consistent hash bucket index is also very valuable, but when it is > >> used for data expansion, it still involves the need to redistribute the > >> data records of some data files, thus affecting the performance. When > it is > >> completely impossible to estimate the range of data capacity, it is very > >> suitable to use consistent hashing. > >> >> you can directly search the data under the partition, which greatly > >> >reduces the scope of the Bloom filter to search for files and reduces > the > >> >false positive of the Bloom filter. > >> >the bloom index is already partition aware and unless you use the > global > >> >version can achieve this. Am I missing something? > >> > > >> >Another key thing is - if we can avoid adding a new meta field, that > would > >> >be great. Is it possible to implement this similar to bucket index, > based > >> >on jsut table properties? > >> Add a hash partition field in the table to implement the hash partition > >> function, which can well reuse the existing partition function, and > >> involves very few code changes. Because the hash partition field values > >> under the parquet file in a columnar storage format are all equal, the > >> added column field hardly occupies storage space after compression. > >> Of course, it is not necessary to add hash partition fields in the > table, > >> but to store hash partition fields in the corresponding metadata to > achieve > >> this function, but it will be difficult to reuse the existing functions. > >> The establishment of hash partition and partition pruning during query > need > >> more time to develop code and test again. > >> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎 <lvh...@163.com> wrote: > >> > > >> >> Hi folks, > >> >> > >> >> Here is my proposal.Thank you very much for reading it.I am looking > >> >> forward to your agreement to create an RFC for it. > >> >> > >> >> Background > >> >> > >> >> In order to deal with the problem that the modification of a small > >> amount > >> >> of local data needs to rewrite the entire partition data, Hudi > divided > >> the > >> >> partition into multiple File Groups, and each File Group is > identified > >> by > >> >> the File ID. In this way, when a small amount of local data is > modified, > >> >> only the data of the corresponding File Group needs to be rewritten. > >> Hudi > >> >> consistently maps the given Hudi record to the File ID through the > index > >> >> mechanism. The mapping relationship between Record Key and File > >> Group/File > >> >> ID will not change once the first version of Record is determined. > >> >> > >> >> At present, Hudi's indexes mainly include Bloom filter index, > Hbase > >> >> index and bucket index. The Bloom filter index has a false positive > >> >> problem. When a large amount of data results in a large number of > File > >> >> Groups, the false positive problem will magnify and lead to poor > >> >> performance. The Hbase index depends on the external Hbase database, > and > >> >> may be inconsistent, which will ultimately increase the operation and > >> >> maintenance costs. Bucket index makes each bucket of the bucket index > >> >> correspond to a File Group. You need to pre-select the number of > buckets > >> >> and use the hash function to determine which bucket a record belongs > to. > >> >> Therefore, you can directly determine the mapping relationship > between > >> the > >> >> Record Key and the File Group/File ID through the hash function. > Using > >> the > >> >> bucket index, you need to determine the number of buckets in advance > >> when > >> >> building the table according to the estimated amount of data, and it > >> cannot > >> >> be changed after building the table. The unreasonable number of > buckets > >> >> will seriously affect performance. Unfortunately, the amount of data > is > >> >> often unpredictable and will continue to grow. > >> >> > >> >> Hash partition feasibility > >> >> > >> >> In this context, I put forward the idea of hash partitioning. > The > >> >> principle is similar to bucket index, but in addition to the > advantages > >> of > >> >> bucket index, hash partitioning can retain the Bloom index. When the > >> amount > >> >> of data in a hash partition is too large, the data in that partition > >> will > >> >> be split into multiple files in the way of Bloom index. Therefore, > the > >> >> problem that bucket index depends heavily on the number of buckets > does > >> not > >> >> exist in the hash partition. Compared with the Bloom index, when > >> searching > >> >> for a data, you can directly search the data under the partition, > which > >> >> greatly reduces the scope of the Bloom filter to search for files and > >> >> reduces the false positive of the Bloom filter. > >> >> > >> >> Design of a simple hash partition implementation > >> >> > >> >> The idea is to use the capabilities of the ComplexKeyGenerator to > >> >> implement hash partitioning. Hash partition field is one of the > >> partition > >> >> fields of the ComplexKeyGenerator. > >> >> > >> >> When hash.partition.fields is specified and partition.fields > contains > >> >> _hoodie_hash_partition, a column named _hoodie_hash_partition will be > >> added > >> >> in this table as one of the partition key. > >> >> > >> >> If predicates of hash.partition.fields appear in the query statement, > >> the > >> >> _hoodie_hash_partition = X predicate will be automatically added to > the > >> >> query statement for partition pruning. > >> >> > >> >> Advantages of this design: simple implementation, no modification of > >> core > >> >> functions, so low risk. > >> >> > >> >> The above design has been implemented in pr 7984. > >> >> > >> >> https://github.com/apache/hudi/pull/7984 > >> >> > >> >> How to use hash partition in spark data source can refer to > >> >> > >> >> > >> >> > >> > https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala > >> >> > >> >> #testHashPartition > >> >> > >> >> Perhaps for experts, the implementation of PR is not elegant enough. > I > >> >> also look forward to a more elegant implementation, from which I can > >> learn > >> >> more. > >> >> > >> >> Problems with hash partition: > >> >> > >> >> Because SparkSQL does not know the existence of hash partitions, when > >> two > >> >> hash-partitioned tables perform associated queries, it may not get > the > >> >> optimal execution plan. However, because hudi still has the ability > to > >> >> prune a single table, the performance of two tables can still be > greatly > >> >> improved by performing associated queries, compared with no hash > >> >> partitioning. > >> >> > >> >> The same problem also exists in the operation of group by. SparkSQL > does > >> >> not know that the aggregation operation has been completed in the > hash > >> >> partition. It may aggregate the results of different partition > >> aggregation > >> >> again. However, due to the data volume after partition aggregation is > >> very > >> >> small, the redundant and useless aggregation operation has little > >> impact on > >> >> the overall performance. > >> >> > >> >> I'm not sure whether bucket indexes will have the same problem as > hash > >> >> partitions. > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> At 2023-02-17 05:18:19, "Y Ethan Guo" <yi...@apache.org> wrote: > >> >> >+1 Thanks Lvhu for bringing up the idea. As Alexey suggested, it > >> would be > >> >> >good for you to write down the proposal with design details for > >> discussion > >> >> >in the community. > >> >> > > >> >> >On Thu, Feb 16, 2023 at 11:28 AM Alexey Kudinkin < > ale...@onehouse.ai> > >> >> wrote: > >> >> > > >> >> >> Thanks for your contribution, Lvhu! > >> >> >> > >> >> >> I think we should actually kick-start this effort with an small > RFC > >> >> >> outlining proposed changes first, as this is modifying the core > >> >> read-flow > >> >> >> for all Hudi tables and we want to make sure our approach there is > >> >> >> rock-solid. > >> >> >> > >> >> >> On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote: > >> >> >> > >> >> >> > Hi folks, > >> >> >> > PR 7984【 https://github.com/apache/hudi/pull/7984 】 > >> implements > >> >> >> hash > >> >> >> > partitioning. > >> >> >> > As you know, It is often difficult to find an appropriate > >> >> partition > >> >> >> > key in the existing big data. Hash partitioning can easily solve > >> this > >> >> >> > problem. it can greatly improve the performance of hudi's big > data > >> >> >> > processing. > >> >> >> > The idea is to use the hash partition field as one of the > >> >> partition > >> >> >> > fields of the ComplexKeyGenerator, so this PR implementation > does > >> not > >> >> >> > involve logic modification of core code. > >> >> >> > The codes are easy to review, but I think hash partition > is > >> very > >> >> >> > usefull. we really need it. > >> >> >> > How to use hash partition in spark data source can refer > to > >> >> >> > > >> >> >> > >> >> > >> > https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala > >> >> >> > #testHashPartition > >> >> >> > > >> >> >> > No public API or user-facing feature change or any > >> performance > >> >> >> > impact if the hash partition parameters are not specified. > >> >> >> > > >> >> >> > When hash.partition.fields is specified and > partition.fields > >> >> >> > contains _hoodie_hash_partition, a column named > >> _hoodie_hash_partition > >> >> >> will > >> >> >> > be added in this table as one of the partition key. > >> >> >> > > >> >> >> > If predicates of hash.partition.fields appear in the query > >> >> >> > statement, the _hoodie_hash_partition = X predicate will be > >> >> automatically > >> >> >> > added to the query statement for partition pruning. > >> >> >> > > >> >> >> > Hope folks help and review! > >> >> >> > Thanks! > >> >> >> > Lvhu > >> >> >> > > >> >> >> > >> >> > >> >