Re:Re: DISCUSS

2023-03-16 Thread 吕虎
Hello, 
 I feel very honored that you are interested in my views.

 Here are some of my thoughts marked with blue font.

At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:

>Thanks for the proposal! Some first set of questions here.
>
>>You need to pre-select the number of buckets and use the hash function to
>determine which bucket a record belongs to.
>>when building the table according to the estimated amount of data, and it
>cannot be changed after building the table
>>When the amount of data in a hash partition is too large, the data in that
>partition will be split into multiple files in the way of Bloom index.
>
>All these issues are related to bucket sizing could be alleviated by the
>consistent hashing index in 0.13? Have you checked it out? Love to hear
>your thoughts on this.

Hash partitioning is applicable to data tables that cannot give the exact 
capacity of data, but can estimate a rough range. For example, if a company 
currently has 300 million customers in the United States, the company will have 
7 billion customers in the world at most. In this scenario, using hash 
partitioning to cope with data growth within the known range by directly adding 
files and establishing  bloom filters can still guarantee performance.
The consistent hash bucket index is also very valuable, but when it is used for 
data expansion, it still involves the need to redistribute the data records of 
some data files, thus affecting the performance. When it is completely 
impossible to estimate the range of data capacity, it is very suitable to use 
consistent hashing.
>> you can directly search the data under the partition, which greatly
>reduces the scope of the Bloom filter to search for files and reduces the
>false positive of the Bloom filter.
>the bloom index is already partition aware and unless you use the global
>version can achieve this. Am I missing something?
>
>Another key thing is - if we can avoid adding a new meta field, that would
>be great. Is it possible to implement this similar to bucket index, based
>on jsut table properties?
Add a hash partition field in the table to implement the hash partition 
function, which can well reuse the existing partition function, and involves 
very few code changes. Because the hash partition field values under the 
parquet file in a columnar storage format are all equal, the added column field 
hardly occupies storage space after compression.
Of course, it is not necessary to add hash partition fields in the table, but 
to store hash partition fields in the corresponding metadata to achieve this 
function, but it will be difficult to reuse the existing functions. The 
establishment of hash partition and partition pruning during query need more 
time to develop code and test again.
>On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:
>
>> Hi folks,
>>
>> Here is my proposal.Thank you very much for reading it.I am looking
>> forward to your agreement  to create an RFC for it.
>>
>> Background
>>
>> In order to deal with the problem that the modification of a small amount
>> of local data needs to rewrite the entire partition data, Hudi divided the
>> partition into multiple File Groups, and each File Group is identified by
>> the File ID. In this way, when a small amount of local data is modified,
>> only the data of the corresponding File Group needs to be rewritten. Hudi
>> consistently maps the given Hudi record to the File ID through the index
>> mechanism. The mapping relationship between Record Key and File Group/File
>> ID will not change once the first version of Record is determined.
>>
>> At present, Hudi's indexes mainly include Bloom filter index, Hbase
>> index and bucket index. The Bloom filter index has a false positive
>> problem. When a large amount of data results in a large number of File
>> Groups, the false positive problem will magnify and lead to poor
>> performance. The Hbase index depends on the external Hbase database, and
>> may be inconsistent, which will ultimately increase the operation and
>> maintenance costs. Bucket index makes each bucket of the bucket index
>> correspond to a File Group. You need to pre-select the number of buckets
>> and use the hash function to determine which bucket a record belongs to.
>> Therefore, you can directly determine the mapping relationship between the
>> Record Key and the File Group/File ID through the hash function. Using the
>> bucket index, you need to determine the number of buckets in advance when
>> building the table according to the estimated amount of data, and it cannot
>> be changed after building the table. The unreasonable number of buckets
>> will seriously affect performance. Unfortunately, the amount of data is
>> often unpredictable and will continue to grow.
>>
>> Hash partition feasibility
>>
>>  In this context, I put forward the idea of hash partitioning. The
>> principle is similar to bucket index, but in addition to the advantages of
>> bucket index, hash partitioning 

Re: DISCUSS

2023-03-16 Thread Vinoth Chandar
Thanks for the proposal! Some first set of questions here.

>You need to pre-select the number of buckets and use the hash function to
determine which bucket a record belongs to.
>when building the table according to the estimated amount of data, and it
cannot be changed after building the table
>When the amount of data in a hash partition is too large, the data in that
partition will be split into multiple files in the way of Bloom index.

All these issues are related to bucket sizing could be alleviated by the
consistent hashing index in 0.13? Have you checked it out? Love to hear
your thoughts on this.

> you can directly search the data under the partition, which greatly
reduces the scope of the Bloom filter to search for files and reduces the
false positive of the Bloom filter.
the bloom index is already partition aware and unless you use the global
version can achieve this. Am I missing something?

Another key thing is - if we can avoid adding a new meta field, that would
be great. Is it possible to implement this similar to bucket index, based
on jsut table properties?

On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:

> Hi folks,
>
> Here is my proposal.Thank you very much for reading it.I am looking
> forward to your agreement  to create an RFC for it.
>
> Background
>
> In order to deal with the problem that the modification of a small amount
> of local data needs to rewrite the entire partition data, Hudi divided the
> partition into multiple File Groups, and each File Group is identified by
> the File ID. In this way, when a small amount of local data is modified,
> only the data of the corresponding File Group needs to be rewritten. Hudi
> consistently maps the given Hudi record to the File ID through the index
> mechanism. The mapping relationship between Record Key and File Group/File
> ID will not change once the first version of Record is determined.
>
> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> index and bucket index. The Bloom filter index has a false positive
> problem. When a large amount of data results in a large number of File
> Groups, the false positive problem will magnify and lead to poor
> performance. The Hbase index depends on the external Hbase database, and
> may be inconsistent, which will ultimately increase the operation and
> maintenance costs. Bucket index makes each bucket of the bucket index
> correspond to a File Group. You need to pre-select the number of buckets
> and use the hash function to determine which bucket a record belongs to.
> Therefore, you can directly determine the mapping relationship between the
> Record Key and the File Group/File ID through the hash function. Using the
> bucket index, you need to determine the number of buckets in advance when
> building the table according to the estimated amount of data, and it cannot
> be changed after building the table. The unreasonable number of buckets
> will seriously affect performance. Unfortunately, the amount of data is
> often unpredictable and will continue to grow.
>
> Hash partition feasibility
>
>  In this context, I put forward the idea of hash partitioning. The
> principle is similar to bucket index, but in addition to the advantages of
> bucket index, hash partitioning can retain the Bloom index. When the amount
> of data in a hash partition is too large, the data in that partition will
> be split into multiple files in the way of Bloom index. Therefore, the
> problem that bucket index depends heavily on the number of buckets does not
> exist in the hash partition. Compared with the Bloom index, when searching
> for a data, you can directly search the data under the partition, which
> greatly reduces the scope of the Bloom filter to search for files and
> reduces the false positive of the Bloom filter.
>
> Design of a simple hash partition implementation
>
> The idea is to use the capabilities of the ComplexKeyGenerator to
> implement hash partitioning. Hash partition field is one of the partition
> fields of the ComplexKeyGenerator.
>
>When hash.partition.fields is specified and partition.fields contains
> _hoodie_hash_partition, a column named _hoodie_hash_partition will be added
> in this table as one of the partition key.
>
> If predicates of hash.partition.fields appear in the query statement, the
> _hoodie_hash_partition = X predicate will be automatically added to the
> query statement for partition pruning.
>
> Advantages of this design: simple implementation, no modification of core
> functions, so low risk.
>
> The above design has been implemented in pr 7984.
>
> https://github.com/apache/hudi/pull/7984
>
> How to use hash partition in spark data source can refer to
>
>
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>
> #testHashPartition
>
> Perhaps for experts, the implementation of PR is not elegant enough. I
> also look