Re: Re: Re: DISCUSS

Vinoth Chandar Thu, 30 Mar 2023 20:18:09 -0700

I think we can focus more on validating the hash index + bloom filter vs
consistent hash index more first. Have you looked at RFC-08, which is a
kind of hash index as well, except it stores the key => file group mapping
externally.


On Fri, Mar 24, 2023 at 2:14 AM 吕虎 <lvh...@163.com> wrote:

> Hi Vinoth, I am very happy to receive your reply. Here are some of my
> thoughts。
>
> At 2023-03-21 23:32:44, "Vinoth Chandar" <vin...@apache.org> wrote:
> >>but when it is used for data expansion, it still involves the need to
> >redistribute the data records of some data files, thus affecting the
> >performance.
> >but expansion of the consistent hash index is an optional operation right?
>
> >Sorry, not still fully understanding the differences here,
> I'm sorry I didn't make myself clearly. The expansion I mentioned last
> time refers to data records increase in hudi table.
> The difference between consistent hash index and hash partition with Bloom
> filters index is how to deal with  data increase:
> For consistent hash index, the way of splitting the file is used.
> Splitting files affects performance, but can permanently work effectively.
> So consistent hash index is  suitable for scenarios where data increase
> cannot be estimated or  data will increase large.
> For hash partitions with Bloom filters index, the way of creating  new
> files is used. Adding new files does not affect performance, but if there
> are too many files, the probability of false positives in the Bloom filters
> will increase. So hash partitions with Bloom filters index is  suitable for
> scenario where data increase can be estimated over a relatively small range.
>
>
> >>Because the hash partition field values under the parquet file in a
> >columnar storage format are all equal, the added column field hardly
> >occupies storage space after compression.
> >Any new meta field added adds other overhead in terms evolving the schema,
> >so forth. are you suggesting this is not possible to do without a new meta
> >field?
>
> No new meta field  implementation is a more elegant implementation, but
> for me, who is not yet familiar with the Hudi source code, it is somewhat
> difficult to implement, but it is not a problem for experts. If you want to
> implement it without adding new meta fields, I hope I can participate in
> some simple development, and I can also learn how experts can do it.
>
>
> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎 <lvh...@163.com> wrote:
> >
> >> Hello,
> >>      I feel very honored that you are interested in my views.
> >>
> >>      Here are some of my thoughts marked with blue font.
> >>
> >> At 2023-03-16 13:18:08, "Vinoth Chandar" <vin...@apache.org> wrote:
> >>
> >> >Thanks for the proposal! Some first set of questions here.
> >> >
> >> >>You need to pre-select the number of buckets and use the hash
> function to
> >> >determine which bucket a record belongs to.
> >> >>when building the table according to the estimated amount of data,
> and it
> >> >cannot be changed after building the table
> >> >>When the amount of data in a hash partition is too large, the data in
> >> that
> >> >partition will be split into multiple files in the way of Bloom index.
> >> >
> >> >All these issues are related to bucket sizing could be alleviated by
> the
> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >> >your thoughts on this.
> >>
> >> Hash partitioning is applicable to data tables that cannot give the
> exact
> >> capacity of data, but can estimate a rough range. For example, if a
> company
> >> currently has 300 million customers in the United States, the company
> will
> >> have 7 billion customers in the world at most. In this scenario, using
> hash
> >> partitioning to cope with data growth within the known range by directly
> >> adding files and establishing  bloom filters can still guarantee
> >> performance.
> >> The consistent hash bucket index is also very valuable, but when it is
> >> used for data expansion, it still involves the need to redistribute the
> >> data records of some data files, thus affecting the performance. When
> it is
> >> completely impossible to estimate the range of data capacity, it is very
> >> suitable to use consistent hashing.
> >> >> you can directly search the data under the partition, which greatly
> >> >reduces the scope of the Bloom filter to search for files and reduces
> the
> >> >false positive of the Bloom filter.
> >> >the bloom index is already partition aware and unless you use the
> global
> >> >version can achieve this. Am I missing something?
> >> >
> >> >Another key thing is - if we can avoid adding a new meta field, that
> would
> >> >be great. Is it possible to implement this similar to bucket index,
> based
> >> >on jsut table properties?
> >> Add a hash partition field in the table to implement the hash partition
> >> function, which can well reuse the existing partition function, and
> >> involves very few code changes. Because the hash partition field values
> >> under the parquet file in a columnar storage format are all equal, the
> >> added column field hardly occupies storage space after compression.
> >> Of course, it is not necessary to add hash partition fields in the
> table,
> >> but to store hash partition fields in the corresponding metadata to
> achieve
> >> this function, but it will be difficult to reuse the existing functions.
> >> The establishment of hash partition and partition pruning during query
> need
> >> more time to develop code and test again.
> >> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎 <lvh...@163.com> wrote:
> >> >
> >> >> Hi folks,
> >> >>
> >> >> Here is my proposal.Thank you very much for reading it.I am looking
> >> >> forward to your agreement  to create an RFC for it.
> >> >>
> >> >> Background
> >> >>
> >> >> In order to deal with the problem that the modification of a small
> >> amount
> >> >> of local data needs to rewrite the entire partition data, Hudi
> divided
> >> the
> >> >> partition into multiple File Groups, and each File Group is
> identified
> >> by
> >> >> the File ID. In this way, when a small amount of local data is
> modified,
> >> >> only the data of the corresponding File Group needs to be rewritten.
> >> Hudi
> >> >> consistently maps the given Hudi record to the File ID through the
> index
> >> >> mechanism. The mapping relationship between Record Key and File
> >> Group/File
> >> >> ID will not change once the first version of Record is determined.
> >> >>
> >> >>     At present, Hudi's indexes mainly include Bloom filter index,
> Hbase
> >> >> index and bucket index. The Bloom filter index has a false positive
> >> >> problem. When a large amount of data results in a large number of
> File
> >> >> Groups, the false positive problem will magnify and lead to poor
> >> >> performance. The Hbase index depends on the external Hbase database,
> and
> >> >> may be inconsistent, which will ultimately increase the operation and
> >> >> maintenance costs. Bucket index makes each bucket of the bucket index
> >> >> correspond to a File Group. You need to pre-select the number of
> buckets
> >> >> and use the hash function to determine which bucket a record belongs
> to.
> >> >> Therefore, you can directly determine the mapping relationship
> between
> >> the
> >> >> Record Key and the File Group/File ID through the hash function.
> Using
> >> the
> >> >> bucket index, you need to determine the number of buckets in advance
> >> when
> >> >> building the table according to the estimated amount of data, and it
> >> cannot
> >> >> be changed after building the table. The unreasonable number of
> buckets
> >> >> will seriously affect performance. Unfortunately, the amount of data
> is
> >> >> often unpredictable and will continue to grow.
> >> >>
> >> >> Hash partition feasibility
> >> >>
> >> >>      In this context, I put forward the idea of hash partitioning.
> The
> >> >> principle is similar to bucket index, but in addition to the
> advantages
> >> of
> >> >> bucket index, hash partitioning can retain the Bloom index. When the
> >> amount
> >> >> of data in a hash partition is too large, the data in that partition
> >> will
> >> >> be split into multiple files in the way of Bloom index. Therefore,
> the
> >> >> problem that bucket index depends heavily on the number of buckets
> does
> >> not
> >> >> exist in the hash partition. Compared with the Bloom index, when
> >> searching
> >> >> for a data, you can directly search the data under the partition,
> which
> >> >> greatly reduces the scope of the Bloom filter to search for files and
> >> >> reduces the false positive of the Bloom filter.
> >> >>
> >> >> Design of a simple hash partition implementation
> >> >>
> >> >> The idea is to use the capabilities of the ComplexKeyGenerator to
> >> >> implement hash partitioning. Hash partition field is one of the
> >> partition
> >> >> fields of the ComplexKeyGenerator.
> >> >>
> >> >>    When hash.partition.fields is specified and partition.fields
> contains
> >> >> _hoodie_hash_partition, a column named _hoodie_hash_partition will be
> >> added
> >> >> in this table as one of the partition key.
> >> >>
> >> >> If predicates of hash.partition.fields appear in the query statement,
> >> the
> >> >> _hoodie_hash_partition = X predicate will be automatically added to
> the
> >> >> query statement for partition pruning.
> >> >>
> >> >> Advantages of this design: simple implementation, no modification of
> >> core
> >> >> functions, so low risk.
> >> >>
> >> >> The above design has been implemented in pr 7984.
> >> >>
> >> >> https://github.com/apache/hudi/pull/7984
> >> >>
> >> >> How to use hash partition in spark data source can refer to
> >> >>
> >> >>
> >> >>
> >>
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
> >> >>
> >> >> #testHashPartition
> >> >>
> >> >> Perhaps for experts, the implementation of PR is not elegant enough.
> I
> >> >> also look forward to a more elegant implementation, from which I can
> >> learn
> >> >> more.
> >> >>
> >> >> Problems with hash partition:
> >> >>
> >> >> Because SparkSQL does not know the existence of hash partitions, when
> >> two
> >> >> hash-partitioned tables perform associated queries, it may not get
> the
> >> >> optimal execution plan. However, because hudi still has the ability
> to
> >> >> prune a single table, the performance of two tables can still be
> greatly
> >> >> improved by performing associated queries, compared with no hash
> >> >> partitioning.
> >> >>
> >> >> The same problem also exists in the operation of group by. SparkSQL
> does
> >> >> not know that the aggregation operation has been completed in the
> hash
> >> >> partition. It may aggregate the results of different partition
> >> aggregation
> >> >> again. However, due to the data volume after partition aggregation is
> >> very
> >> >> small, the redundant and useless aggregation operation has little
> >> impact on
> >> >> the overall performance.
> >> >>
> >> >> I'm not sure whether bucket indexes will have the same problem as
> hash
> >> >> partitions.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> At 2023-02-17 05:18:19, "Y Ethan Guo" <yi...@apache.org> wrote:
> >> >> >+1 Thanks Lvhu for bringing up the idea.  As Alexey suggested, it
> >> would be
> >> >> >good for you to write down the proposal with design details for
> >> discussion
> >> >> >in the community.
> >> >> >
> >> >> >On Thu, Feb 16, 2023 at 11:28 AM Alexey Kudinkin <
> ale...@onehouse.ai>
> >> >> wrote:
> >> >> >
> >> >> >> Thanks for your contribution, Lvhu!
> >> >> >>
> >> >> >> I think we should actually kick-start this effort with an small
> RFC
> >> >> >> outlining proposed changes first, as this is modifying the core
> >> >> read-flow
> >> >> >> for all Hudi tables and we want to make sure our approach there is
> >> >> >> rock-solid.
> >> >> >>
> >> >> >> On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote:
> >> >> >>
> >> >> >> > Hi folks,
> >> >> >> >       PR 7984【 https://github.com/apache/hudi/pull/7984 】
> >> implements
> >> >> >> hash
> >> >> >> > partitioning.
> >> >> >> >       As you know, It is often difficult to find an appropriate
> >> >> partition
> >> >> >> > key in the existing big data. Hash partitioning can easily solve
> >> this
> >> >> >> > problem. it can greatly improve the performance of hudi's big
> data
> >> >> >> > processing.
> >> >> >> >       The idea is to use the hash partition field as one of the
> >> >> partition
> >> >> >> > fields of the ComplexKeyGenerator， so this PR  implementation
> does
> >> not
> >> >> >> > involve logic modification of core code.
> >> >> >> >       The codes are easy to review, but I think hash partition
> is
> >> very
> >> >> >> > usefull. we really need it.
> >> >> >> >       How to use hash partition in spark data source can refer
> to
> >> >> >> >
> >> >> >>
> >> >>
> >>
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
> >> >> >> >  #testHashPartition
> >> >> >> >
> >> >> >> >       No public API or user-facing feature change or any
> >> performance
> >> >> >> > impact if the hash partition parameters are not specified.
> >> >> >> >
> >> >> >> >       When hash.partition.fields is specified and
> partition.fields
> >> >> >> > contains _hoodie_hash_partition, a column named
> >> _hoodie_hash_partition
> >> >> >> will
> >> >> >> > be added in this table as one of the partition key.
> >> >> >> >
> >> >> >> >       If predicates of hash.partition.fields appear in the query
> >> >> >> > statement, the _hoodie_hash_partition = X predicate will be
> >> >> automatically
> >> >> >> > added to the query statement for partition pruning.
> >> >> >> >
> >> >> >> >         Hope folks help and review！
> >> >> >> >       Thanks!
> >> >> >> > Lvhu
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: Re: Re: DISCUSS

Reply via email to