Re: [DISCUSS] Introduce incremental processing API in Hudi
+1 This sounds really interesting! I like that this implicitly gives Hudi the ability to do transformations on ingested data :) On Sun, Aug 30, 2020 at 10:59 PM vino yang wrote: > Hi everyone, > > > For a long time, in the field of big data, people hope that the tools they > use can give greater play to the processing and analysis capabilities of > big data. At present, from the perspective of API, Hudi mostly provides > APIs related to data ingestion, and relies on various big data query > engines on the query side to release capabilities, but does not provide a > more convenient API for data processing after transactional writing. > > Currently, if a user wants to process the incremental data of a commit that > has just recently taken. It needs to go through three steps: > > >1. > >Write data to a hudi table; >2. > >Query or check completion of commit; >3. > >After the data is committed, the data is found out through incremental >query, and then the data is processed; > > > If you want a quick link here, you may use Hudi's recent written commit > callback function to simplify it into two steps: > > >1. > >Write data to a hudi table; >2. > >Based on the written commit callback function to trigger an incremental >query to find out the data, and then perform data processing; > > > However, it is still very troublesome to split into two steps for scenarios > that want to perform more timely and efficient data analysis on the data > ingest pipeline. Therefore, I propose to merge the entire process into one > step and provide a set of incremental(or saying Pipelined) processing API > based on this: > > Write the data to a hudi table, after obtaining the data through > JavaRDD, directly apply the user-defined function(UDF) to > process the data. The processing behavior can be described via these two > steps: > > >1. > >Conventional conversion such as Map/Filter/Reduce; >2. > >Aggregation calculation based on fixed time window; > > > And these calculation functions should be engine independent. Therefore, I > plan to introduce some new APIs that allow users to directly define > incremental processing capabilities after each writing operation. > > The preliminary idea is that we can introduce a tool class, for example, > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used > like this: > > IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder(); > > builder.source() //soure table > > .transform() > > .sink() //derived table > > .build(); > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD> > records, HudiFilterFunction mapFunction); > > //window function > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD> > records, HudiAggregateFunction aggFunction); > > It is suitable for scenarios where the commit interval (window) is moderate > and the delay of data ingestion is not very concerned. > > > What do you think? Looking forward to your thoughts and opinions. > > > Best, > > Vino >
回复: [DISCUSS] Introduce incremental processing API in Hudi
+1 This will give hudi more capabilities besides data ingestion and writing, and make hudi-based data processing more timely! Best, wangxianghu 发件人: Abhishek Modi 发送时间: 2020年8月31日 15:01 收件人: dev@hudi.apache.org 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi +1 This sounds really interesting! I like that this implicitly gives Hudi the ability to do transformations on ingested data :) On Sun, Aug 30, 2020 at 10:59 PM vino yang wrote: > Hi everyone, > > > For a long time, in the field of big data, people hope that the tools they > use can give greater play to the processing and analysis capabilities of > big data. At present, from the perspective of API, Hudi mostly provides > APIs related to data ingestion, and relies on various big data query > engines on the query side to release capabilities, but does not provide a > more convenient API for data processing after transactional writing. > > Currently, if a user wants to process the incremental data of a commit that > has just recently taken. It needs to go through three steps: > > >1. > >Write data to a hudi table; >2. > >Query or check completion of commit; >3. > >After the data is committed, the data is found out through incremental >query, and then the data is processed; > > > If you want a quick link here, you may use Hudi's recent written commit > callback function to simplify it into two steps: > > >1. > >Write data to a hudi table; >2. > >Based on the written commit callback function to trigger an incremental >query to find out the data, and then perform data processing; > > > However, it is still very troublesome to split into two steps for scenarios > that want to perform more timely and efficient data analysis on the data > ingest pipeline. Therefore, I propose to merge the entire process into one > step and provide a set of incremental(or saying Pipelined) processing API > based on this: > > Write the data to a hudi table, after obtaining the data through > JavaRDD, directly apply the user-defined function(UDF) to > process the data. The processing behavior can be described via these two > steps: > > >1. > >Conventional conversion such as Map/Filter/Reduce; >2. > >Aggregation calculation based on fixed time window; > > > And these calculation functions should be engine independent. Therefore, I > plan to introduce some new APIs that allow users to directly define > incremental processing capabilities after each writing operation. > > The preliminary idea is that we can introduce a tool class, for example, > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used > like this: > > IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder(); > > builder.source() //soure table > > .transform() > > .sink() //derived table > > .build(); > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD> > records, HudiFilterFunction mapFunction); > > //window function > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD> > records, HudiAggregateFunction aggFunction); > > It is suitable for scenarios where the commit interval (window) is moderate > and the delay of data ingestion is not very concerned. > > > What do you think? Looking forward to your thoughts and opinions. > > > Best, > > Vino >
Re: DevX, Test infra Rgdn
+1 this is a great way to also ramp on the code base On Sun, Aug 30, 2020 at 8:00 AM Sivabalan wrote: > As Hudi matures as a project, we need to get our devX and test infra rock > solid. Availability of test utils and base classes for ease of writing more > tests, stable integration tests, ease of debuggability, micro benchmarks, > performance test infra, automating checkstyle formatting, nightly snapshot > builds and so on. > > We have identified and categorized these into different areas as below. > > - Test fixes and some clean up. // There are a lot of jira tickets > lying around in this section. > - Test refactoring. // For ease of development, and reduce clutter, we need > to work on refactoring test infra like having more test utils, base classes > etc. > - More tests to improve coverage in some areas. > - CI stability and ease of debugging integration tests. > - Checkstyle, sl4j, warnings, spotless, etc. > - Micro benchmarks. // add benchmarking framework to hudi. and then > identify regressions on any key paths. > - Long running test suite > - Config clean ups in hudi client > - Perf test environment > - Nightly builds > > As we plan out work in each of these sections, we are looking for help from > the community in getting these done. Plan is to put together a few umbrella > tickets for each of these areas and will have a coordinator. Coordinator > will be one who has expertise in the area of interest. Coordinator will > plan out the work in their resp area and will help drive the initiative > with help from the community depending on who volunteers to help out. > > I understand the list is huge. Some work areas will be well defined and > should be able to get it done if we allocate enough time and resources. But > some are exploratory in nature and need some initial push to get the ball > rolling. > > Very likely some of the work items in these would be well defined and > should be easy for new folks to contribute. We are not really having any > target timeframe in mind(as we had 1 month for bug bash), but would like to > get concrete work items done in decent time and have others ready by the > next major release(for eg, perf test env) depending on resources. > > Let us know if you would be interested to help our community in this > regard. > > -- > Regards, > -Sivabalan >
Re: Hudi Writer vs Spark Parquet Writer - Sync
Hi Felix, For read side performance, we are focussed on adding clustering support (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance) and consolidated metadata (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements) in the next release. The clustering support is much more generic and provides capability to dynamically organize the data to suit query performance. Please take a look at those RFCs. Balaji.V On Sunday, August 30, 2020, 02:16:29 PM PDT, Kizhakkel Jose, Felix wrote: Hello All, Hive has the bucketBy feature and spark is going to add support for HIVE style bucketBy support for data sources and once it’s implemented - its going to benefit largely on the read performance. So as HUDI is having different path while writing parquet data, are we planning to add bucketBy functionality? Seems Spark is adding features on writers to be benefitted for better read performance, so having a different writer for HUDI, are keeping track on these new features happening on Spark, therefore HUDI writer is not going to greatly differ from spark file (parquet) writer or lacking features? Regards, Felix K Jose The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Re: DevX, Test infra Rgdn
+1. This would be a great contribution as all developers will benefit from this work. On Monday, August 31, 2020, 08:07:08 AM PDT, Vinoth Chandar wrote: +1 this is a great way to also ramp on the code base On Sun, Aug 30, 2020 at 8:00 AM Sivabalan wrote: > As Hudi matures as a project, we need to get our devX and test infra rock > solid. Availability of test utils and base classes for ease of writing more > tests, stable integration tests, ease of debuggability, micro benchmarks, > performance test infra, automating checkstyle formatting, nightly snapshot > builds and so on. > > We have identified and categorized these into different areas as below. > > - Test fixes and some clean up. // There are a lot of jira tickets > lying around in this section. > - Test refactoring. // For ease of development, and reduce clutter, we need > to work on refactoring test infra like having more test utils, base classes > etc. > - More tests to improve coverage in some areas. > - CI stability and ease of debugging integration tests. > - Checkstyle, sl4j, warnings, spotless, etc. > - Micro benchmarks. // add benchmarking framework to hudi. and then > identify regressions on any key paths. > - Long running test suite > - Config clean ups in hudi client > - Perf test environment > - Nightly builds > > As we plan out work in each of these sections, we are looking for help from > the community in getting these done. Plan is to put together a few umbrella > tickets for each of these areas and will have a coordinator. Coordinator > will be one who has expertise in the area of interest. Coordinator will > plan out the work in their resp area and will help drive the initiative > with help from the community depending on who volunteers to help out. > > I understand the list is huge. Some work areas will be well defined and > should be able to get it done if we allocate enough time and resources. But > some are exploratory in nature and need some initial push to get the ball > rolling. > > Very likely some of the work items in these would be well defined and > should be easy for new folks to contribute. We are not really having any > target timeframe in mind(as we had 1 month for bug bash), but would like to > get concrete work items done in decent time and have others ready by the > next major release(for eg, perf test env) depending on resources. > > Let us know if you would be interested to help our community in this > regard. > > -- > Regards, > -Sivabalan >