date:20200831

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-08-31 Thread Abhishek Modi

+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Query or check completion of commit;
>3.
>
>After the data is committed, the data is found out through incremental
>query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Based on the written commit callback function to trigger an incremental
>query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>1.
>
>Conventional conversion such as Map/Filter/Reduce;
>2.
>
>Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()  //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>

回复: [DISCUSS] Introduce incremental processing API in Hudi

2020-08-31 Thread wangxianghu

+1
This will give hudi more capabilities besides data ingestion and writing, and 
make hudi-based data processing more timely!
Best,
wangxianghu

发件人: Abhishek Modi
发送时间: 2020年8月31日 15:01
收件人: dev@hudi.apache.org
主题: Re: [DISCUSS] Introduce incremental processing API in Hudi

+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Query or check completion of commit;
>3.
>
>After the data is committed, the data is found out through incremental
>query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Based on the written commit callback function to trigger an incremental
>query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>1.
>
>Conventional conversion such as Map/Filter/Reduce;
>2.
>
>Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()  //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>

Re: DevX, Test infra Rgdn

2020-08-31 Thread Vinoth Chandar

+1 this is a great way to also ramp on the code base

On Sun, Aug 30, 2020 at 8:00 AM Sivabalan  wrote:

> As Hudi matures as a project, we need to get our devX and test infra rock
> solid. Availability of test utils and base classes for ease of writing more
> tests, stable integration tests, ease of debuggability, micro benchmarks,
> performance test infra, automating checkstyle formatting, nightly snapshot
> builds and so on.
>
> We have identified and categorized these into different areas as below.
>
> - Test fixes and some clean up. // There are a lot of jira tickets
> lying around in this section.
> - Test refactoring. // For ease of development, and reduce clutter, we need
> to work on refactoring test infra like having more test utils, base classes
> etc.
> - More tests to improve coverage in some areas.
> - CI stability and ease of debugging integration tests.
> - Checkstyle, sl4j, warnings, spotless, etc.
> - Micro benchmarks. // add benchmarking framework to hudi. and then
> identify regressions on any key paths.
> - Long running test suite
> - Config clean ups in hudi client
> - Perf test environment
> - Nightly builds
>
> As we plan out work in each of these sections, we are looking for help from
> the community in getting these done. Plan is to put together a few umbrella
> tickets for each of these areas and will have a coordinator. Coordinator
> will be one who has expertise in the area of interest. Coordinator will
> plan out the work in their resp area and will help drive the initiative
> with help from the community depending on who volunteers to help out.
>
> I understand the list is huge. Some work areas will be well defined and
> should be able to get it done if we allocate enough time and resources. But
> some are exploratory in nature and need some initial push to get the ball
> rolling.
>
> Very likely some of the work items in these would be well defined and
> should be easy for new folks to contribute. We are not really having any
> target timeframe in mind(as we had 1 month for bug bash), but would like to
> get concrete work items done in decent time and have others ready by the
> next major release(for eg, perf test env) depending on resources.
>
> Let us know if you would be interested to help our community in this
> regard.
>
> --
> Regards,
> -Sivabalan
>

Re: Hudi Writer vs Spark Parquet Writer - Sync

2020-08-31 Thread Balaji Varadarajan

 Hi Felix, 
For read side performance, we are focussed on adding clustering support 
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance)
 and consolidated metadata 
(https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements)
 in the next release. The clustering support is much more generic and provides 
capability to dynamically organize the data to suit query performance. Please 
take a look at those RFCs. 
Balaji.V
On Sunday, August 30, 2020, 02:16:29 PM PDT, Kizhakkel Jose, Felix 
 wrote:  
 
 Hello All,

Hive has the bucketBy feature and spark is going to add support for HIVE style 
bucketBy support for data sources and once it’s implemented - its going to 
benefit largely on the read performance. So as HUDI is having different path 
while writing parquet data, are we planning to add bucketBy functionality? 
Seems Spark is adding features on writers to be benefitted for better read 
performance, so having a different writer for HUDI, are keeping track on these 
new features happening on Spark, therefore HUDI writer is not going to greatly 
differ from spark file (parquet) writer or lacking features?

Regards,
Felix K Jose



The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Re: DevX, Test infra Rgdn

2020-08-31 Thread Balaji Varadarajan

 +1. This would be a great contribution as all developers will benefit from 
this work. 
On Monday, August 31, 2020, 08:07:08 AM PDT, Vinoth Chandar 
 wrote:  
 
 +1 this is a great way to also ramp on the code base

On Sun, Aug 30, 2020 at 8:00 AM Sivabalan  wrote:

> As Hudi matures as a project, we need to get our devX and test infra rock
> solid. Availability of test utils and base classes for ease of writing more
> tests, stable integration tests, ease of debuggability, micro benchmarks,
> performance test infra, automating checkstyle formatting, nightly snapshot
> builds and so on.
>
> We have identified and categorized these into different areas as below.
>
> - Test fixes and some clean up. // There are a lot of jira tickets
> lying around in this section.
> - Test refactoring. // For ease of development, and reduce clutter, we need
> to work on refactoring test infra like having more test utils, base classes
> etc.
> - More tests to improve coverage in some areas.
> - CI stability and ease of debugging integration tests.
> - Checkstyle, sl4j, warnings, spotless, etc.
> - Micro benchmarks. // add benchmarking framework to hudi. and then
> identify regressions on any key paths.
> - Long running test suite
> - Config clean ups in hudi client
> - Perf test environment
> - Nightly builds
>
> As we plan out work in each of these sections, we are looking for help from
> the community in getting these done. Plan is to put together a few umbrella
> tickets for each of these areas and will have a coordinator. Coordinator
> will be one who has expertise in the area of interest. Coordinator will
> plan out the work in their resp area and will help drive the initiative
> with help from the community depending on who volunteers to help out.
>
> I understand the list is huge. Some work areas will be well defined and
> should be able to get it done if we allocate enough time and resources. But
> some are exploratory in nature and need some initial push to get the ball
> rolling.
>
> Very likely some of the work items in these would be well defined and
> should be easy for new folks to contribute. We are not really having any
> target timeframe in mind(as we had 1 month for bug bash), but would like to
> get concrete work items done in decent time and have others ready by the
> next major release(for eg, perf test env) depending on resources.
>
> Let us know if you would be interested to help our community in this
> regard.
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

回复: [DISCUSS] Introduce incremental processing API in Hudi

Re: DevX, Test infra Rgdn

Re: Hudi Writer vs Spark Parquet Writer - Sync

Re: DevX, Test infra Rgdn

5 matches

Site Navigation

Mail list logo

Footer information