Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-08-31 Thread Abhishek Modi
+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Query or check completion of commit;
>3.
>
>After the data is committed, the data is found out through incremental
>query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Based on the written commit callback function to trigger an incremental
>query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>1.
>
>Conventional conversion such as Map/Filter/Reduce;
>2.
>
>Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()  //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>


Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

2020-08-21 Thread Abhishek Modi
@sivabalan the current plan is to only add this for hoodie_record_key. But
I'm hoping to make the implementation general enough to add other columns
as well going forward :)

On Fri, Aug 21, 2020 at 11:49 AM Sivabalan  wrote:

> +1 for virtual record keys. Do you also propose to generalize this for
> partition path as well ?
>
>
> On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma 
> wrote:
>
> > This is a good option to have. :)
> >
> > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar 
> wrote:
> >
> > > IIRC _hoodie_record_key was supposed to this standardized key field. :)
> > > Anyways, it's good to provide this option to the user.
> > > So +1 for. RFC/further discussion.
> > >
> > > To level set, I want to also share some of the benefits of having an
> > > explicit key column.
> > > a) if you build your data lake using a bunch of hudi tables, now you
> > have a
> > > standardized data model
> > > b) Even if your key generator changes, it does not affect the existing
> > > data's keys. and updates will be matched correctly.
> > >
> > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan
> > >  wrote:
> > >
> > > >  +1. This should be good to have as an option. If everybody agrees,
> > > please
> > > > go ahead with RFC and we can discuss details there.
> > > > Balaji.VOn Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek
> Modi
> > > >  wrote:
> > > >
> > > >  Hi everyone!
> > > >
> > > > I was hoping to discuss adding support for making
> `_hoodie_record_key`
> > a
> > > > virtual column :)
> > > >
> > > > Context:
> > > > Currently, _hoodie_record_key is written to DFS, as a column in the
> > > Parquet
> > > > file. In our production systems at Uber however, _hoodie_record_key
> > > > contains data that can be found in a different column (or set of
> > > columns).
> > > > This means that we are storing duplicated data.
> > > >
> > > > Proposal:
> > > > In the interest of improving storage efficiency, we could add confs /
> > > > abstract classes that can construct the _hoodie_record_key given
> other
> > > > columns. That way we do not have to store duplicated data on DFS.
> > > >
> > > > Any thoughts on this?
> > > >
> > > > Best,
> > > > Modi
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


[DISCUSS] Support for `_hoodie_record_key` as a virtual column

2020-08-18 Thread Abhishek Modi
Hi everyone!

I was hoping to discuss adding support for making `_hoodie_record_key` a
virtual column :)

Context:
Currently, _hoodie_record_key is written to DFS, as a column in the Parquet
file. In our production systems at Uber however, _hoodie_record_key
contains data that can be found in a different column (or set of columns).
This means that we are storing duplicated data.

Proposal:
In the interest of improving storage efficiency, we could add confs /
abstract classes that can construct the _hoodie_record_key given other
columns. That way we do not have to store duplicated data on DFS.

Any thoughts on this?

Best,
Modi


[DISCUSS] Adding Metrics to Hudi Common

2020-07-27 Thread Abhishek Modi
Hi Everyone!

I'm hoping to have a discussion around adding a lightweight metrics class
to Hudi Common. There are parts of Hudi Common that have large performance
implications, and I think adding metrics to these parts will help us track
Hudi's health in production and help us understand the performance
implications of changes we make.

I've opened a Jira on this topic -
https://issues.apache.org/jira/browse/HUDI-1025. This jira
specifically suggests adding HoodieWrapperFileSystem as this class has
performance implications not just for Hudi, but also for the underlying
DFS.

Looking forward to everyone's opinions on this :)

Best,
Modi