Re: [DISCUSS] Introduce incremental processing API in Hudi
+1 This sounds really interesting! I like that this implicitly gives Hudi the ability to do transformations on ingested data :) On Sun, Aug 30, 2020 at 10:59 PM vino yang wrote: > Hi everyone, > > > For a long time, in the field of big data, people hope that the tools they > use can give greater play to the processing and analysis capabilities of > big data. At present, from the perspective of API, Hudi mostly provides > APIs related to data ingestion, and relies on various big data query > engines on the query side to release capabilities, but does not provide a > more convenient API for data processing after transactional writing. > > Currently, if a user wants to process the incremental data of a commit that > has just recently taken. It needs to go through three steps: > > >1. > >Write data to a hudi table; >2. > >Query or check completion of commit; >3. > >After the data is committed, the data is found out through incremental >query, and then the data is processed; > > > If you want a quick link here, you may use Hudi's recent written commit > callback function to simplify it into two steps: > > >1. > >Write data to a hudi table; >2. > >Based on the written commit callback function to trigger an incremental >query to find out the data, and then perform data processing; > > > However, it is still very troublesome to split into two steps for scenarios > that want to perform more timely and efficient data analysis on the data > ingest pipeline. Therefore, I propose to merge the entire process into one > step and provide a set of incremental(or saying Pipelined) processing API > based on this: > > Write the data to a hudi table, after obtaining the data through > JavaRDD, directly apply the user-defined function(UDF) to > process the data. The processing behavior can be described via these two > steps: > > >1. > >Conventional conversion such as Map/Filter/Reduce; >2. > >Aggregation calculation based on fixed time window; > > > And these calculation functions should be engine independent. Therefore, I > plan to introduce some new APIs that allow users to directly define > incremental processing capabilities after each writing operation. > > The preliminary idea is that we can introduce a tool class, for example, > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used > like this: > > IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder(); > > builder.source() //soure table > > .transform() > > .sink() //derived table > > .build(); > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD> > records, HudiFilterFunction mapFunction); > > //window function > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD> > records, HudiAggregateFunction aggFunction); > > It is suitable for scenarios where the commit interval (window) is moderate > and the delay of data ingestion is not very concerned. > > > What do you think? Looking forward to your thoughts and opinions. > > > Best, > > Vino >
Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column
@sivabalan the current plan is to only add this for hoodie_record_key. But I'm hoping to make the implementation general enough to add other columns as well going forward :) On Fri, Aug 21, 2020 at 11:49 AM Sivabalan wrote: > +1 for virtual record keys. Do you also propose to generalize this for > partition path as well ? > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma > wrote: > > > This is a good option to have. :) > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar > wrote: > > > > > IIRC _hoodie_record_key was supposed to this standardized key field. :) > > > Anyways, it's good to provide this option to the user. > > > So +1 for. RFC/further discussion. > > > > > > To level set, I want to also share some of the benefits of having an > > > explicit key column. > > > a) if you build your data lake using a bunch of hudi tables, now you > > have a > > > standardized data model > > > b) Even if your key generator changes, it does not affect the existing > > > data's keys. and updates will be matched correctly. > > > > > > On Thu, Aug 20, 2020 at 10:41 AM Balaji Varadarajan > > > wrote: > > > > > > > +1. This should be good to have as an option. If everybody agrees, > > > please > > > > go ahead with RFC and we can discuss details there. > > > > Balaji.VOn Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek > Modi > > > > wrote: > > > > > > > > Hi everyone! > > > > > > > > I was hoping to discuss adding support for making > `_hoodie_record_key` > > a > > > > virtual column :) > > > > > > > > Context: > > > > Currently, _hoodie_record_key is written to DFS, as a column in the > > > Parquet > > > > file. In our production systems at Uber however, _hoodie_record_key > > > > contains data that can be found in a different column (or set of > > > columns). > > > > This means that we are storing duplicated data. > > > > > > > > Proposal: > > > > In the interest of improving storage efficiency, we could add confs / > > > > abstract classes that can construct the _hoodie_record_key given > other > > > > columns. That way we do not have to store duplicated data on DFS. > > > > > > > > Any thoughts on this? > > > > > > > > Best, > > > > Modi > > > > > > > > > > > > -- > Regards, > -Sivabalan >
[DISCUSS] Support for `_hoodie_record_key` as a virtual column
Hi everyone! I was hoping to discuss adding support for making `_hoodie_record_key` a virtual column :) Context: Currently, _hoodie_record_key is written to DFS, as a column in the Parquet file. In our production systems at Uber however, _hoodie_record_key contains data that can be found in a different column (or set of columns). This means that we are storing duplicated data. Proposal: In the interest of improving storage efficiency, we could add confs / abstract classes that can construct the _hoodie_record_key given other columns. That way we do not have to store duplicated data on DFS. Any thoughts on this? Best, Modi
[DISCUSS] Adding Metrics to Hudi Common
Hi Everyone! I'm hoping to have a discussion around adding a lightweight metrics class to Hudi Common. There are parts of Hudi Common that have large performance implications, and I think adding metrics to these parts will help us track Hudi's health in production and help us understand the performance implications of changes we make. I've opened a Jira on this topic - https://issues.apache.org/jira/browse/HUDI-1025. This jira specifically suggests adding HoodieWrapperFileSystem as this class has performance implications not just for Hudi, but also for the underlying DFS. Looking forward to everyone's opinions on this :) Best, Modi