Re: [DISCUSS] Restructure hudi-utilities module
+1 on Vinoth's suggestion on waiting for the lower level (write-client) re-factored and re-organized first. We can then look at Data-Source and DeltaStreamer to make sure how to best organize them. Balaji.VOn Sunday, March 8, 2020, 11:06:13 PM PDT, Vinoth Chandar wrote: >> make delta streamer a engine agnostic part so that Spark and Flink can share some common logic. If we make the change at the Write Client level to make it engine agnostic, it should help with most of the cases.. I believe there will be spark specific pieces in the Source abstraction since those are using spark datasources underneath in some cases.. My opinion is that we can first focus our efforts on making hudi-client agnostic and pluggable with different engines.. We can tackle deltastreamer down the line once we have it.. On Wed, Mar 4, 2020 at 6:51 PM vino yang wrote: > Hi guys, > > My original thought is to make delta streamer a engine agnostic part so > that Spark and Flink can share some common logic. > > >>I am not sure the ROI is there for renaming to hudi-deltastreamer and > pull this out.. Everytime we change a module name > > Actually, here my suggestion is to move the delta streamer to another new > module and keep the current hudi-utilities module. Although, in a way, > moving classes are similar to rename the module name. > > >> I propose we leave this module to be spark specific, i.e depending on > hudi-spark alone > > OK, will think to build delta streaming mode via Flink and ignore the > current implementation of delta streamer. > > Best, > Vino > > Vinoth Chandar 于2020年3月5日周四 上午12:47写道: > > > I am not sure the ROI is there for renaming to hudi-deltastreamer and > pull > > this out.. Everytime we change a module name, its a breaking change and I > > would prefer if we reserved those for really pressing issues.. or take > > natural course of development and get there.. > > > > Regarding how multi framework support would affect this module, I propose > > we leave this module to be spark specific, i.e depending on hudi-spark > > alone.. Until, we can make flink work end-end. > > This feels kind of premature to me. > > > > On Wed, Mar 4, 2020 at 8:37 AM Gary Li wrote: > > > > > +1. hudi-delta gives me the feeling that it has something to do with > > other > > > frameworks... I’d vote for another name hudi-deltastreamer or > > hudi-streamer > > > or hudi-stream. > > > > > > On Wed, Mar 4, 2020 at 2:29 AM vino yang > wrote: > > > > > > > Hi folks, > > > > > > > > Currently, it seems the content of hudi-utilities looks a bit mix. > > > > Summarize all of them, there are two aspects list below: > > > > > > > > > > > > - delta streamer and its relevant packages, e.g. deltastreamer, > > > sources, > > > > schema, transform, these packages are served for delta streamer. > > > > - Some utility tools such as > > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on > > > > > > > > > > > > We are trying to refactor the computing engine relevant business > logic. > > > > Delta Streamer (especially, the sources package is a start point of a > > job > > > > of Spark/Flink) will be affected. Doing this restructure can make the > > > work > > > > more clear and focus. > > > > > > > > I would like to start a proposal to restructure the hudi-utilites > > module. > > > > Considering delta streamer is a great feature for hudi, the logic is > > very > > > > much in the hudi-utilites. Can we raise its importance via making the > > > delta > > > > streamer as a single module? It could be named e.g. hudi-delta or > > > something > > > > else. Then let the hudi-utilities be a real utilities module to host > > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. > > > > > > > > In short, we can do these restructure works: > > > > > > > > > > > > - create a new module, named “hudi-delta” (or other name?) and > move > > > the > > > > deltastreamer, sources, schema, transform … packages into this > > module > > > > - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … > in > > > the > > > > current place (utilities module) > > > > > > > > What do you think? > > > > > > > > Any comments and suggestions are welcome and appreciated. > > > > > > > > Best, > > > > Vino > > > > > > > > > >
Re: [DISCUSS] Restructure hudi-utilities module
>> make delta streamer a engine agnostic part so that Spark and Flink can share some common logic. If we make the change at the Write Client level to make it engine agnostic, it should help with most of the cases.. I believe there will be spark specific pieces in the Source abstraction since those are using spark datasources underneath in some cases.. My opinion is that we can first focus our efforts on making hudi-client agnostic and pluggable with different engines.. We can tackle deltastreamer down the line once we have it.. On Wed, Mar 4, 2020 at 6:51 PM vino yang wrote: > Hi guys, > > My original thought is to make delta streamer a engine agnostic part so > that Spark and Flink can share some common logic. > > >>I am not sure the ROI is there for renaming to hudi-deltastreamer and > pull this out.. Everytime we change a module name > > Actually, here my suggestion is to move the delta streamer to another new > module and keep the current hudi-utilities module. Although, in a way, > moving classes are similar to rename the module name. > > >> I propose we leave this module to be spark specific, i.e depending on > hudi-spark alone > > OK, will think to build delta streaming mode via Flink and ignore the > current implementation of delta streamer. > > Best, > Vino > > Vinoth Chandar 于2020年3月5日周四 上午12:47写道: > > > I am not sure the ROI is there for renaming to hudi-deltastreamer and > pull > > this out.. Everytime we change a module name, its a breaking change and I > > would prefer if we reserved those for really pressing issues.. or take > > natural course of development and get there.. > > > > Regarding how multi framework support would affect this module, I propose > > we leave this module to be spark specific, i.e depending on hudi-spark > > alone.. Until, we can make flink work end-end. > > This feels kind of premature to me. > > > > On Wed, Mar 4, 2020 at 8:37 AM Gary Li wrote: > > > > > +1. hudi-delta gives me the feeling that it has something to do with > > other > > > frameworks... I’d vote for another name hudi-deltastreamer or > > hudi-streamer > > > or hudi-stream. > > > > > > On Wed, Mar 4, 2020 at 2:29 AM vino yang > wrote: > > > > > > > Hi folks, > > > > > > > > Currently, it seems the content of hudi-utilities looks a bit mix. > > > > Summarize all of them, there are two aspects list below: > > > > > > > > > > > >- delta streamer and its relevant packages, e.g. deltastreamer, > > > sources, > > > >schema, transform, these packages are served for delta streamer. > > > >- Some utility tools such as > > > >HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on > > > > > > > > > > > > We are trying to refactor the computing engine relevant business > logic. > > > > Delta Streamer (especially, the sources package is a start point of a > > job > > > > of Spark/Flink) will be affected. Doing this restructure can make the > > > work > > > > more clear and focus. > > > > > > > > I would like to start a proposal to restructure the hudi-utilites > > module. > > > > Considering delta streamer is a great feature for hudi, the logic is > > very > > > > much in the hudi-utilites. Can we raise its importance via making the > > > delta > > > > streamer as a single module? It could be named e.g. hudi-delta or > > > something > > > > else. Then let the hudi-utilities be a real utilities module to host > > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. > > > > > > > > In short, we can do these restructure works: > > > > > > > > > > > >- create a new module, named “hudi-delta” (or other name?) and > move > > > the > > > >deltastreamer, sources, schema, transform … packages into this > > module > > > >- leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … > in > > > the > > > >current place (utilities module) > > > > > > > > What do you think? > > > > > > > > Any comments and suggestions are welcome and appreciated. > > > > > > > > Best, > > > > Vino > > > > > > > > > >
Re: [DISCUSS] Restructure hudi-utilities module
Hi guys, My original thought is to make delta streamer a engine agnostic part so that Spark and Flink can share some common logic. >>I am not sure the ROI is there for renaming to hudi-deltastreamer and pull this out.. Everytime we change a module name Actually, here my suggestion is to move the delta streamer to another new module and keep the current hudi-utilities module. Although, in a way, moving classes are similar to rename the module name. >> I propose we leave this module to be spark specific, i.e depending on hudi-spark alone OK, will think to build delta streaming mode via Flink and ignore the current implementation of delta streamer. Best, Vino Vinoth Chandar 于2020年3月5日周四 上午12:47写道: > I am not sure the ROI is there for renaming to hudi-deltastreamer and pull > this out.. Everytime we change a module name, its a breaking change and I > would prefer if we reserved those for really pressing issues.. or take > natural course of development and get there.. > > Regarding how multi framework support would affect this module, I propose > we leave this module to be spark specific, i.e depending on hudi-spark > alone.. Until, we can make flink work end-end. > This feels kind of premature to me. > > On Wed, Mar 4, 2020 at 8:37 AM Gary Li wrote: > > > +1. hudi-delta gives me the feeling that it has something to do with > other > > frameworks... I’d vote for another name hudi-deltastreamer or > hudi-streamer > > or hudi-stream. > > > > On Wed, Mar 4, 2020 at 2:29 AM vino yang wrote: > > > > > Hi folks, > > > > > > Currently, it seems the content of hudi-utilities looks a bit mix. > > > Summarize all of them, there are two aspects list below: > > > > > > > > >- delta streamer and its relevant packages, e.g. deltastreamer, > > sources, > > >schema, transform, these packages are served for delta streamer. > > >- Some utility tools such as > > >HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on > > > > > > > > > We are trying to refactor the computing engine relevant business logic. > > > Delta Streamer (especially, the sources package is a start point of a > job > > > of Spark/Flink) will be affected. Doing this restructure can make the > > work > > > more clear and focus. > > > > > > I would like to start a proposal to restructure the hudi-utilites > module. > > > Considering delta streamer is a great feature for hudi, the logic is > very > > > much in the hudi-utilites. Can we raise its importance via making the > > delta > > > streamer as a single module? It could be named e.g. hudi-delta or > > something > > > else. Then let the hudi-utilities be a real utilities module to host > > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. > > > > > > In short, we can do these restructure works: > > > > > > > > >- create a new module, named “hudi-delta” (or other name?) and move > > the > > >deltastreamer, sources, schema, transform … packages into this > module > > >- leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in > > the > > >current place (utilities module) > > > > > > What do you think? > > > > > > Any comments and suggestions are welcome and appreciated. > > > > > > Best, > > > Vino > > > > > >
Re: [DISCUSS] Restructure hudi-utilities module
I am not sure the ROI is there for renaming to hudi-deltastreamer and pull this out.. Everytime we change a module name, its a breaking change and I would prefer if we reserved those for really pressing issues.. or take natural course of development and get there.. Regarding how multi framework support would affect this module, I propose we leave this module to be spark specific, i.e depending on hudi-spark alone.. Until, we can make flink work end-end. This feels kind of premature to me. On Wed, Mar 4, 2020 at 8:37 AM Gary Li wrote: > +1. hudi-delta gives me the feeling that it has something to do with other > frameworks... I’d vote for another name hudi-deltastreamer or hudi-streamer > or hudi-stream. > > On Wed, Mar 4, 2020 at 2:29 AM vino yang wrote: > > > Hi folks, > > > > Currently, it seems the content of hudi-utilities looks a bit mix. > > Summarize all of them, there are two aspects list below: > > > > > >- delta streamer and its relevant packages, e.g. deltastreamer, > sources, > >schema, transform, these packages are served for delta streamer. > >- Some utility tools such as > >HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on > > > > > > We are trying to refactor the computing engine relevant business logic. > > Delta Streamer (especially, the sources package is a start point of a job > > of Spark/Flink) will be affected. Doing this restructure can make the > work > > more clear and focus. > > > > I would like to start a proposal to restructure the hudi-utilites module. > > Considering delta streamer is a great feature for hudi, the logic is very > > much in the hudi-utilites. Can we raise its importance via making the > delta > > streamer as a single module? It could be named e.g. hudi-delta or > something > > else. Then let the hudi-utilities be a real utilities module to host > > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. > > > > In short, we can do these restructure works: > > > > > >- create a new module, named “hudi-delta” (or other name?) and move > the > >deltastreamer, sources, schema, transform … packages into this module > >- leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in > the > >current place (utilities module) > > > > What do you think? > > > > Any comments and suggestions are welcome and appreciated. > > > > Best, > > Vino > > >
Re: [DISCUSS] Restructure hudi-utilities module
+1. hudi-delta gives me the feeling that it has something to do with other frameworks... I’d vote for another name hudi-deltastreamer or hudi-streamer or hudi-stream. On Wed, Mar 4, 2020 at 2:29 AM vino yang wrote: > Hi folks, > > Currently, it seems the content of hudi-utilities looks a bit mix. > Summarize all of them, there are two aspects list below: > > >- delta streamer and its relevant packages, e.g. deltastreamer, sources, >schema, transform, these packages are served for delta streamer. >- Some utility tools such as >HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on > > > We are trying to refactor the computing engine relevant business logic. > Delta Streamer (especially, the sources package is a start point of a job > of Spark/Flink) will be affected. Doing this restructure can make the work > more clear and focus. > > I would like to start a proposal to restructure the hudi-utilites module. > Considering delta streamer is a great feature for hudi, the logic is very > much in the hudi-utilites. Can we raise its importance via making the delta > streamer as a single module? It could be named e.g. hudi-delta or something > else. Then let the hudi-utilities be a real utilities module to host > HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. > > In short, we can do these restructure works: > > >- create a new module, named “hudi-delta” (or other name?) and move the >deltastreamer, sources, schema, transform … packages into this module >- leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in the >current place (utilities module) > > What do you think? > > Any comments and suggestions are welcome and appreciated. > > Best, > Vino >
[DISCUSS] Restructure hudi-utilities module
Hi folks, Currently, it seems the content of hudi-utilities looks a bit mix. Summarize all of them, there are two aspects list below: - delta streamer and its relevant packages, e.g. deltastreamer, sources, schema, transform, these packages are served for delta streamer. - Some utility tools such as HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner and so on We are trying to refactor the computing engine relevant business logic. Delta Streamer (especially, the sources package is a start point of a job of Spark/Flink) will be affected. Doing this restructure can make the work more clear and focus. I would like to start a proposal to restructure the hudi-utilites module. Considering delta streamer is a great feature for hudi, the logic is very much in the hudi-utilites. Can we raise its importance via making the delta streamer as a single module? It could be named e.g. hudi-delta or something else. Then let the hudi-utilities be a real utilities module to host HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner tools. In short, we can do these restructure works: - create a new module, named “hudi-delta” (or other name?) and move the deltastreamer, sources, schema, transform … packages into this module - leave HDFSParquetImporter、HiveIncrementalPuller、HoodieCleaner … in the current place (utilities module) What do you think? Any comments and suggestions are welcome and appreciated. Best, Vino