from:"Vinoth Chandar"

Re: New site/docs navigation

2021-10-28 Thread Vinoth Chandar

Awesome!
I think Kyle has already fixed some issues around cn docs in the PR above.
Could you review that?
Kyle, if you are here, please chime in. We can organize all the work under
a single umbrella JIRA.
https://issues.apache.org/jira/browse/HUDI-270 so its easier for any
volunteers to pick up?

On Thu, Oct 28, 2021 at 6:21 AM Shawy Geng  wrote:

> Hi Vinoth,
>
> Volunteer to update the Chinese doc. Already commented at the
> https://issues.apache.org/jira/browse/HUDI-2628 <
> https://issues.apache.org/jira/browse/HUDI-2628>.
> Are there any other volunteers who want to work together to translate?
> Please contact me.
>
> > 2021年10月28日 20:35，Vinoth Chandar  写道：
> >
> > Hi all,
> >
> > https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
> > content, that can show case all of the Hudi capabilities. Please chime in
> > and help merge the PR.
> >
> > As follow on, we can also fix the Chinese site docs after this?
> >
> > Thanks
> > Vinoth
>
>

New site/docs navigation

2021-10-28 Thread Vinoth Chandar

Hi all,

https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
content, that can show case all of the Hudi capabilities. Please chime in
and help merge the PR.

As follow on, we can also fix the Chinese site docs after this?

Thanks
Vinoth

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Vinoth Chandar

Sounds great!

On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
wrote:

> Hi Vinoth,
>
> Thanks for the starter. Definitely once the new way to manage indexes
> and we get migrated on hudi on our datalake, I d'be glad to give this a
> shot.
>
>
> Regards, Nicolas
>
> On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > Hi Nicolas,
> >
> > Thanks for raising this! I think it's a very valid ask.
> > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> >
> > As a proof of concept, would you be able to give filterExists() a shot
> > and
> > see if the filtering time improves?
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> >
> > In the upcoming 0.10.0 release, we are planning to move the bloom
> > filters
> > out to a partition on the metadata table, to even speed this up for very
> > large tables.
> > https://issues.apache.org/jira/browse/HUDI-1295
> >
> > Please let us know if you are interested in testing that when the PR is
> > up.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > wrote:
> >
> > > hi !
> > >
> > > In my use case, for GDPR I have to export all informations of a given
> > > user from several hudi HUGE tables. Filtering the table results in a
> > > full scan of around 10 hours and this will get worst year after year.
> > >
> > > Since the filter criteria is based on the bloom key (user_id) it would
> > > be handy to exploit the bloom and produce a temporary table (in the
> > > metastore for eg) with the resulting rows.
> > >
> > > So far the bloom indexing is used for update/delete operations on a
> hudi
> > > table.
> > >
> > > 1. There is a oportunity to exploit the bloom for select operations.
> > > the hudi options would be:
> > > operation: select
> > > result-table: 
> > > result-path: 
> > > result-schema:  (optional ; when empty no
> > > sync with the hms, only raw path)
> > >
> > >
> > > 2. It could be implemented as predicate push down in the spark
> > > datasource API. When filtering with a IN statement.
> > >
> > >
> > > Thought ?
> > >
>
>

Re: Limitations of non unique keys

2021-10-28 Thread Vinoth Chandar

Hi,

Are you asking if there are advantages to allowing duplicates or not having
keys in your table?

Having keys, helps with othe practical scenarios, in addition to what you
called out.
e.g: Oftentimes, you would want to backfill an insert-only table and you
don't want to introduce duplicates when doing so.

Thanks
Vinoth

On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris 
wrote:

> Hi devs,
>
> AFAIK, hudi has been designed to have primary keys in the hudi's key.
> However it is possible to also choose a non unique field. I have listed
> several trouble with such design:
>
> Non unique key yield to :
> - cannot delete / update a unique record
> - cannot apply primary key for new sql tables feature
>
> Is there other downsides to choose a non unique key you have in mind ?
>
> In my case, having user_id as a hudi key will help to apply deletion on
> the user level in any user table. The table are insert only, so the
> drawbacks listed above do not really apply. In case of error in the
> tables I have several options:
>
> - rollback to a previous commit
> - read partition/filter overwrite partition
>
> Thanks
>

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-22 Thread Vinoth Chandar

We could, but just need storage space over the longer term. :)

On Wed, Oct 20, 2021 at 9:56 PM Raymond Xu 
wrote:

> Timing looks ok. Are we going to record the sessions too?
>
> On Wed, Oct 20, 2021 at 7:17 PM Vinoth Chandar  wrote:
>
> > I think we can do 7AM PST winters and 8AM summers.
> > Will draft a page with a zoom link we can use and put up a PR.
> >
> >
> > On Thu, Oct 14, 2021 at 9:48 AM Vinoth Chandar 
> wrote:
> >
> > > Yes. I can do 7AM PST. Can others in PST chime in please?
> > >
> > > We can wrap this up this week.
> > >
> > > On Tue, Oct 12, 2021 at 7:25 PM Gary Li  wrote:
> > >
> > >> Hi Vinoth,
> > >>
> > >> Summertime 8 AM PST was 11 PM in China so I guess it works for some
> > forks,
> > >> but switching to wintertime it was 12 AM in China. It might be a bit
> > late
> > >> IMO. Does 3 PM UTC(7 AM PST in winter, 8 AM in summer) work?
> > >>
> > >> Best,
> > >> Gary
> > >>
> > >> On Tue, Oct 5, 2021 at 9:20 PM Pratyaksh Sharma <
> pratyaks...@gmail.com>
> > >> wrote:
> > >>
> > >> > Works for me in India :)
> > >> >
> > >> > On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar 
> > >> wrote:
> > >> >
> > >> > > Looks like there is enough interest here.
> > >> > >
> > >> > > Moving onto timing. Does 8AM PST, on the second thursday of every
> > >> > > month work for everyone?
> > >> > > This is the time I find, works best for most time zones.
> > >> > >
> > >> > > On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo <
> > ethan.guoyi...@gmail.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > +1 on monthly community sync.
> > >> > > >
> > >> > > > On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra <
> udi...@apache.org
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > +1 for the monthly meeting. It would be great to start syncing
> > up
> > >> > > > > again. Thanks Vinoth for bringing it up !
> > >> > > > >
> > >> > > > > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan <
> n.siv...@gmail.com>
> > >> > wrote:
> > >> > > > > >
> > >> > > > > > +1 on monthly meet up.
> > >> > > > > >
> > >> > > > > > On Thu, Sep 23, 2021 at 11:01 AM vino yang <
> > >> yanghua1...@gmail.com>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > > +1 for monthly
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Vino
> > >> > > > > > >
> > >> > > > > > > Pratyaksh Sharma  于2021年9月23日周四
> > >> 下午9:36写道：
> > >> > > > > > >
> > >> > > > > > > > Monthly should be good. Been a long time since we
> > connected
> > >> in
> > >> > > > these
> > >> > > > > > > > meetings. :)
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> > >> > > > > > > > mail.vinoth.chan...@gmail.com> wrote:
> > >> > > > > > > >
> > >> > > > > > > > > 1 hour monthly is what I was proposing to be specific.
> > >> > > > > > > > >
> > >> > > > > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li <
> > >> gar...@apache.org>
> > >> > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > +1 for monthly.
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
> > >> > > > > vin...@apache.org>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hi all,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Once upon a time, we used to have a weekly
> community
> > >> > sync.
> > >> > > > > > > Wondering
> > >> > > > > > > > if
> > >> > > > > > > > > > > there is interest in having a monthly or
> bi-monthly
> > >> dev
> > >> > > > > meeting?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Agenda could be
> > >> > > > > > > > > > > - Update/Summary of all dev work tracks
> > >> > > > > > > > > > > - Show and tell, where people can present their
> > >> ongoing
> > >> > > work
> > >> > > > > > > > > > > - Open floor discussions, bring up new issues.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Thanks
> > >> > > > > > > > > > > Vinoth
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Regards,
> > >> > > > > > -Sivabalan
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-22 Thread Vinoth Chandar

Hi Nicolas,

Thanks for raising this! I think it's a very valid ask.
https://issues.apache.org/jira/browse/HUDI-2601 has been raised.

As a proof of concept, would you be able to give filterExists() a shot  and
see if the filtering time improves?
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172

In the upcoming 0.10.0 release, we are planning to move the bloom filters
out to a partition on the metadata table, to even speed this up for very
large tables.
https://issues.apache.org/jira/browse/HUDI-1295

Please let us know if you are interested in testing that when the PR is up.

Thanks
Vinoth

On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
wrote:

> hi !
>
> In my use case, for GDPR I have to export all informations of a given
> user from several hudi HUGE tables. Filtering the table results in a
> full scan of around 10 hours and this will get worst year after year.
>
> Since the filter criteria is based on the bloom key (user_id) it would
> be handy to exploit the bloom and produce a temporary table (in the
> metastore for eg) with the resulting rows.
>
> So far the bloom indexing is used for update/delete operations on a hudi
> table.
>
> 1. There is a oportunity to exploit the bloom for select operations.
> the hudi options would be:
> operation: select
> result-table: 
> result-path: 
> result-schema:  (optional ; when empty no
> sync with the hms, only raw path)
>
>
> 2. It could be implemented as predicate push down in the spark
> datasource API. When filtering with a IN statement.
>
>
> Thought ?
>

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-20 Thread Vinoth Chandar

I think we can do 7AM PST winters and 8AM summers.
Will draft a page with a zoom link we can use and put up a PR.


On Thu, Oct 14, 2021 at 9:48 AM Vinoth Chandar  wrote:

> Yes. I can do 7AM PST. Can others in PST chime in please?
>
> We can wrap this up this week.
>
> On Tue, Oct 12, 2021 at 7:25 PM Gary Li  wrote:
>
>> Hi Vinoth,
>>
>> Summertime 8 AM PST was 11 PM in China so I guess it works for some forks,
>> but switching to wintertime it was 12 AM in China. It might be a bit late
>> IMO. Does 3 PM UTC(7 AM PST in winter, 8 AM in summer) work?
>>
>> Best,
>> Gary
>>
>> On Tue, Oct 5, 2021 at 9:20 PM Pratyaksh Sharma 
>> wrote:
>>
>> > Works for me in India :)
>> >
>> > On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar 
>> wrote:
>> >
>> > > Looks like there is enough interest here.
>> > >
>> > > Moving onto timing. Does 8AM PST, on the second thursday of every
>> > > month work for everyone?
>> > > This is the time I find, works best for most time zones.
>> > >
>> > > On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo > >
>> > > wrote:
>> > >
>> > > > +1 on monthly community sync.
>> > > >
>> > > > On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra 
>> > > wrote:
>> > > >
>> > > > > +1 for the monthly meeting. It would be great to start syncing up
>> > > > > again. Thanks Vinoth for bringing it up !
>> > > > >
>> > > > > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan 
>> > wrote:
>> > > > > >
>> > > > > > +1 on monthly meet up.
>> > > > > >
>> > > > > > On Thu, Sep 23, 2021 at 11:01 AM vino yang <
>> yanghua1...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > +1 for monthly
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Vino
>> > > > > > >
>> > > > > > > Pratyaksh Sharma  于2021年9月23日周四
>> 下午9:36写道：
>> > > > > > >
>> > > > > > > > Monthly should be good. Been a long time since we connected
>> in
>> > > > these
>> > > > > > > > meetings. :)
>> > > > > > > >
>> > > > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
>> > > > > > > > mail.vinoth.chan...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > > 1 hour monthly is what I was proposing to be specific.
>> > > > > > > > >
>> > > > > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li <
>> gar...@apache.org>
>> > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > +1 for monthly.
>> > > > > > > > > >
>> > > > > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
>> > > > > vin...@apache.org>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi all,
>> > > > > > > > > > >
>> > > > > > > > > > > Once upon a time, we used to have a weekly community
>> > sync.
>> > > > > > > Wondering
>> > > > > > > > if
>> > > > > > > > > > > there is interest in having a monthly or bi-monthly
>> dev
>> > > > > meeting?
>> > > > > > > > > > >
>> > > > > > > > > > > Agenda could be
>> > > > > > > > > > > - Update/Summary of all dev work tracks
>> > > > > > > > > > > - Show and tell, where people can present their
>> ongoing
>> > > work
>> > > > > > > > > > > - Open floor discussions, bring up new issues.
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks
>> > > > > > > > > > > Vinoth
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards,
>> > > > > > -Sivabalan
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-18 Thread Vinoth Chandar

Hi,

Hudi async compaction in general can deal with concurrent writes to the
same file group. This is done, by writer respecting all the pending
compaction plans.

Thanks
Vinoth

On Sun, Oct 17, 2021 at 8:39 AM Jian Feng  wrote:

> I have a question, when Delta streamer does delta commit with BloomIndex,
> if data is new , it may need to append them to the existing file group.
> meanwhile may cause concurrent issue with async compaction thread if
> compaction plan contains same file group，how Hudi avoid that？
>
> On Fri, Oct 15, 2021 at 12:50 AM Vinoth Chandar  wrote:
>
> > Yeah all the rate limiting code in HBaseIndex is working around for these
> > large bulk writes.
> >
> > On Tue, Oct 5, 2021 at 11:16 AM Jian Feng  wrote:
> >
> > > actually I met this problem when bootstrap a huge table，after changed
> > > region key split strategy，problem solved.
> > > Im glad to hear that hfile solution will work in the future，since
> > > bloomindex cannot index mor log file，hence new insert data still write
> > into
> > > parquet ，that why I choose hbase index ，get better performance.
> > >
> > > Vinoth Chandar 于2021年10月5日 周二下午7:29写道：
> > >
> > > > +1 on that answer. It's pretty spot on.
> > > >
> > > > Even as random prefix helps with HBase balancing, the issue then
> > becomes
> > > > that you lose all the key ordering inside the Hudi table, which
> > > > can be a nice thing if you even want range pruning/indexing to be
> > > > effective.
> > > >
> > > > To paint a picture of all the work being done around this area. This
> > > work,
> > > > driven by uber engineers https://github.com/apache/hudi/pull/3508
> > could
> > > > technically solve the issue by directly reading HFiles
> > > > for the indexing, avoiding going to HBase servers. But obviously, it
> > > could
> > > > be less performant for small upsert batches than HBase (given the
> > region
> > > > servers will cache etc).
> > > > If your backing storage is a cloud/object storage, which again
> > throttles
> > > by
> > > > prefixes etc, then we could run into the same hotspotting problem
> > again.
> > > > Otherwise, for larger batches, this would be far more scalable.
> > > >
> > > >
> > > > On Mon, Oct 4, 2021 at 7:06 PM 管梓越 
> > wrote:
> > > >
> > > > > Hi jianfeng
> > > > >   As far as I know, there may not be a solution in hudi side
> yet.
> > > > > However, I have met this problem before so hope my experience could
> > > help.
> > > > > Just like other usages of hbase, adding a random prefix to rowkey
> may
> > > be
> > > > > the most universal solution to this problem.
> > > > > We may change the primary key for hudi by adding such prefix before
> > the
> > > > > data is ingested into hudi. A new column could be added to save
> > > original
> > > > > primary key for query and hide the pk of hudi.
> > > > > Also, we may have a small modification to hbase index. Copy the
> code
> > of
> > > > > hbase index, add the prefix on the aspect of query and update
> hbase.
> > By
> > > > > this way, the pk in hbase will be different with the one in hudi
> but
> > > such
> > > > > logic will be transparent to business logic. I have adopted this
> > method
> > > > in
> > > > > prod environment. Using withIndexClass config in IndexConfig could
> > > > specify
> > > > > custom index which allows the change of index without re
> compilation
> > of
> > > > the
> > > > > whole hudi project.
> > > > >
> > > > > On Mon, Oct 4, 2021, 11:29 PM  wrote:
> > > > > when I bootstrape a huge hbase index table, I found all keys have a
> > > > prefix
> > > > > 'itemid:', then it caused data skew, there are 100 region servers
> in
> > > > hbase
> > > > > but only one was handle datas Is there any way to avoid this issue
> on
> > > the
> > > > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data
> Infrastructure
> > > > >
> > > >
> > > --
> > > Full jian
> > >  | 
> > > Mobile 
> > > Address 
> > >
> >
>
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>

Re: [DISCUSS] Presto Plugin for Hudi

2021-10-17 Thread Vinoth Chandar

+1 here in general.

Raised some points on the Trino thread. Same apply here as well.  For
presto, would the new connector work with the aria/raptorx work done by
facebook engs?

On Sat, Oct 16, 2021 at 11:12 PM sagar sumit  wrote:

> Dear Hudi Community,
>
> I initiated a discussion thread earlier regarding a new Trino plugin for
> Hudi [1]. Please take a look if you haven't.
> This thread is for a new Hudi connector in Presto. The motivations for
> Trino apply here as well. This is just a separate thread to discuss nuances
> related to Presto.
>
> Regards,
> Sagar
>
> [1]
>
> https://lists.apache.org/thread.html/r3e71e548ea9a596bbfa50ee98b9e4ea89203484c1a5e7d102931492b%40%3Cdev.hudi.apache.org%3E
>

Re: [DISCUSS] Trino Plugin for Hudi

2021-10-17 Thread Vinoth Chandar

Hi Sagar;

Thanks for the detailed write up. +1 on the separate connector in general.

I would love to understand few aspects which work really well for the Hive
connector path (which is kind of why we did it this way to begin with)

- whats the new user experience for users? With the hive plugin
integration, hudi tables can be queried like any hive table. This is very
nice and easy to get started. Can we provide a seamless experience, what
about existing tables?

- what are we giving up? Trino docs talk about caching etc that are built
into Hive connector?

- IMO we should retain the hive connector path as well. Most of the issues
we faced are because Hudi was adding transactions/snapshots which had no
good abstractions in Hive connector.

Thanks
Vinoth

On Sat, Oct 16, 2021 at 11:06 PM sagar sumit  wrote:

> Dear Hudi Community,
>
> I would like to propose the development of a new Trino plugin/connector for
> Hudi.
>
> Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> read-optimized queries on Merge-On-Read tables with Trino, through the
> input format based integration in the Hive connector [1
> ]. This
> approach has known performance limitations with very large tables, which
> has been since fixed on PrestoDB [2
> ]. We are working
> on
> replicating the same fixes on Trino as well [3
> ].
>
> However, as Hudi keeps getting better, a new plugin to provide access to
> Hudi data and metadata will help in unlocking those capabilities for the
> Trino users. Just to name a few benefits, metadata-based listing, full
> schema evolution, etc [4
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >].
> Moreover, a separate Hudi connector would allow its independent evolution
> without having to worry about hacking/breaking the Hive connector.
>
> A separate connector also falls in line with our vision [5
> <
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >]
> when we think of a standalone timeline server or a lake cache to balance
> the tradeoff between writing and querying. Imagine users having read and
> write access to data and metadata in Hudi directly through Trino.
>
> I did some prototyping to get the snapshot queries on a Hudi COW table
> working with a new plugin [6
> ], and I feel the effort
> is worth it. High-level approach is to implement the connector SPI [7
> ] provided by Trino
> such as:
> a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> ConnectorSplitManager to produce logical units of data partitioning, so
> that Trino can parallelize reads and writes.
>
> Let me know your thoughts on the proposal. I can draft an RFC for the
> detailed design discussion once we have consensus.
>
> Regards,
> Sagar
>
> References:
> [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> [3] https://github.com/trinodb/trino/pull/9641
> [4]
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> [5]
>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> [6] https://github.com/codope/trino/tree/hudi-plugin
> [7] https://trino.io/docs/current/develop/connectors.html
>

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-10-14 Thread Vinoth Chandar

Sorry, dropped the ball on this.

If you do the following, your queries will be correct and not see any
duplicates/partial data.

- For Spark, you need to now do spark.read.format("hudi").load()
- For Presto/Trino, when you sync the table metadata out to Hive
metastores, Presto/Trino understand Hudi tables natively and can filter the
correct snapshot.

In general, while copying data, you. need to copy both data files and the
.hoodie folder
You can also explore the incremental query in Hudi, to simplify this
process.
The delta streamer tool can do this for you. i.e read incrementally from
table 1 in bucket 1 and then transform it optionally, and transactionally
write to another table 2 in bucket 2.

Hope that helps. Otherwise, your plan seems good to go!

On Mon, Sep 27, 2021 at 2:55 PM Xiong Qiang 
wrote:

> Hi Vinoth,
>
> Thank you very much for the detailed explanation. That is very helpful.
>
> For the downstream applications, we have Spark applications, and
> Presto/Trino.
> For Spark, We use spark.read.format('parquet').load() to read the Parquet
> files for other processing.
> For Presto/Trino, we use AWS Lambda to add or drop the partitions in a
> daily schedule.
>
> Our use case of Hudi is mainly on the Copy On Write mode. We load Parquet
> files from S3 bucket1 in Hudi, do selection and deletion, and then write to
> S3 bucket2.
> Our downstream application read from either S3 bucket1, or S3 bucket2, but
> not both. *My understanding is that, this use case will not create a
> duplication issue. Is that correct?*
>
> In the  future, we may consider combining S3 bucket2 into S3 bucket1. We
> plan to first delete the old/original Parquet file in bucket1, drop the
> partitions as necessary in Presto/Trino, and then copy the modified Parquet
> (i.e. output Parquet from Hudi dataset) from bucket2 to bucket1. *Will that
> result in duplicate (if we are on Copy On Write mode)?*
>
> Besides the potential duplicate, any other pitfall that I need to pay
> special attention to?
>
> Thanks a lot!
> Xiong
>
>
>
> On Wed, Sep 22, 2021 at 2:29 PM Vinoth Chandar  wrote:
>
> > Hi,
> >
> > There is no format difference whatsoever. Hudi just adds additional
> footers
> > for min, max key values and bloom filters to parquet and some meta fields
> > for tracking commit times for incremental queries and keys.
> > Any standard parquet reader can read the parquet files in a Hudi table.
> > These downstream applications, are these Spark jobs? what do you use to
> > consume the parquet files?
> >
> > The main thing your downstream reader needs to do is to read a correct
> > snapshot i.e only the latest committed files. Otherwise,you may end up
> with
> > duplicate values.
> > For example, when you issue the hudi delete, hudi will internally create
> a
> > new version of parquet files, without the deleted rows. So if you are not
> > careful about filtering for the latest file, you may end up reading both
> > files and have duplicates
> >
> > All of this happens automatically, if you are using a supported engine
> like
> > spark, flink, hive, presto, trino, ...
> >
> > yes. hudi (copy on write) dataset is a set of parquet files, with some
> > metadata.
> >
> > Hope that helps
> >
> > Thanks
> > Vinoth
> >
> >
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Sep 17, 2021 at 9:09 PM Xiong Qiang 
> > wrote:
> >
> > > Hi, all,
> > >
> > > I am new to Hudi, so please forgive me for naive questions.
> > >
> > > I was following the guides at
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > > and at https://hudi.incubator.apache.org/docs/quick-start-guide/.
> > >
> > > My goal is to load original Parquet files (written by Spark application
> > > from Kafka to S3) into Hudi, delete some rows, and then save back to (a
> > > different path in) S3 (the modified Parquet file). There are other
> > > downstream applications that consumes the original Parquet files for
> > > further processing.
> > >
> > > My question: *Is there any format difference between the original
> Parquet
> > > files and the Hudi modified Parquet files?* Is the Hudi modified
> Parquet
> > > files compatible with the original Parquet files? In other words, will
> > > other downstream applications (previously consuming the original
> Parquet
> > > files) be able to consume the modified Parquet files (i.e. the Hudi
> > > dataset) without any code change?
> > >
> >

Release 0.10.0 planning

2021-10-14 Thread Vinoth Chandar

Hi all,

It's time for our next release again!

I have marked out some blockers here on JIRA.

https://issues.apache.org/jira/projects/HUDI/versions/12350285


Quick highlights:
- Metadata table v2, which is synchronously updated
- Row writing (Spark) for all write operations
- Kafka Connect for append only data model
- New indexing schemes moving bloom filters and file range footers into
metadata table to improve upsert/delete performance.
- Fixes needed for Trino/Presto support.
- Most of the "big-needle-mover" PRs that are up already.
- Revamp of docs to match our vision.

May need some help understanding all the Flink related changes.

Kindly review and let's use this thread to ratify and discuss timelines.


Thanks
Vinoth

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-14 Thread Vinoth Chandar

Yeah all the rate limiting code in HBaseIndex is working around for these
large bulk writes.

On Tue, Oct 5, 2021 at 11:16 AM Jian Feng  wrote:

> actually I met this problem when bootstrap a huge table，after changed
> region key split strategy，problem solved.
> Im glad to hear that hfile solution will work in the future，since
> bloomindex cannot index mor log file，hence new insert data still write into
> parquet ，that why I choose hbase index ，get better performance.
>
> Vinoth Chandar 于2021年10月5日 周二下午7:29写道：
>
> > +1 on that answer. It's pretty spot on.
> >
> > Even as random prefix helps with HBase balancing, the issue then becomes
> > that you lose all the key ordering inside the Hudi table, which
> > can be a nice thing if you even want range pruning/indexing to be
> > effective.
> >
> > To paint a picture of all the work being done around this area. This
> work,
> > driven by uber engineers https://github.com/apache/hudi/pull/3508 could
> > technically solve the issue by directly reading HFiles
> > for the indexing, avoiding going to HBase servers. But obviously, it
> could
> > be less performant for small upsert batches than HBase (given the region
> > servers will cache etc).
> > If your backing storage is a cloud/object storage, which again throttles
> by
> > prefixes etc, then we could run into the same hotspotting problem again.
> > Otherwise, for larger batches, this would be far more scalable.
> >
> >
> > On Mon, Oct 4, 2021 at 7:06 PM 管梓越  wrote:
> >
> > > Hi jianfeng
> > >   As far as I know, there may not be a solution in hudi side yet.
> > > However, I have met this problem before so hope my experience could
> help.
> > > Just like other usages of hbase, adding a random prefix to rowkey may
> be
> > > the most universal solution to this problem.
> > > We may change the primary key for hudi by adding such prefix before the
> > > data is ingested into hudi. A new column could be added to save
> original
> > > primary key for query and hide the pk of hudi.
> > > Also, we may have a small modification to hbase index. Copy the code of
> > > hbase index, add the prefix on the aspect of query and update hbase. By
> > > this way, the pk in hbase will be different with the one in hudi but
> such
> > > logic will be transparent to business logic. I have adopted this method
> > in
> > > prod environment. Using withIndexClass config in IndexConfig could
> > specify
> > > custom index which allows the change of index without re compilation of
> > the
> > > whole hudi project.
> > >
> > > On Mon, Oct 4, 2021, 11:29 PM  wrote:
> > > when I bootstrape a huge hbase index table, I found all keys have a
> > prefix
> > > 'itemid:', then it caused data skew, there are 100 region servers in
> > hbase
> > > but only one was handle datas Is there any way to avoid this issue on
> the
> > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
> > >
> >
> --
> Full jian
>  | 
> Mobile 
> Address 
>

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-14 Thread Vinoth Chandar

Yes. I can do 7AM PST. Can others in PST chime in please?

We can wrap this up this week.

On Tue, Oct 12, 2021 at 7:25 PM Gary Li  wrote:

> Hi Vinoth,
>
> Summertime 8 AM PST was 11 PM in China so I guess it works for some forks,
> but switching to wintertime it was 12 AM in China. It might be a bit late
> IMO. Does 3 PM UTC(7 AM PST in winter, 8 AM in summer) work?
>
> Best,
> Gary
>
> On Tue, Oct 5, 2021 at 9:20 PM Pratyaksh Sharma 
> wrote:
>
> > Works for me in India :)
> >
> > On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar  wrote:
> >
> > > Looks like there is enough interest here.
> > >
> > > Moving onto timing. Does 8AM PST, on the second thursday of every
> > > month work for everyone?
> > > This is the time I find, works best for most time zones.
> > >
> > > On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo 
> > > wrote:
> > >
> > > > +1 on monthly community sync.
> > > >
> > > > On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra 
> > > wrote:
> > > >
> > > > > +1 for the monthly meeting. It would be great to start syncing up
> > > > > again. Thanks Vinoth for bringing it up !
> > > > >
> > > > > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan 
> > wrote:
> > > > > >
> > > > > > +1 on monthly meet up.
> > > > > >
> > > > > > On Thu, Sep 23, 2021 at 11:01 AM vino yang <
> yanghua1...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > +1 for monthly
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > Pratyaksh Sharma  于2021年9月23日周四
> 下午9:36写道：
> > > > > > >
> > > > > > > > Monthly should be good. Been a long time since we connected
> in
> > > > these
> > > > > > > > meetings. :)
> > > > > > > >
> > > > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> > > > > > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > 1 hour monthly is what I was proposing to be specific.
> > > > > > > > >
> > > > > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li  >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 for monthly.
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
> > > > > vin...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Once upon a time, we used to have a weekly community
> > sync.
> > > > > > > Wondering
> > > > > > > > if
> > > > > > > > > > > there is interest in having a monthly or bi-monthly dev
> > > > > meeting?
> > > > > > > > > > >
> > > > > > > > > > > Agenda could be
> > > > > > > > > > > - Update/Summary of all dev work tracks
> > > > > > > > > > > - Show and tell, where people can present their ongoing
> > > work
> > > > > > > > > > > - Open floor discussions, bring up new issues.
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Vinoth
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-10-06 Thread Vinoth Chandar

Hi Gary,

We can pass the constructed timeline and filesystem view into the IOHandle.
I think it makes sense for how Flink does things.

Thanks
Vinoth

On Fri, Sep 24, 2021 at 2:04 AM Gary Li  wrote:

> Hi Vinoth,
>
> Currently, each executor of Flink has a timeline server I believe. Do you
> think we can avoid passing the timeline and filesystem view into the
> IOHandle? I mean one IOHandle is handling the IO of one filegroup, and it
> doesn't need to know the timeline and filesystem view of the table, if we
> can define what the IOHandle is supposed to do during the initialization.
> Based on the log I see, constructing the filesystem view of a partition
> with 500 filegroups is taking 200ms. If the AppendHandle is only flushing a
> few records to disk, the actual flush could be faster than filesystem view
> construction.
>
> On Fri, Sep 24, 2021 at 12:03 PM Vinoth Chandar  wrote:
>
> > Thanks for the explanation. I get the streaming aspect better now. Esp in
> > Flink land. Timeline server and remote file system view are what the
> > defaults are. Assuming its a RPC call that takes 10-100 ms to the
> timeline
> > server, not sure how much room there is for optimization for loading of
> the
> > file system view itself. For timeline, it can be redundant (again in
> spark
> > model, it gets passed from driver to executor). Wondering if we can serve
> > timeline also view the server (duh).
> >
> > In flink, currently we run a timeline server per executor? Wondering if
> > that helps or hurts. In spark we run one in the driver alone.
> >
> > If we want to pass in a constructed table filesystem view and timeline
> into
> > IOHandle we can. I am fine with it, but trying to understand what exactly
> > we are solving by that.
> >
> >
> > On Thu, Sep 23, 2021 at 7:05 PM Gary Li  wrote:
> >
> > > Hi Vinoth,
> > >
> > > IMO the IOHandle should be as lightweight as possible, especially when
> we
> > > want to do streaming and near-real-time update(possibly real-time in
> the
> > > future?). Constructing the timeline and filesystem view inside the
> handle
> > > is time-consuming. In some cases, some handles only write a few records
> > in
> > > each commit, when we try to commit very aggressively. The timeline
> server
> > > and remote filesystem view are helpful, but I feel like there is still
> > some
> > > room for improvement.
> > >
> > > Best,
> > > Gary
> > >
> > > On Fri, Sep 24, 2021 at 3:04 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi Gary,
> > > >
> > > > So in effect you want to pull all the timeline filtering out of the
> > > handles
> > > > and pass a plan i.e what file slice to work on - to the handle?
> > > > That does sound cleaner. but we need to introduce this additional
> > layer.
> > > > The timeline and filesystem view do live within the table, I believe
> > > today.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Wed, Sep 22, 2021 at 6:35 PM Gary Li  wrote:
> > > >
> > > > > Hi Vinoth,
> > > > >
> > > > > Thanks for your response. For HoodieIOHandle, IMO we could define
> the
> > > > scope
> > > > > of the Handle during the initialization, so we don't need to care
> > about
> > > > the
> > > > > timeline and table view when actually writing the data. Is that
> > > > possible? A
> > > > > HoodieTable could have many Handles writing data at the same time
> and
> > > it
> > > > > will look cleaner if we can keep the timeline and file system view
> > > inside
> > > > > the table itself.
> > > > >
> > > > > Best,
> > > > > Gary
> > > > >
> > > > > On Sat, Sep 18, 2021 at 12:06 AM Vinoth Chandar  >
> > > > wrote:
> > > > >
> > > > > > Hi Gary,
> > > > > >
> > > > > > Thanks for the detailed response. Let me add my take on it.
> > > > > >
> > > > > > >>HoodieFlinkMergeOnReadTable.upsert(List) to use
> the
> > > > > > AppendHandle.write(HoodieRecord) directly,
> > > > > >
> > > > > > I have the same issue on JavaClient, for the Kafka Connect
> > > > > implementation.
> > > > > > I have an idea of how we can implement this. Wi

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-05 Thread Vinoth Chandar

+1 on that answer. It's pretty spot on.

Even as random prefix helps with HBase balancing, the issue then becomes
that you lose all the key ordering inside the Hudi table, which
can be a nice thing if you even want range pruning/indexing to be
effective.

To paint a picture of all the work being done around this area. This work,
driven by uber engineers https://github.com/apache/hudi/pull/3508 could
technically solve the issue by directly reading HFiles
for the indexing, avoiding going to HBase servers. But obviously, it could
be less performant for small upsert batches than HBase (given the region
servers will cache etc).
If your backing storage is a cloud/object storage, which again throttles by
prefixes etc, then we could run into the same hotspotting problem again.
Otherwise, for larger batches, this would be far more scalable.

On Mon, Oct 4, 2021 at 7:06 PM 管梓越  wrote:

> Hi jianfeng
>   As far as I know, there may not be a solution in hudi side yet.
> However, I have met this problem before so hope my experience could help.
> Just like other usages of hbase, adding a random prefix to rowkey may be
> the most universal solution to this problem.
> We may change the primary key for hudi by adding such prefix before the
> data is ingested into hudi. A new column could be added to save original
> primary key for query and hide the pk of hudi.
> Also, we may have a small modification to hbase index. Copy the code of
> hbase index, add the prefix on the aspect of query and update hbase. By
> this way, the pk in hbase will be different with the one in hudi but such
> logic will be transparent to business logic. I have adopted this method in
> prod environment. Using withIndexClass config in IndexConfig could specify
> custom index which allows the change of index without re compilation of the
> whole hudi project.
>
> On Mon, Oct 4, 2021, 11:29 PM  wrote:
> when I bootstrape a huge hbase index table, I found all keys have a prefix
> 'itemid:', then it caused data skew, there are 100 region servers in hbase
> but only one was handle datas Is there any way to avoid this issue on the
> Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
>

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-04 Thread Vinoth Chandar

Looks like there is enough interest here.

Moving onto timing. Does 8AM PST, on the second thursday of every
month work for everyone?
This is the time I find, works best for most time zones.

On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo 
wrote:

> +1 on monthly community sync.
>
> On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra  wrote:
>
> > +1 for the monthly meeting. It would be great to start syncing up
> > again. Thanks Vinoth for bringing it up !
> >
> > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan  wrote:
> > >
> > > +1 on monthly meet up.
> > >
> > > On Thu, Sep 23, 2021 at 11:01 AM vino yang 
> > wrote:
> > >
> > > > +1 for monthly
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Pratyaksh Sharma  于2021年9月23日周四 下午9:36写道：
> > > >
> > > > > Monthly should be good. Been a long time since we connected in
> these
> > > > > meetings. :)
> > > > >
> > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> > > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > >
> > > > > > 1 hour monthly is what I was proposing to be specific.
> > > > > >
> > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li 
> wrote:
> > > > > >
> > > > > > > +1 for monthly.
> > > > > > >
> > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
> > vin...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Once upon a time, we used to have a weekly community sync.
> > > > Wondering
> > > > > if
> > > > > > > > there is interest in having a monthly or bi-monthly dev
> > meeting?
> > > > > > > >
> > > > > > > > Agenda could be
> > > > > > > > - Update/Summary of all dev work tracks
> > > > > > > > - Show and tell, where people can present their ongoing work
> > > > > > > > - Open floor discussions, bring up new issues.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> >
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar

Thanks for the explanation. I get the streaming aspect better now. Esp in
Flink land. Timeline server and remote file system view are what the
defaults are. Assuming its a RPC call that takes 10-100 ms to the timeline
server, not sure how much room there is for optimization for loading of the
file system view itself. For timeline, it can be redundant (again in spark
model, it gets passed from driver to executor). Wondering if we can serve
timeline also view the server (duh).

In flink, currently we run a timeline server per executor? Wondering if
that helps or hurts. In spark we run one in the driver alone.

If we want to pass in a constructed table filesystem view and timeline into
IOHandle we can. I am fine with it, but trying to understand what exactly
we are solving by that.


On Thu, Sep 23, 2021 at 7:05 PM Gary Li  wrote:

> Hi Vinoth,
>
> IMO the IOHandle should be as lightweight as possible, especially when we
> want to do streaming and near-real-time update(possibly real-time in the
> future?). Constructing the timeline and filesystem view inside the handle
> is time-consuming. In some cases, some handles only write a few records in
> each commit, when we try to commit very aggressively. The timeline server
> and remote filesystem view are helpful, but I feel like there is still some
> room for improvement.
>
> Best,
> Gary
>
> On Fri, Sep 24, 2021 at 3:04 AM Vinoth Chandar  wrote:
>
> > Hi Gary,
> >
> > So in effect you want to pull all the timeline filtering out of the
> handles
> > and pass a plan i.e what file slice to work on - to the handle?
> > That does sound cleaner. but we need to introduce this additional layer.
> > The timeline and filesystem view do live within the table, I believe
> today.
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Sep 22, 2021 at 6:35 PM Gary Li  wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for your response. For HoodieIOHandle, IMO we could define the
> > scope
> > > of the Handle during the initialization, so we don't need to care about
> > the
> > > timeline and table view when actually writing the data. Is that
> > possible? A
> > > HoodieTable could have many Handles writing data at the same time and
> it
> > > will look cleaner if we can keep the timeline and file system view
> inside
> > > the table itself.
> > >
> > > Best,
> > > Gary
> > >
> > > On Sat, Sep 18, 2021 at 12:06 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi Gary,
> > > >
> > > > Thanks for the detailed response. Let me add my take on it.
> > > >
> > > > >>HoodieFlinkMergeOnReadTable.upsert(List) to use the
> > > > AppendHandle.write(HoodieRecord) directly,
> > > >
> > > > I have the same issue on JavaClient, for the Kafka Connect
> > > implementation.
> > > > I have an idea of how we can implement this. Will raise a PR and get
> > your
> > > > thoughts.
> > > > We can then see if this can be leveraged across Flink and Java
> clients.
> > > >
> > > > On the IOHandle not having the Table inside, I think the File
> > > > reader/writer  abstraction exists already and having the Table in the
> > io
> > > > layers helps us perform I/O
> > > > while maintaining consistency with the timeline.
> > > >
> > > > +1 on the next two points.
> > > >
> > > > I think these layers have well defined roles, and probably why we are
> > > able
> > > > to get this far :) . May be we need to pull I/O up into hudi-common ?
> > > >
> > > > For this project, we can trim the scope to code reuse and moving all
> > the
> > > > different engine specific implementations up into hudi-client-common.
> > > >
> > > > What do you think?
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > >
> > > > On Thu, Sep 16, 2021 at 6:55 AM Gary Li  wrote:
> > > >
> > > > > Huge +1. Recently I am working on making the Flink writer in a
> > > streaming
> > > > > fashion and found the List interface is limiting the
> > > > > streaming power of Flink. By switching from
> > > > > HoodieFlinkMergeOnReadTable.upsert(List) to use the
> > > > > AppendHandle.write(HoodieRecord) directly, the throughput was
> almost
> > > > > doubled and the checkpoint time of the writer was reduced from
> > minutes
> > > to
> > > >

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar

Hi Gary,

So in effect you want to pull all the timeline filtering out of the handles
and pass a plan i.e what file slice to work on - to the handle?
That does sound cleaner. but we need to introduce this additional layer.
The timeline and filesystem view do live within the table, I believe today.

Thanks
Vinoth

On Wed, Sep 22, 2021 at 6:35 PM Gary Li  wrote:

> Hi Vinoth,
>
> Thanks for your response. For HoodieIOHandle, IMO we could define the scope
> of the Handle during the initialization, so we don't need to care about the
> timeline and table view when actually writing the data. Is that possible? A
> HoodieTable could have many Handles writing data at the same time and it
> will look cleaner if we can keep the timeline and file system view inside
> the table itself.
>
> Best,
> Gary
>
> On Sat, Sep 18, 2021 at 12:06 AM Vinoth Chandar  wrote:
>
> > Hi Gary,
> >
> > Thanks for the detailed response. Let me add my take on it.
> >
> > >>HoodieFlinkMergeOnReadTable.upsert(List) to use the
> > AppendHandle.write(HoodieRecord) directly,
> >
> > I have the same issue on JavaClient, for the Kafka Connect
> implementation.
> > I have an idea of how we can implement this. Will raise a PR and get your
> > thoughts.
> > We can then see if this can be leveraged across Flink and Java clients.
> >
> > On the IOHandle not having the Table inside, I think the File
> > reader/writer  abstraction exists already and having the Table in the io
> > layers helps us perform I/O
> > while maintaining consistency with the timeline.
> >
> > +1 on the next two points.
> >
> > I think these layers have well defined roles, and probably why we are
> able
> > to get this far :) . May be we need to pull I/O up into hudi-common ?
> >
> > For this project, we can trim the scope to code reuse and moving all the
> > different engine specific implementations up into hudi-client-common.
> >
> > What do you think?
> >
> > Thanks
> > Vinoth
> >
> >
> > On Thu, Sep 16, 2021 at 6:55 AM Gary Li  wrote:
> >
> > > Huge +1. Recently I am working on making the Flink writer in a
> streaming
> > > fashion and found the List interface is limiting the
> > > streaming power of Flink. By switching from
> > > HoodieFlinkMergeOnReadTable.upsert(List) to use the
> > > AppendHandle.write(HoodieRecord) directly, the throughput was almost
> > > doubled and the checkpoint time of the writer was reduced from minutes
> to
> > > seconds. But I found it really difficult to fit this change into the
> > > current client interface.
> > >
> > > My 2 cents:
> > >
> > >- The HoodieIOHandle should only handle the IO, and not having
> > >HoodieTable inside.
> > >- We need a more streaming-friendly Handle. For Flink, we can
> > definitely
> > >change all the batch mode List to processing
> > HoodieRecord
> > > one
> > >by one, just like the AppendHandle.write(HoodieRecord) and
> > >AppendHandle.close(). This will spread the computing cost and
> > >flattening the curve.
> > >- We can use the Handle to precisely control the JVM to avoid OOM
> and
> > >optimize the memory footprint. Then we don't need to implement
> another
> > >memory control mechanism in the compute engine itself.
> > >- HoodieClient, HoodieTable, HoodieIOHandle, HoodieTimeline,
> > >HoodieFileSystemView e.t.c should have a well-defined role and
> > > well-defined
> > >layer. We should know when to use what, it should be used by the
> > driver
> > > in
> > >a single thread or used by the worker in a distributed way.
> > >
> > > This is a big project and could benefit Hudi in long term. Happy to
> > discuss
> > > more in the design doc or PRs.
> > >
> > > Best,
> > > Gary
> > >
> > > On Thu, Sep 16, 2021 at 3:21 AM Raymond Xu <
> xu.shiyan.raym...@gmail.com>
> > > wrote:
> > >
> > > > +1 that's a great improvement.
> > > >
> > > > On Wed, Sep 15, 2021 at 10:40 AM Sivabalan 
> wrote:
> > > >
> > > > > ++1. definitely help's Hudi scale and makes it more maintainable.
> > > Thanks
> > > > > for driving this effort. Mostly devs show interest in major
> features
> > > and
> > > > > don't like to spend time in such foundational work. But as the
> > project
> > > > > scales, t

Re: Monthly or Bi-Monthly Dev meeting?

2021-09-23 Thread Vinoth Chandar

1 hour monthly is what I was proposing to be specific.

On Thu, Sep 23, 2021 at 6:30 AM Gary Li  wrote:

> +1 for monthly.
>
> On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > Once upon a time, we used to have a weekly community sync. Wondering if
> > there is interest in having a monthly or bi-monthly dev meeting?
> >
> > Agenda could be
> > - Update/Summary of all dev work tracks
> > - Show and tell, where people can present their ongoing work
> > - Open floor discussions, bring up new issues.
> >
> > Thanks
> > Vinoth
> >
>

Re: How to do apache hudi performance test?

2021-09-23 Thread Vinoth Chandar

Hi,

Those numbers you see are from production at Uber, which I no longer have
access to. So they are not synthetic numbers.
I use my own little script for testing write performance - tpcds does not
really have good support for updates/delete workloads.
I am happy to throw it up, but I think we could invest in a
`hudi-perf-suite` module wrapping some of these things, at-least for
Spark/Flink to begin with

I am happy to assist anyone willing to take a stab at it.

Thanks
Vinoth

On Thu, Sep 23, 2021 at 2:45 AM Danny Chan  wrote:

> +1, a benchmark that can reproduce is important for user testing then
> choose their final product.
>
> Best,
> Danny Chan
>
> casel.chen  于2021年9月14日周二 下午9:38写道：
>
> > Hello, everyone!
> >
> >
> > I want to know how to do apache hudi performance test like
> > https://hudi.apache.org/docs/performance/? How to monitor those metrics
> > and any replay steps is appreciate. Thanks!
> >
> >
> > Shuai
>

Monthly or Bi-Monthly Dev meeting?

2021-09-23 Thread Vinoth Chandar

Hi all,

Once upon a time, we used to have a weekly community sync. Wondering if
there is interest in having a monthly or bi-monthly dev meeting?

Agenda could be
- Update/Summary of all dev work tracks
- Show and tell, where people can present their ongoing work
- Open floor discussions, bring up new issues.

Thanks
Vinoth

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-09-22 Thread Vinoth Chandar

Hi,

There is no format difference whatsoever. Hudi just adds additional footers
for min, max key values and bloom filters to parquet and some meta fields
for tracking commit times for incremental queries and keys.
Any standard parquet reader can read the parquet files in a Hudi table.
These downstream applications, are these Spark jobs? what do you use to
consume the parquet files?

The main thing your downstream reader needs to do is to read a correct
snapshot i.e only the latest committed files. Otherwise,you may end up with
duplicate values.
For example, when you issue the hudi delete, hudi will internally create a
new version of parquet files, without the deleted rows. So if you are not
careful about filtering for the latest file, you may end up reading both
files and have duplicates

All of this happens automatically, if you are using a supported engine like
spark, flink, hive, presto, trino, ...

yes. hudi (copy on write) dataset is a set of parquet files, with some
metadata.

Hope that helps

Thanks
Vinoth

Thanks
Vinoth

On Fri, Sep 17, 2021 at 9:09 PM Xiong Qiang 
wrote:

> Hi, all,
>
> I am new to Hudi, so please forgive me for naive questions.
>
> I was following the guides at
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> and at https://hudi.incubator.apache.org/docs/quick-start-guide/.
>
> My goal is to load original Parquet files (written by Spark application
> from Kafka to S3) into Hudi, delete some rows, and then save back to (a
> different path in) S3 (the modified Parquet file). There are other
> downstream applications that consumes the original Parquet files for
> further processing.
>
> My question: *Is there any format difference between the original Parquet
> files and the Hudi modified Parquet files?* Is the Hudi modified Parquet
> files compatible with the original Parquet files? In other words, will
> other downstream applications (previously consuming the original Parquet
> files) be able to consume the modified Parquet files (i.e. the Hudi
> dataset) without any code change?
>
> In the docs, I have seen the phrase "Hudi dataset", which, in my
> understanding, is simply a Parquet file with accompanied Hudi metadata. I
> have also read the migration doc (
> https://hudi.incubator.apache.org/docs/migration_guide/). My understanding
> is that we can migrate from original Parquet file to Hudi dataset (Hudi
> modified Parquet file). *Can we use (or migrate) Hudi dataset (Hudi
> modified Parquet file) back to original Parquet file to be consumed by
> other downstream application?*
>
> Thank you very much!
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-17 Thread Vinoth Chandar

Hi Gary,

Thanks for the detailed response. Let me add my take on it.

>>HoodieFlinkMergeOnReadTable.upsert(List) to use the
AppendHandle.write(HoodieRecord) directly,

I have the same issue on JavaClient, for the Kafka Connect implementation.
I have an idea of how we can implement this. Will raise a PR and get your
thoughts.
We can then see if this can be leveraged across Flink and Java clients.

On the IOHandle not having the Table inside, I think the File
reader/writer  abstraction exists already and having the Table in the io
layers helps us perform I/O
while maintaining consistency with the timeline.

+1 on the next two points.

I think these layers have well defined roles, and probably why we are able
to get this far :) . May be we need to pull I/O up into hudi-common ?

For this project, we can trim the scope to code reuse and moving all the
different engine specific implementations up into hudi-client-common.

What do you think?

Thanks
Vinoth


On Thu, Sep 16, 2021 at 6:55 AM Gary Li  wrote:

> Huge +1. Recently I am working on making the Flink writer in a streaming
> fashion and found the List interface is limiting the
> streaming power of Flink. By switching from
> HoodieFlinkMergeOnReadTable.upsert(List) to use the
> AppendHandle.write(HoodieRecord) directly, the throughput was almost
> doubled and the checkpoint time of the writer was reduced from minutes to
> seconds. But I found it really difficult to fit this change into the
> current client interface.
>
> My 2 cents:
>
>- The HoodieIOHandle should only handle the IO, and not having
>HoodieTable inside.
>- We need a more streaming-friendly Handle. For Flink, we can definitely
>change all the batch mode List to processing HoodieRecord
> one
>by one, just like the AppendHandle.write(HoodieRecord) and
>AppendHandle.close(). This will spread the computing cost and
>flattening the curve.
>- We can use the Handle to precisely control the JVM to avoid OOM and
>optimize the memory footprint. Then we don't need to implement another
>memory control mechanism in the compute engine itself.
>- HoodieClient, HoodieTable, HoodieIOHandle, HoodieTimeline,
>HoodieFileSystemView e.t.c should have a well-defined role and
> well-defined
>layer. We should know when to use what, it should be used by the driver
> in
>a single thread or used by the worker in a distributed way.
>
> This is a big project and could benefit Hudi in long term. Happy to discuss
> more in the design doc or PRs.
>
> Best,
> Gary
>
> On Thu, Sep 16, 2021 at 3:21 AM Raymond Xu 
> wrote:
>
> > +1 that's a great improvement.
> >
> > On Wed, Sep 15, 2021 at 10:40 AM Sivabalan  wrote:
> >
> > > ++1. definitely help's Hudi scale and makes it more maintainable.
> Thanks
> > > for driving this effort. Mostly devs show interest in major features
> and
> > > don't like to spend time in such foundational work. But as the project
> > > scales, these foundational work will have a higher returns in the long
> > run.
> > >
> > > On Wed, Sep 15, 2021 at 8:29 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Another +1 ,  HoodieData abstraction will go a long way in reducing
> > LoC.
> > > >
> > > > Happy to work with you to see this through! I really encourage top
> > > > contributors to the Flink and Java clients as well,
> > > > actively review all PRs, given there are subtle differences
> everywhere.
> > > >
> > > > This will help us smoothly provide all the core features across
> > engines.
> > > > Also help us easily write a DataSet/Row based
> > > > client for Spark as well.
> > > >
> > > > Onwards and upwards
> > > > Vinoth
> > > >
> > > > On Wed, Sep 15, 2021 at 4:57 AM vino yang 
> > wrote:
> > > >
> > > > > Hi Ethan,
> > > > >
> > > > > Big +1 for the proposal.
> > > > >
> > > > > Actually, we have discussed this topic before.[1]
> > > > >
> > > > > Will review your refactor PR later.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
> > > > >
> > > > >
> > > > > Y Ethan Guo  于2021年9月15日周三 下午3:34写道：
> > > > >
> > > > > > Hi all,
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Vinoth Chandar

Another +1 ,  HoodieData abstraction will go a long way in reducing LoC.

Happy to work with you to see this through! I really encourage top
contributors to the Flink and Java clients as well,
actively review all PRs, given there are subtle differences everywhere.

This will help us smoothly provide all the core features across engines.
Also help us easily write a DataSet/Row based
client for Spark as well.

Onwards and upwards
Vinoth

On Wed, Sep 15, 2021 at 4:57 AM vino yang  wrote:

> Hi Ethan,
>
> Big +1 for the proposal.
>
> Actually, we have discussed this topic before.[1]
>
> Will review your refactor PR later.
>
> Best,
> Vino
>
> [1]:
>
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
>
>
> Y Ethan Guo  于2021年9月15日周三 下午3:34写道：
>
> > Hi all,
> >
> > hudi-client module has core Hudi abstractions and client logic for
> > different engines like Spark, Flink, and Java.  While previous effort
> > (HUDI-538 [1]) has decoupled the integration with Spark, there is quite
> > some code duplication across different engines for almost the same logic
> > due to the current interface design.  Some part also has divergence among
> > engines, making debugging and support difficult.
> >
> > I propose to further refactor the hudi-client module with the goal of
> > improving the code reuse across multiple engines and reducing the
> > divergence of the logic among them, so that the core Hudi action
> execution
> > logic should be shared across engines, except for engine specific
> > transformations.  Such a pattern also allows easy support of core Hudi
> > functionality for all engines in the future.  Specifically,
> >
> > (1) Abstracts the transformation boilerplates inside the
> > HoodieEngineContext and implements the engine-specific data
> transformation
> > logic in the subclasses.  Type cast can be done inside the engine
> context.
> > (2) Creates new HoodieData abstraction for passing input and output along
> > the flow of execution, and uses it in different Hudi abstractions, e.g.,
> > HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of enforcing
> type
> > parameters encountering RDD and List which
> are
> > one source of duplication.
> > (3) Extracts common execution logic to hudi-client-common module from
> > multiple engines.
> >
> > As a first step and exploration for item (1) and (3) above, I've tried to
> > refactor the rollback actions and the PR is here [HUDI-2433][2].  In this
> > PR, I completely remove all engine-specific rollback packages and only
> keep
> > one rollback package in hudi-client-common, adding ~350 LoC while
> deleting
> > 1.3K LoC.  My next step is to refactor the commit action which
> encompasses
> > item (2) above.
> >
> > What do you folks think and any other suggestions?
> >
> > [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for multi
> engine
> > support
> > https://issues.apache.org/jira/browse/HUDI-538
> > [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client module
> > https://github.com/apache/hudi/pull/3664/files
> >
> > Best,
> > - Ethan
> >
>

Re: [ANNOUNCEMENT] CI changes

2021-09-07 Thread Vinoth Chandar

+1  this is a truly champion effort.

On Tue, Sep 7, 2021 at 11:51 AM Sivabalan  wrote:

> Really great job Raymond! Good to see improvements on CI infra. Definitely
> helps developer experience a lot better.
>
>
> On Mon, Sep 6, 2021 at 9:19 PM vino yang  wrote:
>
> > awesome! Great job!
> >
> > Thanks for driving and landing this big infra improvement!
> >
> > Best,
> > Vino
> >
> > Raymond Xu  于2021年9月4日周六 上午9:42写道：
> >
> > > Hi all,
> > >
> > > As you may have noticed, we have been running Azure Pipelines for the
> > tests
> > > for some time and have recently retired Travis CI in this PR
> > > .
> > >
> > > Background
> > >
> > > It was a pain for the CI process in the past with Travis, which from
> time
> > > to time queued up CI jobs forever. This severely affects the developer
> > > experience for making contributions, and also the release process.
> > >
> > > The New Setup
> > >
> > > Thanks to the Flink community, who pioneered the CI setup, and MS
> Azure,
> > > who provided the free resources, we are able to mirror the repo and PRs
> > to
> > > a separate GitHub organization  and
> > run
> > > the tests in Azure Pipelines. Hudi's ci-bot
> > >  (forked from Flink's ci-bot
> > > ) runs on a GCP server and
> > > periodically
> > > scans recently changed PRs for CI submission. CI results are commented
> > back
> > > to the PR by hudi-bot . Full details
> about
> > > the
> > > setup are documented in this
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Guide+on+CI+infrastructure
> > > >
> > > page
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Guide+on+CI+infrastructure
> > > >
> > > .
> > >
> > > Azure Pipelines provides 10 free managed parallel jobs. CI tests are
> > split
> > > into 5 jobs. We have dedicated resources to test 2 PRs in parallel.
> > >
> > >- master builds:
> > >
> > >
> >
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build?definitionId=3
> > >- branch builds:
> > >
> > >
> >
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build?definitionId=5
> > >
> > > Note: PRs against asf-site (website updates) will be ignored by this
> > setup.
> > >
> > > Additionally, we make use of GitHub Actions to build against different
> > > Spark and Scala versions. GitHub Actions jobs also provide fast
> feedback
> > > for compliance like checkstyle and apache-rat.
> > >
> > > For PR Owners and Reviewers
> > >
> > > With these changes, PR owners and reviewers should pay attention to the
> > > following:
> > >
> > >- CI results are indicated in hudi-bot's comment
> > >- A new commit in the same PR will trigger a new build and cancel
> any
> > >existing build
> > >- Comment `@hudi-bot run azure` to manually trigger a new build
> > >- GitHub Actions jobs will show as checks in the PR
> > >- Minimum conditions to merge:
> > >   - Azure CI report shows success, and
> > >   - GitHub Actions jobs passed
> > >- For website update PRs (for asf-site branch), owners post
> > screenshots
> > >to show the changes in lieu of CI tests.
> > >
> > >
> > > Hope this contributes towards a more seamless developer experience.
> > Please
> > > reach out to the community for CI issues or further questions.
> > >
> > >
> > > Best,
> > > Raymond
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: Apache Hudi release voting process

2021-08-22 Thread Vinoth Chandar

Hi all,

First of all, it's great to see us debating around ensuring high quality,
timely releases. Shows we have developers
who care and are passionate around the project!

Thanks for establishing the timelines, Siva. I would like to add the
following data points that all 4 PRs in question
(raised on the voting thread) have been submitted well after the Aug 13th
cutoff date we had originally agreed upon.
If there was communication to the RM or PMC before the cutoff around these
JIRAs, please chime in. In the absence of
this, it seems like this is an issue of some last minute feature requests
and 1 hot fix around SparkSQL. I am pretty
concerned about setting this precedent here, that RCs can be voted down if
certain bug fixes did not make it in time.

We have never encountered an issue like this before. So it's an opportunity
to draft an agreed upon set of criteria
for what qualifies for a valid -1, this also needs us as a community in
investing in nightly integration tests, and more volunteers
to ensure our tests are not flaky and exercise all complex scenarios.
Without this, I think we have to enforce agreed upon timelines
very stringently.

I added my +1 to 0.9.0, since I perceive all 4 PRs issues to be
non-blocking (even though the sparkSQL bug is a serious limitation).
But, I'd still have us honor the timelines we agreed upon, rather than cut
another RC3 for these.

Apache Voting guidelines explicitly state that "Releases may not be
vetoed. Generally the community will cancel the release
vote if anyone identifies serious problems, but in most cases the ultimate
decision lies with the individual serving as release manager"

Love to hear more from the other PMC members and the RM. Looks like the RM
has the clear final decision here, unless the majority
binding votes for the RC cannot be obtained from the PMC.

Thanks
Vinoth

On Sun, Aug 22, 2021 at 1:25 PM Sivabalan  wrote:

> Hi folks,
> Wanted to start a thread to discuss our guidelines on the release
> process with Apache Hudi. You can find our existing release process here
> <
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide
> >.
> On
> a high level, our release process is as follows.
>
> 1. Call for a release and rough timeline.
> 2. Nominate a release manager.
> 3. RM collects all release blockers and starts an email thread. This
> email is to call for any other jiras/release blockers to be included as
> part of the release. Also, asks respective owners of release blockers to
> commit to a time for closing it out.
> 4. After the deadline, if not all release blockers are landed: a. Moves
> any pending release blockers to the next release that seems not very
> critical. b. If there are some essential release blockers, ask the
> respective owners to get it to closure and extends deadlines to get them
> in.
> 5. Once all release blockers are landed, works on RC1. Verifies the
> candidate and puts it out for voting.
> 6. If approved by majority, proceed with actual release.
> 7. If not approved by majority, waits for the fix to get merged and
> works on RC2.
>
> Coming to our 0.9.0 release, here is how the timeline looks like.
>
> Jul 14: Decided on RM
> Aug  3: RM sent an email with all release blockers. He called out for
> missed our release blockers and asked for respective owners to be mindful
> of the deadline for release.
> Aug  5: Decided the deadline for release blockers to be landed as 13th Aug.
> Aug 14: All release blockers were landed. Those that could not be landed
> were rolled over to 0.10.0
> Aug 15: RC1 was announced.
> Aug 17: voted -1 by PMC due to config name issue. Existing jobs from older
> hudi versions might fail once migrated.
> Aug 18 to Aug 19: Config patch was worked upon and landed.
> Aug 20: RC2 was announced.
>
> From the literature I found on apache release guidelines
> , there are no strict
> guidelines on what can be qualified as -1. But only PMC votes are binding
> in general. From what we have seen, reasons for -1 are as follows: if there
> are any license issues or any of the apache process is not followed
> properly, or basic quick start is failing, or major regression for existing
> users who might be migrating to this latest release or anything that's
> considered very critical for the project to have something as part of the
> upcoming release. I also went through the release guidelines of other
> popular apache projects (spark, flink), but didn't not find any project
> specific guidelines on voting -1 as such.
>
> I agree that we have a gap here wrt voting guidelines. We can definitely
> work towards that and fix it before the next release on what can be
> qualified for -1. But we have laid out the release guidelines to make the
> process smooth and to follow the apache way. That's why RM sent an email
> calling out for any more release blockers to be considered for the upcoming
> release. And looks

Re: [VOTE] Release 0.9.0, release candidate #2

2021-08-22 Thread Vinoth Chandar

+1 (binding)

RC check [1] passed

[1] https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b


On Sun, Aug 22, 2021 at 1:28 PM Sivabalan  wrote:

> We can keep the specific discussion out of this voting thread. Have started
> a new thread here
> <
> https://lists.apache.org/thread.html/r3bae7622904b04c7d1fb2ddaf5226e37166d5fbb1721f403b1b04545%40%3Cdev.hudi.apache.org%3E
> >
> to
> continue this discussion. We can keep this thread just for voting. Thanks.
>
> On Sun, Aug 22, 2021 at 2:13 AM Danny Chan  wrote:
>
> > It's not a surprise that 0.9 has a longer release process, the Spark SQL
> > was added and many promotions from the Flink engine. We need more
> patience
> > for this release IMO.
> >
> > Having another minor release like 0.9.1 is a solution but not a good one,
> > people have much more promise to the major release and it carries
> > many expectations. If people report the problems during the release
> > process, just accept it if it is not a big PR/fix, and there are only a
> few
> > ones up to now. I would not take too much time.
> >
> > I know that it has been about 4 months since the last release, but people
> > want a complete release version not a defective one.
> >
> > Best,
> > Danny
> >
> > Sivabalan  于2021年8月22日周日 上午11:50写道：
> >
> > > I would like to share my thoughts on the release process in general. I
> > will
> > > read more about what exactly qualifies for -1 and will look into what
> > Peng
> > > and Danny has put up. But some thoughts on the release in general.
> > >
> > > Every release process is very tedious and time consuming and RM does
> put
> > in
> > > non-trivial amount of work in getting the release out. To make the
> > process
> > > smooth, RM started an email thread by Aug 3, calling for any release
> > > blockers. Would like to understand, if these were surfaced in that
> > thread?
> > > What I am afraid of is, we might keep delaying our release by adding
> more
> > > patches/bug fixes with every candidate. For instance, if we consider
> > these
> > > and RM works on RC3 and puts up a vote in 5 days and what if someone
> else
> > > wants to add a couple of more fixes or improvements to the release? If
> > it's
> > > a very serious bug that one cannot do basic operations like
> insert/upsert
> > > in any of the engines or some serious regression, yeah we can
> definitely
> > > block the release. But if there are corner case bugs, or any
> improvements
> > > in general, we can always have another release immediately following
> > this.
> > > This is my humble opinion having gone through the release process
> myself
> > in
> > > the past and have helped others in doing the release in Hudi. It's been
> > > more than 4 months we have had a release. Would be good for us to be
> > > mindful of that as well. Maybe this is common in other projects, but I
> am
> > > not aware of that. Please enlighten me if you have experience with
> other
> > > projects.
> > >
> > > I would like to hear from other PMCs and experts who are more
> > > knowledgeable about the release process.
> > > And if anyone has any suggestions on improving the release process in
> > > general (if we can seal the patches that go into a release upfront,
> > etc), I
> > > am all ears to that as well.
> > >
> > >
> > > On Sat, Aug 21, 2021 at 10:41 PM Danny Chan 
> > wrote:
> > >
> > > > I have fired a cherry-pick PR:
> > https://github.com/apache/hudi/pull/3519
> > > >
> > > > Best,
> > > > Danny
> > > >
> > > > Danny Chan  于2021年8月22日周日 上午9:07写道：
> > > >
> > > > > I'm sorry I would also vote -1.
> > > > >
> > > > > HUDI-2316
> > > > > HUDI-2340
> > > > > HUDI-2342
> > > > >
> > > > > are all important improvements for Flink and we hope they can be
> > > > > cherry picked to release 0.9.
> > > > >
> > > > > Best,
> > > > > Danny
> > > > >
> > > > > Udit Mehrotra  于2021年8月21日周六 上午7:13写道：
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> Please review and vote on the release candidate #2 for the version
> > > > 0.9.0,
> > > > >> as follows:
> > > > >>
> > > > >> [ ] +1, Approve the release
> > > > >>
> > > > >> [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > >>
> > > > >> The complete staging area is available for your review, which
> > > includes:
> > > > >>
> > > > >> * JIRA release notes [1],
> > > > >>
> > > > >> * the official Apache source release and binary convenience
> releases
> > > to
> > > > be
> > > > >> deployed to dist.apache.org [2], which are signed with the key
> with
> > > > >> fingerprint 44A484600E48193A74F97447C47E66F8386204DF [3],
> > > > >>
> > > > >> * all artifacts to be deployed to the Maven Central Repository
> [4],
> > > > >>
> > > > >> * source code tag "release-0.9.0-rc2" [5],
> > > > >>
> > > > >> The vote will be open for at least 72 hours. It is adopted by
> > majority
> > > > >> approval, with at least 3 PMC affirmative votes.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Release Manager
> > > > >>
> > > > >> [1]
>

Re: DISCUSS RFC RFC-32 Kafka Connect Sink for Hudi

2021-08-19 Thread Vinoth Chandar

+1 on this.

Thanks for driving this!

On Wed, Aug 18, 2021 at 4:44 PM Rajesh Mahindra  wrote:

> Hi All,
>
> We have a new RFC, RFC-32
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi
> >
> that
> details the design and implementation of a Kafka Sink for Hudi, ensuring
> that Kafka connect users can easily ingest / stream kafka records directly
> to Hudi Tables.
>
> Please review the RFC and provide any feedback. Do not hesitate to ask
> questions if any. Thanks a lot.
>
> --
> Rajesh Mahindra
>

Re: [DISCUSS] Enable Github Discussions

2021-08-18 Thread Vinoth Chandar

I think we need a VOTE thread for infra to enable this for us and explore.
Will start one!

On Thu, Aug 12, 2021 at 12:24 AM Raymond Xu 
wrote:

> +1
>
> On Wed, Aug 11, 2021 at 9:03 PM Balaji Varadarajan 
> wrote:
>
> > +1
> >
> > Balaji.V
> >
> > On Wed, Aug 11, 2021 at 7:12 PM Bhavani Sudha 
> > wrote:
> >
> > > +1
> > >
> > > Thanks,
> > > Sudha
> > >
> > > On Wed, Aug 11, 2021 at 7:08 PM vino yang 
> wrote:
> > >
> > > > +1
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Pratyaksh Sharma  于2021年8月12日周四 上午2:16写道：
> > > >
> > > > > +1
> > > > >
> > > > > I have never used it, but we can try this out. :)
> > > > >
> > > > > On Thu, Jul 15, 2021 at 9:43 AM Vinoth Chandar 
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to propose that we explore the use of github
> > > discussions.
> > > > > Few
> > > > > > other apache projects have also been trying this out.
> > > > > >
> > > > > > Please chime in
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Release 0.9.0, release candidate #1

2021-08-17 Thread Vinoth Chandar

Hi,

This is another issue that needs to be addressed I think.

Checking for binary files in source release
There were non-text files in source release. Please check below

./hudi-flink/src/test/resources/test_source_5.data: application/csv;
charset=us-ascii

Thanks
Vinoth

On Tue, Aug 17, 2021 at 7:48 AM Vinoth Chandar  wrote:

> -1 (binding)
>
> An issue was surfaced yesterday, that affects the re-definition of some
> configs in HoodieWriteConfig e.g TABLE_NAME, AVRO_SCHEMA,
> AVRO_SCHEMA_VALIDATE
> which unfortunately do not have the _PROP suffix added. These are now
> re-defining as ConfigProperty members now and jobs using them directly (we
> have such usages
> even in our quickstart) will have to break upon upgrading to this release.
> While, its a bit of a step back to have to rename these to something else,
> it might make sense
> to keep all such members without _PROP suffix as string and make the new
> members TABLE_NAME_CFG etc, to avoid upgrade pain.
>
> I am also testing the RC for other issues, will update the thread again
> with findings.
>
> Thanks
> Vinoth
>
> On Sun, Aug 15, 2021 at 6:05 PM Raymond Xu 
> wrote:
>
>> Ok the Azure CI passed with the patch so the missing test coverage is
>> proved not an issue
>>
>> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=1770=results
>>
>> reverse my -1 to +1
>>
>> On Sun, Aug 15, 2021 at 5:10 PM Raymond Xu 
>> wrote:
>>
>> > Hey Udit, sorry I'm -1 due to a misconfigured test resulting in some
>> test
>> > cases getting skipped. The patch was merged to master.
>> > https://github.com/apache/hudi/pull/3476/files
>> >
>> > Created a hotfix PR to the release branch.
>> > https://github.com/apache/hudi/pull/3480/files
>> >
>> >
>> > On Sun, Aug 15, 2021 at 4:22 PM Udit Mehrotra 
>> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> Please review and vote on the release candidate #1 for the version
>> 0.9.0,
>> >> as follows:
>> >>
>> >> [ ] +1, Approve the release
>> >>
>> >> [ ] -1, Do not approve the release (please provide specific comments)
>> >>
>> >> The complete staging area is available for your review, which includes:
>> >>
>> >> * JIRA release notes [1],
>> >>
>> >> * the official Apache source release and binary convenience releases
>> to be
>> >> deployed to dist.apache.org [2], which are signed with the key with
>> >> fingerprint 44A484600E48193A74F97447C47E66F8386204DF [3],
>> >>
>> >> * all artifacts to be deployed to the Maven Central Repository [4],
>> >>
>> >> * source code tag "release-0.9.0-rc1" [5],
>> >>
>> >> The vote will be open for at least 72 hours. It is adopted by majority
>> >> approval, with at least 3 PMC affirmative votes.
>> >>
>> >> Thanks,
>> >>
>> >> Release Manager
>> >>
>> >> [1]
>> >>
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350027
>> >>
>> >> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.9.0-rc1/
>> >>
>> >> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
>> >>
>> >> [4]
>> >>
>> >>
>> https://repository.apache.org/content/repositories/orgapachehudi-1042/org/apache/hudi/
>> >>
>> >> [5] https://github.com/apache/hudi/tree/release-0.9.0-rc1
>> >>
>> >
>>
>

Re: [VOTE] Release 0.9.0, release candidate #1

2021-08-17 Thread Vinoth Chandar

-1 (binding)

An issue was surfaced yesterday, that affects the re-definition of some
configs in HoodieWriteConfig e.g TABLE_NAME, AVRO_SCHEMA,
AVRO_SCHEMA_VALIDATE
which unfortunately do not have the _PROP suffix added. These are now
re-defining as ConfigProperty members now and jobs using them directly (we
have such usages
even in our quickstart) will have to break upon upgrading to this release.
While, its a bit of a step back to have to rename these to something else,
it might make sense
to keep all such members without _PROP suffix as string and make the new
members TABLE_NAME_CFG etc, to avoid upgrade pain.

I am also testing the RC for other issues, will update the thread again
with findings.

Thanks
Vinoth

On Sun, Aug 15, 2021 at 6:05 PM Raymond Xu 
wrote:

> Ok the Azure CI passed with the patch so the missing test coverage is
> proved not an issue
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=1770=results
>
> reverse my -1 to +1
>
> On Sun, Aug 15, 2021 at 5:10 PM Raymond Xu 
> wrote:
>
> > Hey Udit, sorry I'm -1 due to a misconfigured test resulting in some test
> > cases getting skipped. The patch was merged to master.
> > https://github.com/apache/hudi/pull/3476/files
> >
> > Created a hotfix PR to the release branch.
> > https://github.com/apache/hudi/pull/3480/files
> >
> >
> > On Sun, Aug 15, 2021 at 4:22 PM Udit Mehrotra  wrote:
> >
> >> Hi everyone,
> >>
> >> Please review and vote on the release candidate #1 for the version
> 0.9.0,
> >> as follows:
> >>
> >> [ ] +1, Approve the release
> >>
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >> The complete staging area is available for your review, which includes:
> >>
> >> * JIRA release notes [1],
> >>
> >> * the official Apache source release and binary convenience releases to
> be
> >> deployed to dist.apache.org [2], which are signed with the key with
> >> fingerprint 44A484600E48193A74F97447C47E66F8386204DF [3],
> >>
> >> * all artifacts to be deployed to the Maven Central Repository [4],
> >>
> >> * source code tag "release-0.9.0-rc1" [5],
> >>
> >> The vote will be open for at least 72 hours. It is adopted by majority
> >> approval, with at least 3 PMC affirmative votes.
> >>
> >> Thanks,
> >>
> >> Release Manager
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350027
> >>
> >> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.9.0-rc1/
> >>
> >> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> >>
> >> [4]
> >>
> >>
> https://repository.apache.org/content/repositories/orgapachehudi-1042/org/apache/hudi/
> >>
> >> [5] https://github.com/apache/hudi/tree/release-0.9.0-rc1
> >>
> >
>

Re: How to read hudi files with Mapreduce?

2021-08-13 Thread Vinoth Chandar

Hi Jian,

We have a hoodie-hadoop-mr package with some InputFormat. You can try using
HoodieParquetInputFormat to read from a MR job.
I have only tested with Hive this way myself. So wondering if anyone else
here has real experience trying with MR itself.

Thanks
Vinoth

On Wed, Aug 11, 2021 at 3:29 AM Jian Feng  wrote:

> Hi all,  anyone can give me a sample？
> --
>
> FengJian
>
> Data Infrastructure Team
>
> Mobile +65 90388153
>
> Address 5 Science Park Drive, Shopee Building, Singapore 118265
>

Re: Website redesign

2021-08-10 Thread Vinoth Chandar

gt; >   3. How to Contribute
> > > > >11. Added the code tab support for the quick-start spark guide,
> > the
> > > > same
> > > > >could be easily applied for other pages.
> > > > >
> > > > >
> > > > > Please refer to the docusaurus official documentation
> > > > > <https://docusaurus.io/docs/markdown-features> for adding more
> > > features
> > > > to
> > > > > the site.
> > > > >
> > > > > This is not the final version, the first step in the process is to
> > > > migrate
> > > > > the content as-is with few minor changes to the docusaurus
> platform,
> > > then
> > > > > we can make incremental changes to add more features.
> > > > >
> > > > > I'll be adding the search bar in the next iteration.
> > > > >
> > > > > These sed one-liners saved the day for me!!
> > > > >
> > > > > sed -i '' -E 's/src=("[^"]*")/src={require(\1).default}/g' *.md
> > > > > sed -i '' -E 's/keywords:(.*)$/keywords: [\1]/g' *.md
> > > > > sed -i '' -E 's///g' *.md
> > > > > sed -i '' -E 's/style=("[^"]*")//g' *.md
> > > > > sed -i '' -E 's/class=("[^"]*")/className=\1/g' *.md
> > > > > sed -i '' -E 's|/docs/0.5.0-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.5.1-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.5.2-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.5.3-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.6.0-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.7.0-|/docs/|g' *.md
> > > > > sed -i '' -E 's|/docs/0.8.0-|/docs/|g' *.md
> > > > >
> > > > > find . -name '*.md' -exec sed -i '' '/permalink:/d' {} +
> > > > >
> > > > > for filename in *.cn.md; do mv $filename ${filename//cn.md/md};
> done
> > > > >
> > > > > Cheers,
> > > > > Vinoth
> > > > >
> > > > >
> > > > > On Thu, Jul 29, 2021 at 12:53 AM Vinoth Chandar  >
> > > > wrote:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > the PR is up! https://github.com/apache/hudi/pull/3366
> > > > > > Please review.
> > > > > >
> > > > > > This is truly heroic work, vingov, fixing all the broken links
> and
> > > > > cleaning
> > > > > > a lot of debt in the jekyll based theme !
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 12, 2021 at 10:48 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Sounds good! Please grab the JIRA and we can start scoping it
> > into
> > > > sub
> > > > > > > tasks?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Mon, Jul 12, 2021 at 10:02 PM Vinoth Govindarajan <
> > > > > > > vinoth.govindara...@gmail.com> wrote:
> > > > > > >
> > > > > > >> Hi Folks,
> > > > > > >> I have experience in the past building websites, I can
> volunteer
> > > to
> > > > > work
> > > > > > >> on
> > > > > > >> this re-design.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Vinoth
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Jul 2, 2021 at 6:45 PM Vinoth Chandar <
> > vin...@apache.org>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> > At this point, scoping the work itself is a good first task,
> > > > > breaking
> > > > > > >> into
> > > > > > >> > sub tasks.
> > > > > > >> >
> > > > > > >> > I am willing to partner with someone closely, to drive this.
> > > > > > >> >
> > > > > > >> > On Wed, Jun 30, 2021 at 5:45 PM Danny Chan <
> > > danny0...@apache.org>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > All the pages assigns to volunteers or there is a someone
> > > major
> > > > in
> > > > > > it.
> > > > > > >> > >
> > > > > > >> > > Best,
> > > > > > >> > > Danny Chan
> > > > > > >> > >
> > > > > > >> > > Vinoth Chandar 于2021年7月1日 周四上午6:00写道：
> > > > > > >> > >
> > > > > > >> > > > Any volunteers? Also worth asking in slack?
> > > > > > >> > > >
> > > > > > >> > > > On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu <
> > > > > > >> > xu.shiyan.raym...@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi all,
> > > > > > >> > > > >
> > > > > > >> > > > > We've completed a re-design of Hudi's website (
> > > > > hudi.apache.org)
> > > > > > >> , in
> > > > > > >> > > the
> > > > > > >> > > > > goal of making the navigation more organized and
> > > information
> > > > > > more
> > > > > > >> > > > > discoverable. The design document can be found here
> > > (thanks
> > > > to
> > > > > > >> > designer
> > > > > > >> > > > > Joanna)
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6
> > > > > > >> > > > >
> > > > > > >> > > > > The design is ready for implementation; would like to
> > call
> > > > for
> > > > > > >> > > volunteers
> > > > > > >> > > > > to pick up this one!
> > > > > > >> > > > > https://issues.apache.org/jira/browse/HUDI-1985
> > > > > > >> > > > >
> > > > > > >> > > > > Cheers,
> > > > > > >> > > > > Raymond
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>

Re: [DISCUSS] Hudi 0.9.0 Release

2021-08-05 Thread Vinoth Chandar

Any other thoughts? Love to lock this date down sooner than later.

Thanks
Vinoth

On Tue, Aug 3, 2021 at 11:35 PM Udit Mehrotra  wrote:

> Agreed Vinoth. End of next week seems reasonable as a hard deadline for
> cutting the RC.
>
> If anyone thinks otherwise or needs more time, feel free to chime in.
>
> On Tue, Aug 3, 2021 at 8:10 PM Vinoth Chandar  wrote:
>
> > Thanks Udit! I propose we set end of next week as a hard deadline for
> > cutting the RC. Any thoughts?
> >
> > A good amount of progress is being made on these blockers, I think.
> >
> >
> > On Tue, Aug 3, 2021 at 5:13 PM Udit Mehrotra  wrote:
> >
> > > Hi Community,
> > >
> > > As we draw close to doing Hudi 0.9.0 release, I am happy to share a
> > summary
> > > of the key features/improvements that would be going in the release and
> > the
> > > current blockers for everyone's visibility.
> > >
> > > *Highlights*
> > >
> > >- [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink
> > >writer
> > >- [HUDI-1738] Detect and emit deleted records for Flink MOR table
> > >streaming read
> > >- [HUDI-1867] Support streaming reads for Flink COW table
> > >- [HUDI-1908] Global index for flink writer
> > >- [HUDI-1788] Support Insert Overwrite with Flink Writer
> > >- [HUDI-2209] Bulk insert for flink writer
> > >- [HUDI-1591] Support querying using non-globbed paths for Hudi
> Spark
> > >DataSource queries
> > >- [HUDI-1591] Partition pruning support for read optimized queries
> via
> > >Hudi Spark DataSource
> > >- [HUDI-1415] Register Hudi Table as a Spark DataSource Table with
> > >metastore. Queries via Spark SQL will be routed through Hudi
> > DataSource
> > >(instead of InputFormat), thus making it more performant due to
> > Spark's
> > >native/optimized readers
> > >- [HUDI-1591] Partition pruning support for snapshot queries via
> Hudi
> > >Spark DataSource
> > >- [HUDI-1658] DML and DDL support via Spark SQL
> > >- [HUDI-1790] Add SqlSource for DeltaStreamer to support backfill
> use
> > >cases:
> > >- [HUDI-251] Add JDBC Source support for DeltaStreamer
> > >- [HUDI-1910] Support Kafka based checkpointing for
> > HoodieDeltaStreamer
> > >- [HUDI-1371] Support metadata based listing for Spark DataSource
> and
> > >Spark SQL
> > >- [HUDI-2013] [HUDI-1717] [HUDI-2089] [HUDI-2016] Improvements to
> > >Metadata based listing
> > >- HUDI-89] Introduce a HoodieConfig/ConfigProperty framework to
> bring
> > >all configs under one roof
> > >- [HUDI-2124] Grafana dashboard for Hudi
> > >- [HUDI-1104] [HUDI-1105] [HUDI-2009] Improvements to Bulk Insert
> via
> > >row writing
> > >- [HUDI-1483] Async clustering for Delta Streamer
> > >- [HUDI-2235] Add virtual key support to Hudi
> > >- [HUDI-1848] Add support for Hive Metastore in Hive-sync-tool
> > >- In addition, there have been significant improvements and bug
> fixes
> > to
> > >improve the overall stability of Flink Hudi integration
> > >
> > > *Current Blockers*
> > >
> > >- [HUDI-2208] Support Bulk Insert For Spark Sql (Owner: pengzhiwei)
> > >- [HUDI-1256] Follow on improvements to HFile tables for metadata
> > based
> > >listing (Owner: None)
> > >- [HUDI-2063] Add Doc For Spark Sql (DML and DDL) integration With
> > Hudi
> > >(Owner: pengzhiwei)
> > >- [HUDI-1842] Spark Sql Support For The Exists Hoodie Table (Owner:
> > >pengzhiwei)
> > >- [HUDI-1138] Re-implement marker files via timeline server (Owner:
> > >Ethan Guo)
> > >- [HUDI-1985] Website redesign implementation (Owner: Vinoth
> > >Govindarajan)
> > >- [HUDI-2232] MERGE INTO fails with table having nested struct
> (Owner:
> > >pengzhiwei)
> > >- [HUDI-1468] incremental read support with clustering (Owner:
> Liwei)
> > >- [HUDI-2250] Bulk insert support for tables w/ primary key (Owner:
> > > None)
> > >- [HUDI-] [SQL] Test catalog integration (Owner: Sagar Sumit)
> > >- [HUDI-2221] [SQL] Functionality testing with Spark 2 (Owner: Sagar
> > >Sumit)
> > >- [HUDI-1887] Setting default value to false for enabling schema
> post
> > >

Re: [DISCUSS] Hudi is the data lake platform

2021-08-04 Thread Vinoth Chandar

Folks,

I have been digesting some feedback on what we show on the home page itself.

While the blog explains the vision, it might be good to bubble up sub-areas
that are
more relevant to our users today. transactions, updates, deletes.

So, i have raised a PR moving stuff around.

Now we lead with
- "Hudi brings transactions, record-level updates/deletes and change
streams to data lakes"

then explain the platform, in the next level of detail.

https://github.com/apache/hudi/pull/3406

On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar  wrote:

> Thanks! Will work on it this week.
> Also redoing some images based on feedback.
>
> On Fri, Jul 30, 2021 at 2:06 AM vino yang  wrote:
>
>> +1
>>
>> Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道：
>>
>> > Guess we should rebrand Hudi on README.md file as well -
>> > https://github.com/apache/hudi#readme?
>> >
>> > This page still mentions the following -
>> >
>> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
>> > Incrementals. Hudi manages the storage of large analytical datasets on
>> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>> >
>> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar 
>> wrote:
>> >
>> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
>> >>
>> >> Will land this monday, giving it more time over the weekend as well.
>> >>
>> >>
>> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang 
>> wrote:
>> >>
>> >> > Thanks vc
>> >> >
>> >> > Very good blog, in-depth and forward-looking. Learned!
>> >> >
>> >> > Best,
>> >> > Vino
>> >> >
>> >> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道：
>> >> >
>> >> > > Expanding to users@ as well.
>> >> > >
>> >> > > Hi all,
>> >> > >
>> >> > > Since this discussion, I started to pen down a coherent strategy
>> and
>> >> > convey
>> >> > > these ideas via a blog post.
>> >> > > I have also done my own research, talked to (ex)colleagues I
>> respect
>> >> to
>> >> > get
>> >> > > their take and refine it.
>> >> > >
>> >> > > Here's a blog that hopefully explains this vision.
>> >> > >
>> >> > > https://github.com/apache/hudi/pull/3322
>> >> > >
>> >> > > Look forward to your feedback on the PR. We are hoping to land this
>> >> early
>> >> > > next week, if everyone is aligned.
>> >> > >
>> >> > > Thanks
>> >> > > Vinoth
>> >> > >
>> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li 
>> wrote:
>> >> > >
>> >> > > > +1 , Cannot agree more.
>> >> > > >  *aux metadata* and metatable, can make hudi have large
>> preformance
>> >> > > > optimization on query end.
>> >> > > > Can continuous develop.
>> >> > > > cache service may the necessary component in cloud native
>> >> environment.
>> >> > > >
>> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar 
>> wrote:
>> >> > > > > Hello all,
>> >> > > > >
>> >> > > > > Reading one more article today, positioning Hudi, as just a
>> table
>> >> > > format,
>> >> > > > > made me wonder, if we have done enough justice in explaining
>> what
>> >> we
>> >> > > have
>> >> > > > > built together here.
>> >> > > > > I tend to think of Hudi as the data lake platform, which has
>> the
>> >> > > > following
>> >> > > > > components, of which - one if a table format, one is a
>> >> transactional
>> >> > > > > storage layer.
>> >> > > > > But the whole stack we have is definitely worth more than the
>> sum
>> >> of
>> >> > > all
>> >> > > > > the parts IMO (speaking from my own experience from the past
>> 10+
>> >> > years
>> >> > > of
>> >> > > > > open source software dev).
>> >> > > > >
>> >> > &

Re: [DISCUSS] Hudi 0.9.0 Release

2021-08-03 Thread Vinoth Chandar

Thanks Udit! I propose we set end of next week as a hard deadline for
cutting the RC. Any thoughts?

A good amount of progress is being made on these blockers, I think.


On Tue, Aug 3, 2021 at 5:13 PM Udit Mehrotra  wrote:

> Hi Community,
>
> As we draw close to doing Hudi 0.9.0 release, I am happy to share a summary
> of the key features/improvements that would be going in the release and the
> current blockers for everyone's visibility.
>
> *Highlights*
>
>- [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink
>writer
>- [HUDI-1738] Detect and emit deleted records for Flink MOR table
>streaming read
>- [HUDI-1867] Support streaming reads for Flink COW table
>- [HUDI-1908] Global index for flink writer
>- [HUDI-1788] Support Insert Overwrite with Flink Writer
>- [HUDI-2209] Bulk insert for flink writer
>- [HUDI-1591] Support querying using non-globbed paths for Hudi Spark
>DataSource queries
>- [HUDI-1591] Partition pruning support for read optimized queries via
>Hudi Spark DataSource
>- [HUDI-1415] Register Hudi Table as a Spark DataSource Table with
>metastore. Queries via Spark SQL will be routed through Hudi DataSource
>(instead of InputFormat), thus making it more performant due to Spark's
>native/optimized readers
>- [HUDI-1591] Partition pruning support for snapshot queries via Hudi
>Spark DataSource
>- [HUDI-1658] DML and DDL support via Spark SQL
>- [HUDI-1790] Add SqlSource for DeltaStreamer to support backfill use
>cases:
>- [HUDI-251] Add JDBC Source support for DeltaStreamer
>- [HUDI-1910] Support Kafka based checkpointing for HoodieDeltaStreamer
>- [HUDI-1371] Support metadata based listing for Spark DataSource and
>Spark SQL
>- [HUDI-2013] [HUDI-1717] [HUDI-2089] [HUDI-2016] Improvements to
>Metadata based listing
>- HUDI-89] Introduce a HoodieConfig/ConfigProperty framework to bring
>all configs under one roof
>- [HUDI-2124] Grafana dashboard for Hudi
>- [HUDI-1104] [HUDI-1105] [HUDI-2009] Improvements to Bulk Insert via
>row writing
>- [HUDI-1483] Async clustering for Delta Streamer
>- [HUDI-2235] Add virtual key support to Hudi
>- [HUDI-1848] Add support for Hive Metastore in Hive-sync-tool
>- In addition, there have been significant improvements and bug fixes to
>improve the overall stability of Flink Hudi integration
>
> *Current Blockers*
>
>- [HUDI-2208] Support Bulk Insert For Spark Sql (Owner: pengzhiwei)
>- [HUDI-1256] Follow on improvements to HFile tables for metadata based
>listing (Owner: None)
>- [HUDI-2063] Add Doc For Spark Sql (DML and DDL) integration With Hudi
>(Owner: pengzhiwei)
>- [HUDI-1842] Spark Sql Support For The Exists Hoodie Table (Owner:
>pengzhiwei)
>- [HUDI-1138] Re-implement marker files via timeline server (Owner:
>Ethan Guo)
>- [HUDI-1985] Website redesign implementation (Owner: Vinoth
>Govindarajan)
>- [HUDI-2232] MERGE INTO fails with table having nested struct (Owner:
>pengzhiwei)
>- [HUDI-1468] incremental read support with clustering (Owner: Liwei)
>- [HUDI-2250] Bulk insert support for tables w/ primary key (Owner:
> None)
>- [HUDI-] [SQL] Test catalog integration (Owner: Sagar Sumit)
>- [HUDI-2221] [SQL] Functionality testing with Spark 2 (Owner: Sagar
>Sumit)
>- [HUDI-1887] Setting default value to false for enabling schema post
>processor (Owner: Sivabalan)
>- [HUDI-1850] Fixing read of a empty table but with failed write (Owner:
>Sivabalan)
>- [HUDI-2151] Enable defaults for out of box performance (Owner: Udit
>Mehrotra)
>- [HUDI-2119] Ensure the rolled-back instance was previously synced to
>the Metadata Table when syncing a Rollback Instant (Owner: Prashant
> Wason)
>- [HUDI-1458] Support custom clustering strategies and preserve commit
>time to support incremental read (Owner: Satish Kotha)
>- [HUDI-1763] Fixing honoring of Ordering val in
>DefaultHoodieRecordPayload.preCombine (Owner: Sivabalan)
>- [HUDI-1129] Improving schema evolution support in hudi (Owner:
>Sivabalan)
>- [HUDI-2120] [DOC] Update docs about schema in flink sql configuration
>(Owner: Xianghu Wang)
>- [HUDI-2182] Support Compaction Command For Spark Sql (Owner:
>pengzhiwei)
>
> Please respond to the thread if you think that I have missed capturing any
> of the highlights or blockers for Hudi 0.9.0 release. For the owners of
> these release blockers, can you please provide a specific timeline you are
> willing to commit to for finishing these so we can cut an RC ?
>
> Thanks,
> Udit
>

Re: [DISCUSS] Disable ASF GitHub Bot comments under the JIRA issue

2021-08-02 Thread Vinoth Chandar

+1 as well.
Danny, please go ahead and remove, land this.

On Sun, Aug 1, 2021 at 2:51 AM leesf  wrote:

> +1 to disable.
>
> Vinoth Chandar  于2021年7月28日周三 上午12:37写道：
>
> > Anybody with strong opinions to keep them?
> > I am happy to go back to clicking to get to github links.
> >
> > On Tue, Jul 27, 2021 at 6:33 AM xuedong luan 
> > wrote:
> >
> > > +1
> > >
> > > Danny Chan  于2021年7月27日周二 上午10:38写道：
> > >
> > > > I found that there are many ASF GitHub Bot comments under our issue
> > now,
> > > it
> > > > messes up with the design discussions and is hard to read. The normal
> > > > comments are drowned in these junk messages.
> > > >
> > > > So i request to disable it to make the JIRA comments clear and clean.
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > >
> >
>

Re: [DISCUSS] Hudi is the data lake platform

2021-08-02 Thread Vinoth Chandar

Thanks! Will work on it this week.
Also redoing some images based on feedback.

On Fri, Jul 30, 2021 at 2:06 AM vino yang  wrote:

> +1
>
> Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道：
>
> > Guess we should rebrand Hudi on README.md file as well -
> > https://github.com/apache/hudi#readme?
> >
> > This page still mentions the following -
> >
> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> > Incrementals. Hudi manages the storage of large analytical datasets on
> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
> >
> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar 
> wrote:
> >
> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
> >>
> >> Will land this monday, giving it more time over the weekend as well.
> >>
> >>
> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang 
> wrote:
> >>
> >> > Thanks vc
> >> >
> >> > Very good blog, in-depth and forward-looking. Learned!
> >> >
> >> > Best,
> >> > Vino
> >> >
> >> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道：
> >> >
> >> > > Expanding to users@ as well.
> >> > >
> >> > > Hi all,
> >> > >
> >> > > Since this discussion, I started to pen down a coherent strategy and
> >> > convey
> >> > > these ideas via a blog post.
> >> > > I have also done my own research, talked to (ex)colleagues I respect
> >> to
> >> > get
> >> > > their take and refine it.
> >> > >
> >> > > Here's a blog that hopefully explains this vision.
> >> > >
> >> > > https://github.com/apache/hudi/pull/3322
> >> > >
> >> > > Look forward to your feedback on the PR. We are hoping to land this
> >> early
> >> > > next week, if everyone is aligned.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li 
> wrote:
> >> > >
> >> > > > +1 , Cannot agree more.
> >> > > >  *aux metadata* and metatable, can make hudi have large
> preformance
> >> > > > optimization on query end.
> >> > > > Can continuous develop.
> >> > > > cache service may the necessary component in cloud native
> >> environment.
> >> > > >
> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> >> > > > > Hello all,
> >> > > > >
> >> > > > > Reading one more article today, positioning Hudi, as just a
> table
> >> > > format,
> >> > > > > made me wonder, if we have done enough justice in explaining
> what
> >> we
> >> > > have
> >> > > > > built together here.
> >> > > > > I tend to think of Hudi as the data lake platform, which has the
> >> > > > following
> >> > > > > components, of which - one if a table format, one is a
> >> transactional
> >> > > > > storage layer.
> >> > > > > But the whole stack we have is definitely worth more than the
> sum
> >> of
> >> > > all
> >> > > > > the parts IMO (speaking from my own experience from the past 10+
> >> > years
> >> > > of
> >> > > > > open source software dev).
> >> > > > >
> >> > > > > Here's what we have built so far.
> >> > > > >
> >> > > > > a) *table format* : something that stores table schema, a
> metadata
> >> > > table
> >> > > > > that stores file listing today, and being extended to store
> column
> >> > > ranges
> >> > > > > and more in the future (RFC-27)
> >> > > > > b) *aux metadata* : bloom filters, external record level indexes
> >> > today,
> >> > > > > bitmaps/interval trees and other advanced on-disk data
> structures
> >> > > > tomorrow
> >> > > > > c) *concurrency control* : we always supported MVCC based log
> >> based
> >> > > > > concurrency (serialize writes into a time ordered log), and we
> now
> >> > also
> >> > > > > have OCC for batch m

Re: Long test run times

2021-07-30 Thread Vinoth Chandar

There is probably some good amount of low hanging fruit, we can make some
headway and see how it goes from there?

On Thu, Jul 29, 2021 at 7:18 PM Danny Chan  wrote:

> What should we do for these long running tests ? Simplify them to more
> simple UTs ?
>
> Vinoth Chandar  于2021年7月30日周五 上午6:53写道：
>
> > I am looking into
> >
> > 614.322 org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> > 556.392 org.apache.hudi.metadata.TestHoodieBackedMetadata
> >
> > On Thu, Jul 29, 2021 at 1:54 PM Sivabalan  wrote:
> >
> > > I will take up some.
> > >
> > > org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> > > org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
> > > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> > > org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jul 29, 2021 at 4:04 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Folks,
> > > >
> > > > Our tests are now exceeding 60 minutes and suffering timeouts on
> azure
> > > > (travis is slow to actually schedule, but seems to finish on time).
> > > >
> > > > Following are the list of top slow tests. I am working from the top
> of
> > > the
> > > > list. Any one interested in chipping in?  Please respond with the
> test
> > > you
> > > > are interested in fixing.
> > > >
> > > > 927.371
> > > >
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> > > > 894.438
> > > >
> org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
> > > > 866.088 org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> > > > 614.322 org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> > > > 556.392 org.apache.hudi.metadata.TestHoodieBackedMetadata
> > > > 291.229 org.apache.hudi.table.TestHoodieMergeOnReadTable
> > > > 245.922 org.apache.hudi.functional.TestStructuredStreaming
> > > > 234.737 org.apache.hudi.index.TestHoodieIndex
> > > > 224.893 org.apache.hudi.functional.TestMORDataSource
> > > > 217.354
> > > >
> org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter
> > > > 210.717 org.apache.hudi.common.functional.TestHoodieLogFormat
> > > > 204.989 org.apache.hudi.table.TestCleaner
> > > > 198.623 org.apache.hudi.client.TestBootstrap
> > > > 170.029 org.apache.hudi.hive.TestHiveSyncTool
> > > > 135.295 org.apache.hudi.table.action.compact.TestInlineCompaction
> > > > 116.381
> > org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> > > > 113.559 org.apache.hudi.functional.TestCOWDataSource
> > > > 111.297 org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: Long test run times

2021-07-29 Thread Vinoth Chandar

I am looking into

614.322 org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
556.392 org.apache.hudi.metadata.TestHoodieBackedMetadata

On Thu, Jul 29, 2021 at 1:54 PM Sivabalan  wrote:

> I will take up some.
>
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter
>
>
>
>
>
>
>
>
> On Thu, Jul 29, 2021 at 4:04 PM Vinoth Chandar  wrote:
>
> > Folks,
> >
> > Our tests are now exceeding 60 minutes and suffering timeouts on azure
> > (travis is slow to actually schedule, but seems to finish on time).
> >
> > Following are the list of top slow tests. I am working from the top of
> the
> > list. Any one interested in chipping in?  Please respond with the test
> you
> > are interested in fixing.
> >
> > 927.371
> > org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> > 894.438
> > org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
> > 866.088 org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> > 614.322 org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> > 556.392 org.apache.hudi.metadata.TestHoodieBackedMetadata
> > 291.229 org.apache.hudi.table.TestHoodieMergeOnReadTable
> > 245.922 org.apache.hudi.functional.TestStructuredStreaming
> > 234.737 org.apache.hudi.index.TestHoodieIndex
> > 224.893 org.apache.hudi.functional.TestMORDataSource
> > 217.354
> > org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter
> > 210.717 org.apache.hudi.common.functional.TestHoodieLogFormat
> > 204.989 org.apache.hudi.table.TestCleaner
> > 198.623 org.apache.hudi.client.TestBootstrap
> > 170.029 org.apache.hudi.hive.TestHiveSyncTool
> > 135.295 org.apache.hudi.table.action.compact.TestInlineCompaction
> > 116.381 org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> > 113.559 org.apache.hudi.functional.TestCOWDataSource
> > 111.297 org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> >
> > Thanks
> > Vinoth
> >
>
>
> --
> Regards,
> -Sivabalan
>

Long test run times

2021-07-29 Thread Vinoth Chandar

Folks,

Our tests are now exceeding 60 minutes and suffering timeouts on azure
(travis is slow to actually schedule, but seems to finish on time).

Following are the list of top slow tests. I am working from the top of the
list. Any one interested in chipping in?  Please respond with the test you
are interested in fixing.

927.371
org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
894.438
org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite
866.088 org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
614.322 org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
556.392 org.apache.hudi.metadata.TestHoodieBackedMetadata
291.229 org.apache.hudi.table.TestHoodieMergeOnReadTable
245.922 org.apache.hudi.functional.TestStructuredStreaming
234.737 org.apache.hudi.index.TestHoodieIndex
224.893 org.apache.hudi.functional.TestMORDataSource
217.354
org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter
210.717 org.apache.hudi.common.functional.TestHoodieLogFormat
204.989 org.apache.hudi.table.TestCleaner
198.623 org.apache.hudi.client.TestBootstrap
170.029 org.apache.hudi.hive.TestHiveSyncTool
135.295 org.apache.hudi.table.action.compact.TestInlineCompaction
116.381 org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
113.559 org.apache.hudi.functional.TestCOWDataSource
111.297 org.apache.hudi.table.upgrade.TestUpgradeDowngrade

Thanks
Vinoth

Re: Website redesign

2021-07-29 Thread Vinoth Chandar

Folks,

the PR is up! https://github.com/apache/hudi/pull/3366
Please review.

This is truly heroic work, vingov, fixing all the broken links and cleaning
a lot of debt in the jekyll based theme !


On Mon, Jul 12, 2021 at 10:48 PM Vinoth Chandar  wrote:

> Hi,
>
> Sounds good! Please grab the JIRA and we can start scoping it into sub
> tasks?
>
> Thanks
> Vinoth
>
> On Mon, Jul 12, 2021 at 10:02 PM Vinoth Govindarajan <
> vinoth.govindara...@gmail.com> wrote:
>
>> Hi Folks,
>> I have experience in the past building websites, I can volunteer to work
>> on
>> this re-design.
>>
>> Best,
>> Vinoth
>>
>>
>> On Fri, Jul 2, 2021 at 6:45 PM Vinoth Chandar  wrote:
>>
>> > At this point, scoping the work itself is a good first task, breaking
>> into
>> > sub tasks.
>> >
>> > I am willing to partner with someone closely, to drive this.
>> >
>> > On Wed, Jun 30, 2021 at 5:45 PM Danny Chan 
>> wrote:
>> >
>> > > All the pages assigns to volunteers or there is a someone major in it.
>> > >
>> > > Best,
>> > > Danny Chan
>> > >
>> > > Vinoth Chandar 于2021年7月1日 周四上午6:00写道：
>> > >
>> > > > Any volunteers? Also worth asking in slack?
>> > > >
>> > > > On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu <
>> > xu.shiyan.raym...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > We've completed a re-design of Hudi's website (hudi.apache.org)
>> , in
>> > > the
>> > > > > goal of making the navigation more organized and information more
>> > > > > discoverable. The design document can be found here (thanks to
>> > designer
>> > > > > Joanna)
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6
>> > > > >
>> > > > > The design is ready for implementation; would like to call for
>> > > volunteers
>> > > > > to pick up this one!
>> > > > > https://issues.apache.org/jira/browse/HUDI-1985
>> > > > >
>> > > > > Cheers,
>> > > > > Raymond
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Disable ASF GitHub Bot comments under the JIRA issue

2021-07-27 Thread Vinoth Chandar

Anybody with strong opinions to keep them?
I am happy to go back to clicking to get to github links.

On Tue, Jul 27, 2021 at 6:33 AM xuedong luan 
wrote:

> +1
>
> Danny Chan  于2021年7月27日周二 上午10:38写道：
>
> > I found that there are many ASF GitHub Bot comments under our issue now,
> it
> > messes up with the design discussions and is hard to read. The normal
> > comments are drowned in these junk messages.
> >
> > So i request to disable it to make the JIRA comments clear and clean.
> >
> > Best,
> > Danny Chan
> >
>

Re: How to disable the ASF GitHub Bot comments under the issue ticket ?

2021-07-26 Thread Vinoth Chandar

Hi Danny,

Worth discussing. It was turned on by adding "comment" here.

https://github.com/apache/hudi/blob/master/.asf.yaml#L41

The intention is that all GH activity is reflected once you see a JIRA.
otherwise, one has to keep clicking back and forth.

Maybe open this for discussion and see what everyone thinks?

Thanks
Vinoth

On Mon, Jul 26, 2021 at 3:06 AM Danny Chan  wrote:

> I found that there are many ASF GitHub Bot comments under our issue now, it
> messed up with the design discussions and hard to read.
>
> Is there are way to disable it ?
>
> Best,
> Danny
>

Re: [DISCUSS] Hudi is the data lake platform

2021-07-23 Thread Vinoth Chandar

Thanks Vino! Got a bunch of emoticons on the PR as well.

Will land this monday, giving it more time over the weekend as well.


On Wed, Jul 21, 2021 at 7:36 PM vino yang  wrote:

> Thanks vc
>
> Very good blog, in-depth and forward-looking. Learned!
>
> Best,
> Vino
>
> Vinoth Chandar  于2021年7月22日周四 上午3:58写道：
>
> > Expanding to users@ as well.
> >
> > Hi all,
> >
> > Since this discussion, I started to pen down a coherent strategy and
> convey
> > these ideas via a blog post.
> > I have also done my own research, talked to (ex)colleagues I respect to
> get
> > their take and refine it.
> >
> > Here's a blog that hopefully explains this vision.
> >
> > https://github.com/apache/hudi/pull/3322
> >
> > Look forward to your feedback on the PR. We are hoping to land this early
> > next week, if everyone is aligned.
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
> >
> > > +1 , Cannot agree more.
> > >  *aux metadata* and metatable, can make hudi have large preformance
> > > optimization on query end.
> > > Can continuous develop.
> > > cache service may the necessary component in cloud native environment.
> > >
> > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > > > Hello all,
> > > >
> > > > Reading one more article today, positioning Hudi, as just a table
> > format,
> > > > made me wonder, if we have done enough justice in explaining what we
> > have
> > > > built together here.
> > > > I tend to think of Hudi as the data lake platform, which has the
> > > following
> > > > components, of which - one if a table format, one is a transactional
> > > > storage layer.
> > > > But the whole stack we have is definitely worth more than the sum of
> > all
> > > > the parts IMO (speaking from my own experience from the past 10+
> years
> > of
> > > > open source software dev).
> > > >
> > > > Here's what we have built so far.
> > > >
> > > > a) *table format* : something that stores table schema, a metadata
> > table
> > > > that stores file listing today, and being extended to store column
> > ranges
> > > > and more in the future (RFC-27)
> > > > b) *aux metadata* : bloom filters, external record level indexes
> today,
> > > > bitmaps/interval trees and other advanced on-disk data structures
> > > tomorrow
> > > > c) *concurrency control* : we always supported MVCC based log based
> > > > concurrency (serialize writes into a time ordered log), and we now
> also
> > > > have OCC for batch merge workloads with 0.8.0. We will have
> multi-table
> > > and
> > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> Hudi,
> > > but
> > > > we support primary/unique key constraints and we could add foreign
> keys
> > > as
> > > > an extension, once our transactions can span tables.
> > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > files,
> > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > actions working off each other without blocking one another. (for
> most
> > > > parts).
> > > > f) *data services*: we also have higher level functionality with
> > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > callbacks, pre-commit validations are coming, error tables have been
> > > > proposed. I could also envision us building towards streaming egress,
> > > data
> > > > monitoring.
> > > >
> > > > I also think we should build the following (subject to separate
> > > > DISCUSS/RFCs)
> > > >
> > > > g) *caching service*: Hudi specific caching service that can hold
> > mutable
> > > > data and serve oft-queried data across engines.
> > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > turn
> > > > it into a scalable, sharded metastore, that all engines can use to
> > obtain
> > > > any metadata.
> > > >
> > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> opposed
> > to
> > > > "ingests & manages storage of large analytical datasets over DFS
> (hdfs
> > or
> > > > cloud stores)." and convey the scope of our vision,
> > > > given we have already been building towards that. It would also
> provide
> > > new
> > > > contributors a good lens to look at the project from.
> > > >
> > > > (This is very similar to for e.g, the evolution of Kafka from a
> pub-sub
> > > > system, to an event streaming platform - with addition of
> > > > MirrorMaker/Connect etc. )
> > > >
> > > > Please share your thoughts!
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Re: [VOTE] Move content off cWiki

2021-07-23 Thread Vinoth Chandar

Vote is now closed.

Vote passed with 13 +1s and no -1s.

Thanks all!

On Fri, Jul 23, 2021 at 7:31 AM Vinoth Chandar  wrote:

> +1
>
> On Fri, Jul 23, 2021 at 2:41 AM Gary Li  wrote:
>
>> +1
>>
>> On Tue, Jul 20, 2021 at 8:06 PM vino yang  wrote:
>>
>> > +1
>> >
>> > Navinder Brar  于2021年7月20日周二
>> 上午11:01写道：
>> >
>> > > +1
>> > > Navinder
>> > >
>> > >
>> > > Sent from Yahoo Mail for iPhone
>> > >
>> > >
>> > > On Tuesday, July 20, 2021, 7:28 AM, Sivabalan 
>> > wrote:
>> > >
>> > > +1
>> > >
>> > > On Mon, Jul 19, 2021 at 9:19 PM Nishith  wrote:
>> > >
>> > > > +1
>> > > >
>> > > > -Nishith
>> > > >
>> > > > > On Jul 19, 2021, at 6:15 PM, Udit Mehrotra <
>> > udit.mehrotr...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > +1
>> > > > >
>> > > > > Best,
>> > > > > Udit
>> > > > >
>> > > > >> On Mon, Jul 19, 2021 at 6:04 PM wangxianghu 
>> > wrote:
>> > > > >>
>> > > > >> +1 - Approve the move
>> > > > >>
>> > > > >>> 2021年7月20日 上午8:37，Danny Chan  写道：
>> > > > >>>
>> > > > >>> +1 - Approve the move
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > -Sivabalan
>> > >
>> > >
>> > >
>> > >
>> >
>>
>

Re: [VOTE] Move content off cWiki

2021-07-23 Thread Vinoth Chandar

+1

On Fri, Jul 23, 2021 at 2:41 AM Gary Li  wrote:

> +1
>
> On Tue, Jul 20, 2021 at 8:06 PM vino yang  wrote:
>
> > +1
> >
> > Navinder Brar  于2021年7月20日周二 上午11:01写道：
> >
> > > +1
> > > Navinder
> > >
> > >
> > > Sent from Yahoo Mail for iPhone
> > >
> > >
> > > On Tuesday, July 20, 2021, 7:28 AM, Sivabalan 
> > wrote:
> > >
> > > +1
> > >
> > > On Mon, Jul 19, 2021 at 9:19 PM Nishith  wrote:
> > >
> > > > +1
> > > >
> > > > -Nishith
> > > >
> > > > > On Jul 19, 2021, at 6:15 PM, Udit Mehrotra <
> > udit.mehrotr...@gmail.com>
> > > > wrote:
> > > > >
> > > > > +1
> > > > >
> > > > > Best,
> > > > > Udit
> > > > >
> > > > >> On Mon, Jul 19, 2021 at 6:04 PM wangxianghu 
> > wrote:
> > > > >>
> > > > >> +1 - Approve the move
> > > > >>
> > > > >>> 2021年7月20日 上午8:37，Danny Chan  写道：
> > > > >>>
> > > > >>> +1 - Approve the move
> > > > >>
> > > > >>
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Create Spark and Flink utilities module

2021-07-21 Thread Vinoth Chandar

Hi Vinay,

I am not sure why we are bundling parquet with Flink. If so, we could try
and resolve that?
That's the route we can first take IMO.

Our bundles don't bundle, spark, flink, hadoop, parquet. So I think a
single bundle is doable.

Happy to help with specific issues as they come up.

On Tue, Jul 20, 2021 at 9:35 AM Vinay Patil  wrote:

> Hi Vinoth,
>
> > I wonder if it's possible to structure the code in separate modules, but
> have a single bundle
>
> Yes, this is possible, initially I started doing the same in this PR -
> https://github.com/apache/hudi/pull/3162 , hence wanted to discuss this
> here, if we create a single bundle, we have to make sure there are no
> dependency conflicts. For example: Flink-Hudi is using a different version
> of parquet as compared to Spark-Hudi because of which the tests started to
> fail with `java.lang.NoSuchMethodError:
> org.apache.parquet.column.ParquetProperties.getColumnIndexTruncateLength()`
>
>
> Regards,
> Vinay Patil
>
>
> On Tue, Jul 20, 2021 at 9:46 PM Vinoth Chandar  wrote:
>
> > Hi Vinay.
> >
> > Thanks for kicking this off.
> >
> > I wonder if it's possible to structure the code in separate modules, but
> > have a single bundle.
> > Or is that a painful experience? (if so, can you share what issues we are
> > running into?)
> >
> > We have rarely done backwards incompatible changes and users appreciate
> > that.
> > So love to understand why this is warranted here.
> >
> > Thanks
> > Vinoth
> >
> > On Sat, Jul 17, 2021 at 7:58 AM Vinay Patil 
> > wrote:
> >
> > > Hi Team,
> > >
> > > As part of https://issues.apache.org/jira/browse/HUDI-1872, we are
> > > creating
> > > a separate flink-utilities module. Based on our discussion on the PR,
> > > should we even create a spark-utilities module. This would look like :
> > >
> > > hudi-utilities
> > > ├── hudi-flink-utilities
> > > └── hudi-spark-utilities
> > >
> > > This would also mean to create separate utilities-bundle for Flink and
> > > Spark,
> > >
> > > hudi-utilities-bundle
> > >  ├── hudi-flink-utilities-bundle
> > >  └── hudi-spark-utilities-bundle
> > >
> > > This is not a backward compatible change as users will have to provide
> an
> > > engine specific bundle. IMO, since Hudi is supporting Flink and Spark
> it
> > > will be good to have  engine specific bundle.
> > >
> > > What do you think?
> > >
> > > Regards,
> > > Vinay Patil
> > >
> >
>

Re: [DISCUSS] Hudi is the data lake platform

2021-07-21 Thread Vinoth Chandar

Expanding to users@ as well.

Hi all,

Since this discussion, I started to pen down a coherent strategy and convey
these ideas via a blog post.
I have also done my own research, talked to (ex)colleagues I respect to get
their take and refine it.

Here's a blog that hopefully explains this vision.

https://github.com/apache/hudi/pull/3322

Look forward to your feedback on the PR. We are hoping to land this early
next week, if everyone is aligned.

Thanks
Vinoth

On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:

> +1 , Cannot agree more.
>  *aux metadata* and metatable, can make hudi have large preformance
> optimization on query end.
> Can continuous develop.
> cache service may the necessary component in cloud native environment.
>
> On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > Hello all,
> >
> > Reading one more article today, positioning Hudi, as just a table format,
> > made me wonder, if we have done enough justice in explaining what we have
> > built together here.
> > I tend to think of Hudi as the data lake platform, which has the
> following
> > components, of which - one if a table format, one is a transactional
> > storage layer.
> > But the whole stack we have is definitely worth more than the sum of all
> > the parts IMO (speaking from my own experience from the past 10+ years of
> > open source software dev).
> >
> > Here's what we have built so far.
> >
> > a) *table format* : something that stores table schema, a metadata table
> > that stores file listing today, and being extended to store column ranges
> > and more in the future (RFC-27)
> > b) *aux metadata* : bloom filters, external record level indexes today,
> > bitmaps/interval trees and other advanced on-disk data structures
> tomorrow
> > c) *concurrency control* : we always supported MVCC based log based
> > concurrency (serialize writes into a time ordered log), and we now also
> > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> and
> > fully non-blocking writers soon (see future work section of RFC-22)
> > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> but
> > we support primary/unique key constraints and we could add foreign keys
> as
> > an extension, once our transactions can span tables.
> > e) *table services*: a hudi pipeline today is self-managing - sizes
> files,
> > cleans, compacts, clusters data, bootstraps existing data - all these
> > actions working off each other without blocking one another. (for most
> > parts).
> > f) *data services*: we also have higher level functionality with
> > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > coming, ...and more), incremental ETL support, de-duplication, commit
> > callbacks, pre-commit validations are coming, error tables have been
> > proposed. I could also envision us building towards streaming egress,
> data
> > monitoring.
> >
> > I also think we should build the following (subject to separate
> > DISCUSS/RFCs)
> >
> > g) *caching service*: Hudi specific caching service that can hold mutable
> > data and serve oft-queried data across engines.
> > h) t*imeline metaserver:* We already run a metaserver in spark
> > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> turn
> > it into a scalable, sharded metastore, that all engines can use to obtain
> > any metadata.
> >
> > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> > "ingests & manages storage of large analytical datasets over DFS (hdfs or
> > cloud stores)." and convey the scope of our vision,
> > given we have already been building towards that. It would also provide
> new
> > contributors a good lens to look at the project from.
> >
> > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > system, to an event streaming platform - with addition of
> > MirrorMaker/Connect etc. )
> >
> > Please share your thoughts!
> >
> > Thanks
> > Vinoth
> >
>

Re: Hive integration Improvment

2021-07-20 Thread Vinoth Chandar

Thanks for this! Will review this week!

On Thu, Jul 15, 2021 at 5:15 AM 18717838093 <18717838...@126.com> wrote:

>
>
> Hi, experts.
>
>
> Currently, Hudi sql statements for DML are executed by Hive Driver with
> concatenation SQL statements in most cases. The way SQL is concatenated is
> hard to maintain and the code is easy to break. Other than that, multiple
> versions of Hive cannot be supported at the moment and makes a lot of
> headaches for users to use. So, I would like to refactor and refine these
> two things for getting a better design and more convenient for users to use.
>
>
> for example, the following function use driver to execute sql.
>
> HiveSyncTool#syncHoodieTable used for creating a database by driver.
> HoodieHiveClient#createTable, for creating a table by driver.
> HoodieHiveClient#addPartitionsToTable by driver.
> HoodieHiveClient#updatePartitionsToTable by driver.
> HoodieHiveClient#updateTableDefinition, alter table by driver.
>
>
>
>
> Other than that, HoodieHiveClient#updateTableProperties,
> HoodieHiveClient#scanTablePartitions, HoodieHiveClient#doesTableExist and
> etc, those metadata operation use client api to execute sql. Consider from
> the design, the two pieces are not aligned. So I would think we need to
> abstract a unified interface completely for all stuff contact with HMS and
> does not use Driver to execute DML. As for the hive that can support
> multiple versions, we can add a shim layer to support different versions of
> HMS.
>
>
> I have a preliminary conception of the design in RFC-31 (
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment).
> I hope everyone can help with some reviews and provide some suggestions.
> thank you very much.
>
>
> - Looking forward to your reply.
>
>
> minglei
>
>
>
>
> | |
> 18717838093
> |
> |
> 18717838...@126.com
> |
> 签名由网易邮箱大师定制
>
>

Re: [DISCUSS] Move to spark v2 datasource API

2021-07-20 Thread Vinoth Chandar

Hi Siva,

Reg the ability to specify distribution, sorting, can they be dynamic? Not
just at table creation time.
Hudi is really a storage system. i.e has a specific layout of data with
multiple tables (ro,rt) exposed.
So all of these "file" management APIs, tend to fit poorly at times.
To your point, we still need to support indexing, transformations.

>>from within our V1's createRelation can't we bypass and start using
hudi_v2 somehow?

That would mean,we still expose V1 datasource APIs to the users right?
Which is what we want to avoid in the first place?

Thanks
Vinoth

On Thu, Jul 15, 2021 at 8:10 PM Sivabalan  wrote:

> I don't have much knowledge wrt catalog, but is there an option of
> exploring spark catalog based table to create a hudi table? I do know with
> spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to
> your table. But still not sure on custom transformation for indexing, etc.
>
> Also, wrt Option2, is there a way to not explicitly ask users to start
> using hudi_V2 once we have one. For eg, from within our V1's createRelation
> can't we bypass and start using hudi_v2 somehow? Directly using hudi_v2
> should also be an option. I need to explore more on these lines, but just
> putting it out.
>
> Once we make some headway in this(by some spark expertise), I can
> definitely contribute from my side on this project.
>
>
> On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar  wrote:
>
> > Folks,
> >
> > As you may know, we still use the V1 API, given it the flexibility
> further
> > transform the dataframe, after one calls `df.write.format()`, to
> implement
> > a fully featured write pipeline with precombining, indexing, custom
> > partitioning. V2 API takes this away and rather provides a very
> restrictive
> > API that simply provides a partition level write interface that hands a
> > Iterator.
> >
> > That said, v2 has evolved (again) in Spark 3 and we would like to get
> with
> > the V2 APIs at some point, for both querying and writing. This thread
> > summarizes a few approaches we can take.
> >
> > *Option 1 : Introduce a pre write hook in Spark Datasource API*
> > If datasource API provided a simple way to further transform dataframes,
> > after the call to df.write.format is done, we would be able to move much
> of
> > our HoodieSparkSQLWriter logic into that and make the transition.
> >
> > Sivabalan engaged with the Spark community around this, without luck.
> > Anyone who can help revive this or make a more successful attempt, please
> > chime in.
> >
> > *Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
> > HoodieSparkClient API*
> >
> > We would limit datasource writes to simply bulk_inserts or
> > insert_overwrites. All other write operations would be supported via a
> new
> > HoodieSparkClient API (similar to all the write clients we have, but
> works
> > with DataSet). Queries will be supported on the v2 APIs. This will
> be
> > done only for Spark 3.
> >
> > We would still keep the current v1 support until Spark supports it.
> > Obviously, users have to migrate pipelines to hudi_v2 at some point, if
> > datasource v1 support is dropped
> >
> > My concern is having two datasources, causing greater confusion for the
> > users.
> >
> > Maybe there are others that I did not list out here. Please add
> >
> > Thanks
> > Vinoth
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Create Spark and Flink utilities module

2021-07-20 Thread Vinoth Chandar

Hi Vinay.

Thanks for kicking this off.

I wonder if it's possible to structure the code in separate modules, but
have a single bundle.
Or is that a painful experience? (if so, can you share what issues we are
running into?)

We have rarely done backwards incompatible changes and users appreciate
that.
So love to understand why this is warranted here.

Thanks
Vinoth

On Sat, Jul 17, 2021 at 7:58 AM Vinay Patil  wrote:

> Hi Team,
>
> As part of https://issues.apache.org/jira/browse/HUDI-1872, we are
> creating
> a separate flink-utilities module. Based on our discussion on the PR,
> should we even create a spark-utilities module. This would look like :
>
> hudi-utilities
> ├── hudi-flink-utilities
> └── hudi-spark-utilities
>
> This would also mean to create separate utilities-bundle for Flink and
> Spark,
>
> hudi-utilities-bundle
>  ├── hudi-flink-utilities-bundle
>  └── hudi-spark-utilities-bundle
>
> This is not a backward compatible change as users will have to provide an
> engine specific bundle. IMO, since Hudi is supporting Flink and Spark it
> will be good to have  engine specific bundle.
>
> What do you think?
>
> Regards,
> Vinay Patil
>

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-19 Thread Vinoth Chandar

Thanks all. Started a VOTE around this.
If you can chime in with your +1s there as well, it will be great!

On Fri, Jul 16, 2021 at 8:31 PM Gary Li  wrote:

> +1 for option B.
>
> On Sat, Jul 17, 2021 at 9:51 AM Udit Mehrotra  wrote:
>
> > +1 for option B. For A, I will need more data points to convince myself
> if
> > GitHub issues will provide all the issue tracking functionality that Jira
> > provides today.
> >
> > Thanks,
> > Udit
> >
> > On Fri, Jul 16, 2021 at 2:33 PM Vinoth Chandar 
> wrote:
> >
> > > Looks like we can start with B has a lot of support.
> > > I will start a VOTE on B alone and we can proceed if the VOTE passes.
> > >
> > > On Fri, Jul 16, 2021 at 8:05 AM Nishith  wrote:
> > >
> > > > +1 for option B.
> > > >
> > > > > On Jul 15, 2021, at 10:50 PM, Bhavani Sudha <
> bhavanisud...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Completely agree on B. On A I feel the necessity to centralize
> > > > everything
> > > > > in one place but also without losing the capabilities of Jira. I
> > think
> > > we
> > > > > will have to explore tools in eitherways.
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > > >> On Thu, Jul 15, 2021 at 10:42 PM vino yang  >
> > > > wrote:
> > > > >>
> > > > >> +1 for option B.
> > > > >>
> > > > >> Best,
> > > > >> Vino
> > > > >>
> > > > >> Sivabalan  于2021年7月16日周五 上午10:35写道：
> > > > >>
> > > > >>> +1 on B. Not sure on A though. I understand the intent to have
> all
> > in
> > > > >>> one place. but not very sure if we can get all functionality
> > > (version,
> > > > >>> type, component, status, parent- child relation), etc ported over
> > to
> > > > >>> github. I assume labels are the only option we have to achieve
> > these.
> > > > >>> Probably, we should also document the labels in detail so that
> > anyone
> > > > >>> looking to take a look at untriaged issues should know how/where
> to
> > > > look
> > > > >>> at. If we plan to use GH issues for all, I am sure there will be
> a
> > > lot
> > > > of
> > > > >>> proliferation of issues.
> > > > >>>
> > > > >>> On Fri, Jul 9, 2021 at 12:29 PM Vinoth Chandar <
> vin...@apache.org>
> > > > >> wrote:
> > > > >>>
> > > > >>>> Based on this, I will start consolidating more of the cWiki
> > content
> > > to
> > > > >>>> github wiki and master branch?
> > > > >>>>
> > > > >>>> JIRA vs GH Issue still probably needs more feedback. I do see
> the
> > > > >>> tradeoffs
> > > > >>>> there.
> > > > >>>>
> > > > >>>> On Fri, Jul 9, 2021 at 2:39 AM wei li 
> > > wrote:
> > > > >>>>
> > > > >>>>> +1
> > > > >>>>>
> > > > >>>>> On 2021/07/02 03:40:51, Vinoth Chandar 
> > wrote:
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> When we incubated Hudi, we made some initial choices around
> > > > >>>> collaboration
> > > > >>>>>> tools of choice. I am wondering if there are still optimal,
> > given
> > > > >> the
> > > > >>>>> scale
> > > > >>>>>> of the community at this point.
> > > > >>>>>>
> > > > >>>>>> Specifically, two points.
> > > > >>>>>>
> > > > >>>>>> A) Our issue tracker is JIRA, while we just use Github Issues
> > for
> > > > >>>> support
> > > > >>>>>> triage. While JIRA is pretty advanced and gives us the ability
> > to
> > > > >>> track
> > > > >>>>>> releases, versions and kanban boards, there are few practical
> > > > >>>> operational
> > > > >&g

[VOTE] Move content off cWiki

2021-07-19 Thread Vinoth Chandar

Hi all,

Starting a vote based on the DISCUSS thread here [1], to consolidate
content from cWiki into Github wiki and project's master branch (for design
docs)

Please chime with a

+1 - Approve the move
-1  - Disapprove the move (please state your reasoning)

The vote will use lazy consensus, needing three +1s to pass, remaining open
for 72 hours.

Thanks
Vinoth

[1]
https://lists.apache.org/thread.html/rb0a96bc10788c9635cc1a35ade7d5d42997a5c9591a5ec5d5a99adf0%40%3Cdev.hudi.apache.org%3E

Re: Welcome New Committers: Pengzhiwei and DannyChan

2021-07-16 Thread Vinoth Chandar

Congrats both! Your impact is amazing!
More miles to travel. Looking forward

On Fri, Jul 16, 2021 at 4:43 PM 18717838093 <18717838...@126.com> wrote:

> Congratulations! Well deserved!
>
>
>
> | |
> 18717838093
> |
> |
> 18717838...@126.com
> |
> 签名由网易邮箱大师定制
>
>
> On 07/16/2021 19:50，wangxianghu wrote：
> Congratulations! Well deserved!
>
> 2021年7月16日 下午6:52，vino yang  写道：
>
> Congratulation to both of you! Well deserved!
>
> Best,
> Vino
>
> leesf mailto:leesf0...@gmail.com>> 于2021年7月16日周五
> 下午6:38写道：
> Hi all,
>
> Please join me in congratulating our newest committers Pengzhiwei and
> DannyChan.
>
> Pengzhiwei has been a consistent contributor to Hudi, he has contributed
> numerous features to Hudi, such as Spark SQL integration with Hudi, Spark
> Structured Streaming Source for Hudi and Spark FileIndex for Hudi and also
> lots of other good contributions around Spark, and also very active to
> answer users's questions. He is a solid team player and an asset to the
> project.
>
> DannyChan has contributed many good features, such as new streaming write
> pipeline for Flink with automatic compaction and cleaning (COW and MOR),
> batch and streaming reader for Flink (COW and MOR) and support Flink SQL
> connectors (reader and writer), he is actively join the ML and answer
> users' questions as well as wrote a Hudi Flink integration guide and
> launched a live show to promote Hudi Flink integration for Chinese users.
>
> Thanks so much for your continued contributions to make Hudi better and
> better!
>
> Also I would like to introduce the current state of Hudi in China. Hudi
> becomes more and more popular in China with the help of all community
> members and has been adopted by almost all top companies in China,
> including Alibaba, Baidu, ByteDance, Huawei, Tencent and other companies,
> from startups to large companies, data scale from TB to PB. You would find
> the logo wall below(PS: unofficial statistics, just listed some of them and
> you can contact me to add your company logo if wanted).
>
> We would not achieve this without such a good community and the
> contribution of all community members. Cheers and Go！
>
>
>
> Thanks,
> Leesf
>
>

Welcome our PMC Member, Raymond Xu

2021-07-16 Thread Vinoth Chandar

Folks,

I am incredibly happy to share the addition of Raymond Xu to the Hudi PMC.
Raymond has been a valuable member of our community, over the past few
years now. Always hustlin and taking on the most underappreciated, but
extremely valuable aspects of the project, mostly recently with getting our
tests working smoothly on Azure CI!

Please join me in congratulating Raymond!

Onwards,
Vinoth

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Vinoth Chandar

Looks like we can start with B has a lot of support.
I will start a VOTE on B alone and we can proceed if the VOTE passes.

On Fri, Jul 16, 2021 at 8:05 AM Nishith  wrote:

> +1 for option B.
>
> > On Jul 15, 2021, at 10:50 PM, Bhavani Sudha 
> wrote:
> >
> > Completely agree on B. On A I feel the necessity to centralize
> everything
> > in one place but also without losing the capabilities of Jira. I think we
> > will have to explore tools in eitherways.
> >
> > Thanks,
> > Sudha
> >
> >> On Thu, Jul 15, 2021 at 10:42 PM vino yang 
> wrote:
> >>
> >> +1 for option B.
> >>
> >> Best,
> >> Vino
> >>
> >> Sivabalan  于2021年7月16日周五 上午10:35写道：
> >>
> >>> +1 on B. Not sure on A though. I understand the intent to have all in
> >>> one place. but not very sure if we can get all functionality (version,
> >>> type, component, status, parent- child relation), etc ported over to
> >>> github. I assume labels are the only option we have to achieve these.
> >>> Probably, we should also document the labels in detail so that anyone
> >>> looking to take a look at untriaged issues should know how/where to
> look
> >>> at. If we plan to use GH issues for all, I am sure there will be a lot
> of
> >>> proliferation of issues.
> >>>
> >>> On Fri, Jul 9, 2021 at 12:29 PM Vinoth Chandar 
> >> wrote:
> >>>
> >>>> Based on this, I will start consolidating more of the cWiki content to
> >>>> github wiki and master branch?
> >>>>
> >>>> JIRA vs GH Issue still probably needs more feedback. I do see the
> >>> tradeoffs
> >>>> there.
> >>>>
> >>>> On Fri, Jul 9, 2021 at 2:39 AM wei li  wrote:
> >>>>
> >>>>> +1
> >>>>>
> >>>>> On 2021/07/02 03:40:51, Vinoth Chandar  wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> When we incubated Hudi, we made some initial choices around
> >>>> collaboration
> >>>>>> tools of choice. I am wondering if there are still optimal, given
> >> the
> >>>>> scale
> >>>>>> of the community at this point.
> >>>>>>
> >>>>>> Specifically, two points.
> >>>>>>
> >>>>>> A) Our issue tracker is JIRA, while we just use Github Issues for
> >>>> support
> >>>>>> triage. While JIRA is pretty advanced and gives us the ability to
> >>> track
> >>>>>> releases, versions and kanban boards, there are few practical
> >>>> operational
> >>>>>> problems.
> >>>>>>
> >>>>>> - Developers often open bug fixes/PR which all need to be
> >>> continuously
> >>>>>> tagged against a release version (fix version)
> >>>>>> - Referencing JIRAs from Pull Requests is great (we cannot do
> >> things
> >>>> like
> >>>>>> `fixes #1234` to close issues when PR lands, not an easy way to
> >> click
> >>>> and
> >>>>>> get to the JIRA)
> >>>>>> - Many more developers have a github account, to contribute to Hudi
> >>>>> though,
> >>>>>> they need an additional sign-up on jira.
> >>>>>>
> >>>>>> So wondering if we should just use one thing - Github Issues, and
> >>> build
> >>>>>> scripts/hubot or something to get the missing project management
> >> from
> >>>>>> boards.
> >>>>>>
> >>>>>> B) Our design docs are on cWiki. Even though we link it off the
> >> site,
> >>>>> from
> >>>>>> my experience, many do not discover them.
> >>>>>> For large PRs, we need to manually enforce that design and code are
> >>> in
> >>>>> sync
> >>>>>> before we land. If we can, I would love to make RFC being in good
> >>>> shape a
> >>>>>> pre-requisite for landing the PR.
> >>>>>> Once again, separate signup is needed to write design docs or
> >> comment
> >>>> on
> >>>>>> them.
> >>>>>>
> >>>>>> So, wondering if we can move our process docs etc into Github Wiki
> >>> and
> >>>>> RFCs
> >>>>>> to the master branch in a rfc folder, and we just use github PRs to
> >>>> raise
> >>>>>> RFCs and discuss them.
> >>>>>>
> >>>>>> This all also makes it easy for us to measure community activity
> >> and
> >>>> keep
> >>>>>> streamlining our processes.
> >>>>>>
> >>>>>> personally, these different channels are overwhelming to me
> >> at-least
> >>> :)
> >>>>>>
> >>>>>> Love to hear thoughts. Please specify if you are for,against each
> >> of
> >>> A
> >>>>> and
> >>>>>> B.
> >>>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>> Vinoth
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> -Sivabalan
> >>>
> >>
>

[DISCUSS] Move to spark v2 datasource API

2021-07-14 Thread Vinoth Chandar

Folks,

As you may know, we still use the V1 API, given it the flexibility further
transform the dataframe, after one calls `df.write.format()`, to implement
a fully featured write pipeline with precombining, indexing, custom
partitioning. V2 API takes this away and rather provides a very restrictive
API that simply provides a partition level write interface that hands a
Iterator.

That said, v2 has evolved (again) in Spark 3 and we would like to get with
the V2 APIs at some point, for both querying and writing. This thread
summarizes a few approaches we can take.

*Option 1 : Introduce a pre write hook in Spark Datasource API*
If datasource API provided a simple way to further transform dataframes,
after the call to df.write.format is done, we would be able to move much of
our HoodieSparkSQLWriter logic into that and make the transition.

Sivabalan engaged with the Spark community around this, without luck.
Anyone who can help revive this or make a more successful attempt, please
chime in.

*Option 2 : Introduce a new datasource hudi_v2 for Spark Datasource +
HoodieSparkClient API*

We would limit datasource writes to simply bulk_inserts or
insert_overwrites. All other write operations would be supported via a new
HoodieSparkClient API (similar to all the write clients we have, but works
with DataSet). Queries will be supported on the v2 APIs. This will be
done only for Spark 3.

We would still keep the current v1 support until Spark supports it.
Obviously, users have to migrate pipelines to hudi_v2 at some point, if
datasource v1 support is dropped

My concern is having two datasources, causing greater confusion for the
users.

Maybe there are others that I did not list out here. Please add

Thanks
Vinoth

[DISCUSS] Enable Github Discussions

2021-07-14 Thread Vinoth Chandar

Hi all,

I would like to propose that we explore the use of github discussions. Few
other apache projects have also been trying this out.

Please chime in

Thanks
Vinoth

Re: Release manager for 0.9.0

2021-07-14 Thread Vinoth Chandar

Next week works great. Hoping to land all remaining blockers by then.

Thanks for taking this up!

On Wed, Jul 14, 2021 at 7:16 PM Udit Mehrotra 
wrote:

> Hey Vinoth,
>
> I will have some cycles starting next week and can help drive the release
> this time. Let me know.
>
> Thanks,
> Udit
>
> Sent from my iPhone
>
> > On Jul 14, 2021, at 7:07 PM, Vinoth Chandar  wrote:
> >
> > Hi all,
> >
> > 0.9.0 is upon us. Any volunteers to drive this forward?
> >
> > Thanks
> > Vinoth
>

Release manager for 0.9.0

2021-07-14 Thread Vinoth Chandar

Hi all,

0.9.0 is upon us. Any volunteers to drive this forward?

Thanks
Vinoth

Re: Website redesign

2021-07-12 Thread Vinoth Chandar

Hi,

Sounds good! Please grab the JIRA and we can start scoping it into sub
tasks?

Thanks
Vinoth

On Mon, Jul 12, 2021 at 10:02 PM Vinoth Govindarajan <
vinoth.govindara...@gmail.com> wrote:

> Hi Folks,
> I have experience in the past building websites, I can volunteer to work on
> this re-design.
>
> Best,
> Vinoth
>
>
> On Fri, Jul 2, 2021 at 6:45 PM Vinoth Chandar  wrote:
>
> > At this point, scoping the work itself is a good first task, breaking
> into
> > sub tasks.
> >
> > I am willing to partner with someone closely, to drive this.
> >
> > On Wed, Jun 30, 2021 at 5:45 PM Danny Chan  wrote:
> >
> > > All the pages assigns to volunteers or there is a someone major in it.
> > >
> > > Best,
> > > Danny Chan
> > >
> > > Vinoth Chandar 于2021年7月1日 周四上午6:00写道：
> > >
> > > > Any volunteers? Also worth asking in slack?
> > > >
> > > > On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu <
> > xu.shiyan.raym...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We've completed a re-design of Hudi's website (hudi.apache.org) ,
> in
> > > the
> > > > > goal of making the navigation more organized and information more
> > > > > discoverable. The design document can be found here (thanks to
> > designer
> > > > > Joanna)
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6
> > > > >
> > > > > The design is ready for implementation; would like to call for
> > > volunteers
> > > > > to pick up this one!
> > > > > https://issues.apache.org/jira/browse/HUDI-1985
> > > > >
> > > > > Cheers,
> > > > > Raymond
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-09 Thread Vinoth Chandar

Based on this, I will start consolidating more of the cWiki content to
github wiki and master branch?

JIRA vs GH Issue still probably needs more feedback. I do see the tradeoffs
there.

On Fri, Jul 9, 2021 at 2:39 AM wei li  wrote:

> +1
>
> On 2021/07/02 03:40:51, Vinoth Chandar  wrote:
> > Hi all,
> >
> > When we incubated Hudi, we made some initial choices around collaboration
> > tools of choice. I am wondering if there are still optimal, given the
> scale
> > of the community at this point.
> >
> > Specifically, two points.
> >
> > A) Our issue tracker is JIRA, while we just use Github Issues for support
> > triage. While JIRA is pretty advanced and gives us the ability to track
> > releases, versions and kanban boards, there are few practical operational
> > problems.
> >
> > - Developers often open bug fixes/PR which all need to be continuously
> > tagged against a release version (fix version)
> > - Referencing JIRAs from Pull Requests is great (we cannot do things like
> > `fixes #1234` to close issues when PR lands, not an easy way to click and
> > get to the JIRA)
> > - Many more developers have a github account, to contribute to Hudi
> though,
> > they need an additional sign-up on jira.
> >
> > So wondering if we should just use one thing - Github Issues, and build
> > scripts/hubot or something to get the missing project management from
> > boards.
> >
> > B) Our design docs are on cWiki. Even though we link it off the site,
> from
> > my experience, many do not discover them.
> > For large PRs, we need to manually enforce that design and code are in
> sync
> > before we land. If we can, I would love to make RFC being in good shape a
> > pre-requisite for landing the PR.
> > Once again, separate signup is needed to write design docs or comment on
> > them.
> >
> > So, wondering if we can move our process docs etc into Github Wiki and
> RFCs
> > to the master branch in a rfc folder, and we just use github PRs to raise
> > RFCs and discuss them.
> >
> > This all also makes it easy for us to measure community activity and keep
> > streamlining our processes.
> >
> > personally, these different channels are overwhelming to me at-least :)
> >
> > Love to hear thoughts. Please specify if you are for,against each of A
> and
> > B.
> >
> >
> > Thanks
> > Vinoth
> >
>

PSA : Rebase PRs before landing

2021-07-08 Thread Vinoth Chandar

Hi all,

We had a large Config framework change that went in and since then there
have been at-least two master breaks, from not rebasing before landing.

Please check if your PR changes any configs and if so you may need to
rebase and rework before landing (even if there are no conflicts per se
shown on github). reviewers please pay close attention to this aspect,
until we cross over the hump.

Thanks
Vinoth

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-08 Thread Vinoth Chandar

Any more strong opinions around these?

On Mon, Jul 5, 2021 at 7:43 AM Vinoth Chandar  wrote:

> I had similar views on A actually. JIRA is pretty powerful, queryable.
> But, I convinced myself on labelling and then building out dashboards using
> SQL (for reports/analytics).
> Still having one place for issues/prs. For releases, we can
> directly leverage milestones.
>
> We can definitely prioritize B first. Seems like everyone is on-board for
> that.
>
> @raymond : I am not sure if we will be able to install this on to the
> apache organization. Has been an issue for many of these github related
> services.
>
>
> On Sat, Jul 3, 2021 at 1:00 PM Raymond Xu 
> wrote:
>
>> Just to mention there are some GitHub plugin brings JIRA features to GH
>> issues. This one for example is free for open source.
>> https://www.zenhub.com/pricing
>>
>>
>>
>> On Fri, Jul 2, 2021 at 8:58 PM Navi Brar > .invalid>
>> wrote:
>>
>> > Hi,
>> >
>> >
>> > +1 on B
>> >
>> >
>> > But I have a slightly orthogonal view on A. I think jira should stay. It
>> > provides a lot more visibility on the issue management. You can link
>> PRs,
>> > wikis, releases etc easily which everyone will have to dig through the
>> > comments in github or every github issue might end up having too many
>> > labels that they might loose the significance. I recently landed a jira
>> on
>> > Hudi and I think the integration with github was pretty seamless.
>> >
>> >
>> > Happy either ways though.
>> >
>> >
>> > Thanks,
>> >
>> > Navinder
>> >
>> > >
>> > > On 03-Jul-2021, at 8:12 AM, vbal...@apache.org wrote:
>> > >
>> > >  +1 for both A and B. Makes sense to centralize bug tracking and RFCs
>> > in github.
>> > > Balaji.V
>> > >
>> > >
>> > >On Friday, July 2, 2021, 06:44:06 PM PDT, Vinoth Chandar <
>> > vin...@apache.org> wrote:
>> > >
>> > > Raymond - +1 on your thoughts.
>> > >
>> > > Once we have more voices and alignment, we can do one final RFC on
>> cWiki
>> > > covering everything.
>> > >
>> > > Can more people please chime in. Ideally we will put this to a VOTE
>> > >
>> > >> On Fri, Jul 2, 2021 at 12:54 PM Raymond Xu <
>> xu.shiyan.raym...@gmail.com
>> > >
>> > >> wrote:
>> > >>
>> > >> +1 for both A and B
>> > >>
>> > >> Also a related suggestion:
>> > >> we can put the release notes and new feature highlights in the
>> release
>> > >> notes section in GitHub releases instead of separately writing them
>> in
>> > the
>> > >> asf-site
>> > >>
>> > >>
>> > >> On Fri, Jul 2, 2021 at 11:25 AM Prashant Wason
>> > > >
>> > >> wrote:
>> > >>
>> > >>> +1 for complete Github migration. JIRA is too cumbersome and
>> painful to
>> > >>> use.
>> > >>>
>> > >>> Github PRs and wiki also improve visibility of the project and I
>> think
>> > >> may
>> > >>> increase community feedback and participation as its simpler to use.
>> > >>>
>> > >>> Prashant
>> > >>>
>> > >>>
>> > >>>> On Thu, Jul 1, 2021 at 8:41 PM Vinoth Chandar 
>> > wrote:
>> > >>>
>> > >>>> Hi all,
>> > >>>>
>> > >>>> When we incubated Hudi, we made some initial choices around
>> > >> collaboration
>> > >>>> tools of choice. I am wondering if there are still optimal, given
>> the
>> > >>> scale
>> > >>>> of the community at this point.
>> > >>>>
>> > >>>> Specifically, two points.
>> > >>>>
>> > >>>> A) Our issue tracker is JIRA, while we just use Github Issues for
>> > >> support
>> > >>>> triage. While JIRA is pretty advanced and gives us the ability to
>> > track
>> > >>>> releases, versions and kanban boards, there are few practical
>> > >> operational
>> > >>>> problems.
>> > >>>>
>> > >>>> - Developer

Re: [DISCUSS] scenario-based quickstart demo

2021-07-06 Thread Vinoth Chandar

Hi Raymond,

Are you suggesting a fix to the dev workflow or general site/quickstart
docs?

Agree, that the current doc is all-at-once and at least better docs on
incrementally testing parts could be useful.
It takes a while to learn what to skip and what not to.

Thanks
Vinoth

On Sat, Jul 3, 2021 at 2:11 PM Raymond Xu 
wrote:

> I found the demo setup in the "docker" directory not beginner friendly. It
> took some effort to digest what's there and it's hard to play with.
> Proposing some scenario-based quickstart setup
>
> - Scenario 1: DeltaStreamer write
>   - sample raw dataset, local FS
>   - run deltastreamer with local Spark or Flink write to COW or MOR
> - Scenario 2: meta sync
>   - sample hoodie table (COW or MOR), local FS
>   - run hive sync with local Hive server
> - Scenario 3: SQL read
>   - sample hoodie table (COW or MOR), local FS
>   - run local Trino/Presto queries
> - More scenarios: incremental read, clustering, etc
>
> In all scenarios, users can choose between a release version and the local
> version of Hudi.
>
> Not meant to replace the current "docker" demo. It can be under a
> "quickstart" dir and aims to be more focused quick sandbox. A typical dev
> flow is
> 1. changed some code
> 2. run mvn install -DskipTests
> 3. play with affected scenarios to verify the change
>
> Any thoughts or comments? Thank you.
>

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-05 Thread Vinoth Chandar

I had similar views on A actually. JIRA is pretty powerful, queryable. But,
I convinced myself on labelling and then building out dashboards using SQL
(for reports/analytics).
Still having one place for issues/prs. For releases, we can
directly leverage milestones.

We can definitely prioritize B first. Seems like everyone is on-board for
that.

@raymond : I am not sure if we will be able to install this on to the
apache organization. Has been an issue for many of these github related
services.


On Sat, Jul 3, 2021 at 1:00 PM Raymond Xu 
wrote:

> Just to mention there are some GitHub plugin brings JIRA features to GH
> issues. This one for example is free for open source.
> https://www.zenhub.com/pricing
>
>
>
> On Fri, Jul 2, 2021 at 8:58 PM Navi Brar 
> wrote:
>
> > Hi,
> >
> >
> > +1 on B
> >
> >
> > But I have a slightly orthogonal view on A. I think jira should stay. It
> > provides a lot more visibility on the issue management. You can link PRs,
> > wikis, releases etc easily which everyone will have to dig through the
> > comments in github or every github issue might end up having too many
> > labels that they might loose the significance. I recently landed a jira
> on
> > Hudi and I think the integration with github was pretty seamless.
> >
> >
> > Happy either ways though.
> >
> >
> > Thanks,
> >
> > Navinder
> >
> > >
> > > On 03-Jul-2021, at 8:12 AM, vbal...@apache.org wrote:
> > >
> > >  +1 for both A and B. Makes sense to centralize bug tracking and RFCs
> > in github.
> > > Balaji.V
> > >
> > >
> > >On Friday, July 2, 2021, 06:44:06 PM PDT, Vinoth Chandar <
> > vin...@apache.org> wrote:
> > >
> > > Raymond - +1 on your thoughts.
> > >
> > > Once we have more voices and alignment, we can do one final RFC on
> cWiki
> > > covering everything.
> > >
> > > Can more people please chime in. Ideally we will put this to a VOTE
> > >
> > >> On Fri, Jul 2, 2021 at 12:54 PM Raymond Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> +1 for both A and B
> > >>
> > >> Also a related suggestion:
> > >> we can put the release notes and new feature highlights in the release
> > >> notes section in GitHub releases instead of separately writing them in
> > the
> > >> asf-site
> > >>
> > >>
> > >> On Fri, Jul 2, 2021 at 11:25 AM Prashant Wason
>  > >
> > >> wrote:
> > >>
> > >>> +1 for complete Github migration. JIRA is too cumbersome and painful
> to
> > >>> use.
> > >>>
> > >>> Github PRs and wiki also improve visibility of the project and I
> think
> > >> may
> > >>> increase community feedback and participation as its simpler to use.
> > >>>
> > >>> Prashant
> > >>>
> > >>>
> > >>>> On Thu, Jul 1, 2021 at 8:41 PM Vinoth Chandar 
> > wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> When we incubated Hudi, we made some initial choices around
> > >> collaboration
> > >>>> tools of choice. I am wondering if there are still optimal, given
> the
> > >>> scale
> > >>>> of the community at this point.
> > >>>>
> > >>>> Specifically, two points.
> > >>>>
> > >>>> A) Our issue tracker is JIRA, while we just use Github Issues for
> > >> support
> > >>>> triage. While JIRA is pretty advanced and gives us the ability to
> > track
> > >>>> releases, versions and kanban boards, there are few practical
> > >> operational
> > >>>> problems.
> > >>>>
> > >>>> - Developers often open bug fixes/PR which all need to be
> continuously
> > >>>> tagged against a release version (fix version)
> > >>>> - Referencing JIRAs from Pull Requests is great (we cannot do things
> > >> like
> > >>>> `fixes #1234` to close issues when PR lands, not an easy way to
> click
> > >> and
> > >>>> get to the JIRA)
> > >>>> - Many more developers have a github account, to contribute to Hudi
> > >>> though,
> > >>>> they need an additional sign-up on jira.
> > >&

Re: Website redesign

2021-07-02 Thread Vinoth Chandar

At this point, scoping the work itself is a good first task, breaking into
sub tasks.

I am willing to partner with someone closely, to drive this.

On Wed, Jun 30, 2021 at 5:45 PM Danny Chan  wrote:

> All the pages assigns to volunteers or there is a someone major in it.
>
> Best,
> Danny Chan
>
> Vinoth Chandar 于2021年7月1日 周四上午6:00写道：
>
> > Any volunteers? Also worth asking in slack?
> >
> > On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > We've completed a re-design of Hudi's website (hudi.apache.org) , in
> the
> > > goal of making the navigation more organized and information more
> > > discoverable. The design document can be found here (thanks to designer
> > > Joanna)
> > >
> > >
> > >
> >
> https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6
> > >
> > > The design is ready for implementation; would like to call for
> volunteers
> > > to pick up this one!
> > > https://issues.apache.org/jira/browse/HUDI-1985
> > >
> > > Cheers,
> > > Raymond
> > >
> >
>

[DISCUSS] Consolidate all dev collaboration to Github

2021-07-01 Thread Vinoth Chandar

Hi all,

When we incubated Hudi, we made some initial choices around collaboration
tools of choice. I am wondering if there are still optimal, given the scale
of the community at this point.

Specifically, two points.

A) Our issue tracker is JIRA, while we just use Github Issues for support
triage. While JIRA is pretty advanced and gives us the ability to track
releases, versions and kanban boards, there are few practical operational
problems.

- Developers often open bug fixes/PR which all need to be continuously
tagged against a release version (fix version)
- Referencing JIRAs from Pull Requests is great (we cannot do things like
`fixes #1234` to close issues when PR lands, not an easy way to click and
get to the JIRA)
- Many more developers have a github account, to contribute to Hudi though,
they need an additional sign-up on jira.

So wondering if we should just use one thing - Github Issues, and build
scripts/hubot or something to get the missing project management from
boards.

B) Our design docs are on cWiki. Even though we link it off the site, from
my experience, many do not discover them.
For large PRs, we need to manually enforce that design and code are in sync
before we land. If we can, I would love to make RFC being in good shape a
pre-requisite for landing the PR.
Once again, separate signup is needed to write design docs or comment on
them.

So, wondering if we can move our process docs etc into Github Wiki and RFCs
to the master branch in a rfc folder, and we just use github PRs to raise
RFCs and discuss them.

This all also makes it easy for us to measure community activity and keep
streamlining our processes.

personally, these different channels are overwhelming to me at-least :)

Love to hear thoughts. Please specify if you are for,against each of A and
B.


Thanks
Vinoth

Re: [DISCUSS] Hash Index for HUDI

2021-06-30 Thread Vinoth Chandar

I see that we already have a PR up. Will catch up on it and provide some
initial comments.
Thanks!

On Wed, Jun 16, 2021 at 9:02 AM Shawy Geng  wrote:

> Combining bucket index and bloom filter is a great idea. There is no
> conflict between the two in implementation, and the bloom filter info can
> be still stored in the file to position faster.
>
> Best,
> Shawy
>
> > 2021年6月9日 16:23，Thiru Malai  写道：
> >
> > Hi,
> >
> > This feature seems promising. If we are planning to assign the
> filegroupID as the hash mod value, then we can leverage this change in
> Bloom Index as well by pruning the files based on hash mod value before mix
> max record_key pruning. So that the exploded RDD will be comparatively
> smaller which will eventually optimise the shuffle size in "Compute all
> comparisons needed between records and files" stages.
> >
> > Can we add this hash based indexing approach to Bloom Filter Based
> approach also
> >
> > On 2021/06/07 03:26:34, Danny Chan  wrote:
> >>> number of buckets expanded by multiple is recommended
> >> The condition is too harsh and the bucket number would be with
> >> exponential growth.
> >>
> >>> with hash index can be solved by using mutiple file groups per bucket
> as
> >> mentioned in the RFC
> >> The relation of file groups and bucket would be too complicated, we
> should
> >> avoid that. It also requires that the query engine be aware of the
> >> bucketing rules, not that transparent and is not a common query
> >> optimization.
> >>
> >> Best,
> >> Danny Chan
> >>
> >> 耿筱喻  于2021年6月4日周五 下午6:06写道：
> >>
> >>> Thank you for your questions.
> >>>
> >>> For the first question, the number of buckets expanded by mutiple is
> >>> recommended. Combine rehashing and clustering to re-distribute the data
> >>> without shuffling. For example, 2 buckets expands to 4 by splitting
> the 1st
> >>> bucket and rehashing data in it to two small buckets: 1st and 3st
> bucket.
> >>> Details have been supplied to the RFC.
> >>>
> >>> For the second one, data skew when writing to hudi with hash index can
> be
> >>> solved by using mutiple file groups per bucket as mentioned in the
> RFC. To
> >>> data process engine like Spark, data skew when table joining can be
> solved
> >>> by splitting the skew partition to some smaller units and distributing
> them
> >>> to different tasks to execute, and it works in some scenarios which has
> >>> fixed sql pattern. Besides, data skew solution needs more effort to be
> >>> compatible with bucket join rule. However, the read and write long tail
> >>> caused by data skew in sql query is hard to be solved.
> >>>
> >>> Regards,
> >>> Shawy
> >>>
>  2021年6月3日 10:47，Danny Chan  写道：
> 
>  Thanks for the new feature, very promising ~
> 
>  Some confusion about the *Scalability* and *Data Skew* part:
> 
>  How do we expanded the number of existing buckets, say if we have 100
>  buckets before, but 120 buckets now, what is the algorithm ？
> 
>  About the data skew, did you mean there is no good solution to solve
> this
>  problem now ?
> 
>  Best,
>  Danny Chan
> 
>  耿筱喻  于2021年6月2日周三 下午10:42写道：
> 
> > Hi,
> > Currently, Hudi index implementation is pluggable and provides two
> > options: bloom filter and hbase. When a Hudi table becomes large, the
> > performance of bloom filter degrade drastically due to the increase
> in
> > false positive probability.
> >
> > Hash index is an efficient light-weight approach to address the
> > performance issue. It is used in Hive called Bucket, which clusters
> the
> > records whose key have the same hash value under a unique hash
> function.
> > This pre-distribution can accelerate the sql query in some scenarios.
> > Besides, Bucket in Hive offers the efficient sampling.
> >
> > I make a RFC for this
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> >>> .
> >
> > Feel free to discuss under this thread and suggestions are welcomed.
> >
> > Regards,
> > Shawy
> >>>
> >>>
> >>
>
>

Re: Website redesign

2021-06-30 Thread Vinoth Chandar

Any volunteers? Also worth asking in slack?

On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu 
wrote:

> Hi all,
>
> We've completed a re-design of Hudi's website (hudi.apache.org) , in the
> goal of making the navigation more organized and information more
> discoverable. The design document can be found here (thanks to designer
> Joanna)
>
>
> https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6
>
> The design is ready for implementation; would like to call for volunteers
> to pick up this one!
> https://issues.apache.org/jira/browse/HUDI-1985
>
> Cheers,
> Raymond
>

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-30 Thread Vinoth Chandar

This is now done, Thanks to Navi!
Also checked that an update to doc is properly reflected after this

On Fri, Jun 25, 2021 at 8:21 AM Vinoth Chandar  wrote:

> Thanks! Happy to jump in as needed!
>
> On Thu, Jun 24, 2021 at 12:46 PM Navinder Brar
>  wrote:
>
>> Hi Vinoth
>>
>> I have created a jira for this
>> https://issues.apache.org/jira/browse/HUDI-2070.
>>
>>
>> I can assign this to myself and start working on it.
>>
>> Regards,
>> NavinderOn Friday, 25 June, 2021, 12:23:26 am IST, Vinoth Chandar <
>> vin...@apache.org> wrote:
>>
>>  Hi Navinder,
>>
>> Our site is pushed from the asf-site branch and it has a README with
>> building the site locally etc. that’s a good starting point. I don’t
>> believe there is a open JIRA yet for this. On the example itself, I am not
>> sure myself since this is new. So need to do some sluething and figure
>> out.
>>
>> Please suggest next steps
>>
>> Thanks
>> Vinoth
>>
>> On Thu, Jun 24, 2021 at 8:56 AM Navinder Brar
>>  wrote:
>>
>> > Hi Vinoth,
>> >
>> > I can take this up. Is there any existing jira for this? Sorry I am new
>> > new to Hudi community, if not I can clone an existing one. Please share
>> a
>> > sample.
>> >
>> > Thanks,
>> > Navinder
>> >
>> >On Thursday, 24 June, 2021, 04:10:41 am IST, Vinoth Chandar <
>> > vin...@apache.org> wrote:
>> >
>> >  Hi all,
>> >
>> > Looks like this will apply to our site? Any volunteers to help fix this?
>> >
>> > Thanks
>> > Vinoth
>> >
>> > -- Forwarded message -
>> > From: Daniel Gruno 
>> > Date: Mon, May 31, 2021 at 6:41 AM
>> > Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only
>> as
>> > of July 1st
>> > To: Users 
>> >
>> >
>> > TL;DR: if your project web site is kept in subversion, disregard this
>> > email please. If your project web site is using git, and you have not
>> > deployed it via .asf.yaml, you MUST switch before July 1st or risk your
>> > web site goes stale.
>> >
>> >
>> >
>> > Dear Apache projects,
>> > In order to simplify our web site publishing services and improve
>> > self-serve for projects and stability of deployments, we will be turning
>> > off the old 'gitwcsub' method of publishing git web sites. As of this
>> > moment, this involves 120 web sites. All web sites should switch to our
>> > self-serve method of publishing via the .asf.yaml meta-file. We aim to
>> > turn off gitwcsub around July 1st.
>> >
>> >
>> > ## How to publish via .asf.yaml:
>> > Publishing via .asf.yaml is described at:
>> > https://s.apache.org/asfyamlpublishing
>> > You can also see an example .asf.yaml with publishing and staging
>> > profiles for our own infra web site at:
>> >
>> https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml
>> >
>> > In short, one puts a file called .asf.yaml into the branch that needs to
>> > be published as the project's web site, with the following two-line
>> > content, in this case assuming the published branch is 'asf-site':
>> >
>> > publish:
>> >  whoami: asf-site
>> >
>> >
>> > It is important to note that the .asf.yaml file MUST be present at the
>> > root of the file system in the branch you wish to publish. The 'whoami'
>> > parameter acts as a guard, ensure that only the intended branch is used
>> > for publishing.
>> >
>> >
>> > ## Is my project affected by this?
>> > The quickest way to check if you need to switch to a .asf.yaml approach
>> > is to check out site source page at
>> > https://infra-reports.apache.org/site-source/ - if your site is listed
>> > in yellow, you will need to switch. This page will also tell you which
>> > branch you are currently publishing as your web site. This is (should
>> > be) the branch that you must add a .asf.yaml meta file to.
>> >
>> > The web site source list updates every hour. If your project site
>> > appears in green, you are already using .asf.yaml for publishing and do
>> > not need to make any changes.
>> >
>> >
>> > ## What happens if we miss the deadline?
>> > If you miss the deadline, don't fret. Your site will of course still
>> > remain online as is, but new updates will not appear till you
>> > create/edit the .asf.yaml and set up publishing.
>> >
>> >
>> > ## Who do we contact if we have questions?
>> > Please contact us at us...@infra.apache.org if you have any additional
>> > questions.
>> >
>> >
>> > With regards,
>> > Daniel on behalf of ASF Infra.
>> >
>>
>
>

Re: [HELP] unstable tests in the travis CI

2021-06-30 Thread Vinoth Chandar

Thanks Raymond. Will chat more on the JIRA.
I am looking into some of these.

I vote for stabilizing things before we land any involved PRs.

On Sat, Jun 26, 2021 at 4:45 PM Raymond Xu 
wrote:

> I did some backlog grooming; putting flaky tests by class in subtasks there
> in
> https://issues.apache.org/jira/browse/HUDI-1248
>
> If you're working on any of those, please set the assignee. Let's
> parallelize the efforts :)
>
> Also, the Azure CI builds for master/release versions can be found here
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build?definitionId=3&_a=summary
>
> Looks like some flakiness only happened in Travis CI. Let's also keep
> observing how it goes in Azure.
>
> On Wed, Jun 23, 2021 at 12:48 PM Vinoth Chandar  wrote:
>
> > yes. CI is pretty flaky atm. There is a compiled list here
> > https://issues.apache.org/jira/browse/HUDI-1248
> >
> > Siva and I are looking into some of this and try and get everything back
> to
> > normal again
> >
> > That schema evolution test, I have tried reproducing a few times, without
> > luck. :/
> >
> > On Wed, Jun 23, 2021 at 10:17 AM Prashant Wason  >
> > wrote:
> >
> > > Sure. I will take a look today. I wonder how the CI passed during the
> > > merge.
> > >
> > >
> > > On Wed, Jun 23, 2021 at 7:57 AM pzwpzw  > .invalid>
> > > wrote:
> > >
> > > > Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:
> > > > 11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday， the
> > test
> > > > case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will
> always
> > > > crash:
> > > >
> > > > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve
> > > > files in partition
> > > >
> > >
> >
> /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
> > > > from metadata
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > >
> > > > Can you take a look at this,  Thanks~
> > > >
> > > >
> > > >
> > > > 2021年6月23日 下午1:49，Danny Chan  写道：
> > > >
> > > > Hi, fellows, there are two test cases in the travis CI that fails
> very
> > > > often, which blocks our coding too many times, please, if these tests
> > are
> > > > not stable, can we disable them first ?
> > > > They are annoying ~
> > > >
> > > >
> > > > TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
> > > > HoodieSparkSqlWriterSuite: schema evolution for ... [2]
> > > >
> > > > [1] https://travis-ci.com/github/apache/hudi/jobs/518067391
> > > > [2] https://travis-ci.com/github/apache/hudi/jobs/518067393
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > > >
> > >
> >
>

Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-26 Thread Vinoth Chandar

Yes. Thats a working approach.

One thing I would like to suggest is the use of Hudi’s incremental queries
to update DynamoDB as opposed to full exporting periodically, depending on
how much of your target dynamoDB table changes between loads, it can save
you cost and time.

On Sat, Jun 26, 2021 at 5:43 PM Jialun Liu  wrote:

> Hey Vinoth,
>
> Thanks for your reply!
>
> I am actually looking into a different direction atm. Basically write the
> transformed data into a OLTP database, e.g. DynamoDB, any data need to
> support low latency high throughput read would be exported periodically.
>
> Not sure if this is the right pattern, appreciated if you can point me to
> any similar architecture that I could study.
>
> Best regards,
> Bill
>
> On Wed, Jun 23, 2021 at 3:51 PM Vinoth Chandar  wrote:
>
> > >>>>Maybe it is just not sane to serve online request-response service
> > using Data lake as backend?
> > In general, data lakes have not evolved beyond analytics, ML at this
> point,
> > i.e optimized for large batch scans.
> >
> > Not to say that this cannot be possible, but I am skeptical that it will
> > ever be as low-latency as your regular OLTP database.
> > Object store random reads are definitely going to cost ~100ms, like
> reading
> > from a highly loaded hard drive.
> >
> > Hudi does support a HFile format, which is more optimized for random
> reads.
> > We use it to store and serve table metadata.
> > So that path is worth pursuing, if you have the appetite for trying the
> > changing the norm here. :)
> > There is probably some work to do here for scaling it for large amounts
> of
> > data.
> >
> > Hope that helps.
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu  wrote:
> >
> > > Hey Gary,
> > >
> > > Thanks for your reply!
> > >
> > > This is kinda sad that we are not able to serve the insights to
> > commercial
> > > customers in real time.
> > >
> > > Do we have any best practices/ design patterns to get around the
> problem
> > in
> > > order to support online service for low latency, high throughput random
> > > reads by any chance?
> > >
> > > Best regards,
> > > Bill
> > >
> > > On Sun, Jun 6, 2021 at 2:19 AM Gary Li  wrote:
> > >
> > > > Hi Bill,
> > > >
> > > > Data lake was used for offline analytics workload with minutes
> latency.
> > > > Data lake(at least for Hudi) doesn't fit for online request-response
> > > > service as you described for now.
> > > >
> > > > Best,
> > > > Gary
> > > >
> > > > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu 
> > > wrote:
> > > >
> > > > > Hey Felix,
> > > > >
> > > > > Thanks for your reply!
> > > > >
> > > > > I briefly researched in Presto, it looks like it is designed to
> > support
> > > > the
> > > > > high concurrency of Big data SQL query. The official doc suggests
> it
> > > > could
> > > > > process queries in sub-seconds to minutes.
> > > > > https://prestodb.io/
> > > > > "Presto is targeted at analysts who expect response times ranging
> > from
> > > > > sub-second to minutes."
> > > > >
> > > > > However, the doc seems to suggest that it is supposed to be used by
> > > > > analysts running offline queries, and it is not designed to be used
> > as
> > > an
> > > > > OLTP database.
> > > > > https://prestodb.io/docs/current/overview/use-cases.html
> > > > >
> > > > > I am wondering if it is technically possible to use data lake to
> > > support
> > > > > milliseconds latency, high throughput random reads at all today?
> Am I
> > > > just
> > > > > not thinking in the right direction? Maybe it is just not sane to
> > serve
> > > > > online request-response service using Data lake as backend?
> > > > >
> > > > > Best regards,
> > > > > Bill
> > > > >
> > > > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > > > >  wrote:
> > > > >
> > > > > > Hi Bill,
> > > > > >
> > > > > > Did you try using Presto (from EMR) to query HUDI tables on S3,
> and
> >

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-25 Thread Vinoth Chandar

Thanks! Happy to jump in as needed!

On Thu, Jun 24, 2021 at 12:46 PM Navinder Brar
 wrote:

> Hi Vinoth
>
> I have created a jira for this
> https://issues.apache.org/jira/browse/HUDI-2070.
>
>
> I can assign this to myself and start working on it.
>
> Regards,
> NavinderOn Friday, 25 June, 2021, 12:23:26 am IST, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hi Navinder,
>
> Our site is pushed from the asf-site branch and it has a README with
> building the site locally etc. that’s a good starting point. I don’t
> believe there is a open JIRA yet for this. On the example itself, I am not
> sure myself since this is new. So need to do some sluething and figure out.
>
> Please suggest next steps
>
> Thanks
> Vinoth
>
> On Thu, Jun 24, 2021 at 8:56 AM Navinder Brar
>  wrote:
>
> > Hi Vinoth,
> >
> > I can take this up. Is there any existing jira for this? Sorry I am new
> > new to Hudi community, if not I can clone an existing one. Please share a
> > sample.
> >
> > Thanks,
> > Navinder
> >
> >On Thursday, 24 June, 2021, 04:10:41 am IST, Vinoth Chandar <
> > vin...@apache.org> wrote:
> >
> >  Hi all,
> >
> > Looks like this will apply to our site? Any volunteers to help fix this?
> >
> > Thanks
> > Vinoth
> >
> > -- Forwarded message -
> > From: Daniel Gruno 
> > Date: Mon, May 31, 2021 at 6:41 AM
> > Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only
> as
> > of July 1st
> > To: Users 
> >
> >
> > TL;DR: if your project web site is kept in subversion, disregard this
> > email please. If your project web site is using git, and you have not
> > deployed it via .asf.yaml, you MUST switch before July 1st or risk your
> > web site goes stale.
> >
> >
> >
> > Dear Apache projects,
> > In order to simplify our web site publishing services and improve
> > self-serve for projects and stability of deployments, we will be turning
> > off the old 'gitwcsub' method of publishing git web sites. As of this
> > moment, this involves 120 web sites. All web sites should switch to our
> > self-serve method of publishing via the .asf.yaml meta-file. We aim to
> > turn off gitwcsub around July 1st.
> >
> >
> > ## How to publish via .asf.yaml:
> > Publishing via .asf.yaml is described at:
> > https://s.apache.org/asfyamlpublishing
> > You can also see an example .asf.yaml with publishing and staging
> > profiles for our own infra web site at:
> > https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml
> >
> > In short, one puts a file called .asf.yaml into the branch that needs to
> > be published as the project's web site, with the following two-line
> > content, in this case assuming the published branch is 'asf-site':
> >
> > publish:
> >  whoami: asf-site
> >
> >
> > It is important to note that the .asf.yaml file MUST be present at the
> > root of the file system in the branch you wish to publish. The 'whoami'
> > parameter acts as a guard, ensure that only the intended branch is used
> > for publishing.
> >
> >
> > ## Is my project affected by this?
> > The quickest way to check if you need to switch to a .asf.yaml approach
> > is to check out site source page at
> > https://infra-reports.apache.org/site-source/ - if your site is listed
> > in yellow, you will need to switch. This page will also tell you which
> > branch you are currently publishing as your web site. This is (should
> > be) the branch that you must add a .asf.yaml meta file to.
> >
> > The web site source list updates every hour. If your project site
> > appears in green, you are already using .asf.yaml for publishing and do
> > not need to make any changes.
> >
> >
> > ## What happens if we miss the deadline?
> > If you miss the deadline, don't fret. Your site will of course still
> > remain online as is, but new updates will not appear till you
> > create/edit the .asf.yaml and set up publishing.
> >
> >
> > ## Who do we contact if we have questions?
> > Please contact us at us...@infra.apache.org if you have any additional
> > questions.
> >
> >
> > With regards,
> > Daniel on behalf of ASF Infra.
> >
>

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-24 Thread Vinoth Chandar

Hi Navinder,

Our site is pushed from the asf-site branch and it has a README with
building the site locally etc. that’s a good starting point. I don’t
believe there is a open JIRA yet for this. On the example itself, I am not
sure myself since this is new. So need to do some sluething and figure out.

Please suggest next steps

Thanks
Vinoth

On Thu, Jun 24, 2021 at 8:56 AM Navinder Brar
 wrote:

> Hi Vinoth,
>
> I can take this up. Is there any existing jira for this? Sorry I am new
> new to Hudi community, if not I can clone an existing one. Please share a
> sample.
>
> Thanks,
> Navinder
>
> On Thursday, 24 June, 2021, 04:10:41 am IST, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hi all,
>
> Looks like this will apply to our site? Any volunteers to help fix this?
>
> Thanks
> Vinoth
>
> -- Forwarded message -
> From: Daniel Gruno 
> Date: Mon, May 31, 2021 at 6:41 AM
> Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as
> of July 1st
> To: Users 
>
>
> TL;DR: if your project web site is kept in subversion, disregard this
> email please. If your project web site is using git, and you have not
> deployed it via .asf.yaml, you MUST switch before July 1st or risk your
> web site goes stale.
>
>
>
> Dear Apache projects,
> In order to simplify our web site publishing services and improve
> self-serve for projects and stability of deployments, we will be turning
> off the old 'gitwcsub' method of publishing git web sites. As of this
> moment, this involves 120 web sites. All web sites should switch to our
> self-serve method of publishing via the .asf.yaml meta-file. We aim to
> turn off gitwcsub around July 1st.
>
>
> ## How to publish via .asf.yaml:
> Publishing via .asf.yaml is described at:
> https://s.apache.org/asfyamlpublishing
> You can also see an example .asf.yaml with publishing and staging
> profiles for our own infra web site at:
> https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml
>
> In short, one puts a file called .asf.yaml into the branch that needs to
> be published as the project's web site, with the following two-line
> content, in this case assuming the published branch is 'asf-site':
>
> publish:
>   whoami: asf-site
>
>
> It is important to note that the .asf.yaml file MUST be present at the
> root of the file system in the branch you wish to publish. The 'whoami'
> parameter acts as a guard, ensure that only the intended branch is used
> for publishing.
>
>
> ## Is my project affected by this?
> The quickest way to check if you need to switch to a .asf.yaml approach
> is to check out site source page at
> https://infra-reports.apache.org/site-source/ - if your site is listed
> in yellow, you will need to switch. This page will also tell you which
> branch you are currently publishing as your web site. This is (should
> be) the branch that you must add a .asf.yaml meta file to.
>
> The web site source list updates every hour. If your project site
> appears in green, you are already using .asf.yaml for publishing and do
> not need to make any changes.
>
>
> ## What happens if we miss the deadline?
> If you miss the deadline, don't fret. Your site will of course still
> remain online as is, but new updates will not appear till you
> create/edit the .asf.yaml and set up publishing.
>
>
> ## Who do we contact if we have questions?
> Please contact us at us...@infra.apache.org if you have any additional
> questions.
>
>
> With regards,
> Daniel on behalf of ASF Infra.
>

Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-23 Thread Vinoth Chandar

Maybe it is just not sane to serve online request-response service
using Data lake as backend?
In general, data lakes have not evolved beyond analytics, ML at this point,
i.e optimized for large batch scans.

Not to say that this cannot be possible, but I am skeptical that it will
ever be as low-latency as your regular OLTP database.
Object store random reads are definitely going to cost ~100ms, like reading
from a highly loaded hard drive.

Hudi does support a HFile format, which is more optimized for random reads.
We use it to store and serve table metadata.
So that path is worth pursuing, if you have the appetite for trying the
changing the norm here. :)
There is probably some work to do here for scaling it for large amounts of
data.

Hope that helps.

Thanks
Vinoth

On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu  wrote:

> Hey Gary,
>
> Thanks for your reply!
>
> This is kinda sad that we are not able to serve the insights to commercial
> customers in real time.
>
> Do we have any best practices/ design patterns to get around the problem in
> order to support online service for low latency, high throughput random
> reads by any chance?
>
> Best regards,
> Bill
>
> On Sun, Jun 6, 2021 at 2:19 AM Gary Li  wrote:
>
> > Hi Bill,
> >
> > Data lake was used for offline analytics workload with minutes latency.
> > Data lake(at least for Hudi) doesn't fit for online request-response
> > service as you described for now.
> >
> > Best,
> > Gary
> >
> > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu 
> wrote:
> >
> > > Hey Felix,
> > >
> > > Thanks for your reply!
> > >
> > > I briefly researched in Presto, it looks like it is designed to support
> > the
> > > high concurrency of Big data SQL query. The official doc suggests it
> > could
> > > process queries in sub-seconds to minutes.
> > > https://prestodb.io/
> > > "Presto is targeted at analysts who expect response times ranging from
> > > sub-second to minutes."
> > >
> > > However, the doc seems to suggest that it is supposed to be used by
> > > analysts running offline queries, and it is not designed to be used as
> an
> > > OLTP database.
> > > https://prestodb.io/docs/current/overview/use-cases.html
> > >
> > > I am wondering if it is technically possible to use data lake to
> support
> > > milliseconds latency, high throughput random reads at all today? Am I
> > just
> > > not thinking in the right direction? Maybe it is just not sane to serve
> > > online request-response service using Data lake as backend?
> > >
> > > Best regards,
> > > Bill
> > >
> > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > >  wrote:
> > >
> > > > Hi Bill,
> > > >
> > > > Did you try using Presto (from EMR) to query HUDI tables on S3, and
> it
> > > > could support real time queries. And you have to partition your data
> > > > properly to minimize the amount of data each query has to
> scan/process.
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > > From: Jialun Liu 
> > > > Date: Saturday, June 5, 2021 at 3:53 PM
> > > > To: dev@hudi.apache.org 
> > > > Subject: Could Hudi Data lake support low latency, high throughput
> > random
> > > > reads?
> > > > Caution: This e-mail originated from outside of Philips, be careful
> for
> > > > phishing.
> > > >
> > > >
> > > > Hey guys,
> > > >
> > > > I am not sure if this is the right forum for this question, if you
> know
> > > > where this should be directed, appreciated for your help!
> > > >
> > > > The question is that "Could Hudi Data lake support low latency, high
> > > > throughput random reads?".
> > > >
> > > > I am considering building a data lake that produces auxiliary
> > information
> > > > for my main service table. Example, say my main service is S3 and I
> > want
> > > to
> > > > produce the S3 object pull count as the auxiliary information. I am
> > going
> > > > to use Apache Hudi and EMR to process the S3 access log to produce
> the
> > > pull
> > > > count. Now, what I don't know is that can data lake support low
> > latency,
> > > > high throughput random reads for online request-response type of
> > service?
> > > > This way I could serve this information to customers in real time.
> > > >
> > > > I could write the auxiliary information, pull count, back to the main
> > > > service table, but I personally don't think it is a sustainable
> > > > architecture. It would be hard to do independent and agile
> development
> > > if I
> > > > continue to add more derived attributes to the main table.
> > > >
> > > > Any help would be appreciated!
> > > >
> > > > Best regards,
> > > > Bill
> > > >
> > > > 
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be

Fwd: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-23 Thread Vinoth Chandar

Hi all,

Looks like this will apply to our site? Any volunteers to help fix this?

Thanks
Vinoth

-- Forwarded message -
From: Daniel Gruno 
Date: Mon, May 31, 2021 at 6:41 AM
Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as
of July 1st
To: Users 

TL;DR: if your project web site is kept in subversion, disregard this
email please. If your project web site is using git, and you have not
deployed it via .asf.yaml, you MUST switch before July 1st or risk your
web site goes stale.

Dear Apache projects,
In order to simplify our web site publishing services and improve
self-serve for projects and stability of deployments, we will be turning
off the old 'gitwcsub' method of publishing git web sites. As of this
moment, this involves 120 web sites. All web sites should switch to our
self-serve method of publishing via the .asf.yaml meta-file. We aim to
turn off gitwcsub around July 1st.

## How to publish via .asf.yaml:
Publishing via .asf.yaml is described at:
https://s.apache.org/asfyamlpublishing
You can also see an example .asf.yaml with publishing and staging
profiles for our own infra web site at:
https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml

In short, one puts a file called .asf.yaml into the branch that needs to
be published as the project's web site, with the following two-line
content, in this case assuming the published branch is 'asf-site':

publish:
   whoami: asf-site

It is important to note that the .asf.yaml file MUST be present at the
root of the file system in the branch you wish to publish. The 'whoami'
parameter acts as a guard, ensure that only the intended branch is used
for publishing.

## Is my project affected by this?
The quickest way to check if you need to switch to a .asf.yaml approach
is to check out site source page at
https://infra-reports.apache.org/site-source/ - if your site is listed
in yellow, you will need to switch. This page will also tell you which
branch you are currently publishing as your web site. This is (should
be) the branch that you must add a .asf.yaml meta file to.

The web site source list updates every hour. If your project site
appears in green, you are already using .asf.yaml for publishing and do
not need to make any changes.

## What happens if we miss the deadline?
If you miss the deadline, don't fret. Your site will of course still
remain online as is, but new updates will not appear till you
create/edit the .asf.yaml and set up publishing.

## Who do we contact if we have questions?
Please contact us at us...@infra.apache.org if you have any additional
questions.

With regards,
Daniel on behalf of ASF Infra.

Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread Vinoth Chandar

yes. CI is pretty flaky atm. There is a compiled list here
https://issues.apache.org/jira/browse/HUDI-1248

Siva and I are looking into some of this and try and get everything back to
normal again

That schema evolution test, I have tried reproducing a few times, without
luck. :/

On Wed, Jun 23, 2021 at 10:17 AM Prashant Wason 
wrote:

> Sure. I will take a look today. I wonder how the CI passed during the
> merge.
>
>
> On Wed, Jun 23, 2021 at 7:57 AM pzwpzw 
> wrote:
>
> > Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:
> > 11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday， the test
> > case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always
> > crash:
> >
> > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve
> > files in partition
> >
> /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
> > from metadata
> >
> > at
> >
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
> > at
> >
> org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> >
> > Can you take a look at this,  Thanks~
> >
> >
> >
> > 2021年6月23日 下午1:49，Danny Chan  写道：
> >
> > Hi, fellows, there are two test cases in the travis CI that fails very
> > often, which blocks our coding too many times, please, if these tests are
> > not stable, can we disable them first ?
> > They are annoying ~
> >
> >
> > TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
> > HoodieSparkSqlWriterSuite: schema evolution for ... [2]
> >
> > [1] https://travis-ci.com/github/apache/hudi/jobs/518067391
> > [2] https://travis-ci.com/github/apache/hudi/jobs/518067393
> >
> > Best,
> > Danny Chan
> >
> >
>

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-16 Thread Vinoth Chandar

+1 on this effort overall. It will be a little tricky, but doable.

First thing is to see how we can replace raw usages of Spark APIs with
HoodieEngineContext. It will be cool if we can completely generify
DeltaStreamer.
but I suspect we need flink/spark specific modules ultimately

On Wed, Jun 16, 2021 at 12:17 AM Danny Chan  wrote:

> There was actually an issue here:
> https://issues.apache.org/jira/browse/HUDI-1872, maybe you can take it and
> go on with the work ~
>
> Best,
> Danny Chan
>
> Vinay Patil  于2021年6月11日周五 下午3:26写道：
>
> > Thank you Danny for your response.
> >
> > Can we have a JIRA story where all the refactoring is required for
> > Hudi-Flink code as well.
> >
> > I will create a task if we agree that a Flag will be helpful to choose
> > different runners
> >
> > Regards,
> > Vinay Patil
> >
> >
> > On Fri, Jun 11, 2021 at 12:34 PM Danny Chan 
> wrote:
> >
> > > Basically agree with that, but before that we may need some refactoring
> > to
> > > the existing code:
> > >
> > > Move the HoodieFlinkStreamer from the hudi-flink module into the
> > > hudi-utilities to be together with the HoodieDeltaStreamer.
> > > We are planning to add separate flink compaction programs too, which
> has
> > > the same problem.
> > >
> > > Best,
> > > Danny Chan
> > >
> > > Vinay Patil  于2021年6月9日周三 下午3:42写道：
> > >
> > > > Hi Team,
> > > >
> > > > Currently, Hudi supports Flink as well Spark, there are two different
> > > > classes
> > > > 1. HoodieDeltaStreamer
> > > > 2. FlinkHoodieDeltaStreamer
> > > >
> > > > Should we have a provision to pass the flag like --runner to choose
> > > between
> > > > Flink or Spark and have a single entry point class which will take
> all
> > > the
> > > > common configs.
> > > >
> > > > Based on the runner flag, we can call HoodieDeltaStreamer or
> > > > FlinkHoodieDeltaStreamer
> > > >
> > > > Thoughts?
> > > >
> > > > Regards,
> > > > Vinay Patil
> > > >
> > >
> >
>

Re: Why hudi consider the Avro be the MOR's log format?

2021-06-15 Thread Vinoth Chandar

Hi,

We wanted a row based format to quickly log changes to the base files and
flexibly compact the file groups we wanted. If we wrote parquet for e.g, we
would incur costs of writing parquet (can be upto to 10x even) once during
ingest and once again during compaction.

Of course. This trades off query latency for ingest cost. There is also
ongoing work to flexibly keep log block data in parquet. See
InlineFileSystem/tests if interested.

Thanks
Vinoth

On Mon, Jun 14, 2021 at 1:54 AM LakeShen  wrote:

> Hi community,
>
> I have a question,  why hudi consider the Avro be the  MOR's log format?
>
>
> Best,
> LakeShen
>

Re: [DISCUSS] Hash Index for HUDI

2021-06-04 Thread Vinoth Chandar

Thanks for opening the RFC! At first glance, it seemed similar to RFC-08,
but the proposal seems to be adding a bucket id to each file group ID?
If I may suggest, we should call this BucketedIndex?

Instead of changing the existing file name, can we simply assign the
filegroupID as the hash mod value?  i.e just make the fileGroupIDs 0 -
numBuckets-1 (with some hash value of the partition path also for
uniqueness across table)?
This way this is a localized change, not a major change is how we name
files/objects?

I will review the RFC more carefully, early next week.

Thanks
Vinoth







On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻  wrote:

> Thank you for your questions.
>
> For the first question, the number of buckets expanded by mutiple is
> recommended. Combine rehashing and clustering to re-distribute the data
> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
> Details have been supplied to the RFC.
>
> For the second one, data skew when writing to hudi with hash index can be
> solved by using mutiple file groups per bucket as mentioned in the RFC. To
> data process engine like Spark, data skew when table joining can be solved
> by splitting the skew partition to some smaller units and distributing them
> to different tasks to execute, and it works in some scenarios which has
> fixed sql pattern. Besides, data skew solution needs more effort to be
> compatible with bucket join rule. However, the read and write long tail
> caused by data skew in sql query is hard to be solved.
>
> Regards,
> Shawy
>
> > 2021年6月3日 10:47，Danny Chan  写道：
> >
> > Thanks for the new feature, very promising ~
> >
> > Some confusion about the *Scalability* and *Data Skew* part:
> >
> > How do we expanded the number of existing buckets, say if we have 100
> > buckets before, but 120 buckets now, what is the algorithm ？
> >
> > About the data skew, did you mean there is no good solution to solve this
> > problem now ?
> >
> > Best,
> > Danny Chan
> >
> > 耿筱喻  于2021年6月2日周三 下午10:42写道：
> >
> >> Hi,
> >> Currently, Hudi index implementation is pluggable and provides two
> >> options: bloom filter and hbase. When a Hudi table becomes large, the
> >> performance of bloom filter degrade drastically due to the increase in
> >> false positive probability.
> >>
> >> Hash index is an efficient light-weight approach to address the
> >> performance issue. It is used in Hive called Bucket, which clusters the
> >> records whose key have the same hash value under a unique hash function.
> >> This pre-distribution can accelerate the sql query in some scenarios.
> >> Besides, Bucket in Hive offers the efficient sampling.
> >>
> >> I make a RFC for this
> >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> .
> >>
> >> Feel free to discuss under this thread and suggestions are welcomed.
> >>
> >> Regards,
> >> Shawy
>
>

Welcome new committers and PMC Members!

2021-05-11 Thread Vinoth Chandar

Hello all,

Please join me in congratulating our newest set of committers and PMCs.

*Wenning Ding (Committer) *
Wenning has been a consistent contributor to Hudi, over the past year or
so. He has added some critical bug fixes, lots of good contributions
around Spark!

*Gary Li (PMC Member) *
Gary is a regular feature on all our support channels. He has contributed
numerous features to Hudi, and evangelized across many companies including
Bosch/Bytedance. Most of all, he is a solid team player and an asset to the
project.

Thanks so much for your continued contributions, to make Hudi better and
better!

Thanks
Vinoth

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-28 Thread Vinoth Chandar

Hi Danny,

Thanks, I will review this asap. Already, in the "review in progress"
column.

Thanks
Vinoth

On Thu, Apr 22, 2021 at 12:49 AM Danny Chan  wrote:

> > Should we throw together a PoC/test code for an example Flink pipeline
> that
> will use hudi cdc flags + state ful operators?
>
> I have updated the pr https://github.com/apache/hudi/pull/2854,
>
> see the test case HoodieDataSourceITCase#testStreamReadWithDeletes.
>
> A data source:
>
> change_flag | uuid | name | age | ts | partition
>
> I, id1, Danny, 23, 1970-01-01T00:00:01, par1
> I, id2, Stephen, 33, 1970-01-01T00:00:02, par1
> I, id3, Julian, 53, 1970-01-01T00:00:03, par2
> I, id4, Fabian, 31, 1970-01-01T00:00:04, par2
> I, id5, Sophia, 18, 1970-01-01T00:00:05, par3
> I, id6, Emma, 20, 1970-01-01T00:00:06, par3
> I, id7, Bob, 44, 1970-01-01T00:00:07, par4
> I, id8, Han, 56, 1970-01-01T00:00:08, par4
> U, id1, Danny, 24, 1970-01-01T00:00:01, par1
> U, id2, Stephen, 34, 1970-01-01T00:00:02, par1
> I, id3, Julian, 53, 1970-01-01T00:00:03, par2
> D, id5, Sophia, 18, 1970-01-01T00:00:05, par3
> D, id9, Jane, 19, 1970-01-01T00:00:06, par3
>
> with streaming query "select name, sum(age) from t1 group by name" returns:
>
> change_flag | name | age_sum
> I, Danny, 24
> I Stephen, 34
>
> The result is the same as a batch snapshot query.
>
> Best,
> Danny Chan
>
> Vinoth Chandar  于2021年4月21日周三 下午1:32写道：
>
> > Keeping compatibility is a must. i.e users should be able to upgrade to
> the
> > new release with the _hoodie_cdc_flag meta column,
> > and be able to query new data (with this new meta col) alongside old data
> > (without this new meta col).
> > In fact, they should be able to downgrade back to previous versions (say
> > there is some other snag they hit), and go back to not writing this new
> > meta column.
> > if this is too hard, then a config to control is not a bad idea at-least
> > for an initial release?
> >
> > Thanks for clarifying the use-case! Makes total sense to me and look
> > forward to getting this going.
> > Should we throw together a PoC/test code for an example Flink pipeline
> that
> > will use hudi cdc flags + state ful operators?
> > It ll help us iron out gaps iteratively, finalize requirements - instead
> of
> > a more top-down, waterfall like model?
> >
> > On Tue, Apr 20, 2021 at 8:25 PM Danny Chan  wrote:
> >
> > > > Is it providing the ability to author continuous queries on
> > > Hudi source tables end-end,
> > > given Flink can use the flags to generate retract/upsert streams
> > >
> > > Yes，that's the key point, with these flags plus flink stateful
> operators,
> > > we can have a real time incremental ETL pipeline.
> > >
> > > For example, a global aggregation that consumes cdc stream can do
> > > acc/retract continuously and send the changes to downstream.
> > > The ETL pipeline with cdc stream generates the same result as the batch
> > > snapshot with the same sql query.
> > >
> > > If keeping compatibility is a must with/without the new metadata
> > columns, I
> > > think there is no need to add a config option which brings in
> > > unnecessary overhead. If we do not ensure backward compatibility for
> new
> > > column, then we should add such a config option and by default
> > > disable it.
> > >
> > > Best,
> > > Danny Chan
> > >
> > >
> > > Vinoth Chandar  于2021年4月21日周三 上午6:30写道：
> > >
> > > > Hi Danny,
> > > >
> > > > Read up on the Flink docs as well.
> > > >
> > > > If we don't actually publish data to the metacolumn, I think the
> > overhead
> > > > is pretty low w.r.t avro/parquet. Both are very good at encoding
> nulls.
> > > > But, I feel it's worth adding a HoodieWriteConfig to control this and
> > > since
> > > > addition of meta columns mostly happens in the handles,
> > > > it may not be as bad ? Happy to suggest more concrete ideas on the
> PR.
> > > >
> > > > We still need to test backwards compatibility from different engines
> > > quite
> > > > early though and make sure there are no surprises.
> > > > Hive, Parquet, Spark, Presto all have their own rules on evolution as
> > > well.
> > > > So we need to think this through if/how seamlessly this can be turned
> > on
> > > > for existing tables
> > > >
> > > > As for testing the new column, given Flink is w

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar

Keeping compatibility is a must. i.e users should be able to upgrade to the
new release with the _hoodie_cdc_flag meta column,
and be able to query new data (with this new meta col) alongside old data
(without this new meta col).
In fact, they should be able to downgrade back to previous versions (say
there is some other snag they hit), and go back to not writing this new
meta column.
if this is too hard, then a config to control is not a bad idea at-least
for an initial release?

Thanks for clarifying the use-case! Makes total sense to me and look
forward to getting this going.
Should we throw together a PoC/test code for an example Flink pipeline that
will use hudi cdc flags + state ful operators?
It ll help us iron out gaps iteratively, finalize requirements - instead of
a more top-down, waterfall like model?

On Tue, Apr 20, 2021 at 8:25 PM Danny Chan  wrote:

> > Is it providing the ability to author continuous queries on
> Hudi source tables end-end,
> given Flink can use the flags to generate retract/upsert streams
>
> Yes，that's the key point, with these flags plus flink stateful operators,
> we can have a real time incremental ETL pipeline.
>
> For example, a global aggregation that consumes cdc stream can do
> acc/retract continuously and send the changes to downstream.
> The ETL pipeline with cdc stream generates the same result as the batch
> snapshot with the same sql query.
>
> If keeping compatibility is a must with/without the new metadata columns, I
> think there is no need to add a config option which brings in
> unnecessary overhead. If we do not ensure backward compatibility for new
> column, then we should add such a config option and by default
> disable it.
>
> Best,
> Danny Chan
>
>
> Vinoth Chandar  于2021年4月21日周三 上午6:30写道：
>
> > Hi Danny,
> >
> > Read up on the Flink docs as well.
> >
> > If we don't actually publish data to the metacolumn, I think the overhead
> > is pretty low w.r.t avro/parquet. Both are very good at encoding nulls.
> > But, I feel it's worth adding a HoodieWriteConfig to control this and
> since
> > addition of meta columns mostly happens in the handles,
> > it may not be as bad ? Happy to suggest more concrete ideas on the PR.
> >
> > We still need to test backwards compatibility from different engines
> quite
> > early though and make sure there are no surprises.
> > Hive, Parquet, Spark, Presto all have their own rules on evolution as
> well.
> > So we need to think this through if/how seamlessly this can be turned on
> > for existing tables
> >
> > As for testing the new column, given Flink is what will be able to
> consume
> > the flags? Can we write a quick unit test using Dynamic tables?
> > I am also curious to understand how the flags help the end user
> ultimately?
> > Reading the flink docs, I understand the concepts (coming from a Kafka
> > streams world,
> > most of it seem familiar), but what exact problem does the flag solve
> that
> > exist today? Is it providing the ability to author continuous queries on
> > Hudi source tables end-end,
> > given Flink can use the flags to generate retract/upsert streams?
> >
> > For hard deletes, we still need to do some core work to make it available
> > in the incremental query. So there's more to be done here for cracking
> this
> > end-end streaming/continuous ETL vision?
> >
> > Very exciting stuff!
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> >
> > On Tue, Apr 20, 2021 at 2:23 AM Danny Chan  wrote:
> >
> > > Hi, i have created a PR here:
> > > https://github.com/apache/hudi/pull/2854/files
> > >
> > > In the PR i do these changes:
> > > 1. Add a metadata column: "_hoodie_cdc_operation", i did not add a
> config
> > > option because i can not find a good way to make the code clean, a
> > metadata
> > > column is very primitive and a config option would introduce too many
> > > changes
> > > 2. Modify the write handle to add the column: add operation for append
> > > handle but merges the changes for create handle and merge handle
> > > 3. the flag is only useful for streaming read, so i also merge the
> flags
> > > for Flink batch reader, Flink streaming reader would emit each record
> > with
> > > the right cdc operation
> > >
> > > I did not change any Spark code because i'm not familiar with that,
> Spark
> > > actually can not handle these flags in operators, So by default, the
> > > column "_hoodie_cdc_operation", it has a value from the Flink writer.
> > >
> &g

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar

hen adding this support would be elegant.
> >> >
> >> >
> >> >
> >> > On Thu, Apr 15, 2021 at 11:33 PM Danny Chan 
> >> wrote:
> >> >
> >> >> Thanks Vinoth ~
> >> >>
> >> >> Here is a document about the notion of 《Flink Dynamic Table》[1] ,
> every
> >> >> operator that has accumulate state can handle
> >> retractions(UPDATE_BEFORE or
> >> >> DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each
> >> >> operator can consume the CDC format messages in streaming way.
> >> >>
> >> >> > Another aspect to think about is, how the new flag can be added to
> >> >> existing
> >> >> tables and if the schema evolution would be fine.
> >> >>
> >> >> That is also my concern, but it's not that bad because adding a new
> >> column
> >> >> is still compatible for old schema in Avro.
> >> >>
> >> >> [1]
> >> >>
> >> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> >> >>
> >> >> Best,
> >> >> Danny Chan
> >> >>
> >> >> Vinoth Chandar  于2021年4月16日周五 上午9:44写道：
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > Is the intent of the flag to convey if an insert delete or update
> >> >> changed
> >> >> > the record? If so I would imagine that we do this even for cow
> >> tables,
> >> >> > since that also supports a logical notion of a change stream using
> >> the
> >> >> > commit_time meta field.
> >> >> >
> >> >> > You may be right, but I am trying to understand the use case for
> >> this.
> >> >> Any
> >> >> > links/flink docs I can read?
> >> >> >
> >> >> > Another aspect to think about is, how the new flag can be added to
> >> >> existing
> >> >> > tables and if the schema evolution would be fine.
> >> >> >
> >> >> > Thanks
> >> >> > Vinoth
> >> >> >
> >> >> > On Thu, Apr 8, 2021 at 2:13 AM Danny Chan 
> >> wrote:
> >> >> >
> >> >> > > I tries to do a POC for flink locally and it works well, in the
> PR
> >> i
> >> >> add
> >> >> > a
> >> >> > > new metadata column named "_hoodie_change_flag", but actually i
> >> found
> >> >> > that
> >> >> > > only log format needs this flag, and the Spark may has no ability
> >> to
> >> >> > handle
> >> >> > > the flag for incremental processing yet.
> >> >> > >
> >> >> > > So should i add the "_hoodie_change_flag" metadata column, or is
> >> there
> >> >> > any
> >> >> > > better solution for this?
> >> >> > >
> >> >> > > Best,
> >> >> > > Danny Chan
> >> >> > >
> >> >> > > Danny Chan  于2021年4月2日周五 上午11:08写道：
> >> >> > >
> >> >> > > > Thanks cool, then the left questions are:
> >> >> > > >
> >> >> > > > - where we record these change, should we add a builtin meta
> >> field
> >> >> such
> >> >> > > as
> >> >> > > > the _change_flag_ like the other system columns for e.g
> >> >> > > _hoodie_commit_time
> >> >> > > > - what kind of table should keep these flags, in my thoughts,
> we
> >> >> should
> >> >> > > > only add these flags for "MERGE_ON_READ" table, and only for
> AVRO
> >> >> logs
> >> >> > > > - we should add a config there to switch on/off the flags in
> >> system
> >> >> > meta
> >> >> > > > fields
> >> >> > > >
> >> >> > > > What do you think?
> >> >> > > >
> >> >> > > > Best,
> >> >> > > > Danny Chan
> >> >> > > >
> >> >> > > > vino yang  于2021年4月1日周四 上午10:58写道：
> >&g

Re: Re[2]:Re: About re-run Travis CI

2021-04-20 Thread Vinoth Chandar

Ack.

But the "rerun tests" bot should be working. I see the github actions
running actually. So not sure.

https://github.com/apache/hudi/actions

May be need a JIRA to investigate :)



On Fri, Apr 16, 2021 at 6:44 AM Roc Marshal  wrote:

>
>
>
> Susudong.
> Thanks for your help.
>  Now, I  confirm there are two ways to trigger CI.
>  1. close the PR ,then reopen the PR .   by  Aditya Tiwari
> 2. do an “empty commit” to trigger CI.
> https://stackoverflow.com/questions/20138640/pushing-empty-commits-to-remote
> by Susudong.
>
>
>  Thank you.
>
>
> Best , Roc.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2021-04-16 20:25:04，"Walter Dong"  写道：
> >
> >Hey Roc,
> >
> >You could do an “empty commit” to trigger CI. Git allows you to submit an
> empty commit with the message only.
> >
> >
> https://stackoverflow.com/questions/20138640/pushing-empty-commits-to-remote
> >
> >Susu
> >
> >Friday, April 16, 2021 19:35 +0900 from flin...@126.com   >:
> >>Hi，Vinoth Chandar，Thanks you for your reply.
> >>https://github.com/apache/hudi/pull/2822
> >>I leaved a comment "rerun tests" ,but it didn't work.
> >>It would be great if the gitbot could accept a command to activate the
> test process.
> >>
> >>
> >>
> >>Best, Roc
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>At 2021-04-16 12:57:56, "Vinoth Chandar" < vin...@apache.org > wrote:
> >>>If you leave a comment "rerun tests", I think there is a bot that also
> >>>kicks it off again.
> >>>
> >>>Please report if that still works, and if possible kindly send us a PR
> to
> >>>update the contributing page with this info :)
> >>>
> >>>On Thu, Apr 15, 2021 at 9:56 PM Vinoth Chandar < vin...@apache.org >
> wrote:
> >>>
> >>>> Hi Roc,
> >>>>
> >>>> You should be able to click the travis build, and restart from the
> >>>> travis-ci page.
> >>>>
> >>>> Thanks
> >>>> Vinoth
> >>>>
> >>>> On Thu, Apr 15, 2021 at 8:00 PM Roc Marshal < flin...@126.com >
> wrote:
> >>>>
> >>>>> Hello, all.
> >>>>> In some cases,  the Travis CI test failed shown github-PR
> page.
> >>>>> How to rerun Travis CI?
> >>>>> Thank you .
> >>>>>
> >>>>>
> >>>>> Best Roc.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
>

Re: [DISCUSS] Hudi is the data lake platform

2021-04-19 Thread Vinoth Chandar

Looks like we have consensus here!  Will share the blog PR here once ready.

Thanks all!

On Fri, Apr 16, 2021 at 8:43 PM Sivabalan  wrote:

> totally +1 on clarifying Hudi's vision.
>
> On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal 
> wrote:
>
> > +1
> >
> > I also believe Hudi is a Data Platform technology providing many
> different
> > functionalities to build modern data lakes, Hudi's table format being
> just
> > one of them. I've been using this perspective in some of the conference
> > talks already ;)
> > With this rebranding (and hopefully some code/package structuring down
> the
> > road..), it's easier for us to communicate the value add of Hudi and its
> > associated features and generate interest for future contributors.
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar 
> wrote:
> >
> > > Thanks everyone for the feedback, so far!
> > >
> > > On the incremental aspects, that's actually Hudi's core design
> > > differentiation. While I believe the ETL today is still largely batch
> > > oriented, the way forward for everyone's
> > > benefit is indeed - incremental processing. We have already taken a
> giant
> > > step here for e.g in making raw data ingestion fully incremental using
> > > deltastreamer. We should keep working to crack incremental ETL at
> large.
> > > 100% with your line of thinking!
> > >
> > > It's been in my head for four full years now! :)
> > >
> > >
> >
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
> > >
> > > I have started drafting a blog/PR along these lines already. I will
> make
> > it
> > > more final and share here, as we wait couple more days for more
> feedback!
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan 
> wrote:
> > >
> > > > +1 for the vision, personally i'm promising the incremental ETL part,
> > > with
> > > > engine like Apache Flink we can do intermediate aggregation in
> > streaming
> > > > style.
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > > > leesf  于2021年4月14日周三 上午9:52写道：
> > > >
> > > > > +1. Cool and promising.
> > > > >
> > > > > Mehrotra, Udit  于2021年4月14日周三 上午2:57写道：
> > > > >
> > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table
> format"
> > > and
> > > > > we
> > > > > > need to do justice to all the cool auxiliary features/services we
> > > have
> > > > > > built.
> > > > > >
> > > > > > Also, timeline metadata service in particular would be a really
> big
> > > win
> > > > > if
> > > > > > we move towards something like that.
> > > > > >
> > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  >
> > > > wrote:
> > > > > >
> > > > > > CAUTION: This email originated from outside of the
> > organization.
> > > Do
> > > > > > not click links or open attachments unless you can confirm the
> > sender
> > > > and
> > > > > > know the content is safe.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Definitely we are doing much more than only ingesting and
> > > managing
> > > > > data
> > > > > > over DFS.
> > > > > >
> > > > > > +1 from my side as well. :)
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <
> > susudo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I love this rebranding. Totally agree. +1
> > > > > > >
> > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > > > xu.shiyan.raym...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 The vision looks fantastic.
> > > > > > > >
> > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <
> gar...@apache.org
> > >
> > > > &

Re: PR Tracker board

2021-04-19 Thread Vinoth Chandar

Updated all open PRs into the following columns.

*Opened PRs* => New PRs, PRs with open issues, unclear problem statements,
non-ideal solution approaches
*Ready for Review* => PRs in final reviewable shape
*Review in progress* => PRs being actively reviewed


On Mon, Apr 19, 2021 at 9:40 AM Vinoth Chandar  wrote:

> Hi all,
>
> I know we have a build up of great contributions :) [great problem to
> have], that kind of exceeded our existing triaging processes.
>
> So, in order to generate more transparency into the review process and
> understand where PRs are in the pipeline, made a tracker board here
>
> https://github.com/apache/hudi/projects/7
>
> I'll be working on triaging all 80+ PRs this week and move them to the
> right columns.
>
> Thanks
> Vinoth
>

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar

Biggest difference from PR 1094 and the current PR open, is the addition of
fallback support and that no moving around of configs in the same PR.
This would make this effort straightforward IMO.

>HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP in their client code, they
need to either replace it with
HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP.key()
I think this is a small cost we can take, in return for much better docs
and maintainability.

On Mon, Apr 19, 2021 at 1:16 PM Vinoth Chandar  wrote:

> +1 from me. Long time coming.
>
> On Mon, Apr 19, 2021 at 12:02 PM Ding, Wenning 
> wrote:
>
>> Hi,
>> I planned to refactor the current Hudi configuration framework. lamberken<
>> https://github.com/lamberken> did similar things before:
>> https://github.com/apache/hudi/pull/1094 and I’d like to continue this
>> work and add more features in ConfigOption class.
>>
>> The motivation of this change is, as lamberken mentioned, “Currently,
>> config items and their default value are dispersed in the java class file.
>> It's could be confused when config items are defined more and more”. Having
>> this change would make Hudi developers easy to use and check these
>> configurations.
>>
>> Also, we can also bind configuration description within the ConfigOption
>> class. And for the next step, we could also do something similar to Flink
>> to automatically add/update property description on the Hudi website:
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/description/Description.java
>> .
>> Besides, we can also bind an inference function within the ConfigOption
>> class which can provide an inference mechanism for some of our
>> configurations based on rules. For example, we can infer the key generator
>> class based on the Hudi record fields & partition fields. For example, if
>> the record key field contains comma which indicate that there are multiple
>> record keys, then by default we should use ComplexKeyGenerator. If there’s
>> no partition column, we should use NonpartitionedKeyGenerator. Having this
>> inference mechanism can make Hudi be more intelligent so that users don’t
>> need to set so many parameters from their client side.
>> The disadvantage of this change is for users who are now using e.g.
>> HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP in their client code, they
>> need to either replace it with
>> HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP.key() or
>> hoodie.bootstrap.base.path.
>> I opened a demo for this: https://github.com/apache/hudi/pull/2833.
>> Feel free to discuss under this thread and provide any suggestion!
>>
>> Related JIRAs:
>> https://issues.apache.org/jira/browse/HUDI-89
>> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375
>>
>> Thanks,
>> Wenning
>>
>>
>>
>>

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar

+1 from me. Long time coming.

On Mon, Apr 19, 2021 at 12:02 PM Ding, Wenning 
wrote:

> Hi,
> I planned to refactor the current Hudi configuration framework. lamberken<
> https://github.com/lamberken> did similar things before:
> https://github.com/apache/hudi/pull/1094 and I’d like to continue this
> work and add more features in ConfigOption class.
>
> The motivation of this change is, as lamberken mentioned, “Currently,
> config items and their default value are dispersed in the java class file.
> It's could be confused when config items are defined more and more”. Having
> this change would make Hudi developers easy to use and check these
> configurations.
>
> Also, we can also bind configuration description within the ConfigOption
> class. And for the next step, we could also do something similar to Flink
> to automatically add/update property description on the Hudi website:
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/description/Description.java
> .
> Besides, we can also bind an inference function within the ConfigOption
> class which can provide an inference mechanism for some of our
> configurations based on rules. For example, we can infer the key generator
> class based on the Hudi record fields & partition fields. For example, if
> the record key field contains comma which indicate that there are multiple
> record keys, then by default we should use ComplexKeyGenerator. If there’s
> no partition column, we should use NonpartitionedKeyGenerator. Having this
> inference mechanism can make Hudi be more intelligent so that users don’t
> need to set so many parameters from their client side.
> The disadvantage of this change is for users who are now using e.g.
> HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP in their client code, they
> need to either replace it with
> HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP.key() or
> hoodie.bootstrap.base.path.
> I opened a demo for this: https://github.com/apache/hudi/pull/2833.  Feel
> free to discuss under this thread and provide any suggestion!
>
> Related JIRAs:
> https://issues.apache.org/jira/browse/HUDI-89
> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375
>
> Thanks,
> Wenning
>
>
>
>

PR Tracker board

2021-04-19 Thread Vinoth Chandar

Hi all,

I know we have a build up of great contributions :) [great problem to
have], that kind of exceeded our existing triaging processes.

So, in order to generate more transparency into the review process and
understand where PRs are in the pipeline, made a tracker board here

https://github.com/apache/hudi/projects/7

I'll be working on triaging all 80+ PRs this week and move them to the
right columns.

Thanks
Vinoth

Re: About re-run Travis CI

2021-04-15 Thread Vinoth Chandar

If you leave a comment "rerun tests", I think there is a bot that also
kicks it off again.

Please report if that still works, and if possible kindly send us a PR to
update the contributing page with this info :)

On Thu, Apr 15, 2021 at 9:56 PM Vinoth Chandar  wrote:

> Hi Roc,
>
> You should be able to click the travis build, and restart from the
> travis-ci page.
>
> Thanks
> Vinoth
>
> On Thu, Apr 15, 2021 at 8:00 PM Roc Marshal  wrote:
>
>> Hello, all.
>> In some cases,  the Travis CI test failed shown github-PR page.
>> How to rerun Travis CI?
>> Thank you .
>>
>>
>> Best Roc.
>>
>>
>>
>>
>>
>>

Re: About re-run Travis CI

2021-04-15 Thread Vinoth Chandar

Hi Roc,

You should be able to click the travis build, and restart from the
travis-ci page.

Thanks
Vinoth

On Thu, Apr 15, 2021 at 8:00 PM Roc Marshal  wrote:

> Hello, all.
> In some cases,  the Travis CI test failed shown github-PR page.
> How to rerun Travis CI?
> Thank you .
>
>
> Best Roc.
>
>
>
>
>
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Vinoth Chandar

Hi,

Is the intent of the flag to convey if an insert delete or update changed
the record? If so I would imagine that we do this even for cow tables,
since that also supports a logical notion of a change stream using the
commit_time meta field.

You may be right, but I am trying to understand the use case for this. Any
links/flink docs I can read?

Another aspect to think about is, how the new flag can be added to existing
tables and if the schema evolution would be fine.

Thanks
Vinoth

On Thu, Apr 8, 2021 at 2:13 AM Danny Chan  wrote:

> I tries to do a POC for flink locally and it works well, in the PR i add a
> new metadata column named "_hoodie_change_flag", but actually i found that
> only log format needs this flag, and the Spark may has no ability to handle
> the flag for incremental processing yet.
>
> So should i add the "_hoodie_change_flag" metadata column, or is there any
> better solution for this?
>
> Best,
> Danny Chan
>
> Danny Chan  于2021年4月2日周五 上午11:08写道：
>
> > Thanks cool, then the left questions are:
> >
> > - where we record these change, should we add a builtin meta field such
> as
> > the _change_flag_ like the other system columns for e.g
> _hoodie_commit_time
> > - what kind of table should keep these flags, in my thoughts, we should
> > only add these flags for "MERGE_ON_READ" table, and only for AVRO logs
> > - we should add a config there to switch on/off the flags in system meta
> > fields
> >
> > What do you think?
> >
> > Best,
> > Danny Chan
> >
> > vino yang  于2021年4月1日周四 上午10:58写道：
> >
> >> >> Oops, the image crushes, for "change flags", i mean: insert,
> >> update(before
> >> and after) and delete.
> >>
> >> Yes, the image I attached is also about these flags.
> >> [image: image (3).png]
> >>
> >> +1 for the idea.
> >>
> >> Best,
> >> Vino
> >>
> >>
> >> Danny Chan  于2021年4月1日周四 上午10:03写道：
> >>
> >>> Oops, the image crushes, for "change flags", i mean: insert,
> >>> update(before
> >>> and after) and delete.
> >>>
> >>> The Flink engine can propagate the change flags internally between its
> >>> operators, if HUDI can send the change flags to Flink, the incremental
> >>> calculation of CDC would be very natural (almost transparent to users).
> >>>
> >>> Best,
> >>> Danny Chan
> >>>
> >>> vino yang  于2021年3月31日周三 下午11:32写道：
> >>>
> >>> > Hi Danny,
> >>> >
> >>> > Thanks for kicking off this discussion thread.
> >>> >
> >>> > Yes, incremental query( or says "incremental processing") has always
> >>> been
> >>> > an important feature of the Hudi framework. If we can make this
> feature
> >>> > better, it will be even more exciting.
> >>> >
> >>> > In the data warehouse, in some complex calculations, I have not
> found a
> >>> > good way to conveniently use some incremental change data (similar to
> >>> the
> >>> > concept of retracement stream in Flink?) to locally "correct" the
> >>> > aggregation result (these aggregation results may belong to the DWS
> >>> layer).
> >>> >
> >>> > BTW: Yes, I do admit that some simple calculation scenarios (single
> >>> table
> >>> > or an algorithm that can be very easily retracement) can be dealt
> with
> >>> > based on the incremental calculation of CDC.
> >>> >
> >>> > Of course, the expression of incremental calculation on various
> >>> occasions
> >>> > is sometimes not very clear. Maybe we will discuss it more clearly in
> >>> > specific scenarios.
> >>> >
> >>> > >> If HUDI can keep and propagate these change flags to its
> consumers,
> >>> we
> >>> > can
> >>> > use HUDI as the unified format for the pipeline.
> >>> >
> >>> > Regarding the "change flags" here, do you mean the flags like the one
> >>> > shown in the figure below?
> >>> >
> >>> > [image: image.png]
> >>> >
> >>> > Best,
> >>> > Vino
> >>> >
> >>> > Danny Chan  于2021年3月31日周三 下午6:24写道：
> >>> >
> >>> >> Hi dear HUDI community ~ Here i want to fire a discuss about using
> >>> HUDI as
> >>> >> the unified storage/format for data warehouse/lake incremental
> >>> >> computation.
> >>> >>
> >>> >> Usually people divide data warehouse production into several levels,
> >>> such
> >>> >> as the ODS(operation data store), DWD(data warehouse details),
> >>> DWS(data
> >>> >> warehouse service), ADS(application data service).
> >>> >>
> >>> >>
> >>> >> ODS -> DWD -> DWS -> ADS
> >>> >>
> >>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big
> >>> topic is
> >>> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the
> >>> >> warehouse/lake, the cdc patten records and propagate the change
> flag:
> >>> >> insert, update(before and after) and delete for the consumer, with
> >>> these
> >>> >> flags, the downstream engines can have a realtime accumulation
> >>> >> computation.
> >>> >>
> >>> >> Using streaming engine like Flink, we can have a totally
> >>> NEAR-REAL-TIME
> >>> >> computation pipeline for each of the layer.
> >>> >>
> >>> >> If HUDI can keep and propagate these change flags to its consumers,
> >>> we can
> >>> >> use HUDI as the

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-15 Thread Vinoth Chandar

If you want to quickly try something, you can also build jar off master and
run independently (works for client mode/spark-shell experiments)
https://dev.to/bytearray/using-your-own-apache-spark-hudi-versions-with-aws-emr-40a0



On Thu, Apr 15, 2021 at 6:09 AM Kizhakkel Jose, Felix
 wrote:

> Hi Nishith,
>
> I will check with Udit M, since he had helped me in the past with a custom
> jar for EMR.
>
> Regards,
> Felix K Jose
> From: nishith agarwal 
> Date: Wednesday, April 14, 2021 at 3:59 PM
> To: dev 
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> No worries. Is the custom build something you can work with the AWS team to
> get installed to be able to test ?
>
> -Nishith
>
> On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
>  wrote:
>
> > Hi Nishith, Vinoth,
> >
> > Thank you so much for the quick response and offering the help.
> >
> > Regards,
> > Felix K Jose
> > From: Kizhakkel Jose, Felix 
> > Date: Wednesday, April 14, 2021 at 3:55 PM
> > To: dev@hudi.apache.org 
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hi Nishith,
> >
> > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> > version available as part of existing version. So we need a custom build
> > for working it on latest EMR 6.1.0
> >
> > Regards,
> > Felix K Jose
> > From: nishith agarwal 
> > Date: Wednesday, April 14, 2021 at 3:49 PM
> > To: dev 
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Felix,
> >
> > Happy to help you through trying and rolling out multi-writer on Hudi
> > tables. Do you have a test environment where you can try out the feature
> by
> > following the doc that Vinoth pointed above ?
> >
> > Thanks,
> > Nishith
> >
> > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar 
> wrote:
> >
> > > Hi Felix,
> > >
> > > Most people I think are publishing this data into Kafka,and apply the
> > > deletes as a part of the streaming job itself. The reason why this
> works
> > is
> > > because typically, only a small fraction of users leave the service
> (say
> > <<
> > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> > not
> > > much. Is that not the case for you? Are you looking for one time
> > scrubbing
> > > of data for e.g? The benefit of this approach is that you eliminate any
> > > concurrency issues that arise from streaming job producing data for a
> > user,
> > > while the deletes are also issued for that user.
> > >
> > > On concurrency control, Hudi now supports multiple writers, if you want
> > to
> > > write a background job that will perform these deletes for you. it's in
> > > 0.8.0, see
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.htmldata=04%7C01%7C%7Cde1da0fb3fb2458b31a208d8ff7fcf24%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540271689560701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aaJEMtZBWGIT2SuYO9qRyPihTvHDkHMTHFFMlwVyJXc%3Dreserved=0
> .
> > One of
> > > us
> > > can help you out with trying this and rolling out. (Nishith is the
> > feature
> > > author). Here, if the delete job touches same files, that the streaming
> > job
> > > is writing to, then only one of them will succeed.
> > >
> > > We are working on a design for true lock free concurrency control,
> which
> > > provides the benefits of both models. But, won't be there for another
> > month
> > > or two.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > >  wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> > via
> > > > Spark structured streaming from kafka streams. Now I have to delete
> > > records
> > > > for a given user (userId) from all the tables which has data for that
> > > user.
> > > > Me

Re: I want to contribute to Apache Hudi.

2021-04-15 Thread Vinoth Chandar

Done. You should have access now

On Thu, Apr 15, 2021 at 1:27 AM 蒋龙  wrote:

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is
>
> 用户名:long jiang
> 全名: john
>
>

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Vinoth Chandar

>But which one will fail, either streaming or delete batch job?

That's the pitfall with an OCC based approach. We can't really choose. You
probably need to break up your delete job also to be more incremental and
run in smaller batches to avoid contention. Otherwise, you'll get into a
scenario where the delete job runs for a few hours, and then always fails
to commit, because streaming job wrote some conflicting data.

>you are asking whether I can query those tables and pull the corresponding
records by spark job and write it back (output) to the same topic

yes. That's what did at Uber, for e.g

Thanks
Vinoth

On Wed, Apr 14, 2021 at 12:59 PM nishith agarwal 
wrote:

> No worries. Is the custom build something you can work with the AWS team to
> get installed to be able to test ?
>
> -Nishith
>
> On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
>  wrote:
>
> > Hi Nishith, Vinoth,
> >
> > Thank you so much for the quick response and offering the help.
> >
> > Regards,
> > Felix K Jose
> > From: Kizhakkel Jose, Felix 
> > Date: Wednesday, April 14, 2021 at 3:55 PM
> > To: dev@hudi.apache.org 
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hi Nishith,
> >
> > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> > version available as part of existing version. So we need a custom build
> > for working it on latest EMR 6.1.0
> >
> > Regards,
> > Felix K Jose
> > From: nishith agarwal 
> > Date: Wednesday, April 14, 2021 at 3:49 PM
> > To: dev 
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Felix,
> >
> > Happy to help you through trying and rolling out multi-writer on Hudi
> > tables. Do you have a test environment where you can try out the feature
> by
> > following the doc that Vinoth pointed above ?
> >
> > Thanks,
> > Nishith
> >
> > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar 
> wrote:
> >
> > > Hi Felix,
> > >
> > > Most people I think are publishing this data into Kafka,and apply the
> > > deletes as a part of the streaming job itself. The reason why this
> works
> > is
> > > because typically, only a small fraction of users leave the service
> (say
> > <<
> > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> > not
> > > much. Is that not the case for you? Are you looking for one time
> > scrubbing
> > > of data for e.g? The benefit of this approach is that you eliminate any
> > > concurrency issues that arise from streaming job producing data for a
> > user,
> > > while the deletes are also issued for that user.
> > >
> > > On concurrency control, Hudi now supports multiple writers, if you want
> > to
> > > write a background job that will perform these deletes for you. it's in
> > > 0.8.0, see
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.htmldata=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3Dreserved=0
> .
> > One of
> > > us
> > > can help you out with trying this and rolling out. (Nishith is the
> > feature
> > > author). Here, if the delete job touches same files, that the streaming
> > job
> > > is writing to, then only one of them will succeed.
> > >
> > > We are working on a design for true lock free concurrency control,
> which
> > > provides the benefits of both models. But, won't be there for another
> > month
> > > or two.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > >  wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> > via
> > > > Spark structured streaming from kafka streams. Now I have to delete
> > > records
> > > > for a given user (userId) from all the tables which has data for that
> > > user.
> > > > Meaning all tables where we have refe

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Vinoth Chandar

Hi Felix,

Most people I think are publishing this data into Kafka,and apply the
deletes as a part of the streaming job itself. The reason why this works is
because typically, only a small fraction of users leave the service (say <<
0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
much. Is that not the case for you? Are you looking for one time scrubbing
of data for e.g? The benefit of this approach is that you eliminate any
concurrency issues that arise from streaming job producing data for a user,
while the deletes are also issued for that user.

On concurrency control, Hudi now supports multiple writers, if you want to
write a background job that will perform these deletes for you. it's in
0.8.0, see https://hudi.apache.org/docs/concurrency_control.html. One of us
can help you out with trying this and rolling out. (Nishith is the feature
author). Here, if the delete job touches same files, that the streaming job
is writing to, then only one of them will succeed.

We are working on a design for true lock free concurrency control, which
provides the benefits of both models. But, won't be there for another month
or two.

Thanks
Vinoth

On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
 wrote:

> Hi All,
>
> I have 100s of HUDI tables (AWS S3) where each of those are populated via
> Spark structured streaming from kafka streams. Now I have to delete records
> for a given user (userId) from all the tables which has data for that user.
> Meaning all tables where we have reference to that specific userId. I
> cannot republish all the events/records for that user to kafka to perform
> delete, since its around 10-15 year’s worth of data for each user and is
> going to be so costly and time consuming. So I am wondering how everybody
> is performing GDPR on the their HUDI tables?
>
>
> How I get delete request?
> On a delete kafka topic we get a delete event [which just contains the
> userId of the user  to delete], so we have to use that as filter condition
> and read all the records from HUDI tables and write it back with data
> source operation as ‘delete’. But while performing/running this delete
> spark job on the table if the streaming job continues to ingest new
> arriving data- what will be the side effect? Will it work, since seems like
> multi writers are not currently supported.
>
> Could you help me with a solution?
>
> Regards,
> Felix K Jose
>
> 
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1134 matches

Mail list logo