totally +1 on clarifying Hudi's vision. On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal <n3.nas...@gmail.com> wrote:
> +1 > > I also believe Hudi is a Data Platform technology providing many different > functionalities to build modern data lakes, Hudi's table format being just > one of them. I've been using this perspective in some of the conference > talks already ;) > With this rebranding (and hopefully some code/package structuring down the > road..), it's easier for us to communicate the value add of Hudi and its > associated features and generate interest for future contributors. > > Thanks, > Nishith > > > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar <vin...@apache.org> wrote: > > > Thanks everyone for the feedback, so far! > > > > On the incremental aspects, that's actually Hudi's core design > > differentiation. While I believe the ETL today is still largely batch > > oriented, the way forward for everyone's > > benefit is indeed - incremental processing. We have already taken a giant > > step here for e.g in making raw data ingestion fully incremental using > > deltastreamer. We should keep working to crack incremental ETL at large. > > 100% with your line of thinking! > > > > It's been in my head for four full years now! :) > > > > > https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ > > > > I have started drafting a blog/PR along these lines already. I will make > it > > more final and share here, as we wait couple more days for more feedback! > > > > Thanks > > Vinoth > > > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan <danny0...@apache.org> wrote: > > > > > +1 for the vision, personally i'm promising the incremental ETL part, > > with > > > engine like Apache Flink we can do intermediate aggregation in > streaming > > > style. > > > > > > Best, > > > Danny Chan > > > > > > leesf <leesf0...@gmail.com> 于2021年4月14日周三 上午9:52写道: > > > > > > > +1. Cool and promising. > > > > > > > > Mehrotra, Udit <udi...@amazon.com.invalid> 于2021年4月14日周三 上午2:57写道: > > > > > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" > > and > > > > we > > > > > need to do justice to all the cool auxiliary features/services we > > have > > > > > built. > > > > > > > > > > Also, timeline metadata service in particular would be a really big > > win > > > > if > > > > > we move towards something like that. > > > > > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <pratyaks...@gmail.com> > > > wrote: > > > > > > > > > > CAUTION: This email originated from outside of the > organization. > > Do > > > > > not click links or open attachments unless you can confirm the > sender > > > and > > > > > know the content is safe. > > > > > > > > > > > > > > > > > > > > Definitely we are doing much more than only ingesting and > > managing > > > > data > > > > > over DFS. > > > > > > > > > > +1 from my side as well. :) > > > > > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong < > susudo...@gmail.com> > > > > > wrote: > > > > > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > > > xu.shiyan.raym...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <gar...@apache.org > > > > > > wrote: > > > > > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > > > > > Gary Li > > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > > > rubenssoto2...@gmail.com> > > > > > > > > wrote: > > > > > > > > > Excellent, I agree > > > > > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > > > yanghua1...@gmail.com> > > > > > > > > escreveu: > > > > > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > Dianjin Wang <djw...@streamnative.io.invalid> > > > > 于2021年4月13日周二 > > > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > > > description > > > > > of > > > > > > Hudi. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > > > bhavanisud...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total > > > sense > > > > > and will > > > > > > > > provide > > > > > > > > > > > for > > > > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > > > > vin...@apache.org > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > > > > > Reading one more article today, positioning > Hudi, > > > as > > > > > just a > > > > > > > table > > > > > > > > > > > format, > > > > > > > > > > > > > made me wonder, if we have done enough justice > in > > > > > explaining > > > > > > > > what we > > > > > > > > > > > have > > > > > > > > > > > > > built together here. > > > > > > > > > > > > > I tend to think of Hudi as the data lake > > platform, > > > > > which has > > > > > > > the > > > > > > > > > > > > following > > > > > > > > > > > > > components, of which - one if a table format, > one > > > is > > > > a > > > > > > > > transactional > > > > > > > > > > > > > storage layer. > > > > > > > > > > > > > But the whole stack we have is definitely worth > > > more > > > > > than the > > > > > > > > sum of > > > > > > > > > > > all > > > > > > > > > > > > > the parts IMO (speaking from my own experience > > from > > > > > the past > > > > > > > 10+ > > > > > > > > > > years > > > > > > > > > > > of > > > > > > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > > > > > > > > > > > a) *table format* : something that stores table > > > > > schema, a > > > > > > > > metadata > > > > > > > > > > > table > > > > > > > > > > > > > that stores file listing today, and being > > extended > > > to > > > > > store > > > > > > > > column > > > > > > > > > > > ranges > > > > > > > > > > > > > and more in the future (RFC-27) > > > > > > > > > > > > > b) *aux metadata* : bloom filters, external > > record > > > > > level > > > > > > > indexes > > > > > > > > > > today, > > > > > > > > > > > > > bitmaps/interval trees and other advanced > on-disk > > > > data > > > > > > > structures > > > > > > > > > > > > tomorrow > > > > > > > > > > > > > c) *concurrency control* : we always supported > > MVCC > > > > > based log > > > > > > > > based > > > > > > > > > > > > > concurrency (serialize writes into a time > ordered > > > > > log), and > > > > > > we > > > > > > > > now > > > > > > > > > > also > > > > > > > > > > > > > have OCC for batch merge workloads with 0.8.0. > We > > > > will > > > > > have > > > > > > > > > > multi-table > > > > > > > > > > > > and > > > > > > > > > > > > > fully non-blocking writers soon (see future > work > > > > > section of > > > > > > > > RFC-22) > > > > > > > > > > > > > d) *updates/deletes* : this is the > > bread-and-butter > > > > > use-case > > > > > > > for > > > > > > > > > > Hudi, > > > > > > > > > > > > but > > > > > > > > > > > > > we support primary/unique key constraints and > we > > > > could > > > > > add > > > > > > > > foreign > > > > > > > > > > keys > > > > > > > > > > > > as > > > > > > > > > > > > > an extension, once our transactions can span > > > tables. > > > > > > > > > > > > > e) *table services*: a hudi pipeline today is > > > > > self-managing - > > > > > > > > sizes > > > > > > > > > > > > files, > > > > > > > > > > > > > cleans, compacts, clusters data, bootstraps > > > existing > > > > > data - > > > > > > all > > > > > > > > these > > > > > > > > > > > > > actions working off each other without blocking > > one > > > > > another. > > > > > > > (for > > > > > > > > > > most > > > > > > > > > > > > > parts). > > > > > > > > > > > > > f) *data services*: we also have higher level > > > > > functionality > > > > > > > with > > > > > > > > > > > > > deltastreamer sources (scalable DFS listing > > source, > > > > > Kafka, > > > > > > > > Pulsar is > > > > > > > > > > > > > coming, ...and more), incremental ETL support, > > > > > > de-duplication, > > > > > > > > commit > > > > > > > > > > > > > callbacks, pre-commit validations are coming, > > error > > > > > tables > > > > > > have > > > > > > > > been > > > > > > > > > > > > > proposed. I could also envision us building > > towards > > > > > streaming > > > > > > > > egress, > > > > > > > > > > > > data > > > > > > > > > > > > > monitoring. > > > > > > > > > > > > > > > > > > > > > > > > > > I also think we should build the following > > (subject > > > > to > > > > > > separate > > > > > > > > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > > > > > > > > > > > > > > > g) *caching service*: Hudi specific caching > > service > > > > > that can > > > > > > > hold > > > > > > > > > > > mutable > > > > > > > > > > > > > data and serve oft-queried data across engines. > > > > > > > > > > > > > h) t*imeline metaserver:* We already run a > > > metaserver > > > > > in > > > > > > spark > > > > > > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's > > > > > metadata > > > > > > table. > > > > > > > > Let's > > > > > > > > > > > > turn > > > > > > > > > > > > > it into a scalable, sharded metastore, that all > > > > > engines can > > > > > > use > > > > > > > > to > > > > > > > > > > > obtain > > > > > > > > > > > > > any metadata. > > > > > > > > > > > > > > > > > > > > > > > > > > To this end, I propose we rebrand to "*Data > Lake > > > > > Platform*" > > > > > > as > > > > > > > > > > opposed > > > > > > > > > > > to > > > > > > > > > > > > > "ingests & manages storage of large analytical > > > > > datasets over > > > > > > > DFS > > > > > > > > > > (hdfs > > > > > > > > > > > or > > > > > > > > > > > > > cloud stores)." and convey the scope of our > > vision, > > > > > > > > > > > > > given we have already been building towards > that. > > > It > > > > > would > > > > > > also > > > > > > > > > > provide > > > > > > > > > > > > new > > > > > > > > > > > > > contributors a good lens to look at the project > > > from. > > > > > > > > > > > > > > > > > > > > > > > > > > (This is very similar to for e.g, the evolution > > of > > > > > Kafka > > > > > > from a > > > > > > > > > > pub-sub > > > > > > > > > > > > > system, to an event streaming platform - with > > > > addition > > > > > of > > > > > > > > > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > > > > > > > > > > > > > > > > > Please share your thoughts! > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Regards, -Sivabalan