+1 I also believe Hudi is a Data Platform technology providing many different functionalities to build modern data lakes, Hudi's table format being just one of them. I've been using this perspective in some of the conference talks already ;) With this rebranding (and hopefully some code/package structuring down the road..), it's easier for us to communicate the value add of Hudi and its associated features and generate interest for future contributors.
Thanks, Nishith On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar <vin...@apache.org> wrote: > Thanks everyone for the feedback, so far! > > On the incremental aspects, that's actually Hudi's core design > differentiation. While I believe the ETL today is still largely batch > oriented, the way forward for everyone's > benefit is indeed - incremental processing. We have already taken a giant > step here for e.g in making raw data ingestion fully incremental using > deltastreamer. We should keep working to crack incremental ETL at large. > 100% with your line of thinking! > > It's been in my head for four full years now! :) > > https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ > > I have started drafting a blog/PR along these lines already. I will make it > more final and share here, as we wait couple more days for more feedback! > > Thanks > Vinoth > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan <danny0...@apache.org> wrote: > > > +1 for the vision, personally i'm promising the incremental ETL part, > with > > engine like Apache Flink we can do intermediate aggregation in streaming > > style. > > > > Best, > > Danny Chan > > > > leesf <leesf0...@gmail.com> 于2021年4月14日周三 上午9:52写道: > > > > > +1. Cool and promising. > > > > > > Mehrotra, Udit <udi...@amazon.com.invalid> 于2021年4月14日周三 上午2:57写道: > > > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" > and > > > we > > > > need to do justice to all the cool auxiliary features/services we > have > > > > built. > > > > > > > > Also, timeline metadata service in particular would be a really big > win > > > if > > > > we move towards something like that. > > > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <pratyaks...@gmail.com> > > wrote: > > > > > > > > CAUTION: This email originated from outside of the organization. > Do > > > > not click links or open attachments unless you can confirm the sender > > and > > > > know the content is safe. > > > > > > > > > > > > > > > > Definitely we are doing much more than only ingesting and > managing > > > data > > > > over DFS. > > > > > > > > +1 from my side as well. :) > > > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <susudo...@gmail.com> > > > > wrote: > > > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > > xu.shiyan.raym...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <gar...@apache.org> > > > wrote: > > > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > > > Gary Li > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > > rubenssoto2...@gmail.com> > > > > > > > wrote: > > > > > > > > Excellent, I agree > > > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > > yanghua1...@gmail.com> > > > > > > > escreveu: > > > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > Dianjin Wang <djw...@streamnative.io.invalid> > > > 于2021年4月13日周二 > > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > > description > > > > of > > > > > Hudi. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > > bhavanisud...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total > > sense > > > > and will > > > > > > > provide > > > > > > > > > > for > > > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > > > vin...@apache.org > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, > > as > > > > just a > > > > > > table > > > > > > > > > > format, > > > > > > > > > > > > made me wonder, if we have done enough justice in > > > > explaining > > > > > > > what we > > > > > > > > > > have > > > > > > > > > > > > built together here. > > > > > > > > > > > > I tend to think of Hudi as the data lake > platform, > > > > which has > > > > > > the > > > > > > > > > > > following > > > > > > > > > > > > components, of which - one if a table format, one > > is > > > a > > > > > > > transactional > > > > > > > > > > > > storage layer. > > > > > > > > > > > > But the whole stack we have is definitely worth > > more > > > > than the > > > > > > > sum of > > > > > > > > > > all > > > > > > > > > > > > the parts IMO (speaking from my own experience > from > > > > the past > > > > > > 10+ > > > > > > > > > years > > > > > > > > > > of > > > > > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > > > > > > > > > a) *table format* : something that stores table > > > > schema, a > > > > > > > metadata > > > > > > > > > > table > > > > > > > > > > > > that stores file listing today, and being > extended > > to > > > > store > > > > > > > column > > > > > > > > > > ranges > > > > > > > > > > > > and more in the future (RFC-27) > > > > > > > > > > > > b) *aux metadata* : bloom filters, external > record > > > > level > > > > > > indexes > > > > > > > > > today, > > > > > > > > > > > > bitmaps/interval trees and other advanced on-disk > > > data > > > > > > structures > > > > > > > > > > > tomorrow > > > > > > > > > > > > c) *concurrency control* : we always supported > MVCC > > > > based log > > > > > > > based > > > > > > > > > > > > concurrency (serialize writes into a time ordered > > > > log), and > > > > > we > > > > > > > now > > > > > > > > > also > > > > > > > > > > > > have OCC for batch merge workloads with 0.8.0. We > > > will > > > > have > > > > > > > > > multi-table > > > > > > > > > > > and > > > > > > > > > > > > fully non-blocking writers soon (see future work > > > > section of > > > > > > > RFC-22) > > > > > > > > > > > > d) *updates/deletes* : this is the > bread-and-butter > > > > use-case > > > > > > for > > > > > > > > > Hudi, > > > > > > > > > > > but > > > > > > > > > > > > we support primary/unique key constraints and we > > > could > > > > add > > > > > > > foreign > > > > > > > > > keys > > > > > > > > > > > as > > > > > > > > > > > > an extension, once our transactions can span > > tables. > > > > > > > > > > > > e) *table services*: a hudi pipeline today is > > > > self-managing - > > > > > > > sizes > > > > > > > > > > > files, > > > > > > > > > > > > cleans, compacts, clusters data, bootstraps > > existing > > > > data - > > > > > all > > > > > > > these > > > > > > > > > > > > actions working off each other without blocking > one > > > > another. > > > > > > (for > > > > > > > > > most > > > > > > > > > > > > parts). > > > > > > > > > > > > f) *data services*: we also have higher level > > > > functionality > > > > > > with > > > > > > > > > > > > deltastreamer sources (scalable DFS listing > source, > > > > Kafka, > > > > > > > Pulsar is > > > > > > > > > > > > coming, ...and more), incremental ETL support, > > > > > de-duplication, > > > > > > > commit > > > > > > > > > > > > callbacks, pre-commit validations are coming, > error > > > > tables > > > > > have > > > > > > > been > > > > > > > > > > > > proposed. I could also envision us building > towards > > > > streaming > > > > > > > egress, > > > > > > > > > > > data > > > > > > > > > > > > monitoring. > > > > > > > > > > > > > > > > > > > > > > > > I also think we should build the following > (subject > > > to > > > > > separate > > > > > > > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > > > > > > > > > > > > > g) *caching service*: Hudi specific caching > service > > > > that can > > > > > > hold > > > > > > > > > > mutable > > > > > > > > > > > > data and serve oft-queried data across engines. > > > > > > > > > > > > h) t*imeline metaserver:* We already run a > > metaserver > > > > in > > > > > spark > > > > > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's > > > > metadata > > > > > table. > > > > > > > Let's > > > > > > > > > > > turn > > > > > > > > > > > > it into a scalable, sharded metastore, that all > > > > engines can > > > > > use > > > > > > > to > > > > > > > > > > obtain > > > > > > > > > > > > any metadata. > > > > > > > > > > > > > > > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake > > > > Platform*" > > > > > as > > > > > > > > > opposed > > > > > > > > > > to > > > > > > > > > > > > "ingests & manages storage of large analytical > > > > datasets over > > > > > > DFS > > > > > > > > > (hdfs > > > > > > > > > > or > > > > > > > > > > > > cloud stores)." and convey the scope of our > vision, > > > > > > > > > > > > given we have already been building towards that. > > It > > > > would > > > > > also > > > > > > > > > provide > > > > > > > > > > > new > > > > > > > > > > > > contributors a good lens to look at the project > > from. > > > > > > > > > > > > > > > > > > > > > > > > (This is very similar to for e.g, the evolution > of > > > > Kafka > > > > > from a > > > > > > > > > pub-sub > > > > > > > > > > > > system, to an event streaming platform - with > > > addition > > > > of > > > > > > > > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > > > > > > > > > > > > > > > Please share your thoughts! > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >