Re: [DISCUSS] Decouple Hudi and Spark

Vinoth Chandar Mon, 16 Sep 2019 21:19:19 -0700

Alright then. Happy to take the lead here. But please give me a week or so,
to finish up the spark bundling and other jar issues.. Too much context
switching :)




On Mon, Sep 16, 2019 at 6:57 PM vino yang <yanghua1...@gmail.com> wrote:

> Hi guys,
>
> Currently, I am busy with HUDI-203[1] and other things.
>
> I agree with Vinoth that we should try to find a new solution to decouple
> the dependency with the Spark RDD cache.
>
> It's an excellent way to start this big work.
>
> [1]: https://issues.apache.org/jira/browse/HUDI-203
>
> vbal...@apache.org <vbal...@apache.org> 于2019年9月17日周二 上午3:49写道：
>
> >
> > +1 This is a pretty large undertaking. While the community is getting
> > their hands dirty and ramping up on Hudi internals, it would be
> productive
> > if Vinoth shepherds this
> > Balaji.V    On Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth
> Chandar
> > <vin...@apache.org> wrote:
> >
> >  sg. :)
> >
> > I will wait for others on this thread as well to chime in.
> >
> > On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala <taher...@gmail.com>
> > wrote:
> >
> > > Vinoth, I think right now given your experience with the project you
> > should
> > > be scoping out what needs to be done to take us there. So +1 for giving
> > you
> > > more work :)
> > >
> > > We want to reach a point where we can start scoping out addition of
> Flink
> > > and Beam components within. Then I think will tremendous progress.
> > >
> > > On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> > >
> > > > I still feel the key thing here is reimplementing HoodieBloomIndex
> > > without
> > > > needing spark caching.
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global)
> > > >  documents the spark DAG in detail.
> > > >
> > > > If everyone feels, it's best for me to scope the work out, then happy
> > to
> > > do
> > > > it!
> > > >
> > > > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <taher...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Guys I think we are slowing down on this again. We need to start
> > > planning
> > > > > small small tasks towards this VC please can you help fast track
> > this?
> > > > >
> > > > > Regards,
> > > > > Taher Koitawala
> > > > >
> > > > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <vin...@apache.org>
> > > wrote:
> > > > >
> > > > > > Look forward to the analysis. A key class to read would be
> > > > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <yanghua1...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > >> Currently Spark Streaming micro batching fits well with
> Hudi,
> > > > since
> > > > > it
> > > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > > micro
> > > > > > batch
> > > > > > > = 1 hudi commit
> > > > > > > With the per-record model in Flink, I am not sure how useful it
> > > will
> > > > be
> > > > > > to
> > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> > it
> > > > will
> > > > > > be
> > > > > > > inefficient..
> > > > > > >
> > > > > > > Yes, if 1 input record = 1 hudi commit, it would be
> inefficient.
> > > > About
> > > > > > > Flink streaming, we can also implement the "batch" and
> > > "micro-batch"
> > > > > > model
> > > > > > > when process data. For example:
> > > > > > >
> > > > > > >    - aggregation: use flexibility window mechanism;
> > > > > > >    - non-aggregation: use Flink stateful state API cache a
> batch
> > > data
> > > > > > >
> > > > > > >
> > > > > > > >> On first focussing on decoupling of Spark and Hudi alone,
> yes
> > a
> > > > full
> > > > > > > summary of how Spark is being used in a wiki page is a good
> start
> > > > IMO.
> > > > > We
> > > > > > > can then hash out what can be generalized and what cannot be
> and
> > > > needs
> > > > > to
> > > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > > >
> > > > > > > agree
> > > > > > >
> > > > > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三 上午8:35写道：
> > > > > > >
> > > > > > > > >> We should only stick to Flink Streaming. Furthermore if
> > there
> > > > is a
> > > > > > > > requirement for batch then users
> > > > > > > > >> should use Spark or then we will anyway have a beam
> > > integration
> > > > > > coming
> > > > > > > > up.
> > > > > > > >
> > > > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> > > since
> > > > > it
> > > > > > > > amortizes the cost of indexing, workload profiling etc. 1
> spark
> > > > micro
> > > > > > > batch
> > > > > > > > = 1 hudi commit
> > > > > > > > With the per-record model in Flink, I am not sure how useful
> it
> > > > will
> > > > > be
> > > > > > > to
> > > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> commit,
> > > it
> > > > > will
> > > > > > > be
> > > > > > > > inefficient..
> > > > > > > >
> > > > > > > > On first focussing on decoupling of Spark and Hudi alone,
> yes a
> > > > full
> > > > > > > > summary of how Spark is being used in a wiki page is a good
> > start
> > > > > IMO.
> > > > > > We
> > > > > > > > can then hash out what can be generalized and what cannot be
> > and
> > > > > needs
> > > > > > to
> > > > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> > yanghua1...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Nick and Taher,
> > > > > > > > >
> > > > > > > > > I just want to answer Nishith's question. Reference his old
> > > > > > description
> > > > > > > > > here:
> > > > > > > > >
> > > > > > > > > > You can do a parallel investigation while we are deciding
> > on
> > > > the
> > > > > > > module
> > > > > > > > > structure.  You could be looking at all the patterns in
> > Hudi's
> > > > > Spark
> > > > > > > APIs
> > > > > > > > > usage (RDD/DataSource/SparkContext) and see if such support
> > can
> > > > be
> > > > > > > > achieved
> > > > > > > > > in theory with Flink. If not, what is the workaround.
> > > Documenting
> > > > > > such
> > > > > > > > > patterns would be valuable when multiple engineers are
> > working
> > > on
> > > > > it.
> > > > > > > For
> > > > > > > > > e:g, Hudi relies on    (a) custom partitioning logic for
> > > > upserts,
> > > > > > > >  (b)
> > > > > > > > > caching RDDs to avoid reruns of costly stages    (c) A
> Spark
> > > > > upsert
> > > > > > > task
> > > > > > > > > knowing its spark partition/task/attempt ids
> > > > > > > > >
> > > > > > > > > And just like the title of this thread, we are going to try
> > to
> > > > > > decouple
> > > > > > > > > Hudi and Spark. That means we can run the whole Hudi
> without
> > > > > > depending
> > > > > > > > > Spark. So we need to analyze all the usage of Spark in
> Hudi.
> > > > > > > > >
> > > > > > > > > Here we are not discussing the integration of Hudi and
> Flink
> > in
> > > > the
> > > > > > > > > application layer. Instead, I want Hudi to be decoupled
> from
> > > > Spark
> > > > > > and
> > > > > > > > > allow other engines (such as Flink) to replace Spark.
> > > > > > > > >
> > > > > > > > > It can be divided into long-term goals and short-term
> goals.
> > As
> > > > > > Nishith
> > > > > > > > > stated in a recent email.
> > > > > > > > >
> > > > > > > > > I mentioned the Flink Batch API here because Hudi can
> connect
> > > > with
> > > > > > many
> > > > > > > > > different Source/Sinks. Some file-based reads are not
> > > appropriate
> > > > > for
> > > > > > > > Flink
> > > > > > > > > Streaming.
> > > > > > > > >
> > > > > > > > > Therefore, this is a comprehensive survey of the use of
> Spark
> > > in
> > > > > > Hudi.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Vino
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二
> 下午5:43写道：
> > > > > > > > >
> > > > > > > > > > Hi Vino,
> > > > > > > > > >      According to what I've seen Hudi has a lot of spark
> > > > > component
> > > > > > > > > flowing
> > > > > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc.
> The
> > > main
> > > > > > > > classes I
> > > > > > > > > > guess we should focus upon is HoodieTable and Hoodie
> write
> > > > > clients.
> > > > > > > > > >
> > > > > > > > > > Also Vino, I don't think we should be providing Flink
> > dataset
> > > > > > > > > > implementation. We should only stick to Flink Streaming.
> > > > > > > > > >                Furthermore if there is a requirement for
> > > batch
> > > > > then
> > > > > > > > users
> > > > > > > > > > should use Spark or then we will anyway have a beam
> > > integration
> > > > > > > coming
> > > > > > > > > up.
> > > > > > > > > >
> > > > > > > > > > As of cache, How about we write our stateful Flink
> function
> > > and
> > > > > use
> > > > > > > > > > RocksDbStateBackend with some state TTL.
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
> > > yanghua1...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > After doing some research, let me share my information:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Limitation of computing engine capabilities: Hudi
> > uses
> > > > > > Spark's
> > > > > > > > > > >    RDD#persist, and Flink currently has no API to cache
> > > > > datasets.
> > > > > > > > Maybe
> > > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > >    only choose to use external storage or do not use
> > cache?
> > > > For
> > > > > > the
> > > > > > > > use
> > > > > > > > > > of
> > > > > > > > > > >    other APIs, the two currently offer almost
> equivalent
> > > > > > > > capabilities.
> > > > > > > > > > >    - The abstraction of the computing engine is
> > different:
> > > > > > > > Considering
> > > > > > > > > > the
> > > > > > > > > > >    different usage scenarios of the computing engine in
> > > Hudi,
> > > > > > Flink
> > > > > > > > has
> > > > > > > > > > not
> > > > > > > > > > >    yet implemented stream batch unification, so we may
> > use
> > > > both
> > > > > > > > Flink's
> > > > > > > > > > >    DataSet API (batch processing) and DataStream API
> > > (stream
> > > > > > > > > processing).
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Vino
> > > > > > > > > > >
> > > > > > > > > > > nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四
> > > > 上午12:57写道：
> > > > > > > > > > >
> > > > > > > > > > > > Nick,
> > > > > > > > > > > >
> > > > > > > > > > > > You bring up a good point about the non-trivial
> > > programming
> > > > > > model
> > > > > > > > > > > > differences between these different technologies.
> From
> > a
> > > > > > > > theoretical
> > > > > > > > > > > > perspective, I'd say considering a higher level
> > > abstraction
> > > > > > makes
> > > > > > > > > > sense.
> > > > > > > > > > > I
> > > > > > > > > > > > think we have to decouple some objectives and
> concerns
> > > > here.
> > > > > > > > > > > >
> > > > > > > > > > > > a) The immediate desire is to have Hudi be able to
> run
> > > on a
> > > > > > Flink
> > > > > > > > (or
> > > > > > > > > > > > non-spark) engine. This naturally begs the question
> of
> > > > > > decoupling
> > > > > > > > > Hudi
> > > > > > > > > > > > concepts from direct Spark dependencies.
> > > > > > > > > > > >
> > > > > > > > > > > > b) If we do want to initiate the above effort, would
> it
> > > > make
> > > > > > > sense
> > > > > > > > to
> > > > > > > > > > > just
> > > > > > > > > > > > have a higher level abstraction, building on other
> > > > > technologies
> > > > > > > > like
> > > > > > > > > > beam
> > > > > > > > > > > > (euphoria etc) and provide single, clean API's that
> may
> > > be
> > > > > more
> > > > > > > > > > > > maintainable from a code perspective. But at the same
> > > time
> > > > > this
> > > > > > > > will
> > > > > > > > > > > > introduce challenges on how to maintain efficiency
> and
> > > > > > optimized
> > > > > > > > > > runtime
> > > > > > > > > > > > dags for Hudi (since the code would move away from
> > point
> > > > > > > > integrations
> > > > > > > > > > and
> > > > > > > > > > > > whenever this happens, tuning natively for specific
> > > engines
> > > > > > > becomes
> > > > > > > > > > more
> > > > > > > > > > > > and more difficult).
> > > > > > > > > > > >
> > > > > > > > > > > > My general opinion is that, as the community grows
> over
> > > > time
> > > > > > with
> > > > > > > > > more
> > > > > > > > > > > > folks having an in-depth understanding of Hudi, going
> > > from
> > > > > > > > > > current_state
> > > > > > > > > > > ->
> > > > > > > > > > > > (a) -> (b) might be the most reliable and adoptable
> > path
> > > > for
> > > > > > this
> > > > > > > > > > > project.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Nishith
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <
> > > > > > > > > n...@semanticbeeng.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > There are some not trivial difference between
> > > programming
> > > > > > model
> > > > > > > > and
> > > > > > > > > > > > > runtime semantics between Beam, Spark and Flink.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > > > > > > > > > > >
> > > > > > > > > > > > > Nitish, Vino - thoughts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Does it feel to consider a higher level
> abstraction /
> > > DSL
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > > > > > maintaining different code with same functionality
> > but
> > > > > > > different
> > > > > > > > > > > > > programming models ?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > https://beam.apache.org/documentation/sdks/java/euphoria/
> > > > > > > > > > > > >
> > > > > > > > > > > > > Nick
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal <
> > > > > > > > n3.nas...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for Approach 1 Point integration with each
> > > framework.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pros for point integration
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Hudi community is already familiar with spark
> > and
> > > > > spark
> > > > > > > > based
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > actions/shuffles etc. Since both modules can be
> > > > decoupled,
> > > > > > this
> > > > > > > > > > enables
> > > > > > > > > > > > us
> > > > > > > > > > > > > to have a steady release for Hudi for 1 execution
> > > engine
> > > > > > > (spark)
> > > > > > > > > > while
> > > > > > > > > > > we
> > > > > > > > > > > > > hone our skills and iterate on making flink dag
> > > > optimized,
> > > > > > > > > performant
> > > > > > > > > > > > with
> > > > > > > > > > > > > the right configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - This might be a stepping stone towards
> rewriting
> > > the
> > > > > > > entire
> > > > > > > > > code
> > > > > > > > > > > > base
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > being agnostic of spark/flink. This approach will
> > help
> > > us
> > > > > fix
> > > > > > > > > tests,
> > > > > > > > > > > > > intricacies and help make the code base ready for a
> > > > larger
> > > > > > > > rework.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Seems like the easiest way to add flink
> support
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cons
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - More code paths to maintain and reason since
> the
> > > > spark
> > > > > > and
> > > > > > > > > flink
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > integrations will naturally diverge over time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Theoretically, I do like the idea of being able to
> > run
> > > > the
> > > > > > hudi
> > > > > > > > dag
> > > > > > > > > > on
> > > > > > > > > > > > beam
> > > > > > > > > > > > > more than point integrations, where there is one
> > > > API/logic
> > > > > to
> > > > > > > > > reason
> > > > > > > > > > > > about.
> > > > > > > > > > > > > But practically, that may not be the right
> direction.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pros
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Lesser cognitive burden in maintaining,
> evolving
> > > and
> > > > > > > > releasing
> > > > > > > > > > the
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > project with one API to reason with.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Theoretically, going forward assuming beam is
> > > > adopted
> > > > > > as a
> > > > > > > > > > > standard
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > programming paradigm for stream/batch, this would
> > > enable
> > > > > > > > consumers
> > > > > > > > > > > > leverage
> > > > > > > > > > > > > the power of hudi more easily.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cons
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Massive rewrite of the code base.
> Additionally,
> > > > since
> > > > > we
> > > > > > > > would
> > > > > > > > > > > have
> > > > > > > > > > > > >    moved
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > away from directly using spark APIs, there is a
> > bigger
> > > > risk
> > > > > > of
> > > > > > > > > > > > regression.
> > > > > > > > > > > > > We would have to be very thorough with all the
> > > > intricacies
> > > > > > and
> > > > > > > > > ensure
> > > > > > > > > > > the
> > > > > > > > > > > > > same stability of new releases.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Managing future features (which may be very
> > spark
> > > > > > driven)
> > > > > > > > will
> > > > > > > > > > > > either
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > clash or pause or will need to be reworked.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Tuning jobs for Spark/Flink type execution
> > > > frameworks
> > > > > > > > > > individually
> > > > > > > > > > > > >    might
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > be difficult and will get difficult over time as
> the
> > > > > project
> > > > > > > > > evolves,
> > > > > > > > > > > > where
> > > > > > > > > > > > > some beam integrations with spark/flink may not
> work
> > as
> > > > > > > expected.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Also, as pointed above, need to probably
> support
> > > the
> > > > > > > > > > hoodie-spark
> > > > > > > > > > > > >    module
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > as a first-class.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank,
> > > > > > > > > > > > > Nishith
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <
> > > > > > > > taher...@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Vinoth,
> > > > > > > > > > > > > Are there some tasks I can take up to ramp up the
> > code?
> > > > > Want
> > > > > > to
> > > > > > > > get
> > > > > > > > > > > > > more used to the code and understand the existing
> > > > > > > implementation
> > > > > > > > > > > better.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Taher Koitawala
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <
> > > > > > > vin...@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let's see if others have any thoughts as well. We
> can
> > > > plan
> > > > > to
> > > > > > > fix
> > > > > > > > > the
> > > > > > > > > > > > > approach by EOW.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
> > > > > > > yanghua1...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi guys,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, +1 for Approach 1 like Taher.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we can do a comprehensive analysis of this model
> > and
> > > > > come
> > > > > > up
> > > > > > > > > with.
> > > > > > > > > > > > >
> > > > > > > > > > > > > means
> > > > > > > > > > > > >
> > > > > > > > > > > > > to refactor this cleanly, this would be promising.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, when we get the conclusion, we could start
> this
> > > > work.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Vino
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二
> > > > > 上午12:28写道：
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for Approch 1 Point integration with each
> > framework
> > > > > > > > > > > > >
> > > > > > > > > > > > > Approach 2 has a problem as you said "Developers
> need
> > > to
> > > > > > think
> > > > > > > > > about
> > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> So
> > > in
> > > > > the
> > > > > > > end,
> > > > > > > > > > > > >
> > > > > > > > > > > > > this
> > > > > > > > > > > > >
> > > > > > > > > > > > > may
> > > > > > > > > > > > >
> > > > > > > > > > > > > not be the panacea that it seems to be"
> > > > > > > > > > > > >
> > > > > > > > > > > > > We have seen various pipelines in the beam dag
> being
> > > > > > expressed
> > > > > > > > > > > > >
> > > > > > > > > > > > > differently
> > > > > > > > > > > > >
> > > > > > > > > > > > > then we had them in our original usecase. And also
> > > > > switching
> > > > > > > > > between
> > > > > > > > > > > > >
> > > > > > > > > > > > > spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > and Flink runners in beam have various impact on
> the
> > > > > > pipelines
> > > > > > > > like
> > > > > > > > > > > > >
> > > > > > > > > > > > > some
> > > > > > > > > > > > >
> > > > > > > > > > > > > features available in Flink are not available on
> the
> > > > spark
> > > > > > > runner
> > > > > > > > > > > > >
> > > > > > > > > > > > > etc.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Refer to this compatible matrix ->
> > > > > > > > > > > > >
> > > > > > >
> https://beam.apache.org/documentation/runners/capability-matrix/
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hence my vote on Approch 1 let's decouple and build
> > the
> > > > > > > abstract
> > > > > > > > > for
> > > > > > > > > > > > >
> > > > > > > > > > > > > each
> > > > > > > > > > > > >
> > > > > > > > > > > > > framework. That is a much better option. We will
> also
> > > > have
> > > > > > more
> > > > > > > > > > > > >
> > > > > > > > > > > > > control
> > > > > > > > > > > > >
> > > > > > > > > > > > > over each framework's implement.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <
> > > > > > vin...@apache.org
> > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would like to highlight that there are two distinct
> > > > > > approaches
> > > > > > > > here
> > > > > > > > > > > > >
> > > > > > > > > > > > > with
> > > > > > > > > > > > >
> > > > > > > > > > > > > different tradeoffs. Think of this as my braindump,
> > as
> > > I
> > > > > have
> > > > > > > > been
> > > > > > > > > > > > >
> > > > > > > > > > > > > thinking
> > > > > > > > > > > > >
> > > > > > > > > > > > > about this quite a bit in the past.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Approach 1 : Point integration with each
> framework *
> > > > > > > > > > > > >
> > > > > > > > > > > > > We may need a pure client module named for example
> > > > > > > > > > > > > hoodie-client-core(common)
> > > > > > > > > > > > > >> Then we could have: hoodie-client-spark,
> > > > > > hoodie-client-flink
> > > > > > > > > > > > > and hoodie-client-beam
> > > > > > > > > > > > >
> > > > > > > > > > > > > (+) This is the safest to do IMO, since we can
> > isolate
> > > > the
> > > > > > > > current
> > > > > > > > > > > > >
> > > > > > > > > > > > > Spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) from
> > the
> > > > > > changes
> > > > > > > > for
> > > > > > > > > > > > >
> > > > > > > > > > > > > flink,
> > > > > > > > > > > > >
> > > > > > > > > > > > > while it stabilizes over few releases.
> > > > > > > > > > > > > (-) Downside is that the utilities needs to be
> > redone :
> > > > > > > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink
> and
> > > > > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we can do a comprehensive analysis of this model
> > and
> > > > > come
> > > > > > up
> > > > > > > > > > > > >
> > > > > > > > > > > > > with.
> > > > > > > > > > > > >
> > > > > > > > > > > > > means
> > > > > > > > > > > > >
> > > > > > > > > > > > > to refactor this cleanly, this would be promising.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Approach 2: Beam as the compute abstraction*
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another more drastic approach is to remove Spark as
> > the
> > > > > > compute
> > > > > > > > > > > > >
> > > > > > > > > > > > > abstraction
> > > > > > > > > > > > >
> > > > > > > > > > > > > for writing data and replace it with Beam.
> > > > > > > > > > > > >
> > > > > > > > > > > > > (+) All of the code remains more or less similar
> and
> > > > there
> > > > > is
> > > > > > > one
> > > > > > > > > > > > >
> > > > > > > > > > > > > compute
> > > > > > > > > > > > >
> > > > > > > > > > > > > API to reason about.
> > > > > > > > > > > > >
> > > > > > > > > > > > > (-) The (very big) assumption here is that we are
> > able
> > > to
> > > > > > tune
> > > > > > > > the
> > > > > > > > > > > > >
> > > > > > > > > > > > > spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > runtime the same way using Beam : custom
> > partitioners,
> > > > > > support
> > > > > > > > for
> > > > > > > > > > > > >
> > > > > > > > > > > > > all
> > > > > > > > > > > > >
> > > > > > > > > > > > > RDD
> > > > > > > > > > > > >
> > > > > > > > > > > > > operations we invoke, caching etc etc.
> > > > > > > > > > > > > (-) It will be a massive rewrite and testing of
> such
> > a
> > > > > large
> > > > > > > > > > > > >
> > > > > > > > > > > > > rewrite
> > > > > > > > > > > > >
> > > > > > > > > > > > > would
> > > > > > > > > > > > >
> > > > > > > > > > > > > also be really challenging, since we need to pay
> > > > attention
> > > > > to
> > > > > > > all
> > > > > > > > > > > > >
> > > > > > > > > > > > > intricate
> > > > > > > > > > > > >
> > > > > > > > > > > > > details to ensure the spark users today experience
> no
> > > > > > > > > > > > > regressions/side-effects
> > > > > > > > > > > > > (-) Note that we still need to probably support the
> > > > > > > hoodie-spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > module
> > > > > > > > > > > > >
> > > > > > > > > > > > > and
> > > > > > > > > > > > >
> > > > > > > > > > > > > may be a first-class such integration with flink,
> for
> > > > > native
> > > > > > > > > > > > >
> > > > > > > > > > > > > flink/spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > pipeline authoring. Users of say DeltaStreamer need
> > to
> > > > pass
> > > > > > in
> > > > > > > > > > > > >
> > > > > > > > > > > > > Spark
> > > > > > > > > > > > >
> > > > > > > > > > > > > or
> > > > > > > > > > > > >
> > > > > > > > > > > > > Flink configs anyway.. Developers need to think
> about
> > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> So
> > > in
> > > > > the
> > > > > > > end,
> > > > > > > > > > > > >
> > > > > > > > > > > > > this
> > > > > > > > > > > > >
> > > > > > > > > > > > > may
> > > > > > > > > > > > >
> > > > > > > > > > > > > not be the panacea that it seems to be.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > One goal for the HIP is to get us all to agree as a
> > > > > community
> > > > > > > > which
> > > > > > > > > > > > >
> > > > > > > > > > > > > one
> > > > > > > > > > > > >
> > > > > > > > > > > > > to
> > > > > > > > > > > > >
> > > > > > > > > > > > > pick, with sufficient investigation, testing,
> > > > > benchmarking..
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
> > > > > > > yanghua1...@gmail.com>
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for both Beam and Flink
> > > > > > > > > > > > >
> > > > > > > > > > > > > First step here is to probably draw out current
> > > > hierrarchy
> > > > > > and
> > > > > > > > > > > > >
> > > > > > > > > > > > > figure
> > > > > > > > > > > > >
> > > > > > > > > > > > > out
> > > > > > > > > > > > >
> > > > > > > > > > > > > what the abstraction points are..
> > > > > > > > > > > > > In my opinion, the runtime (spark, flink) should be
> > > done
> > > > at
> > > > > > the
> > > > > > > > > > > > > hoodie-client level and just used by
> hoodie-utilties
> > > > > > > > > > > > >
> > > > > > > > > > > > > seamlessly..
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for Vinoth's opinion, it should be the first
> step.
> > > > > > > > > > > > >
> > > > > > > > > > > > > No matter we hope Hudi to integrate with which
> > > computing
> > > > > > > > > > > > >
> > > > > > > > > > > > > framework.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We need to decouple Hudi client and Spark.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We may need a pure client module named for example
> > > > > > > > > > > > > hoodie-client-core(common)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then we could have: hoodie-client-spark,
> > > > > hoodie-client-flink
> > > > > > > and
> > > > > > > > > > > > > hoodie-client-beam
> > > > > > > > > > > > >
> > > > > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日
> > > > 上午10:45写道：
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's
> analysis.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> > > > > > > > > > > > >
> > > > > > > > > > > > > taher...@gmail.com>
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > So the way to go around this is that file a hip.
> > Chalk
> > > > all
> > > > > th
> > > > > > > > > > > > >
> > > > > > > > > > > > > classes
> > > > > > > > > > > > >
> > > > > > > > > > > > > our
> > > > > > > > > > > > >
> > > > > > > > > > > > > and start moving towards Pure client.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Secondly should we want to try beam?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think there is to much going on here and I'm not
> > able
> > > > to
> > > > > > > > > > > > >
> > > > > > > > > > > > > follow.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If
> > > > > > > > > > > > >
> > > > > > > > > > > > > we
> > > > > > > > > > > > >
> > > > > > > > > > > > > want to try out beam all along I don't think it
> makes
> > > > sense
> > > > > > > > > > > > >
> > > > > > > > > > > > > to
> > > > > > > > > > > > >
> > > > > > > > > > > > > do
> > > > > > > > > > > > >
> > > > > > > > > > > > > anything
> > > > > > > > > > > > >
> > > > > > > > > > > > > on Flink then.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > > > > > > > > > > >
> > > > > > > > > > > > > n...@semanticbeeng.com>
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> +1 My money is on this approach.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> The existing abstractions from Beam seem enough
> > for
> > > > the
> > > > > > use
> > > > > > > > > > > > >
> > > > > > > > > > > > > cases
> > > > > > > > > > > > >
> > > > > > > > > > > > > as I
> > > > > > > > > > > > >
> > > > > > > > > > > > > imagine them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> Flink also has "dynamic table", "table source"
> and
> > > > > "table
> > > > > > > > > > > > >
> > > > > > > > > > > > > sink"
> > > > > > > > > > > > >
> > > > > > > > > > > > > which
> > > > > > > > > > > > >
> > > > > > > > > > > > > seem very useful abstractions where Hudi might fit
> > > > nicely.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > > > > > > > > > >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Attached a screen shot.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> This seems to fit with the original premise of
> > Hudi
> > > as
> > > > > > well.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Am exploring this venue with a use case that
> > > involves
> > > > > > > > > > > > >
> > > > > > > > > > > > > "temporal
> > > > > > > > > > > > >
> > > > > > > > > > > > > joins
> > > > > > > > > > > > >
> > > > > > > > > > > > > on
> > > > > > > > > > > > >
> > > > > > > > > > > > > streams" which I need for feature extraction.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> Anyone is interested in this or has concrete
> > enough
> > > > > needs
> > > > > > > > > > > > >
> > > > > > > > > > > > > and
> > > > > > > > > > > > >
> > > > > > > > > > > > > use
> > > > > > > > > > > > >
> > > > > > > > > > > > > cases
> > > > > > > > > > > > >
> > > > > > > > > > > > > please let me know.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use
> > cases.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Cheers
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Nick
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Also, we do have some Beam experts on the
> > mailing
> > > > > list..
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can
> > > > > > > > > > > > >
> > > > > > > > > > > > > you
> > > > > > > > > > > > >
> > > > > > > > > > > > > please
> > > > > > > > > > > > > >> weigh on viability of using Beam as the
> > intermediate
> > > > > > > > > > > > >
> > > > > > > > > > > > > abstraction
> > > > > > > > > > > > >
> > > > > > > > > > > > > here
> > > > > > > > > > > > >
> > > > > > > > > > > > > between Spark/Flink?
> > > > > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair,
> > > > > > > > > > > > >
> > > > > > > > > > > > > sortAndRepartition,
> > > > > > > > > > > > >
> > > > > > > > > > > > > reduceByKey, countByKey and also does custom
> > > > partitioning a
> > > > > > > > > > > > >
> > > > > > > > > > > > > lot.>
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to