Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

taher koitawala Sun, 04 Aug 2019 06:30:52 -0700

If anyone wants to see a Flink Streaming pipeline here is a really small
and basic Flink pipeline.
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example


Consider users playing a game across multiple platforms and we only get the
timestamp, username and the current score as the record. The pipelines has
a custom source function which produces this stream record.

The pipeline does aggregations(Sum score of current window with the total
score of the user) every 2 seconds based on the event time attached with
the record.

User's score keeps increasing as new windows are fired and new outputs are
emitted. That's where Hudi fits as per my vision now, where Hudi
intelligently shows only the latest records written.



On Sun, Aug 4, 2019, 6:43 PM taher koitawala <taher...@gmail.com> wrote:

> Fully agreed with Vino. I think let's chalk out the classes. Make
> hierarchies and start decoupling everything. Then we can move forward with
> the Flink and Beam streaming components.
>
> On Sun, Aug 4, 2019, 1:52 PM vino yang <yanghua1...@gmail.com> wrote:
>
>> Hi Nick,
>>
>> Thank you for your more detailed thoughts, and I fully agree with your
>> thoughts about HudiLink, which should also be part of the long-term
>> planning of the Hudi Ecology.
>>
>>
>> *But I found that the angle of our thinking and the starting point are
>> not consistent. I pay more attention to the rationality of the existing
>> architecture and whether the dependence on the computing engine is
>> pluggable. Don't get me wrong, I know very well that although we have
>> different perspectives, these views have value for Hudi.*
>> Let me give more details on the discussion I made earlier.
>>
>> Currently, multiple submodules of the Hudi project are tightly coupled to
>> Spark's design and dependencies. You can see that many of the class files
>> contain statements such as "import org.apache.spark.xxx".
>>
>> I first put forward a discussion: "Integrate Hudi with Apache Flink", and
>> then came up with a discussion: "Decouple Hudi and Spark".
>>
>> I think the word "Integrate" I used for the first discussion may not be
>> accurate enough. My intention is to make the computing engine used by Hudi
>> pluggable. Spark is equivalent to Hudi is just a library, it is not the
>> core of Hudi, it should not be strongly coupled with Hudi. The features
>> currently provided by Spark are also available from Flink. But in order to
>> achieve this, we need to decouple Hudi from the code level with the use of
>> Spark.
>>
>> This makes sense both in terms of structural rationality and community
>> ecology.
>>
>> Best,
>> Vino
>>
>>
>> Semantic Beeng <n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道：
>>
>>> "+1 for both Beam and Flink" - what I propose implies this indeed.
>>>
>>> But/and am working from the desired functionality and a proposed design.
>>>
>>> (as opposed to starting with refactoring Hudi with the goal of close
>>> integration with Flink)
>>>
>>> I feel this is not necessary - but am not an expert in Hudi
>>> implementation.
>>>
>>> But am pretty sure it is not sufficient for the use cases I have in
>>> mind. The gist is using Hudi as a file based data lake + ML feature store
>>> that enables incremental analyses done with a combination of Flink, Beam,
>>> Spark, Tensorlflow (see Petastorm from UberEng for an idea.)
>>>
>>> Let us call this HudiLink from now on (think of it as a mediator, not
>>> another Hudi).
>>>
>>> The intuition behind looking at more then Flink is that both Beam and
>>> Flink have good design abstractions we might reuse and extend.
>>>
>>> Like I said before, do not believe in point to point integrations.
>>>
>>> Alternatively / in parallel,If you care to share your use cases it would
>>> be very useful. Working with explicit use cases helps others to relate and
>>> help.
>>>
>>> Also, if some of you know there believe in (see) value of refactoring
>>> Hudi implementation for a hard integration with Flink (but have no time to
>>> argue for it) ofc you please go ahead.
>>>
>>> That may be a valid bottom up approach but I cannot relate to it myself
>>> (due to lack of use cases).
>>>
>>> Working on a material on HudiLink - if any are interested I might
>>> publish when more mature.
>>>
>>> Hint: this was part of the inspiration
>>> https://eng.uber.com/michelangelo/
>>>
>>> One well thought use case will get you "in". :-) Kidding, ofc.
>>>
>>> Cheers
>>>
>>> Nick
>>>
>>>
>>> On August 3, 2019 at 10:55 PM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>
>>> +1 for both Beam and Flink
>>>
>>> First step here is to probably draw out current hierrarchy and figure out
>>> what the abstraction points are..
>>> In my opinion, the runtime (spark, flink) should be done at the
>>> hoodie-client level and just used by hoodie-utilties seamlessly..
>>>
>>>
>>> +1 for Vinoth's opinion, it should be the first step.
>>>
>>> No matter we hope Hudi to integrate with which computing framework.
>>> We need to decouple Hudi client and Spark.
>>>
>>> We may need a pure client module named for example
>>> hoodie-client-core(common)
>>>
>>> Then we could have: hoodie-client-spark, hoodie-client-flink and
>>> hoodie-client-beam
>>>
>>> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
>>>
>>> +1 for Beam -- agree with Semantic Beeng's analysis.
>>>
>>> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com>
>>> wrote:
>>>
>>> So the way to go around this is that file a hip. Chalk all th classes our
>>> and start moving towards Pure client.
>>>
>>> Secondly should we want to try beam?
>>>
>>> I think there is to much going on here and I'm not able to follow. If we
>>> want to try out beam all along I don't think it makes sense to do
>>> anything on Flink then.
>>>
>>> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com>
>>> wrote:
>>>
>>> >> +1 My money is on this approach.
>>> >>
>>> >> The existing abstractions from Beam seem enough for the use cases as I
>>> >> imagine them.
>>> >>
>>> >> Flink also has "dynamic table", "table source" and "table sink" which
>>> >> seem very useful abstractions where Hudi might fit nicely.
>>> >>
>>> >>
>>> >>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>>> >>
>>> >>
>>> >> Attached a screen shot.
>>> >>
>>> >> This seems to fit with the original premise of Hudi as well.
>>> >>
>>> >> Am exploring this venue with a use case that involves "temporal joins
>>> on
>>> >> streams" which I need for feature extraction.
>>> >>
>>> >> Anyone is interested in this or has concrete enough needs and use
>>> cases
>>> >> please let me know.
>>> >>
>>> >> Best to go from an agreed upon set of 2-3 use cases.
>>> >>
>>> >> Cheers
>>> >>
>>> >> Nick
>>> >>
>>> >>
>>> >> > Also, we do have some Beam experts on the mailing list.. Can you
>>> please
>>> >> weigh on viability of using Beam as the intermediate abstraction here
>>> >> between Spark/Flink?
>>> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
>>> >> reduceByKey, countByKey and also does custom partitioning a lot.>
>>> >>
>>> >> >
>>> >>
>>> >
>>>
>>>

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Reply via email to