Re: Pointers on Contributing to Structured Streaming Spark Runner

Etienne Chauchot Wed, 18 Sep 2019 06:40:16 -0700

Hi Xinyu,
Thanks for offering help ! My comments are inline:
Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit :
> Hi, Etienne,
> The slides are very informative! Thanks for sharing the details about how the 
> Beam API are mapped into Spark
> Structural Streaming.


Thanks !
> We (LinkedIn) are also interested in trying the new SparkRunner to run Beam 
> pipeine in batch, and contribute to it
> too. From my understanding, seems the functionality on batch side is mostly 
> complete and covers quite a large
> percentage of the tests (a few missing pieces like state and timer in ParDo 
> and SDF). 

Correct, it passes 89% of the tests, but there is more than SDF, state and 
timer missing, there is also ongoing encoders
work that I would like to commit/push before merging.
> If so, is it possible to merge the new runner sooner into master so it's much 
> easier for us to pull it in (we have an
> internal fork) and contribute back?

Sure, see my other mail on this thread. As Alexey mentioned, please join the 
sync meeting we have, the more the merrier
!
> Also curious about the scheme part in the runner. Seems we can leverage the 
> schema-aware work in PCollection and
> translate from Beam schema to Spark, so it can be optimized in the planner 
> layer. It will be great to hear back your
> plans on that.

Well, it is not designed yet but, if you remember my talk, we need to store 
beam windowing information with the data
itself, so ending up having a dataset<WindowedValue> . One lead that was 
discussed is to store it as a Spark schema such
as this:
1. field1: binary data for beam windowing information (cannot be mapped to 
fields  because beam windowing info is
complex structure)
2. fields of data as defined in the Beam schema if there is one 

> Congrats on this great work!
Thanks !
Best,
Etienne
> Thanks,
> Xinyu
> On Wed, Sep 11, 2019 at 6:02 PM Rui Wang <[email protected]> wrote:
> > Hello Etienne,
> > Your slide mentioned that streaming mode development is blocked because 
> > Spark lacks supporting multiple-aggregations 
> > in its streaming mode but design is ongoing. Do you have a link or 
> > something else to their design discussion/doc?
> > 
> > 
> > -Rui  
> > On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot <[email protected]> 
> > wrote:
> > > Hi Rahul,Sure, and great ! Thanks for proposing !If you want details, 
> > > here is the presentation I did 30 mins ago
> > > at the apachecon. You will find the video on youtube shortly but in the 
> > > meantime, here is my presentation slides.
> > > And here is the structured streaming branch. I'll be happy to review your 
> > > PRs, thanks ! 
> > > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > > BestEtienne
> > > Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
> > > > Hi Etienne,
> > > > 
> > > > I came to know about the work going on in Structured Streaming Spark 
> > > > Runner from Apache Beam Wiki - Works in
> > > > Progress.
> > > > I have contributed to BeamSql earlier. And I am working on supporting 
> > > > PCollectionView in BeamSql.
> > > > 
> > > > I would love to understand the Runner's side of Apache Beam and 
> > > > contribute to the Structured Streaming Spark
> > > > Runner.
> > > > 
> > > > Can you please point me in the right direction?
> > > > 
> > > > Thanks,
> > > > Rahul

Re: Pointers on Contributing to Structured Streaming Spark Runner

Reply via email to