Re: Adding Serializable implements to SDK classes like WindowedValue etc...

Kobi Salant Thu, 24 Aug 2017 12:49:01 -0700

Right, as i answered Reuven if the user classes are not Serializable(kryo
and java) then we should avoid caching and by that avoiding Serializing


2017-08-24 22:36 GMT+03:00 Ben Chambers <bchamb...@apache.org>:

> If the user classes are not Serializable, how does adding Serializable to
> the WindowedValue help? The user class, which is stored in a field in the
> WindowedValue will still be non-serializable, and thus cause problems.
>
> On Thu, Aug 24, 2017 at 11:59 AM Kobi Salant <kobi.sal...@gmail.com>
> wrote:
>
> > right, if it is not kryo or java serializable then we should allow user
> to
> > set caching off
> >
> > 2017-08-24 21:55 GMT+03:00 Reuven Lax <re...@google.com.invalid>:
> >
> > > However user classes are not guaranteed to be java serializable either.
> > >
> > > On Thu, Aug 24, 2017 at 11:31 AM, Kobi Salant <kobi.sal...@gmail.com>
> > > wrote:
> > >
> > > > Thank you for your replies,
> > > >
> > > > Lukasz, The kryo does not fail on the WindowedValue but on the user
> > class
> > > > which in some cases is not kryo-serializable usually when there is no
> > > > default constructor. Switching to java serialization does fail on
> > > > WindowedValue
> > > > as it is not defined serializable
> > > >
> > > > Thomas, We thought about writing a custom serializer but it cannot
> work
> > > > without knowing the exact coders of each RDD, it is similar to the
> > > > discussion that was in the past is it mandatory for the user to set
> > > coders
> > > > for each PCollection or not. This suggestion is like asking Beam to
> > auto
> > > > detect coders for all Pcollections.
> > > >
> > > > Ben, Spark uses java serialization to send closures and it will not
> > work
> > > if
> > > > the cluster contains different JVMs that are not compatible. We agree
> > > with
> > > > you all that it will be more efficient to use coders for caching but
> > that
> > > > means
> > > > a major rewrite of the Spark runner and changing almost all
> > > > methods signatures and internal RDDs from a specific type to byte[].
> > > >
> > > > As i said before this is fallback for users who have non serializable
> > > user
> > > > classes, even today if an RDD is cached with kryo it doesn't pass the
> > > coder
> > > > so we are not worsening the situation but enabling users to use the
> > Spark
> > > > runner.
> > > >
> > > > Thanks
> > > > Kobi
> > > >
> > > >
> > > >
> > > > בתאריך 24 באוג' 2017 20:37,‏ "Thomas Weise" <t...@apache.org> כתב:
> > > >
> > > > > Would a custom Kryo serializer that uses the coders to perform
> > > > > serialization help?
> > > > >
> > > > > There are various ways Kryo let's you annotate such serializer
> > without
> > > > full
> > > > > surgery, including @Bind on the field level or at class level.
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Thu, Aug 24, 2017 at 9:57 AM, Lukasz Cwik
> > <lc...@google.com.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > How does Kryo's FieldSerializer fail on WindowedValue/PaneInfo,
> it
> > > > seems
> > > > > > like those are pretty simple types.
> > > > > >
> > > > > > Also, I don't see why they can't be tagged with Serializable but
> I
> > > > think
> > > > > > the original reasoning was that coders should be used to
> guarantee
> > a
> > > > > stable
> > > > > > representation across JVM versions.
> > > > > >
> > > > > > On Thu, Aug 24, 2017 at 5:38 AM, Kobi Salant <
> > kobi.sal...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I am working on Spark runner issue
> > > > > > > https://issues.apache.org/jira/browse/BEAM-2669
> > > > > > > "Kryo serialization exception when DStreams containing
> > > > > > > non-Kryo-serializable data are cached"
> > > > > > >
> > > > > > > Currently, Spark runner enforce the kryo serializer and in
> > > shuffling
> > > > > > > scenarios we always uses coders to transfer bytes over the
> > network.
> > > > But
> > > > > > > Spark can also serialize data for caching/persisting and
> > currently
> > > it
> > > > > > uses
> > > > > > > kryo for this.
> > > > > > >
> > > > > > > Today, when the user uses a class which is not kryo
> serializable
> > > the
> > > > > > > caching fails and we thought to open the option to use java
> > > > > serialization
> > > > > > > as a fallback.
> > > > > > >
> > > > > > > Our RDDs/DStreams behind the PCollections are usually typed
> > > > > > > RDD/DStream<WindowedValue<InputT>>
> > > > > > > and when Spark tries to java serialize them for caching
> purposes
> > it
> > > > > fails
> > > > > > > on WindowedValue not being java serializable.
> > > > > > >
> > > > > > > Is there any objection to add Serializable implements to SDK
> > > classes
> > > > > like
> > > > > > > WindowedValue, PaneInfo and others?
> > > > > > >
> > > > > > > Again, i want to emphasise that coders are a big part of the
> > Spark
> > > > > runner
> > > > > > > code and we do use them whenever a shuffle is expected.
> Changing
> > > the
> > > > > > types
> > > > > > > of the  RDDs/DStreams to byte[] will make the code pretty
> > > unreadable
> > > > > and
> > > > > > > will weaken type safety checks.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Kobi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding Serializable implements to SDK classes like WindowedValue etc...

Reply via email to