Right, as i answered Reuven if the user classes are not Serializable(kryo and java) then we should avoid caching and by that avoiding Serializing
2017-08-24 22:36 GMT+03:00 Ben Chambers <bchamb...@apache.org>: > If the user classes are not Serializable, how does adding Serializable to > the WindowedValue help? The user class, which is stored in a field in the > WindowedValue will still be non-serializable, and thus cause problems. > > On Thu, Aug 24, 2017 at 11:59 AM Kobi Salant <kobi.sal...@gmail.com> > wrote: > > > right, if it is not kryo or java serializable then we should allow user > to > > set caching off > > > > 2017-08-24 21:55 GMT+03:00 Reuven Lax <re...@google.com.invalid>: > > > > > However user classes are not guaranteed to be java serializable either. > > > > > > On Thu, Aug 24, 2017 at 11:31 AM, Kobi Salant <kobi.sal...@gmail.com> > > > wrote: > > > > > > > Thank you for your replies, > > > > > > > > Lukasz, The kryo does not fail on the WindowedValue but on the user > > class > > > > which in some cases is not kryo-serializable usually when there is no > > > > default constructor. Switching to java serialization does fail on > > > > WindowedValue > > > > as it is not defined serializable > > > > > > > > Thomas, We thought about writing a custom serializer but it cannot > work > > > > without knowing the exact coders of each RDD, it is similar to the > > > > discussion that was in the past is it mandatory for the user to set > > > coders > > > > for each PCollection or not. This suggestion is like asking Beam to > > auto > > > > detect coders for all Pcollections. > > > > > > > > Ben, Spark uses java serialization to send closures and it will not > > work > > > if > > > > the cluster contains different JVMs that are not compatible. We agree > > > with > > > > you all that it will be more efficient to use coders for caching but > > that > > > > means > > > > a major rewrite of the Spark runner and changing almost all > > > > methods signatures and internal RDDs from a specific type to byte[]. > > > > > > > > As i said before this is fallback for users who have non serializable > > > user > > > > classes, even today if an RDD is cached with kryo it doesn't pass the > > > coder > > > > so we are not worsening the situation but enabling users to use the > > Spark > > > > runner. > > > > > > > > Thanks > > > > Kobi > > > > > > > > > > > > > > > > בתאריך 24 באוג' 2017 20:37, "Thomas Weise" <t...@apache.org> כתב: > > > > > > > > > Would a custom Kryo serializer that uses the coders to perform > > > > > serialization help? > > > > > > > > > > There are various ways Kryo let's you annotate such serializer > > without > > > > full > > > > > surgery, including @Bind on the field level or at class level. > > > > > > > > > > Thanks, > > > > > Thomas > > > > > > > > > > > > > > > On Thu, Aug 24, 2017 at 9:57 AM, Lukasz Cwik > > <lc...@google.com.invalid > > > > > > > > > wrote: > > > > > > > > > > > How does Kryo's FieldSerializer fail on WindowedValue/PaneInfo, > it > > > > seems > > > > > > like those are pretty simple types. > > > > > > > > > > > > Also, I don't see why they can't be tagged with Serializable but > I > > > > think > > > > > > the original reasoning was that coders should be used to > guarantee > > a > > > > > stable > > > > > > representation across JVM versions. > > > > > > > > > > > > On Thu, Aug 24, 2017 at 5:38 AM, Kobi Salant < > > kobi.sal...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > I am working on Spark runner issue > > > > > > > https://issues.apache.org/jira/browse/BEAM-2669 > > > > > > > "Kryo serialization exception when DStreams containing > > > > > > > non-Kryo-serializable data are cached" > > > > > > > > > > > > > > Currently, Spark runner enforce the kryo serializer and in > > > shuffling > > > > > > > scenarios we always uses coders to transfer bytes over the > > network. > > > > But > > > > > > > Spark can also serialize data for caching/persisting and > > currently > > > it > > > > > > uses > > > > > > > kryo for this. > > > > > > > > > > > > > > Today, when the user uses a class which is not kryo > serializable > > > the > > > > > > > caching fails and we thought to open the option to use java > > > > > serialization > > > > > > > as a fallback. > > > > > > > > > > > > > > Our RDDs/DStreams behind the PCollections are usually typed > > > > > > > RDD/DStream<WindowedValue<InputT>> > > > > > > > and when Spark tries to java serialize them for caching > purposes > > it > > > > > fails > > > > > > > on WindowedValue not being java serializable. > > > > > > > > > > > > > > Is there any objection to add Serializable implements to SDK > > > classes > > > > > like > > > > > > > WindowedValue, PaneInfo and others? > > > > > > > > > > > > > > Again, i want to emphasise that coders are a big part of the > > Spark > > > > > runner > > > > > > > code and we do use them whenever a shuffle is expected. > Changing > > > the > > > > > > types > > > > > > > of the RDDs/DStreams to byte[] will make the code pretty > > > unreadable > > > > > and > > > > > > > will weaken type safety checks. > > > > > > > > > > > > > > Thanks > > > > > > > Kobi > > > > > > > > > > > > > > > > > > > > > > > > > > > >