is it a problem only for streaming or it affects batch queries as well? On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote:
> The first case of user report is obvious - according to the user report, > AVRO generated code contains getter which denotes to itself hence Spark > disallows (throws exception), but it doesn't have matching setter method > (if I understand correctly) so technically it shouldn't matter. > > For the second case of user report, I've reproduced with my own code. > Please refer the gist code: > https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca > > This code aggregates the max value of the values in key where the key is > in the range of (0 ~ 9). > > We're expecting the result of execution like (0, 10000), (1, 10001), ..., > (9, 10009), but the result is going to be incorrect like below: > > ------------------------------------------- > Batch: 0 > ------------------------------------------- > +---+--------+ > |key|maxValue| > +---+--------+ > +---+--------+ > > ------------------------------------------- > Batch: 1 > ------------------------------------------- > +---+--------+ > |key|maxValue| > +---+--------+ > | 0| 18990| > | 7| 18997| > | 6| 18996| > | 9| 18999| > | 5| 18995| > | 1| 18991| > | 3| 18993| > | 8| 18998| > | 2| 18992| > | 4| 18994| > +---+--------+ > > ------------------------------------------- > Batch: 2 > ------------------------------------------- > +-----+------------+ > | key| maxValue| > +-----+------------+ > |18990| 30990| > |18997|540502118145| > |18996|249574852617| > |18999|146327314953| > |18995|243603134985| > |18991|476309451025| > |18993|287916490001| > |18998|324427845137| > |18992|412640801297| > |18994|302012976401| > +-----+------------+ > ... > > This can happen with such inconsistent schemas because State in Structured > Streaming doesn't check the schema (both name and type are unchecked) and > simply apply the raw values with the sequence of column. > > On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Can you give some simple examples to demonstrate the problem? I think the >> inconsistency would bring problems but don't know how. >> >> On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> >> wrote: >> >>> (bump to expose the discussion to more readers) >>> >>> On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> Hi devs, >>>> >>>> There're couple of issues being reported on the user@ mailing list >>>> which results in being affected by inconsistent schema on Encoders.bean. >>>> >>>> 1. Typed datataset from Avro generated classes? [1] >>>> 2. spark structured streaming GroupState returns weird values from sate >>>> [2] >>>> >>>> Below is a part of JavaTypeInference.inferDataType() which handles >>>> beans: >>>> >>>> >>>> https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157 >>>> >>>> it collects properties based on the availability of getter. >>>> >>>> (It's applied as well as `SQLContext.beansToRows`.) >>>> >>>> JavaTypeInference.serializerFor() and >>>> JavaTypeInference.deserializerFor() aren't. They collect properties based >>>> on the available of both getter and setter. >>>> (It calls JavaTypeInference.inferDataType() inside the method, making >>>> inconsistent even only these method is called.) >>>> >>>> This inconsistent produces runtime issues when Java bean only has >>>> getter for some fields, even there's no such field for the getter method - >>>> as getter/setter methods are determined by naming convention. >>>> >>>> I feel this is something we should fix, but would like to see opinions >>>> on how to fix it. If the user query has the problematic beans but hasn't >>>> encountered such issue, fixing the issue would drop off some columns, which >>>> would be backward incompatible. I think this is still the way to go, but if >>>> we concern more on not breaking existing query, we may want to at least >>>> document the ideal form of the bean Spark expects. >>>> >>>> Would like to hear opinions on this. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> 1. >>>> https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E >>>> 2. >>>> http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3ccafx8l21dzbyv5m1qozs3y+pcmycwbtjko6ytwvkydztq7u4...@mail.gmail.com%3e >>>> >>>