is it a problem only for streaming or it affects batch queries as well?

On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> The first case of user report is obvious - according to the user report,
> AVRO generated code contains getter which denotes to itself hence Spark
> disallows (throws exception), but it doesn't have matching setter method
> (if I understand correctly) so technically it shouldn't matter.
>
> For the second case of user report, I've reproduced with my own code.
> Please refer the gist code:
> https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca
>
> This code aggregates the max value of the values in key where the key is
> in the range of (0 ~ 9).
>
> We're expecting the result of execution like (0, 10000), (1, 10001), ...,
> (9, 10009), but the result is going to be incorrect like below:
>
> -------------------------------------------
> Batch: 0
> -------------------------------------------
> +---+--------+
> |key|maxValue|
> +---+--------+
> +---+--------+
>
> -------------------------------------------
> Batch: 1
> -------------------------------------------
> +---+--------+
> |key|maxValue|
> +---+--------+
> |  0|   18990|
> |  7|   18997|
> |  6|   18996|
> |  9|   18999|
> |  5|   18995|
> |  1|   18991|
> |  3|   18993|
> |  8|   18998|
> |  2|   18992|
> |  4|   18994|
> +---+--------+
>
> -------------------------------------------
> Batch: 2
> -------------------------------------------
> +-----+------------+
> |  key|    maxValue|
> +-----+------------+
> |18990|       30990|
> |18997|540502118145|
> |18996|249574852617|
> |18999|146327314953|
> |18995|243603134985|
> |18991|476309451025|
> |18993|287916490001|
> |18998|324427845137|
> |18992|412640801297|
> |18994|302012976401|
> +-----+------------+
> ...
>
> This can happen with such inconsistent schemas because State in Structured
> Streaming doesn't check the schema (both name and type are unchecked) and
> simply apply the raw values with the sequence of column.
>
> On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Can you give some simple examples to demonstrate the problem? I think the
>> inconsistency would bring problems but don't know how.
>>
>> On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> (bump to expose the discussion to more readers)
>>>
>>> On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> There're couple of issues being reported on the user@ mailing list
>>>> which results in being affected by inconsistent schema on Encoders.bean.
>>>>
>>>> 1. Typed datataset from Avro generated classes? [1]
>>>> 2. spark structured streaming GroupState returns weird values from sate
>>>> [2]
>>>>
>>>> Below is a part of JavaTypeInference.inferDataType() which handles
>>>> beans:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157
>>>>
>>>> it collects properties based on the availability of getter.
>>>>
>>>> (It's applied as well as `SQLContext.beansToRows`.)
>>>>
>>>> JavaTypeInference.serializerFor() and
>>>> JavaTypeInference.deserializerFor() aren't. They collect properties based
>>>> on the available of both getter and setter.
>>>> (It calls JavaTypeInference.inferDataType() inside the method, making
>>>> inconsistent even only these method is called.)
>>>>
>>>> This inconsistent produces runtime issues when Java bean only has
>>>> getter for some fields, even there's no such field for the getter method -
>>>> as getter/setter methods are determined by naming convention.
>>>>
>>>> I feel this is something we should fix, but would like to see opinions
>>>> on how to fix it. If the user query has the problematic beans but hasn't
>>>> encountered such issue, fixing the issue would drop off some columns, which
>>>> would be backward incompatible. I think this is still the way to go, but if
>>>> we concern more on not breaking existing query, we may want to at least
>>>> document the ideal form of the bean Spark expects.
>>>>
>>>> Would like to hear opinions on this.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1.
>>>> https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E
>>>> 2.
>>>> http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3ccafx8l21dzbyv5m1qozs3y+pcmycwbtjko6ytwvkydztq7u4...@mail.gmail.com%3e
>>>>
>>>

Reply via email to