[ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603800#comment-16603800
 ] 

Fangshi Li commented on SPARK-24256:
------------------------------------

To summarize our discussionĀ forĀ this pr:
Spark-avro is now merged into Spark as a built-in data source. Upstream 
community is not merging the AvroEncoder to support Avro types in Dataset, 
instead, the plan is to exposing the user-defined type API to support defining 
arbitrary user types in Dataset.

The purpose of this patch is to enable ExpressionEncoder to work together with 
other types of Encoders, while it seems like upstream prefers to go with UDT. 
Given this, we can close this PR and we will start the discussion on UDT in 
another channel

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24256
>                 URL: https://issues.apache.org/jira/browse/SPARK-24256
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Fangshi Li
>            Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to