[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287735#comment-16287735 ] Adamos Loizou commented on SPARK-22351: --- Hello guys, once more I've run against this problem now with ADT/Sealed hierarchies examples. For reference, there are already people facing this issue ([stack overflow link|https://stackoverflow.com/questions/41030073/encode-an-adt-sealed-trait-hierarchy-into-spark-dataset-column]). Here is an example: {code:java} sealed trait Fruit case object Apple extends Fruit case object Orange extends Fruit case class Bag(quantity: Int, fruit: Fruit) Seq(Bag(1, Apple), Bag(3, Orange)).toDS // <- This fails because it can't find an encoder for Fruit {code} Ideally I'd like to be able to create my encoder where I can tell it, for example, to use the case object toString method for mapping it to a String column. How feasible would it be to expose an API for creating custom encoders? Unfortunately, not having this limits the capacity for generalised and typesafe models quite a bit. Thank you. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou >Priority: Minor > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224715#comment-16224715 ] Adamos Loizou commented on SPARK-22351: --- Hi [~hyukjin.kwon], in Spark 1.6 I managed to add support for custom types by defining subclasses of {{org.apache.spark.sql.types.UserDefinedType}}. e.g. {code:java} class JodaLocalDateType extends UserDefinedType[org.joda.time.LocalDate] { override def sqlType: DataType = TimestampType override def serialize(p: org.joda.time.LocalDate) = ??? override def deserialize(datum: Any): org.joda.time.LocalDate = ??? ... } {code} This abstract class has been made private in Spark 2.x. Unfortunately, there doesn't seem to be an easy alternative. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou >Priority: Minor > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223470#comment-16223470 ] Hyukjin Kwon commented on SPARK-22351: -- {quote} While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. {quote} Would you mind sharing the codes? I think I can't reproduce this in 1.6. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou >Priority: Minor > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218352#comment-16218352 ] Adamos Loizou commented on SPARK-22351: --- I have also looked at the {{Encoders}} API. The methods available do not allow to create an instance of an encoder of a custom type such as {{Encoder\[org.joda.time.LocalDate\]}} that maps to a specific Spark SQL type of my choosing ({{TimestampType}}). The only option from {{Encoders}} is the kryo one which will not map it to {{TimestampType}}. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218326#comment-16218326 ] Sean Owen commented on SPARK-22351: --- You are looking for {{org.apache.spark.sql.Encoder}} right? it's a trait. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218323#comment-16218323 ] Adamos Loizou commented on SPARK-22351: --- Hi [~srowen], yes. The problem is creating the encoder in the first place. i.e. How can you create an Encoder in Spark 2.x that can serialize a joda LocalDate to a Timestamp? I had a look at the {{ExpressionEncoder}} which is the _only_ implementation allowed and it was non trivial. If you believe it relatively easy please share. Thanks. > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets
[ https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218318#comment-16218318 ] Sean Owen commented on SPARK-22351: --- In Dataset, you can see many APIs accepting an Encoder. Have you tried those? > Support user-created custom Encoders for Datasets > - > > Key: SPARK-22351 > URL: https://issues.apache.org/jira/browse/SPARK-22351 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adamos Loizou > > It would be very helpful if we could easily support creating custom encoders > for classes in Spark SQL. > This is to allow a user to properly define a business model using types of > their choice. They can then map them to Spark SQL types without being forced > to pollute their model with the built-in mappable types (e.g. > {{java.sql.Timestamp}}). > Specifically in our case, we tend to use either the Java 8 time API or the > joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite > limited compared to the others. > Ideally we would like to be able to have a dataset of such a class: > {code:java} > case class Person(name: String, dateOfBirth: org.joda.time.LocalDate) > implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define > something that maps to Spark SQL TimestampType > ... > // read csv and map it to model > val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person] > {code} > While this was possible in Spark 1.6 it's not longer the case in Spark 2.x. > It's also not straight forward as to how to support that using an > {{ExpressionEncoder}} (any tips would be much appreciated) > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org