How to set nullable field when create DataFrame using case class
Hi all, Consider the following case: import java.sql.Timestamp case class MyProduct(t: Timestamp, a: Float) val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() rdd.printSchema() The output is: root |-- t: timestamp (nullable = true) |-- a: float (nullable = false) How can I set the timestamp column to be NOT nullable? Regards, Luis -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to set nullable field when create DataFrame using case class
On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > How can I set the timestamp column to be NOT nullable? Hi, Given [1] it's not possible without defining your own Encoder for Dataset (that you use implicitly). It'd be something as follows: implicit def myEncoder: Encoder[MyProduct] = ??? spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))) I don't know how to create the Encoder though (lack of skills). You'd need to use Encoders.product[MyProduct] as a guideline. That might help - http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6. [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L672 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to set nullable field when create DataFrame using case class
On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > import java.sql.Timestamp > case class MyProduct(t: Timestamp, a: Float) > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() > rdd.printSchema() > > The output is: > root > |-- t: timestamp (nullable = true) > |-- a: float (nullable = false) > > How can I set the timestamp column to be NOT nullable? Gotcha! :) scala> import java.sql.Timestamp import java.sql.Timestamp scala> case class MyProduct(t: java.sql.Timestamp, a: Float) defined class MyProduct scala> import org.apache.spark.sql._ import org.apache.spark.sql._ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.catalyst.encoders._ import org.apache.spark.sql.catalyst.encoders._ scala> implicit def myEncoder: Encoder[MyProduct] = ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t", "timestamp", false).add("a", "float", false)) myEncoder: org.apache.spark.sql.Encoder[MyProduct] scala> spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema root |-- t: timestamp (nullable = false) |-- a: float (nullable = false) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to set nullable field when create DataFrame using case class
Nullable is an optimization for Spark SQL. It is telling spark to not even do an if check when accessing that field. In this case, your data *is* nullable, because timestamp is an object in java and you could put null there. On Thu, Aug 4, 2016 at 2:56 PM, luismattor wrote: > Hi all, > > Consider the following case: > > import java.sql.Timestamp > case class MyProduct(t: Timestamp, a: Float) > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() > rdd.printSchema() > > The output is: > root > |-- t: timestamp (nullable = true) > |-- a: float (nullable = false) > > How can I set the timestamp column to be NOT nullable? > > Regards, > Luis > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-set-nullable-field-when- > create-DataFrame-using-case-class-tp27479.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: How to set nullable field when create DataFrame using case class
Hi Jacek, I have not used Encoders before. Definitely this works! Thank you! Luis On 4 August 2016 at 18:23, Jacek Laskowski wrote: > On Thu, Aug 4, 2016 at 11:56 PM, luismattor wrote: > > > import java.sql.Timestamp > > case class MyProduct(t: Timestamp, a: Float) > > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() > > rdd.printSchema() > > > > The output is: > > root > > |-- t: timestamp (nullable = true) > > |-- a: float (nullable = false) > > > > How can I set the timestamp column to be NOT nullable? > > Gotcha! :) > > scala> import java.sql.Timestamp > import java.sql.Timestamp > > scala> case class MyProduct(t: java.sql.Timestamp, a: Float) > defined class MyProduct > > scala> import org.apache.spark.sql._ > import org.apache.spark.sql._ > > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > > scala> import org.apache.spark.sql.catalyst.encoders._ > import org.apache.spark.sql.catalyst.encoders._ > > scala> implicit def myEncoder: Encoder[MyProduct] = > ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t", > "timestamp", false).add("a", "float", false)) > myEncoder: org.apache.spark.sql.Encoder[MyProduct] > > scala> spark.createDataset(Seq(MyProduct(new Timestamp(0), > 10))).printSchema > root > |-- t: timestamp (nullable = false) > |-- a: float (nullable = false) > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski >
Re: How to set nullable field when create DataFrame using case class
Hi Michael, Since we're at it, could you please point at the code where the optimization happens? I assume you're talking about Catalyst when whole-gening the code for queries. Is this nullability (NULL value) propagation perhaps? I'd appreciate (hoping that would improve my understanding of the low-level bits quite substantially). Thanks! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust wrote: > Nullable is an optimization for Spark SQL. It is telling spark to not even > do an if check when accessing that field. > > In this case, your data is nullable, because timestamp is an object in java > and you could put null there. > > On Thu, Aug 4, 2016 at 2:56 PM, luismattor wrote: >> >> Hi all, >> >> Consider the following case: >> >> import java.sql.Timestamp >> case class MyProduct(t: Timestamp, a: Float) >> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() >> rdd.printSchema() >> >> The output is: >> root >> |-- t: timestamp (nullable = true) >> |-- a: float (nullable = false) >> >> How can I set the timestamp column to be NOT nullable? >> >> Regards, >> Luis >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to set nullable field when create DataFrame using case class
Hi Jacek, Is this line correct? spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 5 August 2016 at 10:21, Jacek Laskowski wrote: > Hi Michael, > > Since we're at it, could you please point at the code where the > optimization happens? I assume you're talking about Catalyst when > whole-gening the code for queries. Is this nullability (NULL value) > propagation perhaps? I'd appreciate (hoping that would improve my > understanding of the low-level bits quite substantially). Thanks! > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust > wrote: > > Nullable is an optimization for Spark SQL. It is telling spark to not > even > > do an if check when accessing that field. > > > > In this case, your data is nullable, because timestamp is an object in > java > > and you could put null there. > > > > On Thu, Aug 4, 2016 at 2:56 PM, luismattor wrote: > >> > >> Hi all, > >> > >> Consider the following case: > >> > >> import java.sql.Timestamp > >> case class MyProduct(t: Timestamp, a: Float) > >> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() > >> rdd.printSchema() > >> > >> The output is: > >> root > >> |-- t: timestamp (nullable = true) > >> |-- a: float (nullable = false) > >> > >> How can I set the timestamp column to be NOT nullable? > >> > >> Regards, > >> Luis > >> > >> > >> > >> -- > >> View this message in context: > >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to- > set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> - > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: How to set nullable field when create DataFrame using case class
Hi, Seems so. It's equivalent to Seq(MyProduct(new Timestamp(0), 10)).toDS.printSchema (and now I'm wondering why I didn't pick this variant) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 5, 2016 at 11:29 AM, Mich Talebzadeh wrote: > Hi Jacek, > > Is this line correct? > > spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > > On 5 August 2016 at 10:21, Jacek Laskowski wrote: >> >> Hi Michael, >> >> Since we're at it, could you please point at the code where the >> optimization happens? I assume you're talking about Catalyst when >> whole-gening the code for queries. Is this nullability (NULL value) >> propagation perhaps? I'd appreciate (hoping that would improve my >> understanding of the low-level bits quite substantially). Thanks! >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust >> wrote: >> > Nullable is an optimization for Spark SQL. It is telling spark to not >> > even >> > do an if check when accessing that field. >> > >> > In this case, your data is nullable, because timestamp is an object in >> > java >> > and you could put null there. >> > >> > On Thu, Aug 4, 2016 at 2:56 PM, luismattor wrote: >> >> >> >> Hi all, >> >> >> >> Consider the following case: >> >> >> >> import java.sql.Timestamp >> >> case class MyProduct(t: Timestamp, a: Float) >> >> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF() >> >> rdd.printSchema() >> >> >> >> The output is: >> >> root >> >> |-- t: timestamp (nullable = true) >> >> |-- a: float (nullable = false) >> >> >> >> How can I set the timestamp column to be NOT nullable? >> >> >> >> Regards, >> >> Luis >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html >> >> Sent from the Apache Spark User List mailing list archive at >> >> Nabble.com. >> >> >> >> - >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org