How to set nullable field when create DataFrame using case class

2016-08-04 Thread luismattor
Hi all,

Consider the following case:

import java.sql.Timestamp
case class MyProduct(t: Timestamp, a: Float)
val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
rdd.printSchema()

The output is:
root
 |-- t: timestamp (nullable = true)
 |-- a: float (nullable = false)

How can I set the timestamp column to be NOT nullable?

Regards,
Luis



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:

> How can I set the timestamp column to be NOT nullable?

Hi,

Given [1] it's not possible without defining your own Encoder for
Dataset (that you use implicitly).

It'd be something as follows:

implicit def myEncoder: Encoder[MyProduct] = ???
spark.createDataset(Seq(MyProduct(new Timestamp(0), 10)))

I don't know how to create the Encoder though (lack of skills). You'd
need to use Encoders.product[MyProduct] as a guideline.

That might help -
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6.

[1] 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L672

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Jacek Laskowski
On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:

> import java.sql.Timestamp
> case class MyProduct(t: Timestamp, a: Float)
> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> rdd.printSchema()
>
> The output is:
> root
>  |-- t: timestamp (nullable = true)
>  |-- a: float (nullable = false)
>
> How can I set the timestamp column to be NOT nullable?

Gotcha! :)

scala> import java.sql.Timestamp
import java.sql.Timestamp

scala> case class MyProduct(t: java.sql.Timestamp, a: Float)
defined class MyProduct

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> import org.apache.spark.sql.catalyst.encoders._
import org.apache.spark.sql.catalyst.encoders._

scala> implicit def myEncoder: Encoder[MyProduct] =
ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t",
"timestamp", false).add("a", "float", false))
myEncoder: org.apache.spark.sql.Encoder[MyProduct]

scala> spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema
root
 |-- t: timestamp (nullable = false)
 |-- a: float (nullable = false)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL.  It is telling spark to not even
do an if check when accessing that field.

In this case, your data *is* nullable, because timestamp is an object in
java and you could put null there.

On Thu, Aug 4, 2016 at 2:56 PM, luismattor  wrote:

> Hi all,
>
> Consider the following case:
>
> import java.sql.Timestamp
> case class MyProduct(t: Timestamp, a: Float)
> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> rdd.printSchema()
>
> The output is:
> root
>  |-- t: timestamp (nullable = true)
>  |-- a: float (nullable = false)
>
> How can I set the timestamp column to be NOT nullable?
>
> Regards,
> Luis
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-set-nullable-field-when-
> create-DataFrame-using-case-class-tp27479.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Luis Mateos
Hi Jacek,

I have not used Encoders before. Definitely this works! Thank you!

Luis


On 4 August 2016 at 18:23, Jacek Laskowski  wrote:

> On Thu, Aug 4, 2016 at 11:56 PM, luismattor  wrote:
>
> > import java.sql.Timestamp
> > case class MyProduct(t: Timestamp, a: Float)
> > val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> > rdd.printSchema()
> >
> > The output is:
> > root
> >  |-- t: timestamp (nullable = true)
> >  |-- a: float (nullable = false)
> >
> > How can I set the timestamp column to be NOT nullable?
>
> Gotcha! :)
>
> scala> import java.sql.Timestamp
> import java.sql.Timestamp
>
> scala> case class MyProduct(t: java.sql.Timestamp, a: Float)
> defined class MyProduct
>
> scala> import org.apache.spark.sql._
> import org.apache.spark.sql._
>
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
>
> scala> import org.apache.spark.sql.catalyst.encoders._
> import org.apache.spark.sql.catalyst.encoders._
>
> scala> implicit def myEncoder: Encoder[MyProduct] =
> ExpressionEncoder[MyProduct].copy(schema = new StructType().add("t",
> "timestamp", false).add("a", "float", false))
> myEncoder: org.apache.spark.sql.Encoder[MyProduct]
>
> scala> spark.createDataset(Seq(MyProduct(new Timestamp(0),
> 10))).printSchema
> root
>  |-- t: timestamp (nullable = false)
>  |-- a: float (nullable = false)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>


Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Jacek Laskowski
Hi Michael,

Since we're at it, could you please point at the code where the
optimization happens? I assume you're talking about Catalyst when
whole-gening the code for queries. Is this nullability (NULL value)
propagation perhaps? I'd appreciate (hoping that would improve my
understanding of the low-level bits quite substantially). Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust  wrote:
> Nullable is an optimization for Spark SQL.  It is telling spark to not even
> do an if check when accessing that field.
>
> In this case, your data is nullable, because timestamp is an object in java
> and you could put null there.
>
> On Thu, Aug 4, 2016 at 2:56 PM, luismattor  wrote:
>>
>> Hi all,
>>
>> Consider the following case:
>>
>> import java.sql.Timestamp
>> case class MyProduct(t: Timestamp, a: Float)
>> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
>> rdd.printSchema()
>>
>> The output is:
>> root
>>  |-- t: timestamp (nullable = true)
>>  |-- a: float (nullable = false)
>>
>> How can I set the timestamp column to be NOT nullable?
>>
>> Regards,
>> Luis
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Mich Talebzadeh
Hi Jacek,

Is this line correct?

spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 August 2016 at 10:21, Jacek Laskowski  wrote:

> Hi Michael,
>
> Since we're at it, could you please point at the code where the
> optimization happens? I assume you're talking about Catalyst when
> whole-gening the code for queries. Is this nullability (NULL value)
> propagation perhaps? I'd appreciate (hoping that would improve my
> understanding of the low-level bits quite substantially). Thanks!
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust 
> wrote:
> > Nullable is an optimization for Spark SQL.  It is telling spark to not
> even
> > do an if check when accessing that field.
> >
> > In this case, your data is nullable, because timestamp is an object in
> java
> > and you could put null there.
> >
> > On Thu, Aug 4, 2016 at 2:56 PM, luismattor  wrote:
> >>
> >> Hi all,
> >>
> >> Consider the following case:
> >>
> >> import java.sql.Timestamp
> >> case class MyProduct(t: Timestamp, a: Float)
> >> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
> >> rdd.printSchema()
> >>
> >> The output is:
> >> root
> >>  |-- t: timestamp (nullable = true)
> >>  |-- a: float (nullable = false)
> >>
> >> How can I set the timestamp column to be NOT nullable?
> >>
> >> Regards,
> >> Luis
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-
> set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Jacek Laskowski
Hi,

Seems so. It's equivalent to

Seq(MyProduct(new Timestamp(0), 10)).toDS.printSchema

(and now I'm wondering why I didn't pick this variant)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Aug 5, 2016 at 11:29 AM, Mich Talebzadeh
 wrote:
> Hi Jacek,
>
> Is this line correct?
>
> spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 5 August 2016 at 10:21, Jacek Laskowski  wrote:
>>
>> Hi Michael,
>>
>> Since we're at it, could you please point at the code where the
>> optimization happens? I assume you're talking about Catalyst when
>> whole-gening the code for queries. Is this nullability (NULL value)
>> propagation perhaps? I'd appreciate (hoping that would improve my
>> understanding of the low-level bits quite substantially). Thanks!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Aug 5, 2016 at 1:24 AM, Michael Armbrust 
>> wrote:
>> > Nullable is an optimization for Spark SQL.  It is telling spark to not
>> > even
>> > do an if check when accessing that field.
>> >
>> > In this case, your data is nullable, because timestamp is an object in
>> > java
>> > and you could put null there.
>> >
>> > On Thu, Aug 4, 2016 at 2:56 PM, luismattor  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> Consider the following case:
>> >>
>> >> import java.sql.Timestamp
>> >> case class MyProduct(t: Timestamp, a: Float)
>> >> val rdd = sc.parallelize(List(MyProduct(new Timestamp(0), 10))).toDF()
>> >> rdd.printSchema()
>> >>
>> >> The output is:
>> >> root
>> >>  |-- t: timestamp (nullable = true)
>> >>  |-- a: float (nullable = false)
>> >>
>> >> How can I set the timestamp column to be NOT nullable?
>> >>
>> >> Regards,
>> >> Luis
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-nullable-field-when-create-DataFrame-using-case-class-tp27479.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> >> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org