Re: Change nullable property in Dataset schema

2016-08-17 Thread Kazuaki Ishizaki
Thank you for your comments

> You should just Seq(...).toDS
I tried this, however the result is not changed.

>> val ds2 = ds1.map(e => e)
> Why are you e => e (since it's identity) and does nothing?
Yes, e => e does nothing. For the sake of simplicity of an example, I used 
the simplest expression in map(). In current Spark, an expression in map() 
does not change an schema for its output.

>   .as(RowEncoder(new StructType()
>  .add("value", ArrayType(IntegerType, false), nullable = 
false)))
Sorry, this was my mistake. It did not work for my purpose. It actually 
does nothing.

Kazuaki Ishizaki



From:   Jacek Laskowski <ja...@japila.pl>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: user <user@spark.apache.org>
Date:   2016/08/15 04:56
Subject:    Re: Change nullable property in Dataset schema



On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> 
wrote:

>   import testImplicits._
>   test("test") {
> val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2),
> Array(3, 3)), 1).toDS

You should just Seq(...).toDS

> val ds2 = ds1.map(e => e)

Why are you e => e (since it's identity) and does nothing?

>   .as(RowEncoder(new StructType()
>  .add("value", ArrayType(IntegerType, false), nullable = 
false)))

I didn't know it's possible but looks like it's toDF where you could
replace the schema too (in a less involved way).

I learnt quite a lot from just a single email. Thanks!

Pozdrawiam,
Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org






Re: Change nullable property in Dataset schema

2016-08-17 Thread Kazuaki Ishizaki
My motivation is to simplify Java code generated by a compiler of 
Tungsten.

Here is a dump of generated code from the program.
https://gist.github.com/kiszk/402bd8bc45a14be29acb3674ebc4df24

If we can succeeded to let catalyst the result of map is never null, we 
can eliminate conditional branches.
For example, in the above URL, we can say the condition at line 45 is 
always false since the result of map() is never null by using our schema. 
As a result, we can eliminate assignments at lines 52 and 56, and 
conditional branches at lines 55 and 61.

Kazuaki Ishizaki



From:   Koert Kuipers <ko...@tresata.com>
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: "user@spark.apache.org" <user@spark.apache.org>
Date:   2016/08/16 04:35
Subject:    Re: Change nullable property in Dataset schema



why do you want the array to have nullable = false? what is the benefit?

On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> 
wrote:
Dear all,
Would it be possible to let me know how to change nullable property in 
Dataset?

When I looked for how to change nullable property in Dataframe schema, I 
found the following approaches.
http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe

https://github.com/apache/spark/pull/13873(Not merged yet)

However, I cannot find how to change nullable property in Dataset schema. 
Even when I wrote the following program, nullable property for "value: 
array" in ds2.schema is not changed.
If my understanding is correct, current Spark 2.0 uses an 
ExpressionEncoder that is generated based on Dataset[T] at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L46


class Test extends QueryTest with SharedSQLContext {
  import testImplicits._
  test("test") {
val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), 
Array(3, 3)), 1).toDS
val schema = new StructType().add("array", ArrayType(IntegerType, 
false), false)
val inputObject = BoundReference(0, 
ScalaReflection.dataTypeFor[Array[Int]], false)
val encoder = new ExpressionEncoder[Array[Int]](schema, true,
  ScalaReflection.serializerFor[Array[Int]](inputObject).flatten,
  ScalaReflection.deserializerFor[Array[Int]],
  ClassTag[Array[Int]](classOf[Array[Int]]))
val ds2 = ds1.map(e => e)(encoder)
ds1.printSchema
ds2.printSchema
  }
}

root
 |-- value: array (nullable = true)
 ||-- element: integer (containsNull = false)

root
 |-- value: array (nullable = true) // Expected 
(nullable = false)
 ||-- element: integer (containsNull = false)


Kazuaki Ishizaki





Re: Change nullable property in Dataset schema

2016-08-15 Thread Koert Kuipers
why do you want the array to have nullable = false? what is the benefit?

On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki 
wrote:

> Dear all,
> Would it be possible to let me know how to change nullable property in
> Dataset?
>
> When I looked for how to change nullable property in Dataframe schema, I
> found the following approaches.
> http://stackoverflow.com/questions/33193958/change-
> nullable-property-of-column-in-spark-dataframe
> https://github.com/apache/spark/pull/13873(Not merged yet)
>
> However, I cannot find how to change nullable property in Dataset schema.
> Even when I wrote the following program, nullable property for "value:
> array" in ds2.schema is not changed.
> If my understanding is correct, current Spark 2.0 uses an
> ExpressionEncoder that is generated based on Dataset[T] at
> https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> encoders/ExpressionEncoder.scala#L46
>
> class Test extends QueryTest with SharedSQLContext {
>   import testImplicits._
>   test("test") {
> val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2),
> Array(3, 3)), 1).toDS
> val schema = new StructType().add("array", ArrayType(IntegerType,
> false), false)
> val inputObject = BoundReference(0, 
> ScalaReflection.dataTypeFor[Array[Int]],
> false)
> val encoder = new ExpressionEncoder[Array[Int]](schema, true,
>   ScalaReflection.serializerFor[Array[Int]](inputObject).flatten,
>   ScalaReflection.deserializerFor[Array[Int]],
>   ClassTag[Array[Int]](classOf[Array[Int]]))
> val ds2 = ds1.map(e => e)(encoder)
> ds1.printSchema
> ds2.printSchema
>   }
> }
>
> root
>  |-- value: array (nullable = true)
>  ||-- element: integer (containsNull = false)
>
> root
>  |-- value: array (nullable = true) // Expected
> (nullable = false)
>  ||-- element: integer (containsNull = false)
>
>
> Kazuaki Ishizaki
>


Re: Change nullable property in Dataset schema

2016-08-14 Thread Jacek Laskowski
On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki  wrote:

>   import testImplicits._
>   test("test") {
> val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2),
> Array(3, 3)), 1).toDS

You should just Seq(...).toDS

> val ds2 = ds1.map(e => e)

Why are you e => e (since it's identity) and does nothing?

>   .as(RowEncoder(new StructType()
>  .add("value", ArrayType(IntegerType, false), nullable = false)))

I didn't know it's possible but looks like it's toDF where you could
replace the schema too (in a less involved way).

I learnt quite a lot from just a single email. Thanks!

Pozdrawiam,
Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Change nullable property in Dataset schema

2016-08-10 Thread Kazuaki Ishizaki
After some investigations, I was able to change nullable property in 
Dataset[Array[Int]] in the following way. Is this right way?

(1) Apply https://github.com/apache/spark/pull/13873
(2) Use two Encoders. One is RowEncoder. The other is predefined 
ExressionEncoder.

class Test extends QueryTest with SharedSQLContext {
  import testImplicits._
  test("test") {
val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), 
Array(3, 3)), 1).toDS
val ds2 = ds1.map(e => e)
  .as(RowEncoder(new StructType()
 .add("value", ArrayType(IntegerType, false), nullable = false)))
  .as(newDoubleArrayEncoder)
ds1.printSchema
ds2.printSchema
  }
}

root
 |-- value: array (nullable = true)
 ||-- element: integer (containsNull = false)

root
 |-- value: array (nullable = false)
 ||-- element: integer (containsNull = false)


Kazuaki Ishizaki



From:   Kazuaki Ishizaki/Japan/IBM@IBMJP
To: user@spark.apache.org
Date:   2016/08/03 23:46
Subject:Change nullable property in Dataset schema



Dear all,
Would it be possible to let me know how to change nullable property in 
Dataset?

When I looked for how to change nullable property in Dataframe schema, I 
found the following approaches.
http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe

https://github.com/apache/spark/pull/13873(Not merged yet)

However, I cannot find how to change nullable property in Dataset schema. 
Even when I wrote the following program, nullable property for "value: 
array" in ds2.schema is not changed.
If my understanding is correct, current Spark 2.0 uses an 
ExpressionEncoder that is generated based on Dataset[T] at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L46


class Test extends QueryTest with SharedSQLContext {
  import testImplicits._
  test("test") {
val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), 
Array(3, 3)), 1).toDS
val schema = new StructType().add("array", ArrayType(IntegerType, 
false), false)
val inputObject = BoundReference(0, 
ScalaReflection.dataTypeFor[Array[Int]], false)
val encoder = new ExpressionEncoder[Array[Int]](schema, true,
  ScalaReflection.serializerFor[Array[Int]](inputObject).flatten,
  ScalaReflection.deserializerFor[Array[Int]],
  ClassTag[Array[Int]](classOf[Array[Int]]))
val ds2 = ds1.map(e => e)(encoder)
ds1.printSchema
ds2.printSchema
  }
}

root
 |-- value: array (nullable = true)
 ||-- element: integer (containsNull = false)

root
 |-- value: array (nullable = true) // Expected 
(nullable = false)
 ||-- element: integer (containsNull = false)


Kazuaki Ishizaki