[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-03 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/21952


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21952#discussion_r207444698
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
@@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   case (f1, f2) => newConverter(f1.dataType, 
resolveNullableType(f2.schema(), f1.nullable))
 }
 val numFields = catalystStruct.length
+val containsNull = catalystStruct.exists(_.nullable)
--- End diff --

Let's remove it. We can fix the issue that Spark always turn schema to 
nullable later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21952#discussion_r207444381
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
@@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   case (f1, f2) => newConverter(f1.dataType, 
resolveNullableType(f2.schema(), f1.nullable))
 }
 val numFields = catalystStruct.length
+val containsNull = catalystStruct.exists(_.nullable)
--- End diff --

Was addressing the feedback from @gengliangwang We can remove it since the 
cases when all the fields are not nullable will be probably fairly rare. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-02 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21952#discussion_r207444037
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
@@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   et, resolveNullableType(avroType.getElementType, containsNull))
 (getter, ordinal) => {
   val arrayData = getter.getArray(ordinal)
-  val result = new java.util.ArrayList[Any]
--- End diff --

My previous experience in ml project told me that `ArrayList` has slower 
setter performance due to one extra function call, so my preference is using 
array as much as possible, and wrap it into the right container in the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21952#discussion_r207443421
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
@@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   case (f1, f2) => newConverter(f1.dataType, 
resolveNullableType(f2.schema(), f1.nullable))
 }
 val numFields = catalystStruct.length
+val containsNull = catalystStruct.exists(_.nullable)
--- End diff --

this only works when all the fields are not nullable, I don't think it's 
very useful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

2018-08-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21952#discussion_r207443196
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
@@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   et, resolveNullableType(avroType.getElementType, containsNull))
 (getter, ordinal) => {
   val arrayData = getter.getArray(ordinal)
-  val result = new java.util.ArrayList[Any]
--- End diff --

can we just `new java.util.ArrayList[Any](len)` here instead of creating an 
array and wrap it with array list?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org