[ 
https://issues.apache.org/jira/browse/SPARK-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Singh updated SPARK-18090:
--------------------------------
    Description: 
*Problem Description:*
Reading a small parquet file (single column, single record), with provided 
schema (StructType(Seq(StructField("field1",StringType,true), 
StructField("hour",StringType,true),StructField("batch",StringType,true)))) and 
with spark.sql.sources.partitionColumnTypeInference.enabled not set (i.e. 
defaulting to true) from a path like 
"<base-path>/hour=2016072313/batch=720b044894e14dcea63829bb4686c7e3" gives 
following exception:
java.lang.NegativeArraySizeException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:45)
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:196)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:239)
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:238)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
which is completely wrong behavior.

*Steps to Reproduce:*
Run following commands from Spark shell (after updating paths):
{code:scala}
val df = sc.parallelize(Seq(("one", "2016072313", 
"720b044894e14dcea63829bb4686c7e3"))).toDF("field1", "hour", "batch")
df.write.partitionBy("hour", 
"batch").parquet("/home/<user>/SmallParquetForTest")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("field1",StringType,true), 
StructField("hour",StringType,true),StructField("batch",StringType,true)))
val dfRead = 
sqlContext.read.schema(sparkSchema).parquet("file:///home/<user>/SmallParquetForTest")
dfRead.show()
{code}

*Root Cause:*
I did some analysis by debugging this in Spark and found out that the partition 
Projection uses inferred schema and generates a row with "hour" as integer. 
Later on final projection uses provided schema and reads "hour" as string from 
the row generated by partition projection. While reading "hour" as string, it's 
integer value 2016072313 is interpreted as size of the string to be read which 
causes byte buffer size overflow.

*Expected Behavior:*
Either there should be an error saying inferred type and provided type for 
partition columns do not match or provided type should be used while generating 
partition projection.

  was:
*Problem Description:*
Reading a small parquet file (single column, single record), with provided 
schema (StructType(Seq(StructField("field1",StringType,true), 
StructField("hour",StringType,true),StructField("batch",StringType,true)))) and 
with spark.sql.sources.partitionColumnTypeInference.enabled not set (i.e. 
defaulting to true) from a path like 
"<base-path>/hour=2016072313/batch=720b044894e14dcea63829bb4686c7e3" gives 
following exception:
java.lang.NegativeArraySizeException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:45)
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:196)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:239)
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:238)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
which is completely wrong behavior.

*Steps to Reproduce:*
Run following commands from Spark shell (after updating paths):
val df = sc.parallelize(Seq(("one", "2016072313", 
"720b044894e14dcea63829bb4686c7e3"))).toDF("field1", "hour", "batch")
df.write.partitionBy("hour", 
"batch").parquet("/home/<user>/SmallParquetForTest")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("field1",StringType,true), 
StructField("hour",StringType,true),StructField("batch",StringType,true)))
val dfRead = 
sqlContext.read.schema(sparkSchema).parquet("file:///home/<user>/SmallParquetForTest")
dfRead.show()

*Root Cause:*
I did some analysis by debugging this in Spark and found out that the partition 
Projection uses inferred schema and generates a row with "hour" as integer. 
Later on final projection uses provided schema and reads "hour" as string from 
the row generated by partition projection. While reading "hour" as string, it's 
integer value 2016072313 is interpreted as size of the string to be read which 
causes byte buffer size overflow.

*Expected Behavior:*
Either there should be an error saying inferred type and provided type for 
partition columns do not match or provided type should be used while generating 
partition projection.


> NegativeArraySize exception while reading parquet when inferred type and 
> provided type for partition column are different
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18090
>                 URL: https://issues.apache.org/jira/browse/SPARK-18090
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Kapil Singh
>
> *Problem Description:*
> Reading a small parquet file (single column, single record), with provided 
> schema (StructType(Seq(StructField("field1",StringType,true), 
> StructField("hour",StringType,true),StructField("batch",StringType,true)))) 
> and with spark.sql.sources.partitionColumnTypeInference.enabled not set (i.e. 
> defaulting to true) from a path like 
> "<base-path>/hour=2016072313/batch=720b044894e14dcea63829bb4686c7e3" gives 
> following exception:
> java.lang.NegativeArraySizeException
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:45)
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:196)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:239)
>       at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$8.apply(DataSourceStrategy.scala:238)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> which is completely wrong behavior.
> *Steps to Reproduce:*
> Run following commands from Spark shell (after updating paths):
> {code:scala}
> val df = sc.parallelize(Seq(("one", "2016072313", 
> "720b044894e14dcea63829bb4686c7e3"))).toDF("field1", "hour", "batch")
> df.write.partitionBy("hour", 
> "batch").parquet("/home/<user>/SmallParquetForTest")
> import org.apache.spark.sql.types._
> val schema = StructType(Seq(StructField("field1",StringType,true), 
> StructField("hour",StringType,true),StructField("batch",StringType,true)))
> val dfRead = 
> sqlContext.read.schema(sparkSchema).parquet("file:///home/<user>/SmallParquetForTest")
> dfRead.show()
> {code}
> *Root Cause:*
> I did some analysis by debugging this in Spark and found out that the 
> partition Projection uses inferred schema and generates a row with "hour" as 
> integer. Later on final projection uses provided schema and reads "hour" as 
> string from the row generated by partition projection. While reading "hour" 
> as string, it's integer value 2016072313 is interpreted as size of the string 
> to be read which causes byte buffer size overflow.
> *Expected Behavior:*
> Either there should be an error saying inferred type and provided type for 
> partition columns do not match or provided type should be used while 
> generating partition projection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to