[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691458#comment-15691458
 ] 

Apache Spark commented on SPARK-18510:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15997

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 2.1.0
>
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681953#comment-15681953
 ] 

Apache Spark commented on SPARK-18510:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15951

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681677#comment-15681677
 ] 

Burak Yavuz commented on SPARK-18510:
-

No. Working on a separate fix

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680487#comment-15680487
 ] 

Reynold Xin commented on SPARK-18510:
-

Does your pull request for SPARK-18407 fix this one as well?


> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680328#comment-15680328
 ] 

Burak Yavuz commented on SPARK-18510:
-

cc [~r...@databricks.com] I marked this as a blocker as it is pretty nasty. 
Feel free to downgrade if you don't think so, or feel like the semantics of 
what I'm doing is wrong.

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org