[jira] [Updated] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24204:
--
Description: 
*SUMMARY*
- CSV: Raising analysis exception.
- JSON: dropping columns with null types
- Parquet/ORC: raising runtime exceptions

The native orc file format throws an exception with a meaningless message in 
executor-sides when unsupported types passed;
{code}

scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
null)))
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("b", NullType) :: Nil)
scala> val df = spark.createDataFrame(rdd, schema)
scala> df.write.orc("/tmp/orc")
java.lang.IllegalArgumentException: Can't parse category at 
'struct'
at 
org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
er.scala:226)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
(FileFormatWriter.scala:278)
{code}
It seems to be better to verify a write schema in a driver side for users along 
with the CSV fromat;
https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65

  was:
The native orc file format throws an exception with a meaningless message in 
executor-sides when unsupported types passed;
{code}

scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
null)))
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("b", NullType) :: Nil)
scala> val df = spark.createDataFrame(rdd, schema)
scala> df.write.orc("/tmp/orc")
java.lang.IllegalArgumentException: Can't parse category at 
'struct'
at 
org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
er.scala:226)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
(FileFormatWriter.scala:278)
{code}
It seems to be better to verify a write schema in a driver side for users along 
with the CSV fromat;
https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65


> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>

[jira] [Updated] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24204:
--
Summary: Verify a write schema in Json/Orc/ParquetFileFormat  (was: Verify 
a write schema in Orc/ParquetFileFormat)

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org