[jira] [Commented] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483468#comment-16483468 ] Apache Spark commented on SPARK-24204: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21389 > Verify a write schema in Json/Orc/ParquetFileFormat > --- > > Key: SPARK-24204 > URL: https://issues.apache.org/jira/browse/SPARK-24204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > *SUMMARY* > - CSV: Raising analysis exception. > - JSON: dropping columns with null types > - Parquet/ORC: raising runtime exceptions > The native orc file format throws an exception with a meaningless message in > executor-sides when unsupported types passed; > {code} > scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, > null))) > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("b", NullType) :: Nil) > scala> val df = spark.createDataFrame(rdd, schema) > scala> df.write.orc("/tmp/orc") > java.lang.IllegalArgumentException: Can't parse category at > 'struct' > at > org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332) > at > org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385) > at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ > er.scala:226) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply > (FileFormatWriter.scala:278) > {code} > It seems to be better to verify a write schema in a driver side for users > along with the CSV fromat; > https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471283#comment-16471283 ] Takeshi Yamamuro commented on SPARK-24204: -- ok, I'll do it later. Thanks for the description update, too. > Verify a write schema in Json/Orc/ParquetFileFormat > --- > > Key: SPARK-24204 > URL: https://issues.apache.org/jira/browse/SPARK-24204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > *SUMMARY* > - CSV: Raising analysis exception. > - JSON: dropping columns with null types > - Parquet/ORC: raising runtime exceptions > The native orc file format throws an exception with a meaningless message in > executor-sides when unsupported types passed; > {code} > scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, > null))) > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("b", NullType) :: Nil) > scala> val df = spark.createDataFrame(rdd, schema) > scala> df.write.orc("/tmp/orc") > java.lang.IllegalArgumentException: Can't parse category at > 'struct' > at > org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332) > at > org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385) > at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ > er.scala:226) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply > (FileFormatWriter.scala:278) > {code} > It seems to be better to verify a write schema in a driver side for users > along with the CSV fromat; > https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org