[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-33940: ------------------------------------ Assignee: (was: Apache Spark) > allow configuring the max column name length in csv writer > ---------------------------------------------------------- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.0 > Reporter: Nan Zhu > Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org