[jira] [Created] (SPARK-33940) allow configuring the max column name length in csv writer

Nan Zhu (Jira) Tue, 29 Dec 2020 23:49:36 -0800

Nan Zhu created SPARK-33940:
-------------------------------

             Summary: allow configuring the max column name length in csv writer
                 Key: SPARK-33940
                 URL: https://issues.apache.org/jira/browse/SPARK-33940
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.0
            Reporter: Nan Zhu



csv writer actually has an implicit limit on column name length due to 
univocity-parser, 

 

when we initialize a writer 
[https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
 it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
eventually 
([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]

 

in that stringCache.get, it has a maxStringLength cap 
[https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
 which is 1024 by default

 

we do not expose this as configurable option, leading to NPE when we have a 
column name larger than 1024, 

 

```

[info]   Cause: java.lang.NullPointerException:

[info]   at 
com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)

[info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)

[info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)

[info]   at 
org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)

[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)

[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)

[info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)

[info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)

```

 

it could be reproduced by a simple unit test

 

```

val row1 = Row("a")
val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
df.repartition(1)
 .write
 .option("header", "true")
 .option("maxColumnNameLength", 1025)
 .csv(dataPath)

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33940) allow configuring the max column name length in csv writer

Reply via email to