Pierre Gramme created SPARK-32618:
-------------------------------------

             Summary: ORC writer doesn't support colon in column names
                 Key: SPARK-32618
                 URL: https://issues.apache.org/jira/browse/SPARK-32618
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.3.0
            Reporter: Pierre Gramme


Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct<a:b^:int>'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct<a:b^:int>'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct<a:b^:int>'
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to