Major difference between Spark and Arrow Parquet Implementations

Lucas Pickup Wed, 16 Aug 2017 08:58:52 -0700

Hello,

I have been using pyarrow and PySpark to write Parquet files. I have used 
pyarrow to successfully write out a Parquet file with spaces in column names. 
E.g. 'X Coordinate'.
When I try to write out the same dataset using Sparks Parquet writer it fails 
claiming:
"Attribute name "X Coordinate" contains invalid character(s) among " 
,;{}()\\n\\t<file://n//t>="".
It seems that according to Spark's Parquet implementation those above 
characters are not allowed to be a part of a Parquet Schema due to special 
meaning.
The code that checks this is 
here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.


I was wondering if there was a reason why the implementations have such a major 
difference when it comes to schema generation?

Cheers, Lucas Pickup

Major difference between Spark and Arrow Parquet Implementations

Reply via email to