Re: Major difference between Spark and Arrow Parquet Implementations

Wes McKinney Wed, 16 Aug 2017 10:30:45 -0700

hi Erin -- please send a separate e-mail to [email protected]


Thanks

On Wed, Aug 16, 2017 at 1:06 PM, Erin Sobkow <[email protected]> wrote:
> Hi Wes:
>
> Somehow I have been inadvertently added to your list and am getting all these 
> emails that make no sense to me at all.  I'm in on some conversation I know 
> nothing about and am getting up to 20 emails a day from different people.  
> Can I ask you to remove me from your list and can you get all the other 
> people in your group to remove me as well?  Thanks!
>
> Erin Sobkow, BA Kin, RMT
> Community Consultant
> Parkland Valley Sport, Culture & Recreation District
>
> Box 263, Yorkton, SK  S3N 2V7
> Phone: (306) 786-6585
> Fax: (306) 782-0474
> Email:  [email protected]
> Website:  www.parklandvalley.ca
>
> If you no longer wish to receive electronic messages from Parkland Valley 
> Sport, Culture & Recreation District please reply with the word 'STOP'.
>
>
>
> Together...building healthy communities through sport, culture and recreation
>
> -----Original Message-----
> From: Wes McKinney [mailto:[email protected]]
> Sent: August 16, 2017 10:04 AM
> To: [email protected]
> Subject: Re: Major difference between Spark and Arrow Parquet Implementations
>
> hi Lucas,
>
> My understanding is that the Parquet format by itself does not place any such 
> restrictions on the names of fields, and so this is a Spark SQL-specific 
> issue (anyone please correct me if I'm mistaken about this). I would be happy 
> to help add a schema cleaning option to normalize field names for use in 
> Spark. I just opened:
>
> https://issues.apache.org/jira/browse/ARROW-1359
>
> Thanks
> Wes
>
> On Wed, Aug 16, 2017 at 11:58 AM, Lucas Pickup 
> <[email protected]> wrote:
>> Hello,
>>
>> I have been using pyarrow and PySpark to write Parquet files. I have used 
>> pyarrow to successfully write out a Parquet file with spaces in column 
>> names. E.g. 'X Coordinate'.
>> When I try to write out the same dataset using Sparks Parquet writer it 
>> fails claiming:
>> "Attribute name "X Coordinate" contains invalid character(s) among " 
>> ,;{}()\\n\\t<file://n//t>="".
>> It seems that according to Spark's Parquet implementation those above 
>> characters are not allowed to be a part of a Parquet Schema due to special 
>> meaning.
>> The code that checks this is 
>> here<https://github.com/apache/spark/blob/cba826d00173a945b0c9a7629c66e36fa73b723e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565>.
>>
>> I was wondering if there was a reason why the implementations have such a 
>> major difference when it comes to schema generation?
>>
>> Cheers, Lucas Pickup
>

Re: Major difference between Spark and Arrow Parquet Implementations

Reply via email to