[jira] [Commented] (PARQUET-465) Parquet-Avro does not support field removal

Ryan Blue (JIRA) Tue, 26 Jan 2016 13:29:05 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118075#comment-15118075
 ]


Ryan Blue commented on PARQUET-465:
-----------------------------------

[~eggsby], thanks for the thoroughness! Have you tried using different schemas 
for the projection and read?

It looks like the read schema can handle renames, but the projection schema 
must use the same names as the underlying file (which we should definitely 
fix). What if you try with v5 for your read schema, but this write schema:

{code}
{ "type": "record",
  "name": "com.example.avro.compatibility.v5.CompatibilityTestRecord",
  "fields": [
    "notId": "string"
  ] }
{code}

Also, what do you think about adding these cases as tests? I'd love to verify 
this behavior and use your work to ensure that we don't have future 
regressions! We also need to improve this area and this is a great start for 
figuring out what to improve. I was just wondering what the cases are where you 
need different read and projection schemas on my way to work and this answers 
it.

> Parquet-Avro does not support field removal
> -------------------------------------------
>
>                 Key: PARQUET-465
>                 URL: https://issues.apache.org/jira/browse/PARQUET-465
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.8.0
>            Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new 
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
>   long foo;
>   string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
>   string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: 
> Avro field 'foo' not found
>       at 
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes 
> the new version must expect it, but this case just means that the field was 
> removed. Avro schema resolution dictates that you just ignore this field, 
> since it is not relevant in the new version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-465) Parquet-Avro does not support field removal

Reply via email to