[
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200632#comment-14200632
]
Gabriel Reid commented on CRUNCH-480:
-------------------------------------
And now having thought about this a bit more, I see that I was over-simplifying
things a bit with my proposed fix of just doing the equivalent of
{{AvroReadSupport.setAvroReadSchema}} when a custom schema is provided, as this
means that a projection schema always means that a custom read schema is used,
and vice versa.
I guess the situations that need to be supported are:
* no projection and use the write schema for reading
* use projection, but use the write schema for reading (which means some fields
will just be null)
* use projection and a custom read schema
I'm not clear if a custom read schema without a projection is something that
would be needed. [~esammer], could you elaborate on your use case?
I'm guessing that using a projection
> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
> Key: CRUNCH-480
> URL: https://issues.apache.org/jira/browse/CRUNCH-480
> Project: Crunch
> Issue Type: Bug
> Components: IO
> Affects Versions: 0.10.0
> Reporter: E. Sammer
> Assignee: Gabriel Reid
> Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration
> param required to use a user-supplied read schema that differs from the
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found
> this:
> {code}
> this.recordConverter = readSupport.prepareForRead(
> configuration, extraMetadata, fileSchema,
> new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore
> the supplied requestedSchema and, instead, looks for the key avro.read.schema
> in the readSupportMetadata map. This is seriously kookie code in Parquet
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can
> never properly supply a read schema. Boooo hisssss.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)