[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525795#comment-17525795
 ] 

Timothy Miller commented on PARQUET-1681:
-----------------------------------------

Have a look at my further analysis of PARQUET-2069. I don't think 1681 will be 
fixed by what I did to fix 2069, but the problem seems to be the same KIND of 
problem, happening in the same place. Basically, isElementType is returning the 
wrong thing because there are types that Avro should think are compatible, but 
it has not been properly informed about that. In the case of 2069, "list" and 
"array" were being considered incompatible, so I fixed it by hacking 
isElementType to inform Avro of that compatibility. It may be that we need to 
completely rethink isElementType to be smarter about detecting reader/writer 
compatibility. In the case of 1681, evidently "phones_items" is some kind of 
list/array type, so it should test positive as compatible with another 
list/array type simply by virtue of them both containing multiple members.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1681
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1681
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.10.0, 1.9.1, 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>            {
>               "name": "phones",
>               "type": [
>                 "null",
>                 {
>                   "type": "array",
>                   "items": {
>                     "type": "record",
>                     "name": "phones_items",
>                     "fields": [
>                       
> {                         "name": "phone_number",                         
> "type": [                           "null",                           
> "string"                         ],                         "default": null   
>                     }
>                     ]
>                   }
>                 }
>               ],
>               "default": null
>             }
> The code to read is as below 
>      val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
>     reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to