[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999420#comment-16999420
 ] 

Xinli Shang edited comment on PARQUET-1681 at 12/18/19 6:47 PM:
----------------------------------------------------------------

[~rdblue], I verified that this issue is not related with avro-2400 by 
reverting the avro version to 1.8.2 which still reproduced. 

[~fokko], I checked the Parquet-1.9.1. It seems PARQUET-651  has been there in 
1.9.1. So it should be affected but not verified it yet. 

I just wrote a small unit test that can reproduce the issue and I also came up 
with a change(might need to be polished, just for proof of concept now) that 
solved the issue 
[https://github.com/shangxinli/parquet-mr/commit/f80469f55b83404ea334ee4019f658ecdb5ac575].
 I have explanation below for this fix. But this change won't fix the parquet 
data files that were already written without this fix. Reverting -PARQUET-651-  
(with a small 
fix([https://github.com/shangxinli/parquet-mr/commit/10875f5b634466e1a58a615bb21f9af426275247]),
 which I believe it is a bug in 1.8.1) can solve it ultimately.

So I would like to hear the community's thoughts on which route we want to go? 

Here are explanations of the issue and the fix. 

Root cause: org.apache.parquet.avro.AvroSchemaConverter() is a lossy 
conversion. Consider the avro schema below. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

When I convert this avro schema to parquet schema and then convert it back avro 
schema, you can see "ignored" is lost and replaced with "nestedrecord" which is 
it's outlayer name. This is why checkReaderWriterCompatibility() return false 
and caused the issue I mentioned originally. This is the avro schema converted 
back from parquet schema. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"nestedrecord","namespace":"","fields":
 [{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

The change I mentioned above was to extend the parquet schema with a metadata 
field. This metadat field can be used for general purpose. In this case we use 
it to contain the inner layer and restore it when the schema is convert it back 
to avro schema. 

 

With the change, the converted avro schema is exactly as the original one. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

 

This bug is pretty severe that it caused the data cannot be read when it is 
hit. I don't know why nobody reported this issue in the industry, maybe because 
this is a corner case or Parquet-1.10.1 is not used widely yet in production. 
But once it is hit, it is severe enough and blocks the business. 

 

 


was (Author: sha...@uber.com):
[~rdblue], I verified that this issue is not related with avro-2400 by 
reverting the avro version to 1.8.2 which still reproduced. 

[~fokko], I checked the Parquet-1.9.1. It seems PARQUET-651  has been there in 
1.9.1. So it should be affected but not verified it yet. 

I just wrote a small unit test that can reproduce the issue and I also came up 
with a change(might need to be polished, just for proof of concept now) that 
solved the issue 
[https://github.com/shangxinli/parquet-mr/commit/f80469f55b83404ea334ee4019f658ecdb5ac575].
 I have explanation below for this fix. But this change won't fix the parquet 
data files that were already written without this fix. Reverting -PARQUET-651-  
(with a small 
fix([https://github.com/shangxinli/parquet-mr/commit/10875f5b634466e1a58a615bb21f9af426275247]),
 which I believe it is a bug in 1.8.1) can solve it ultimately.

So I would like to hear the community's thoughts on which route we want to go? 

Here are explanations of the issue and the fix. 

Root cause: org.apache.parquet.avro.AvroSchemaConverter() is a lossy 
conversion. Consider the avro schema below. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

When I convert this avro schema to parquet schema and then convert it back avro 
schema, you can see "ignored" is lost and replaced with "nestedrecord" which is 
it's outlayer name. This is why checkReaderWriterCompatibility() return false 
and caused the issue I mentioned originally. This is the avro schema converted 
back from parquet schema. 


{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"nestedrecord","namespace":"","fields":[{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

The change I mentioned above was to extend the parquet schema with a metadata 
field. This metadat field can be used for general purpose. In this case we use 
it to contain the inner layer and restore it when the schema is convert it back 
to avro schema. 

 

With the change, the converted avro schema is exactly as the original one. 

{"type":"record","name":"myrecord","namespace":"org.apache.parquet.avro","fields":[\{"name":"flatrecord","type":"int"},\{"name":"nestedrecord","type":{"type":"record","name":"ignored","namespace":"","fields":[{"name":"nestedint1","type":"int"},\{"name":"nestedint2","type":"int"}]}}]}

 

This bug is pretty severe that it caused the data cannot be read when it is 
hit. I don't know why nobody reported this issue in the industry, maybe because 
this is a corner case or Parquet-1.10.1 is not used widely yet in production. 
But once it is hit, it is severe enough and blocks the business. 

 

 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1681
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1681
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.10.0, 1.9.1, 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>            {
>               "name": "phones",
>               "type": [
>                 "null",
>                 {
>                   "type": "array",
>                   "items": {
>                     "type": "record",
>                     "name": "phones_items",
>                     "fields": [
>                       
> {                         "name": "phone_number",                         
> "type": [                           "null",                           
> "string"                         ],                         "default": null   
>                     }
>                     ]
>                   }
>                 }
>               ],
>               "default": null
>             }
> The code to read is as below 
>      val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
>     reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to