[jira] [Updated] (PARQUET-2464) toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

Abhishek Dixit (Jira) Wed, 01 May 2024 02:24:09 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Abhishek Dixit updated PARQUET-2464:
------------------------------------
    Description: 
toParquetMetadata method converts 
org.apache.parquet.hadoop.metadata.ParquetMetadata to 
org.apache.parquet.format.FileMetaData but this does not set the dictionary 
page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then 
deserialized, the dictionary offset is lost as the dictionary page offset bit 
was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present
{code:java}
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); 
} {code}
However, it should setDictionary_page_offset even when getEncodingStats are not 
present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:
{code:java}
public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || 
encodings.contains(RLE_DICTIONARY));
} {code}
So new change in ParquetMetadataCOnvertor should be like:

 
{code:java}
if (columnMetaData.hasDictionaryPage()) { 
metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 
{code}

  was:
toParquetMetadata method converts 
org.apache.parquet.hadoop.metadata.ParquetMetadata to 
org.apache.parquet.format.FileMetaData but this does not set the dictionary 
page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then 
deserialized, the dictionary offset is lost as the dictionary page offset bit 
was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present
{code:java}
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); 
} {code}
However, it should setDictionary_page_offset even when getEncodingStats are not 
present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:
{code:java}
public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { // ensure there is a dictionary page and that it is used 
to encode data pages return stats.hasDictionaryPages() && 
stats.hasDictionaryEncodedPages(); }

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || 
encodings.contains(RLE_DICTIONARY));
} {code}
So new change in ParquetMetadataCOnvertor should be like:

 
{code:java}
if (columnMetaData.hasDictionaryPage()) { 
metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 
{code}


> toParquetMetadata method in ParquetMetadataConverter does not set dictionary 
> page offset bit
> --------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2464
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2464
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Abhishek Dixit
>            Priority: Major
>
> toParquetMetadata method converts 
> org.apache.parquet.hadoop.metadata.ParquetMetadata to 
> org.apache.parquet.format.FileMetaData but this does not set the dictionary 
> page offset bit in FileMetaData.
> When a FileMetaData object is serialized while writing to the footer and then 
> deserialized, the dictionary offset is lost as the dictionary page offset bit 
> was never set.
> PARQUET-1850  tried to fix this but it did only a partial fix.
> It sets setDictionary_page_offset only if getEncodingStats are present
> {code:java}
> if (columnMetaData.getEncodingStats() != null
> && columnMetaData.getEncodingStats().hasDictionaryPages())
> { 
> metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); 
> } {code}
> However, it should setDictionary_page_offset even when getEncodingStats are 
> not present but encodings are present.
> It should use the implementation in ColumnChunkMetatdata below:
> {code:java}
> public boolean hasDictionaryPage() {
> EncodingStats stats = getEncodingStats();
> if (stats != null) { 
> return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
> }
> Set<Encoding> encodings = getEncodings();
> return (encodings.contains(PLAIN_DICTIONARY) || 
> encodings.contains(RLE_DICTIONARY));
> } {code}
> So new change in ParquetMetadataCOnvertor should be like:
>  
> {code:java}
> if (columnMetaData.hasDictionaryPage()) { 
> metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); 
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PARQUET-2464) toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

Reply via email to