[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823025#comment-16823025 ]
Uwe L. Korn commented on ARROW-4139: ------------------------------------ Have a look at https://github.com/apache/arrow/pull/2623/files which implements predicate pushdown in Arrow on the Python side. There should also be some code in there that handles the physical to logical conversion. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > ------------------------------------------------------------------------------- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Matthew Rocklin > Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238> > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418> > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)