[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829633#comment-16829633 ]
Michael Eaton edited comment on ARROW-4139 at 4/29/19 7:14 PM: --------------------------------------------------------------- The following comment seems to suggest that there is a latent issue sorting unsigned types, specifically mentioning UTF-8. If this is indeed the case, then would not the sorting will have to be fixed before progress can continue on this issue? [https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L140] was (Author: meaton): The following comment seems to suggest that there is a latent issue sorting unsigned types, specifically mentioning UTF-8. If this is indeed the case, then the sorting will have to be fixed before progress can continue on this issue. [https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L140] > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > ------------------------------------------------------------------------------- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Matthew Rocklin > Priority: Minor > Labels: parquet, pull-request-available, python > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238> > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418> > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)