[ 
https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386518#comment-16386518
 ] 

ASF GitHub Bot commented on ARROW-1982:
---------------------------------------

xhochy commented on a change in pull request #1698: ARROW-1982: [Python] Coerce 
Parquet statistics as bytes to more useful Python scalar types
URL: https://github.com/apache/arrow/pull/1698#discussion_r172287015
 
 

 ##########
 File path: python/pyarrow/_parquet.pyx
 ##########
 @@ -70,6 +70,28 @@ cdef class RowGroupStatistics:
                                self.num_values,
                                self.physical_type)
 
+    cdef inline _cast_statistic(self, object value):
+        cdef ParquetType physical_type = self.statistics.get().physical_type()
+        if physical_type == ParquetType_BOOLEAN:
+            return bool(int(value))
+        elif physical_type == ParquetType_INT32:
+            return int(value)
+        elif physical_type == ParquetType_INT64:
+            return int(value)
+        elif physical_type == ParquetType_INT96:
+            # TODO
 
 Review comment:
   We should return also `bytes` here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Return parquet statistics min/max as values instead of strings
> -----------------------------------------------------------------------
>
>                 Key: ARROW-1982
>                 URL: https://issues.apache.org/jira/browse/ARROW-1982
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jim Crist
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> Currently `min` and `max` column statistics are returned as formatted strings 
> of the _physical type_. This makes using them in python a bit tricky, as the 
> strings need to be parsed as the proper _logical type_. Observe:
> {code}
> In [20]: import pandas as pd
> In [21]: df = pd.DataFrame({'a': [1, 2, 3],
>     ...:                    'b': ['a', 'b', 'c'],
>     ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
>     ...:
> In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
> In [23]: from pyarrow import parquet as pq
> In [24]: f = pq.ParquetFile('temp.parquet')
> In [25]: rg = f.metadata.row_group(0)
> In [26]: rg.column(0).statistics.min  # string instead of integer
> Out[26]: '1'
> In [27]: rg.column(1).statistics.min  # weird space added after value due to 
> formatter
> Out[27]: 'a '
> In [28]: rg.column(2).statistics.min  # formatted as physical type (int) 
> instead of logical (datetime)
> Out[28]: '662688000000'
> {code}
> Since the type information is known, it should be possible to convert these 
> to arrow values instead of strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to