[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

Joris Van den Bossche (Jira) Tue, 17 Mar 2020 08:04:27 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060980#comment-17060980
 ]


Joris Van den Bossche commented on ARROW-5666:
----------------------------------------------

With the new API, you can also specify the exact schema of the partition keys 
in case the default inference is not what you want (eg specify string in case 
it looks like integers but you want to keep it as string)

For example:

{code}
part = ds.partitioning(pa.schema([("year_week", pa.int64())]), flavor="hive")   
dataset = ds.dataset("test", format="parquet", partitioning=part)      
{code}

(for the actual example code this doesn't work, because "2019_2" is no longer 
parsable as int now, and then it gives all nulls instead of raising an error, 
going to open a separate issue about that)

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-5666
>                 URL: https://issues.apache.org/jira/browse/ARROW-5666
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Julian de Ruiter
>            Priority: Major
>              Labels: dataset-parquet-read, parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
> <Column name='year_week' type=DictionaryType(dictionary<values=int64, 
> indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       0
>     ],
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

Reply via email to