[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060980#comment-17060980 ] Joris Van den Bossche commented on ARROW-5666: -- With the new API, you can also specify the exact schema of the partition keys in case the default inference is not what you want (eg specify string in case it looks like integers but you want to keep it as string) For example: {code} part = ds.partitioning(pa.schema([("year_week", pa.int64())]), flavor="hive") dataset = ds.dataset("test", format="parquet", partitioning=part) {code} (for the actual example code this doesn't work, because "2019_2" is no longer parsable as int now, and then it gives all nulls instead of raising an error, going to open a separate issue about that) > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: dataset-parquet-read, parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060975#comment-17060975 ] Joris Van den Bossche commented on ARROW-5666: -- This now works with the new Datasets API: {code} In [2]: import pyarrow.dataset as ds In [3]: dataset = ds.dataset("test/", format="parquet", partitioning="hive") In [4]: dataset.schema Out[4]: value: int64 year_week: string In [5]: dataset.to_table().to_pandas() Out[5]: value year_week 0 12019_2 1 22019_3 {code} So once we start using this new code in the parquet module (ARROW-8039), this issue should get resolved. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: dataset-parquet-read, parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872692#comment-16872692 ] Julian de Ruiter commented on ARROW-5666: - Ideally you would want to preserve the data type of the partition columns, but that's going to be hard to do properly if you also want to include other types such as dates, etc. Maybe it would be most consistent to just read partitioned columns as a categorical type and let the user handle the details, as this makes the code less magic than what happens now. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870952#comment-16870952 ] Robin Kåveland commented on ARROW-5666: --- That's just a slight modification to the expression above: {code:java} all(key.isdigit() or (key.startswith('-') and key[1:].isdigit()) for key in self.keys){code} But it's starting to feel like a bad idea to attempt to do this coercing at all. Maybe it's better to force the user to deal with what type the partition key should have? If it were to be interpreted as something that looks a bit like {{pd.Categorical}}, it would be relatively cheap to read it into memory and deal with it after the user has read the file? > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869345#comment-16869345 ] Julian de Ruiter commented on ARROW-5666: - What about negative values though? isdigit returns false for '-10' for example, whilst int(-10) returns an int. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869341#comment-16869341 ] Julian de Ruiter commented on ARROW-5666: - I think `isdigit` might be a good solution, it would also guard against silently casting (for example) floats to integers. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868924#comment-16868924 ] Robin Kåveland commented on ARROW-5666: --- Maybe it's an option to check {{all(key.isdigit() for key in self.keys)}} ? Might also be ugly, but apparently: {code:java} >>> '1231_2'.isdigit() False >>> int('123_2') 1232 >>> '1231'.isdigit() True{code} This breaks stuff like base 16 partition keys, but at least the original data is still unchanged, so the user can coerce themselves. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > Labels: parquet > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868583#comment-16868583 ] Joris Van den Bossche commented on ARROW-5666: -- Thanks for the report! The problem is that we try to convert the keys to integer, and if that fails just preserve them as strings. That is done here https://github.com/apache/arrow/blob/961927af56b83d0dbca91132c3f07aa06d69fc63/python/pyarrow/parquet.py#L659-L663 {code} # Only integer and string partition types are supported right now try: integer_keys = [int(x) for x in self.keys] dictionary = lib.array(integer_keys) except ValueError: dictionary = lib.array(self.keys) {code} and apparently, Python will convert a string with an underscore to an integer ... {code} In [3]: int("2019_1") Out[3]: 20191 {code} I think this is because in recent Python versions underscores are allowed in integer literals (eg to separate thousands). We could special case this and first check if there is an underscore in the string before trying to convert to integers, but that's a big ugly. > [Python] Underscores in partition (string) values are dropped when reading > dataset > -- > > Key: ARROW-5666 > URL: https://issues.apache.org/jira/browse/ARROW-5666 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Julian de Ruiter >Priority: Major > > When reading a partitioned dataset, in which the partition column contains > string values with underscores, pyarrow seems to be ignoring the underscores > in the resulting values. > For example if I write and then read a dataset as follows: > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({ > "year_week": ["2019_2", "2019_3"], > "value": [1, 2] > }) > table = pa.Table.from_pandas(df.head()) > pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) > table2 = pq.ParquetDataset('test').read() > {code} > The resulting 'year_week' column in table 2 has lost the underscores: > {code:java} > table2[1] # Gives: > indices=int32, ordered=0>)> > [ > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 0 > ], > -- dictionary: > [ > 20192, > 20193 > ] > -- indices: > [ > 1 > ] > ] > {code} > Is this intentional behaviour or is this a bug in arrow? -- This message was sent by Atlassian JIRA (v7.6.3#76005)