[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset

Lance Dacey (Jira) Wed, 01 Dec 2021 05:23:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451813#comment-17451813
 ]


Lance Dacey commented on ARROW-14938:
-------------------------------------

Sure - refer to this section: 
https://arrow.apache.org/docs/python/dataset.html#different-partitioning-schemes

"hive" is a shortcut which will infer the data type of the partition column 
when it gets added back to the table, but you can specify the schema of your 
partitioned columns too using ds.partitioning().



> Partition column dissappear when reading dataset
> ------------------------------------------------
>
>                 Key: ARROW-14938
>                 URL: https://issues.apache.org/jira/browse/ARROW-14938
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.1
>         Environment: Debian bullseye, python 3.9
>            Reporter: Martin Gran
>            Priority: Major
>
> Appending CSV to parquet dataset with partitioning on "code".
> {code:python}
> table = pa.Table.from_pandas(chunk)
>         pa.dataset.write_dataset(
>             table,
>             output_path,
>             basename_template=f"chunk_\{y}_\{{i}}",
>             format="parquet",
>             partitioning=["code"],
>             existing_data_behavior="overwrite_or_ignore",
>         )
> {code}
> Loading the dataset again and expecting code to be in the dataframe.
> {code:python}
> import pyarrow.dataset as ds
> dataset = ds.dataset("../data/interim/2020_elements_parquet/", 
> format="parquet",)
> df = dataset.to_table().to_pandas()
> >>>df["code"]
> {code}
> Trace
> {code:python}
> --------------------------------------------------------------------------- 
> KeyError Traceback (most recent call last) 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3360 try: -> 3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The 
> above exception was the direct cause of the following exception: KeyError 
> Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in 
> <module> ----> 1 df["code"] 
> ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in 
> __getitem__(self, key)  3456 if self.columns.nlevels > 1:  3457 return 
> self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)  
> 3459 if is_integer(indexer):  3460 indexer = [indexer] 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: -> 3363 raise 
> KeyError(key) from err  3364  3365 if is_scalar(key) and isna(key) and not 
> self.hasnans: KeyError: 'code'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset

Reply via email to