[ 
https://issues.apache.org/jira/browse/ARROW-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071140#comment-17071140
 ] 

Joris Van den Bossche edited comment on ARROW-8276 at 3/30/20, 4:52 PM:
------------------------------------------------------------------------

Reproducer in python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib


# create small partitioned dataset
table = pa.table({'col1': [1, 2, 3]})

basedir = pathlib.Path(".")
dataset_dir = basedir / "test_partitioned_fragment"
dataset_dir.mkdir(exist_ok=True)

(dataset_dir / "A=0").mkdir(exist_ok=True)
(dataset_dir / "A=1").mkdir(exist_ok=True)
pq.write_table(table, dataset_dir / "A=0" / "data.parquet")
pq.write_table(table, dataset_dir / "A=1" / "data.parquet")

# read it with the datasets API
dataset = ds.dataset(str(dataset_dir), format="parquet", partitioning="hive")

dataset.schema
dataset.to_table()

# reading one fragment fails
fragments = list(dataset.get_fragments())
fragments[0].to_table()
{code}

gives:

{code}
ArrowInvalid: Schema at index 0 was different: 
col1: int64
A: int32
vs
col1: int64
{code}


was (Author: jorisvandenbossche):
Reproducer in python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib


# create small partitioned dataset
table = pa.table({'col1': [1, 2, 3]})

basedir = pathlib.Path(".")
dataset_dir = basedir / "test_partitioned_fragment"
dataset_dir.mkdir(exist_ok=True)

(dataset_dir / "A=0").mkdir(exist_ok=True)
(dataset_dir / "A=1").mkdir(exist_ok=True)
pq.write_table(table, dataset_dir / "A=0" / "data.parquet")
pq.write_table(table, dataset_dir / "A=1" / "data.parquet")

# read it with the datasets API
dataset = ds.dataset(str(dataset_dir), format="parquet", partitioning="hive")

dataset.schema
dataset.to_table()

# reading one fragment fails
fragments = list(dataset.get_fragments())
fragments[0].to_table()
{code}

> [C++][Dataset] Scanning a Fragment does not take into account the partition 
> columns
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-8276
>                 URL: https://issues.apache.org/jira/browse/ARROW-8276
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, C++ - Dataset
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Follow-up on ARROW-8061, the {{to_table}} method doesn't work for fragments 
> created from a partitioned dataset.
> (will add a reproducer later)
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to