[ https://issues.apache.org/jira/browse/ARROW-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071140#comment-17071140 ]
Joris Van den Bossche edited comment on ARROW-8276 at 3/30/20, 4:52 PM: ------------------------------------------------------------------------ Reproducer in python: {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib # create small partitioned dataset table = pa.table({'col1': [1, 2, 3]}) basedir = pathlib.Path(".") dataset_dir = basedir / "test_partitioned_fragment" dataset_dir.mkdir(exist_ok=True) (dataset_dir / "A=0").mkdir(exist_ok=True) (dataset_dir / "A=1").mkdir(exist_ok=True) pq.write_table(table, dataset_dir / "A=0" / "data.parquet") pq.write_table(table, dataset_dir / "A=1" / "data.parquet") # read it with the datasets API dataset = ds.dataset(str(dataset_dir), format="parquet", partitioning="hive") dataset.schema dataset.to_table() # reading one fragment fails fragments = list(dataset.get_fragments()) fragments[0].to_table() {code} gives: {code} ArrowInvalid: Schema at index 0 was different: col1: int64 A: int32 vs col1: int64 {code} was (Author: jorisvandenbossche): Reproducer in python: {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib # create small partitioned dataset table = pa.table({'col1': [1, 2, 3]}) basedir = pathlib.Path(".") dataset_dir = basedir / "test_partitioned_fragment" dataset_dir.mkdir(exist_ok=True) (dataset_dir / "A=0").mkdir(exist_ok=True) (dataset_dir / "A=1").mkdir(exist_ok=True) pq.write_table(table, dataset_dir / "A=0" / "data.parquet") pq.write_table(table, dataset_dir / "A=1" / "data.parquet") # read it with the datasets API dataset = ds.dataset(str(dataset_dir), format="parquet", partitioning="hive") dataset.schema dataset.to_table() # reading one fragment fails fragments = list(dataset.get_fragments()) fragments[0].to_table() {code} > [C++][Dataset] Scanning a Fragment does not take into account the partition > columns > ----------------------------------------------------------------------------------- > > Key: ARROW-8276 > URL: https://issues.apache.org/jira/browse/ARROW-8276 > Project: Apache Arrow > Issue Type: Bug > Components: C++, C++ - Dataset > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Fix For: 0.17.0 > > > Follow-up on ARROW-8061, the {{to_table}} method doesn't work for fragments > created from a partitioned dataset. > (will add a reproducer later) > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)