[ https://issues.apache.org/jira/browse/ARROW-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-8210: ----------------------------------------- Description: While testing duplicate column names, I ran into multiple issues: * Factory fails if there are duplicate columns, even for a single file * In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal * Once a Dataset with duplicated columns is created, scanning without any column projection fails --- My python reproducer: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pyarrow.fs # create single parquet file with duplicated column names table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])], names=['a', 'b', 'a']) pq.write_table(table, "data_duplicate_columns.parquet") {code} Factory fails: {code} dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet") ... ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format) 346 347 factories = [_ensure_factory(f, **kwargs) for f in paths_or_factories] --> 348 return UnionDatasetFactory(factories).finish() 349 350 ArrowInvalid: Can't unify schema with duplicate field names. {code} And when creating a Dataset manually: {code:python} schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')]) dataset = ds.FileSystemDataset( schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(), [str(basedir / "data_duplicate_columns.parquet")], [ds.ScalarExpression(True)]) {code} then scanning fails: {code} >>> dataset.to_table() ... ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64 b: int64 a: int64 {code} was: While testing duplicate column names, I ran into multiple issues: * Factory fails if there are duplicate columns, even for a single file * In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal * Once a Dataset with duplicated columns is created, scanning without any column projection fails > [C++][Dataset] Handling of duplicate columns in Dataset factory and scanning > ---------------------------------------------------------------------------- > > Key: ARROW-8210 > URL: https://issues.apache.org/jira/browse/ARROW-8210 > Project: Apache Arrow > Issue Type: Bug > Components: C++, C++ - Dataset > Reporter: Joris Van den Bossche > Priority: Major > > While testing duplicate column names, I ran into multiple issues: > * Factory fails if there are duplicate columns, even for a single file > * In addition, we should also fix and/or test that factory works for > duplicate columns if the schema's are equal > * Once a Dataset with duplicated columns is created, scanning without any > column projection fails > --- > My python reproducer: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > import pyarrow.fs > # create single parquet file with duplicated column names > table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, > 9])], names=['a', 'b', 'a']) > pq.write_table(table, "data_duplicate_columns.parquet") > {code} > Factory fails: > {code} > dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet") > ... > ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, > filesystem, partitioning, format) > 346 > 347 factories = [_ensure_factory(f, **kwargs) for f in > paths_or_factories] > --> 348 return UnionDatasetFactory(factories).finish() > 349 > 350 > ArrowInvalid: Can't unify schema with duplicate field names. > {code} > And when creating a Dataset manually: > {code:python} > schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')]) > dataset = ds.FileSystemDataset( > schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(), > [str(basedir / "data_duplicate_columns.parquet")], > [ds.ScalarExpression(True)]) > {code} > then scanning fails: > {code} > >>> dataset.to_table() > ... > ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64 > b: int64 > a: int64 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)