[ https://issues.apache.org/jira/browse/ARROW-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir updated ARROW-10937: ----------------------------- Description: Hello It looks like pyarrow-2.0.0 has problems in reading parquet could not read partitioned datasets from S3 buckets: {code:java} import s3fs import pyarrow as pa import pyarrow.parquet as pq filesystem = s3fs.S3FileSystem() d = pd.date_range('1990-01-01', freq='D', periods=10000) vals = np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year table = pa.Table.from_pandas(x, preserve_index=True) pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem) {code} Now, reading it via pq.read_table: {code:java} pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True) {code} Raises exception: {code:java} ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet' {code} Direct read in pandas: {code:java} pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code} returns empty DataFrame. The issue does not exist in pyarrow-1.0.1 was: Hello It looks like pyarrow-2.0.0 has problems in reading parquet could not read partitioned datasets from S3 buckets: {code:java} import s3fs import pyarrow as pa import pyarrow.parquet as pq filesystem = s3fs.S3FileSystem() d = pd.date_range('1990-01-01', freq='D', periods=10000) vals = np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year table = pa.Table.from_pandas(x, preserve_index=True) pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem) {code} Now, reading it via pq.read_table: {code:java} pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True) {code} Raises exception: {code:java} ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet' {code} Direct read in pandas: {code:java} pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code} returns empty DataFrame. The issue does not exist in pyarrow-1.0.1 > ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0) > ----------------------------------------------------------------------------- > > Key: ARROW-10937 > URL: https://issues.apache.org/jira/browse/ARROW-10937 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: Vladimir > Priority: Major > > Hello > It looks like pyarrow-2.0.0 has problems in reading parquet could not read > partitioned datasets from S3 buckets: > {code:java} > import s3fs > import pyarrow as pa > import pyarrow.parquet as pq > filesystem = s3fs.S3FileSystem() > d = pd.date_range('1990-01-01', freq='D', periods=10000) > vals = np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > table = pa.Table.from_pandas(x, preserve_index=True) > pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', > partition_cols=['Year'], filesystem=filesystem) > {code} > > Now, reading it via pq.read_table: > {code:java} > pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, > use_pandas_metadata=True) > {code} > Raises exception: > {code:java} > ArrowInvalid: GetFileInfo() yielded path > 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', > which is outside base dir 's3://bucket/test_pyarrow.parquet' > {code} > > Direct read in pandas: > {code:java} > pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code} > returns empty DataFrame. > > The issue does not exist in pyarrow-1.0.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)