[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298828#comment-17298828 ]
Lance Dacey commented on ARROW-10694: ------------------------------------- This is being worked on in the adlfs library so I will close this. There are working aldfs branches that I have tested, but they have unfortunately also included new problems. Hopefully there will be a final solution soon. > [Python] ds.write_dataset() generates empty files for each final partition > -------------------------------------------------------------------------- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch > Reporter: Lance Dacey > Priority: Major > Labels: dataset > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > <ipython-input-127-6fb0d79c4511> in <module> > ----> 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)