Lance Dacey created ARROW-10694: ----------------------------------- Summary: [Python] ds.write_dataset() generates empty files for each final partition Key: ARROW-10694 URL: https://issues.apache.org/jira/browse/ARROW-10694 Project: Apache Arrow Issue Type: Bug Affects Versions: 2.0.0 Environment: Ubuntu 18.04 Python 3.8.6 adlfs master branch Reporter: Lance Dacey
ds.write_dataset() is generating empty files for the final partition folder which causes errors when reading the dataset or converting a dataset to a table. I believe this may be caused by fs.mkdir(). Without the final slash in the path, an empty file is created in the "dev" container: {code:java} fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) fs.mkdir("dev/test2") {code} If the final slash is added, a proper folder is created: {code:java} fs.mkdir("dev/test2/"){code} Here is a full example of what happens with ds.write_dataset: {code:java} schema = pa.schema( [ ("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8()), ("report_date", pa.date32()), ("employee_id", pa.string()), ("designation", pa.dictionary(index_type=pa.int16(), value_type=pa.string())), ] ) part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())])) ds.write_dataset(data=table, base_dir="dev/test-dataset", basename_template="test-{i}.parquet", format="parquet", partitioning=part, schema=schema, filesystem=fs) dataset.files #sample printed below, note the empty files [ 'dev/test-dataset/2018/1/1/test-0.parquet', 'dev/test-dataset/2018/10/1', 'dev/test-dataset/2018/10/1/test-27.parquet', 'dev/test-dataset/2018/3/1', 'dev/test-dataset/2018/3/1/test-6.parquet', 'dev/test-dataset/2020/1/1', 'dev/test-dataset/2020/1/1/test-2.parquet', 'dev/test-dataset/2020/10/1', 'dev/test-dataset/2020/10/1/test-29.parquet', 'dev/test-dataset/2020/11/1', 'dev/test-dataset/2020/11/1/test-32.parquet', 'dev/test-dataset/2020/2/1', 'dev/test-dataset/2020/2/1/test-5.parquet', 'dev/test-dataset/2020/7/1', 'dev/test-dataset/2020/7/1/test-20.parquet', 'dev/test-dataset/2020/8/1', 'dev/test-dataset/2020/8/1/test-23.parquet', 'dev/test-dataset/2020/9/1', 'dev/test-dataset/2020/9/1/test-26.parquet' ]{code} As you can see, there is an empty file for each "day" partition. I was not even able to read the dataset at all until I manually deleted the first empty file in the dataset (2018/1/1). I then get an error when I try to use the to_table() method: {code:java} OSError Traceback (most recent call last) <ipython-input-127-6fb0d79c4511> in <module> ----> 1 dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()OSError: Could not open parquet input source 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes {code} If I manually delete the empty file, I can then use the to_table() function: {code:java} dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 10)).to_pandas() {code} Is this a bug with pyarrow, adlfs, or fsspec? -- This message was sent by Atlassian Jira (v8.3.4#803005)