[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237299#comment-17237299 ]
Lance Dacey commented on ARROW-10694: ------------------------------------- Sure. https://github.com/dask/adlfs/issues/137 I tried the exclude_invalid_files argument but ran into an error: {code:java} dataset = ds.dataset(source="dev/test-dataset", format="parquet", partitioning=partition, exclude_invalid_files=True, filesystem=fs) --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-12-69571963d552> in <module> ----> 1 dataset = ds.dataset(source="dev/test-dataset", 2 format="parquet", 3 partitioning=partition, 4 exclude_invalid_files=True, 5 filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 434 selector_ignore_prefixes=selector_ignore_prefixes 435 ) --> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options) 437 438 return factory.finish(schema) /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.FileSystemDatasetFactory.__init__() /opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, path) 274 275 if not self.fs.isfile(path): --> 276 raise FileNotFoundError(path) 277 278 return PythonFile(self.fs.open(path, mode="rb"), mode="r") FileNotFoundError: dev/test-dataset/2018/1/1 {code} That folder and the empty file exists though: {code:java} for file in fs.find("dev/test-dataset"): print(file) dev/test-dataset/2018/1/1 dev/test-dataset/2018/1/1/test-0.parquet dev/test-dataset/2018/10/1 dev/test-dataset/2018/10/1/test-27.parquet dev/test-dataset/2018/11/1 dev/test-dataset/2018/11/1/test-30.parquet dev/test-dataset/2018/12/1 dev/test-dataset/2018/12/1/test-33.parquet dev/test-dataset/2018/2/1 dev/test-dataset/2018/2/1/test-3.parquet {code} > [Python] ds.write_dataset() generates empty files for each final partition > -------------------------------------------------------------------------- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch > Reporter: Lance Dacey > Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > <ipython-input-127-6fb0d79c4511> in <module> > ----> 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)