[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

Lance Dacey (Jira) Mon, 23 Nov 2020 03:51:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237299#comment-17237299
 ]


Lance Dacey commented on ARROW-10694:
-------------------------------------

Sure. https://github.com/dask/adlfs/issues/137

I tried the exclude_invalid_files argument but ran into an error:

 
{code:java}
dataset = ds.dataset(source="dev/test-dataset", 
                     format="parquet", 
                     partitioning=partition,
                     exclude_invalid_files=True,
                     filesystem=fs)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-12-69571963d552> in <module>
----> 1 dataset = ds.dataset(source="dev/test-dataset", 
      2                      format="parquet",
      3                      partitioning=partition,
      4                      exclude_invalid_files=True,
      5                      filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
    669     # TODO(kszucs): support InMemoryDataset for a table input
    670     if _is_path_like(source):
--> 671         return _filesystem_dataset(source, **kwargs)
    672     elif isinstance(source, (tuple, list)):
    673         if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    434         selector_ignore_prefixes=selector_ignore_prefixes
    435     )
--> 436     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
options)
    437 
    438     return factory.finish(schema)

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset.FileSystemDatasetFactory.__init__()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_open_input_file()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, 
path)
    274 
    275         if not self.fs.isfile(path):
--> 276             raise FileNotFoundError(path)
    277 
    278         return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: dev/test-dataset/2018/1/1
{code}
That folder and the empty file exists though:
{code:java}
for file in fs.find("dev/test-dataset"):
    print(file)

dev/test-dataset/2018/1/1
dev/test-dataset/2018/1/1/test-0.parquet
dev/test-dataset/2018/10/1
dev/test-dataset/2018/10/1/test-27.parquet
dev/test-dataset/2018/11/1
dev/test-dataset/2018/11/1/test-30.parquet
dev/test-dataset/2018/12/1
dev/test-dataset/2018/12/1/test-33.parquet
dev/test-dataset/2018/2/1
dev/test-dataset/2018/2/1/test-3.parquet

{code}
 

> [Python] ds.write_dataset() generates empty files for each final partition
> --------------------------------------------------------------------------
>
>                 Key: ARROW-10694
>                 URL: https://issues.apache.org/jira/browse/ARROW-10694
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>            Reporter: Lance Dacey
>            Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
>     [
>         ("year", pa.int16()),
>         ("month", pa.int8()),
>         ("day", pa.int8()),
>         ("report_date", pa.date32()),
>         ("employee_id", pa.string()),
>         ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
>     ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>                  base_dir="dev/test-dataset", 
>                  basename_template="test-{i}.parquet", 
>                  format="parquet",
>                  partitioning=part, 
>                  schema=schema,
>                  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError                                   Traceback (most recent call last)
> <ipython-input-127-6fb0d79c4511> in <module>
> ----> 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

Reply via email to