David McGuire created ARROW-10029:
-------------------------------------

             Summary: Deadlock in the interaction of pyarrow FileSystem and 
ParquetDataset
                 Key: ARROW-10029
                 URL: https://issues.apache.org/jira/browse/ARROW-10029
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
            Reporter: David McGuire


@martindurant good news (for you): I have a repro test case that is 100% 
`pyarrow`, so it looks like `s3fs` is not involved.

@jorisvandenbossche how should I follow up with this, based on 
`pyarrow.filesystem.LocalFileSystem`?

```python
import pyarrow.parquet as pq
import pyarrow.filesystem as fs

class LoggingLocalFileSystem(fs.LocalFileSystem):
    def walk(self, path):
        print(path)
        return super().walk(path)

fs = LoggingLocalFileSystem()
dataset_url = "dataset"

# Viewing the File System *directories* as a tree, one thread is required for 
every non-leaf node,
# in order to avoid deadlock

# 1) dataset
# 2) dataset/foo=1
# 3) dataset/foo=1/bar=2
# 4) dataset/foo=1/bar=2/baz=0
# 5) dataset/foo=1/bar=2/baz=1
# 6) dataset/foo=1/bar=2/baz=2
# *) dataset/foo=1/bar=2/baz=0/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=true
# *) dataset/foo=1/bar=2/baz=0/qux=true
# *) dataset/foo=1/bar=2/baz=2/qux=false
# *) dataset/foo=1/bar=2/baz=2/qux=true

# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, 
metadata_nthreads=threads)
print(len(dataset.pieces))

# This hangs indefinitely
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, 
metadata_nthreads=threads)
print(len(dataset.pieces))
```

```bash
$ python repro.py 
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
dataset/foo=1/bar=2/baz=0/qux=false
dataset/foo=1/bar=2/baz=0/qux=true
dataset/foo=1/bar=2/baz=1/qux=false
dataset/foo=1/bar=2/baz=1/qux=true
dataset/foo=1/bar=2/baz=2/qux=false
dataset/foo=1/bar=2/baz=2/qux=true
6
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
^C
...
KeyboardInterrupt
^C
...
KeyboardInterrupt
```

**NOTE:** this *also* happens with the un-decorated `LocalFileSystem`, and when 
omitting the `filesystem` argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to