David McGuire created ARROW-10029: ------------------------------------- Summary: Deadlock in the interaction of pyarrow FileSystem and ParquetDataset Key: ARROW-10029 URL: https://issues.apache.org/jira/browse/ARROW-10029 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1 Reporter: David McGuire
@martindurant good news (for you): I have a repro test case that is 100% `pyarrow`, so it looks like `s3fs` is not involved. @jorisvandenbossche how should I follow up with this, based on `pyarrow.filesystem.LocalFileSystem`? ```python import pyarrow.parquet as pq import pyarrow.filesystem as fs class LoggingLocalFileSystem(fs.LocalFileSystem): def walk(self, path): print(path) return super().walk(path) fs = LoggingLocalFileSystem() dataset_url = "dataset" # Viewing the File System *directories* as a tree, one thread is required for every non-leaf node, # in order to avoid deadlock # 1) dataset # 2) dataset/foo=1 # 3) dataset/foo=1/bar=2 # 4) dataset/foo=1/bar=2/baz=0 # 5) dataset/foo=1/bar=2/baz=1 # 6) dataset/foo=1/bar=2/baz=2 # *) dataset/foo=1/bar=2/baz=0/qux=false # *) dataset/foo=1/bar=2/baz=1/qux=false # *) dataset/foo=1/bar=2/baz=1/qux=true # *) dataset/foo=1/bar=2/baz=0/qux=true # *) dataset/foo=1/bar=2/baz=2/qux=false # *) dataset/foo=1/bar=2/baz=2/qux=true # This completes threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) # This hangs indefinitely threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) ``` ```bash $ python repro.py dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 dataset/foo=1/bar=2/baz=0/qux=false dataset/foo=1/bar=2/baz=0/qux=true dataset/foo=1/bar=2/baz=1/qux=false dataset/foo=1/bar=2/baz=1/qux=true dataset/foo=1/bar=2/baz=2/qux=false dataset/foo=1/bar=2/baz=2/qux=true 6 dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 ^C ... KeyboardInterrupt ^C ... KeyboardInterrupt ``` **NOTE:** this *also* happens with the un-decorated `LocalFileSystem`, and when omitting the `filesystem` argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)