[ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-10029: ------------------------------------------ Labels: dataset-parquet-read (was: ) > [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset > ----------------------------------------------------------------------------- > > Key: ARROW-10029 > URL: https://issues.apache.org/jira/browse/ARROW-10029 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: David McGuire > Priority: Major > Labels: dataset-parquet-read > Attachments: repro.py > > > @martindurant good news (for you): I have a repro test case that is 100% > {{pyarrow}}, so it looks like {{s3fs}} is not involved. > @jorisvandenbossche how should I follow up with this, based on > {{pyarrow.filesystem.LocalFileSystem}}? > Viewing the File System *directories* as a tree, one thread is required for > every non-leaf node, in order to avoid deadlock. > 1) dataset > 2) dataset/foo=1 > 3) dataset/foo=1/bar=2 > 4) dataset/foo=1/bar=2/baz=0 > 5) dataset/foo=1/bar=2/baz=1 > 6) dataset/foo=1/bar=2/baz=2 > *) dataset/foo=1/bar=2/baz=0/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=true > *) dataset/foo=1/bar=2/baz=0/qux=true > *) dataset/foo=1/bar=2/baz=2/qux=false > *) dataset/foo=1/bar=2/baz=2/qux=true > {code} > import pyarrow.parquet as pq > import pyarrow.filesystem as fs > class LoggingLocalFileSystem(fs.LocalFileSystem): > def walk(self, path): > print(path) > return super().walk(path) > fs = LoggingLocalFileSystem() > dataset_url = "dataset" > threads = 6 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > threads = 5 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > {code} > *_Call with 6 threads completes._* > *_Call with 5 threads hangs indefinitely._* > {code} > $ python repro.py > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > dataset/foo=1/bar=2/baz=0/qux=false > dataset/foo=1/bar=2/baz=0/qux=true > dataset/foo=1/bar=2/baz=1/qux=false > dataset/foo=1/bar=2/baz=1/qux=true > dataset/foo=1/bar=2/baz=2/qux=false > dataset/foo=1/bar=2/baz=2/qux=true > 6 > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > ^C > ... > KeyboardInterrupt > ^C > ... > KeyboardInterrupt > {code} > **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and > when omitting the {{filesystem}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)