Hi,
I've hit an issue in Python (3.9.12) where creating a Pyarrow dataset over a
remote filesystem (such as GCS filesystem), and then opening a batch iterator
over the dataset and having the program immediately exit / clean-up afterwards
causes a PyGILState_Release error to get thrown. This is with pyarrow version
v7.0.0.
The error looks like:
Fatal Python error: PyGILState_Release: thread state 0x7fbfd4002380 must be
current when releasing
Python runtime state: finalizing (tstate=0x55a079959380)
Thread 0x00007fbfff5ee400 (most recent call first):
<no Python frame>
Example reproduce code:
import pandas as pd
import pyarrow.dataset as ds
# Get GCS fsspec filesystem
fs = get_gcs_fs()
dummy_df = pd.DataFrame({"a": [1,2,3]})
# Write out some dummy data for us to load a dataset from
data_path = "test-bucket/debug-arrow-datasets/data.parquet"
with fs.open(data_path, "wb") as f:
dummy_df.to_parquet(f)
dummy_ds = ds.dataset([data_path], filesystem=fs)
batch_iter = dummy_ds.to_batches()
# Program finish
# Putting some buffer time after the iterator is opened causes the issue to go
away
# import time
# time.sleep(1)
Using local parquet files for the dataset, adding some buffer time between
iterator open and program exit (via time.sleep or something else), or consuming
the entire iterator seems to make the issue go away. Is this reproducible if
you swap in your own GCS filesystem?
Thanks,
Alex