[ https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-4633: ----------------------------------------- Labels: dataset-parquet-read newbie parquet (was: newbie parquet) > [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway > ---------------------------------------------------------------------- > > Key: ARROW-4633 > URL: https://issues.apache.org/jira/browse/ARROW-4633 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.11.1, 0.12.0 > Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0 > Reporter: Taylor Johnson > Priority: Minor > Labels: dataset-parquet-read, newbie, parquet > > The following code seems to suggest that ParquetFile.read(use_threads=False) > still creates a ThreadPool. This is observed in > ParquetFile.read_row_group(use_threads=False) as well. > This does not appear to be a problem in > pyarrow.Table.to_pandas(use_threads=False). > I've tried tracing the error. Starting in python/pyarrow/parquet.py, both > ParquetReader.read_all() and ParquetReader.read_row_group() pass the > use_threads input along to self.reader which is a ParquetReader imported from > _parquet.pyx > Following the calls into python/pyarrow/_parquet.pyx, we see that > ParquetReader.read_all() and ParquetReader.read_row_group() have the > following code which seems a bit suspicious > {quote}if use_threads: > self.set_use_threads(use_threads) > {quote} > Why not just always call self.set_use_threads(use_threads)? > The ParquetReader.set_use_threads simply calls > self.reader.get().set_use_threads(use_threads). This self.reader is assigned > as unique_ptr[FileReader]. I think this points to > cpp/src/parquet/arrow/reader.cc, but I'm not sure about that. The > FileReader::Impl::ReadRowGroup logic looks ok, as a call to > ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True. > The same is true for ReadTable. > So when is the ThreadPool getting created? > Example code: > -------------------------------------------------- > {quote}import pandas as pd > import psutil > import pyarrow as pa > import pyarrow.parquet as pq > use_threads=False > p=psutil.Process() > print('Starting with {} threads'.format(p.num_threads())) > df = pd.DataFrame(\{'x':[0]}) > table = pa.Table.from_pandas(df) > print('After table creation, {} threads'.format(p.num_threads())) > df = table.to_pandas(use_threads=use_threads) > print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, > p.num_threads())) > writer = pq.ParquetWriter('tmp.parquet', table.schema) > writer.write_table(table) > writer.close() > print('After writing parquet file, {} threads'.format(p.num_threads())) > pf = pq.ParquetFile('tmp.parquet') > print('After ParquetFile, {} threads'.format(p.num_threads())) > df = pf.read(use_threads=use_threads).to_pandas() > print('After pf.read(use_threads={}), {} threads'.format(use_threads, > p.num_threads())) > {quote} > ----------------------------------------------------------------------- > $ python pyarrow_test.py > Starting with 1 threads > After table creation, 1 threads > table.to_pandas(use_threads=False), 1 threads > After writing parquet file, 1 threads > After ParquetFile, 1 threads > After pf.read(use_threads=False), 5 threads -- This message was sent by Atlassian Jira (v8.3.4#803005)