[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

Joris Van den Bossche (Jira) Thu, 12 Mar 2020 09:34:14 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-4633:
-----------------------------------------
    Labels: dataset-parquet-read newbie parquet  (was: newbie parquet)

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> ----------------------------------------------------------------------
>
>                 Key: ARROW-4633
>                 URL: https://issues.apache.org/jira/browse/ARROW-4633
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1, 0.12.0
>         Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>            Reporter: Taylor Johnson
>            Priority: Minor
>              Labels: dataset-parquet-read, newbie, parquet
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --------------------------------------------------
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> -----------------------------------------------------------------------
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

Reply via email to