[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view
Paul Balanca created ARROW-11006: Summary: [Python] Array to_numpy slow compared to Numpy.view Key: ARROW-11006 URL: https://issues.apache.org/jira/browse/ARROW-11006 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Paul Balanca Assignee: Paul Balanca The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance: {code:java} N = 100 np_arr = np.arange(N) pa_arr = pa.array(np_arr) %timeit l = [np_arr.view() for _ in range(N)] 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times). I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas
Michael Peleshenko created ARROW-11007: -- Summary: [Python] Memory leak in pq.read_table and table.to_pandas Key: ARROW-11007 URL: https://issues.apache.org/jira/browse/ARROW-11007 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Michael Peleshenko While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1. *Sample Code* {code:python} import io import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from memory_profiler import profile @profile def read_file(f): table = pq.read_table(f) df = table.to_pandas(strings_to_categorical=True) del table del df def main(): rows = 200 df = pd.DataFrame({ "string": ["test"] * rows, "int": [5] * rows, "float": [2.0] * rows, }) table = pa.Table.from_pandas(df, preserve_index=False) parquet_stream = io.BytesIO() pq.write_table(table, parquet_stream) for i in range(3): parquet_stream.seek(0) read_file(parquet_stream) if __name__ == '__main__': main() {code} *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs* {code:java} Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9161.7 MiB161.7 MiB 1 @profile 10 def read_file(f): 11212.1 MiB 50.4 MiB 1 table = pq.read_table(f) 12258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13258.2 MiB 0.0 MiB 1 del table 14256.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9256.3 MiB256.3 MiB 1 @profile 10 def read_file(f): 11279.2 MiB 23.0 MiB 1 table = pq.read_table(f) 12322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13322.2 MiB 0.0 MiB 1 del table 14320.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9320.3 MiB320.3 MiB 1 @profile 10 def read_file(f): 11326.9 MiB 6.5 MiB 1 table = pq.read_table(f) 12361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13361.7 MiB 0.0 MiB 1 del table 14359.8 MiB -1.9 MiB 1 del df {code} *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs* {code:java} Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9138.4 MiB138.4 MiB 1 @profile 10 def read_file(f): 11186.2 MiB 47.8 MiB 1 table = pq.read_table(f) 12219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13171.7 MiB-47.5 MiB 1 del table 14139.3 MiB-32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9139.3 MiB139.3 MiB 1 @profile 10 def read_file(f): 11186.8 MiB 47.5 MiB 1 table = pq.read_table(f) 12219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13171.5 MiB-47.7 MiB 1 del table 14139.1 MiB-32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line #Mem usageIncrement Occurences Line Contents 9139.1 MiB139.1 MiB 1 @profile 10 def read_file(f): 11186.8 MiB 47.7 MiB 1 table = pq.read_table(f) 12219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical
[jira] [Created] (ARROW-11008) [Rust][DataFusion] Simplify count accumulator
Daniël Heres created ARROW-11008: Summary: [Rust][DataFusion] Simplify count accumulator Key: ARROW-11008 URL: https://issues.apache.org/jira/browse/ARROW-11008 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11009) [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc
Wes McKinney created ARROW-11009: Summary: [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc Key: ARROW-11009 URL: https://issues.apache.org/jira/browse/ARROW-11009 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 3.0.0 We routinely get reports like ARROW-11007 where there is suspicion of a memory leak (which may or may not be valid) — having an easy way (requiring no changes to application code) to toggle usage of the non-system memory allocator would help with determining whether the memory usage patterns are the result of the allocator being used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11010) [Python] `np.float` deprecation warning in `_pandas_logical_type_map`
Tim Swast created ARROW-11010: - Summary: [Python] `np.float` deprecation warning in `_pandas_logical_type_map` Key: ARROW-11010 URL: https://issues.apache.org/jira/browse/ARROW-11010 Project: Apache Arrow Issue Type: Task Reporter: Tim Swast I get the following warning when converting a floating point column in a pandas dataframe into a pyarrow array: ``` /Users/swast/src/python-bigquery/.nox/prerelease_deps/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1031: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. Use `float` by itself, which is identical in behavior, to silence this warning. If you specifically wanted the numpy scalar type, use `np.float_` here. 'floating': np.float, ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11011) [Rust] [DataFusion] Implement hash partitioning
Andy Grove created ARROW-11011: -- Summary: [Rust] [DataFusion] Implement hash partitioning Key: ARROW-11011 URL: https://issues.apache.org/jira/browse/ARROW-11011 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Once https://issues.apache.org/jira/browse/ARROW-10582 is implemented, we should add support for hash partitioning. The logical physical plans already support create a plan with hash partitioning but the execution needs implementing in repartition.rs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11012) [Rust] [DataFusion] Make write_csv and write_parquet concurrent
Andy Grove created ARROW-11012: -- Summary: [Rust] [DataFusion] Make write_csv and write_parquet concurrent Key: ARROW-11012 URL: https://issues.apache.org/jira/browse/ARROW-11012 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ExecutionContext.write_csv and write_parquet currently iterate over the output partitions and execute one at a time and write the results out. We should run these as tokio tasks so they can run concurrently. This should, in theory, help with memory usage when the plan contains repartition operators. We may want to add a configuration option so we can choose between serial and parallel writes? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11013) [Rust] CSV Reader cannot handle leading/trailing WhiteSpace
Mike Seddon created ARROW-11013: --- Summary: [Rust] CSV Reader cannot handle leading/trailing WhiteSpace Key: ARROW-11013 URL: https://issues.apache.org/jira/browse/ARROW-11013 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Affects Versions: 2.0.0 Reporter: Mike Seddon Currently the CSV Reader assumes very clean input data which does not have things like leading spaces. This means parsing data like the TPC-H 'answers' set from the databricks/tpch_dbgen repo does not work (like below). Spark uses the Univocity parser library provides the options 'ignoreLeadingWhitespace' and 'ignoreTrailingWhitespace' which would help fix this issue. ``` l|l|sum_qty|sum_base_price|sum_disc_pricesum_chargeavg_qtyavg_priceavg_disccount_order A|F|37734107.00|56586554400.73|53758257134.87|55909065222.83|25.52|38273.13|0.05| 1478493 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11014) [Rust] [DataFusion] ParquetExec reports incorrect statistics
Andy Grove created ARROW-11014: -- Summary: [Rust] [DataFusion] ParquetExec reports incorrect statistics Key: ARROW-11014 URL: https://issues.apache.org/jira/browse/ARROW-11014 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove ParquetExec represents one or more Parquet files and currently only returns statistics based on the first file. -- This message was sent by Atlassian Jira (v8.3.4#803005)