[jira] [Resolved] (ARROW-11350) [C++] Bump dependency versions
[ https://issues.apache.org/jira/browse/ARROW-11350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-11350. -- Resolution: Fixed Issue resolved by pull request 9296 [https://github.com/apache/arrow/pull/9296] > [C++] Bump dependency versions > -- > > Key: ARROW-11350 > URL: https://issues.apache.org/jira/browse/ARROW-11350 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 7h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11470) [C++] Overflow occurs on integer multiplications in ComputeRowMajorStrides, ComputeColumnMajorStrides, and CheckTensorStridesValidity
[ https://issues.apache.org/jira/browse/ARROW-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenta Murata updated ARROW-11470: - Summary: [C++] Overflow occurs on integer multiplications in ComputeRowMajorStrides, ComputeColumnMajorStrides, and CheckTensorStridesValidity (was: [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides) > [C++] Overflow occurs on integer multiplications in ComputeRowMajorStrides, > ComputeColumnMajorStrides, and CheckTensorStridesValidity > - > > Key: ARROW-11470 > URL: https://issues.apache.org/jira/browse/ARROW-11470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides > function occurs overflow. > https://oss-fuzz.com/testcase-detail/623225726408 > The same issue exists in ComputeColumnMajorStrides. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11398) [C++][Compute] Test failures with gcc-9.3 on aarch64
[ https://issues.apache.org/jira/browse/ARROW-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai closed ARROW-11398. Resolution: Won't Fix To summarize: - gcc-9.3 aarch64 auto vectorization generates buggy code for this [code block|https://github.com/apache/arrow/blob/d0ce28b7fdcfba16de404e20eda47097df17/cpp/src/arrow/util/bitmap_generate.h#L89-L96]. - gcc-7.5, gcc-10.1 don't have this problem. > [C++][Compute] Test failures with gcc-9.3 on aarch64 > > > Key: ARROW-11398 > URL: https://issues.apache.org/jira/browse/ARROW-11398 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 >Reporter: Tobias Mayer >Assignee: Yibo Cai >Priority: Major > Attachments: aarch64.log > > > The tests > * arrow-compute-scalar-test > ** "TestCompareKernel.PrimitiveRandomTests" > * arrow-compute-vector-test > ** "TestFilterKernelWithNumeric/3.CompareArrayAndFilterRandomNumeric" > ** "TestFilterKernelWithNumeric/7.CompareArrayAndFilterRandomNumeric" > Fail on aarch64 on NixOS. Full Build log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11398) [C++][Compute] Test failures with gcc-9.3 on aarch64
[ https://issues.apache.org/jira/browse/ARROW-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277665#comment-17277665 ] Yibo Cai commented on ARROW-11398: -- Wrote a simple test program to reproduce this issue. Details in gcc bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98949 > [C++][Compute] Test failures with gcc-9.3 on aarch64 > > > Key: ARROW-11398 > URL: https://issues.apache.org/jira/browse/ARROW-11398 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 >Reporter: Tobias Mayer >Assignee: Yibo Cai >Priority: Major > Attachments: aarch64.log > > > The tests > * arrow-compute-scalar-test > ** "TestCompareKernel.PrimitiveRandomTests" > * arrow-compute-vector-test > ** "TestFilterKernelWithNumeric/3.CompareArrayAndFilterRandomNumeric" > ** "TestFilterKernelWithNumeric/7.CompareArrayAndFilterRandomNumeric" > Fail on aarch64 on NixOS. Full Build log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
[ https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659 ] Paul Taylor edited comment on ARROW-10255 at 2/3/21, 3:44 AM: -- [~bhulette] I vote no on the current PR for 4 reasons: # Arrow releases have moved to releases major-version-revs only, so npm won't upgrade libs/people by default # The [very minor changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507] we had to make to our own tests makes me think this shouldn't be a huge pain to upgrade for users (not to mention Field and Schema aren't the most common APIs to interact with directly) # These changes are needed by [vis.gl|https://github.com/visgl] to import ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're reimplementing the bits of Schema/Field/DataType they need because importing ours adds ~250k (~24k minified) to their bundle, which is over their size budget. # Adding deprecation warnings will add to the size of the lib, and likely in ways that can't be tree-shaken. In Python size on-disk isn't an issue, so people add deprecation warnings all the time, but without extensive tooling support it's difficult to do/guide users how to do in JS. was (Author: paul.e.taylor): [~bhulette] I vote no on the current PR for 3 reasons: # Arrow releases have moved to releases major-version-revs only, so npm won't upgrade libs/people by default # The [very minor changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507] we had to make to our own tests makes me think this shouldn't be a huge pain to upgrade for users (not to mention Field and Schema aren't the most common APIs to interact with directly) # These changes are needed by [vis.gl|https://github.com/visgl] to import ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're reimplementing the bits of Schema/Field/DataType they need because importing ours adds ~250k (~24k minified) to their bundle, which is over their size budget. > [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking > --- > > Key: ARROW-10255 > URL: https://issues.apache.org/jira/browse/ARROW-10255 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 0.17.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Presently most of our public classes can't be easily > [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library > consumers. This is a problem for libraries that only need to use parts of > Arrow. > For example, the vis.gl projects have an integration test that imports three > of our simpler classes and tests the resulting bundle size: > {code:javascript} > import {Schema, Field, Float32} from 'apache-arrow'; > // | Bundle Size| Compressed > // | 202KB (207112) KB | 45KB (46618) KB > {code} > We can help solve this with the following changes: > * Add "sideEffects": false to our ESM package.json > * Reorganize our imports to only include what's needed > * Eliminate or move some static/member methods to standalone exported > functions > * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't > compile in its own Buffer shim > * Removing flatbuffers namespaces from generated TS because these defeat > Webpack's tree-shaking ability > Candidate functions for removal/moving to standalone functions: > * Schema.new, Schema.from, Schema.prototype.compareTo > * Field.prototype.compareTo > * Type.prototype.compareTo > * Table.new, Table.from > * Column.new > * Vector.new, Vector.from > * RecordBatchReader.from > After applying a few of the above changes to the Schema and flatbuffers > files, I was able to reduce the vis.gl's import size 90%: > {code:javascript} > // Bundle Size | Compressed > // 24KB (24942) KB | 6KB (6154) KB > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
[ https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659 ] Paul Taylor commented on ARROW-10255: - [~bhulette] I vote no on the current PR for 3 reasons: # Arrow releases have moved to releases major-version-revs only, so npm won't upgrade libs/people by default # The [very minor changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507] we had to make to our own tests makes me think this shouldn't be a huge pain to upgrade for users (not to mention Field and Schema aren't the most common APIs to interact with directly) # These changes are needed by [vis.gl|https://github.com/visgl] to import ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're reimplementing the bits of Schema/Field/DataType they need because importing ours adds ~250k (~24k minified) to their bundle, which is over their size budget. > [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking > --- > > Key: ARROW-10255 > URL: https://issues.apache.org/jira/browse/ARROW-10255 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 0.17.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Presently most of our public classes can't be easily > [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library > consumers. This is a problem for libraries that only need to use parts of > Arrow. > For example, the vis.gl projects have an integration test that imports three > of our simpler classes and tests the resulting bundle size: > {code:javascript} > import {Schema, Field, Float32} from 'apache-arrow'; > // | Bundle Size| Compressed > // | 202KB (207112) KB | 45KB (46618) KB > {code} > We can help solve this with the following changes: > * Add "sideEffects": false to our ESM package.json > * Reorganize our imports to only include what's needed > * Eliminate or move some static/member methods to standalone exported > functions > * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't > compile in its own Buffer shim > * Removing flatbuffers namespaces from generated TS because these defeat > Webpack's tree-shaking ability > Candidate functions for removal/moving to standalone functions: > * Schema.new, Schema.from, Schema.prototype.compareTo > * Field.prototype.compareTo > * Type.prototype.compareTo > * Table.new, Table.from > * Column.new > * Vector.new, Vector.from > * RecordBatchReader.from > After applying a few of the above changes to the Schema and flatbuffers > files, I was able to reduce the vis.gl's import size 90%: > {code:javascript} > // Bundle Size | Compressed > // 24KB (24942) KB | 6KB (6154) KB > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277647#comment-17277647 ] Tao He commented on ARROW-11463: Thanks for the background of pickle 5 [~lausen] . [~lausen] for your case, rather than invoking `pa.serialize`, you could create a stream using `pa.ipc.new_stream` and feed a IpcWriteOption, then use `.write_table()` to write you tables to the stream. c.f.: + [https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream] + [https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchStreamWriter.html#pyarrow.RecordBatchStreamWriter.write_table] Hope that could be helpful for you! > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pac A. He updated ARROW-11456: -- Description: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. was: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths]
[jira] [Updated] (ARROW-11479) Add method to return compressed size of row group
[ https://issues.apache.org/jira/browse/ARROW-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Karthick updated ARROW-11479: --- Issue Type: New Feature (was: Improvement) > Add method to return compressed size of row group > - > > Key: ARROW-11479 > URL: https://issues.apache.org/jira/browse/ARROW-11479 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Manoj Karthick >Priority: Minor > > Create a method to return the compressed size of all columns in the row > group. This will help with calculating the total compressed size of the > Parquet File. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11479) Add method to return compressed size of row group
[ https://issues.apache.org/jira/browse/ARROW-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11479: --- Labels: pull-request-available (was: ) > Add method to return compressed size of row group > - > > Key: ARROW-11479 > URL: https://issues.apache.org/jira/browse/ARROW-11479 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Manoj Karthick >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Create a method to return the compressed size of all columns in the row > group. This will help with calculating the total compressed size of the > Parquet File. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11479) Add method to return compressed size of row group
Manoj Karthick created ARROW-11479: -- Summary: Add method to return compressed size of row group Key: ARROW-11479 URL: https://issues.apache.org/jira/browse/ARROW-11479 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Manoj Karthick Create a method to return the compressed size of all columns in the row group. This will help with calculating the total compressed size of the Parquet File. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11478) [R] Consider ways to make arrow.skip_nul option more user-friendly
[ https://issues.apache.org/jira/browse/ARROW-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277628#comment-17277628 ] Ian Cook commented on ARROW-11478: -- I think I'm in favor of option 3, assuming it's feasible (which I think it should be). > [R] Consider ways to make arrow.skip_nul option more user-friendly > -- > > Key: ARROW-11478 > URL: https://issues.apache.org/jira/browse/ARROW-11478 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Minor > Fix For: 4.0.0 > > > In Arrow 3.0.0, the {{arrow.skip_nul}} option effectively defaults to > {{FALSE}} for consistency with {{base::readLines}} and {{base::scan}}. > If the user keeps this default option value, then conversion of string data > containing embedded nuls causes an error with a message like: > {code:java} > embedded nul in string: '\0' {code} > If the user sets the option to {{TRUE}}, then no error occurs, but this > warning is issued: > {code:java} > Stripping '\0' (nul) from character vector {code} > Consider whether we should: > # Keep this all as it is > # Change the default option value to {{TRUE}} > # Keep the default option value as it is, but catch the error and re-throw > it with a more actionable message that tells the user how to set the option -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11478) [R] Consider ways to make arrow.skip_nul option more user-friendly
Ian Cook created ARROW-11478: Summary: [R] Consider ways to make arrow.skip_nul option more user-friendly Key: ARROW-11478 URL: https://issues.apache.org/jira/browse/ARROW-11478 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 3.0.0 Reporter: Ian Cook Assignee: Ian Cook Fix For: 4.0.0 In Arrow 3.0.0, the {{arrow.skip_nul}} option effectively defaults to {{FALSE}} for consistency with {{base::readLines}} and {{base::scan}}. If the user keeps this default option value, then conversion of string data containing embedded nuls causes an error with a message like: {code:java} embedded nul in string: '\0' {code} If the user sets the option to {{TRUE}}, then no error occurs, but this warning is issued: {code:java} Stripping '\0' (nul) from character vector {code} Consider whether we should: # Keep this all as it is # Change the default option value to {{TRUE}} # Keep the default option value as it is, but catch the error and re-throw it with a more actionable message that tells the user how to set the option -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-951) [JS] Fix generated API documentation
[ https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-951. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9375 [https://github.com/apache/arrow/pull/9375] > [JS] Fix generated API documentation > > > Key: ARROW-951 > URL: https://issues.apache.org/jira/browse/ARROW-951 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Minor > Labels: documentation, pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > The current generated API documentation doesn't respect the project's > namespaces, it simply lists all exported objects. We should see if we can > make typedoc display the project's structure (even if it means re-structuring > the code a bit), or find another approach for doc generation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11467) [R] Fix reference to json_table_reader() in R docs
[ https://issues.apache.org/jira/browse/ARROW-11467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-11467. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9393 [https://github.com/apache/arrow/pull/9393] > [R] Fix reference to json_table_reader() in R docs > -- > > Key: ARROW-11467 > URL: https://issues.apache.org/jira/browse/ARROW-11467 > Project: Apache Arrow > Issue Type: Task > Components: R >Affects Versions: 3.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The docs entry for the R function {{read_json_arrow()}} refers to the > nonexistent function {{json_table_reader()}}. This should be changed to > {{JsonTableReader$create()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277546#comment-17277546 ] Jonathan Keane commented on ARROW-11433: Yeah, I tried it with the system allocator and that alone doesn’t resolve it > [R] Unexpectedly slow results reading csv > - > > Key: ARROW-11433 > URL: https://issues.apache.org/jira/browse/ARROW-11433 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Priority: Minor > > This came up working on benchmarking Arrow's CSV reading. As far as I can > tell this only impacts R, and only when reading the csv into arrow (but not > pulling it in to R). It appears that most arrow interactions after the csv is > read will result in this behavior not happening. > What I'm seeing is that on subsequent reads, the time to read gets longer and > longer (frequently in a stair step pattern where every other iteration takes > longer). > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + tab <- NULL > + } > + }) >user system elapsed > 24.788 19.485 7.216 >user system elapsed > 24.952 21.786 9.225 >user system elapsed > 25.150 23.039 10.332 >user system elapsed > 25.382 31.012 17.995 >user system elapsed > 25.309 25.140 12.356 >user system elapsed > 25.302 26.975 13.938 >user system elapsed > 25.509 34.390 21.134 >user system elapsed > 25.674 28.195 15.048 >user system elapsed > 25.031 28.094 16.449 >user system elapsed > 25.825 37.165 23.379 > # total time: >user system elapsed > 256.178 299.671 175.119 > {code} > Interestingly, doing something as unrelated as > {{arrow:::default_memory_pool()}} which is [only getting the default memory > pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. > Other interactions totally unrelated to the table also similarly alleviate > this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or > proactively invalidating with {{tab$invalidate()}} > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + pool <- arrow:::default_memory_pool() > + tab <- NULL > + } > + }) >user system elapsed > 25.257 19.475 6.785 >user system elapsed > 25.271 19.838 6.821 >user system elapsed > 25.288 20.103 6.861 >user system elapsed > 25.188 20.290 7.217 >user system elapsed > 25.283 20.043 6.832 >user system elapsed > 25.194 19.947 6.906 >user system elapsed > 25.278 19.993 6.834 >user system elapsed > 25.355 20.018 6.833 >user system elapsed > 24.986 19.869 6.865 >user system elapsed > 25.130 19.878 6.798 > # total time: >user system elapsed > 255.381 210.598 83.109 > > > {code} > I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the > same behavior. > I checked against pyarrow, and do not see the same: > {code:python} > from pyarrow import csv > import time > for i in range(1, 10): > start = time.time() > table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") > print(time.time() - start) > del table > {code} > results: > {code} > 7.586184978485107 > 7.542470932006836 > 7.92852783203125 > 7.647372007369995 > 7.742412805557251 > 8.101378917694092 > 7.7359960079193115 > 7.843957901000977 > 7.6457719802856445 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277544#comment-17277544 ] Neal Richardson commented on ARROW-11433: - "Only on mac" and "freeing memory" makes me think of ARROW-6994 and the various issues linked to that (see also https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L67-L85). I forget, have you tried using the system memory allocator? > [R] Unexpectedly slow results reading csv > - > > Key: ARROW-11433 > URL: https://issues.apache.org/jira/browse/ARROW-11433 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Priority: Minor > > This came up working on benchmarking Arrow's CSV reading. As far as I can > tell this only impacts R, and only when reading the csv into arrow (but not > pulling it in to R). It appears that most arrow interactions after the csv is > read will result in this behavior not happening. > What I'm seeing is that on subsequent reads, the time to read gets longer and > longer (frequently in a stair step pattern where every other iteration takes > longer). > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + tab <- NULL > + } > + }) >user system elapsed > 24.788 19.485 7.216 >user system elapsed > 24.952 21.786 9.225 >user system elapsed > 25.150 23.039 10.332 >user system elapsed > 25.382 31.012 17.995 >user system elapsed > 25.309 25.140 12.356 >user system elapsed > 25.302 26.975 13.938 >user system elapsed > 25.509 34.390 21.134 >user system elapsed > 25.674 28.195 15.048 >user system elapsed > 25.031 28.094 16.449 >user system elapsed > 25.825 37.165 23.379 > # total time: >user system elapsed > 256.178 299.671 175.119 > {code} > Interestingly, doing something as unrelated as > {{arrow:::default_memory_pool()}} which is [only getting the default memory > pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. > Other interactions totally unrelated to the table also similarly alleviate > this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or > proactively invalidating with {{tab$invalidate()}} > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + pool <- arrow:::default_memory_pool() > + tab <- NULL > + } > + }) >user system elapsed > 25.257 19.475 6.785 >user system elapsed > 25.271 19.838 6.821 >user system elapsed > 25.288 20.103 6.861 >user system elapsed > 25.188 20.290 7.217 >user system elapsed > 25.283 20.043 6.832 >user system elapsed > 25.194 19.947 6.906 >user system elapsed > 25.278 19.993 6.834 >user system elapsed > 25.355 20.018 6.833 >user system elapsed > 24.986 19.869 6.865 >user system elapsed > 25.130 19.878 6.798 > # total time: >user system elapsed > 255.381 210.598 83.109 > > > {code} > I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the > same behavior. > I checked against pyarrow, and do not see the same: > {code:python} > from pyarrow import csv > import time > for i in range(1, 10): > start = time.time() > table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") > print(time.time() - start) > del table > {code} > results: > {code} > 7.586184978485107 > 7.542470932006836 > 7.92852783203125 > 7.647372007369995 > 7.742412805557251 > 8.101378917694092 > 7.7359960079193115 > 7.843957901000977 > 7.6457719802856445 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277543#comment-17277543 ] Jonathan Keane commented on ARROW-11433: We thought it might be due to the mmaping of the file, but turning it off at https://github.com/apache/arrow/blob/master/r/R/csv.R#L187 with {{mmap = FALSE}} still exhibits the same pattern. > [R] Unexpectedly slow results reading csv > - > > Key: ARROW-11433 > URL: https://issues.apache.org/jira/browse/ARROW-11433 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Priority: Minor > > This came up working on benchmarking Arrow's CSV reading. As far as I can > tell this only impacts R, and only when reading the csv into arrow (but not > pulling it in to R). It appears that most arrow interactions after the csv is > read will result in this behavior not happening. > What I'm seeing is that on subsequent reads, the time to read gets longer and > longer (frequently in a stair step pattern where every other iteration takes > longer). > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + tab <- NULL > + } > + }) >user system elapsed > 24.788 19.485 7.216 >user system elapsed > 24.952 21.786 9.225 >user system elapsed > 25.150 23.039 10.332 >user system elapsed > 25.382 31.012 17.995 >user system elapsed > 25.309 25.140 12.356 >user system elapsed > 25.302 26.975 13.938 >user system elapsed > 25.509 34.390 21.134 >user system elapsed > 25.674 28.195 15.048 >user system elapsed > 25.031 28.094 16.449 >user system elapsed > 25.825 37.165 23.379 > # total time: >user system elapsed > 256.178 299.671 175.119 > {code} > Interestingly, doing something as unrelated as > {{arrow:::default_memory_pool()}} which is [only getting the default memory > pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. > Other interactions totally unrelated to the table also similarly alleviate > this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or > proactively invalidating with {{tab$invalidate()}} > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + pool <- arrow:::default_memory_pool() > + tab <- NULL > + } > + }) >user system elapsed > 25.257 19.475 6.785 >user system elapsed > 25.271 19.838 6.821 >user system elapsed > 25.288 20.103 6.861 >user system elapsed > 25.188 20.290 7.217 >user system elapsed > 25.283 20.043 6.832 >user system elapsed > 25.194 19.947 6.906 >user system elapsed > 25.278 19.993 6.834 >user system elapsed > 25.355 20.018 6.833 >user system elapsed > 24.986 19.869 6.865 >user system elapsed > 25.130 19.878 6.798 > # total time: >user system elapsed > 255.381 210.598 83.109 > > > {code} > I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the > same behavior. > I checked against pyarrow, and do not see the same: > {code:python} > from pyarrow import csv > import time > for i in range(1, 10): > start = time.time() > table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") > print(time.time() - start) > del table > {code} > results: > {code} > 7.586184978485107 > 7.542470932006836 > 7.92852783203125 > 7.647372007369995 > 7.742412805557251 > 8.101378917694092 > 7.7359960079193115 > 7.843957901000977 > 7.6457719802856445 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277537#comment-17277537 ] Jonathan Keane commented on ARROW-11433: Ben and I spent some time on this today. Turns out it's not reproducible on Ubuntu as far as we tried. So we suspect something specific with macOS. As another example of even the most basic cpp <-> R interaction alleviating the behavior: {code:r} > system.time({ + for (i in 1:10) { + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) + dev_null <- arrow::CsvParseOptions$create() + tab <- NULL + } + }) user system elapsed 27.894 22.633 12.224 user system elapsed 25.016 19.855 8.576 user system elapsed 25.092 20.625 8.051 user system elapsed 25.263 21.161 8.353 user system elapsed 25.168 20.575 8.745 user system elapsed 25.087 20.207 7.672 user system elapsed 25.311 20.525 7.438 user system elapsed 25.013 20.537 7.962 user system elapsed 25.088 20.671 8.371 user system elapsed 25.409 20.375 7.659 user system elapsed 258.004 224.511 106.238 {code} > [R] Unexpectedly slow results reading csv > - > > Key: ARROW-11433 > URL: https://issues.apache.org/jira/browse/ARROW-11433 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Priority: Minor > > This came up working on benchmarking Arrow's CSV reading. As far as I can > tell this only impacts R, and only when reading the csv into arrow (but not > pulling it in to R). It appears that most arrow interactions after the csv is > read will result in this behavior not happening. > What I'm seeing is that on subsequent reads, the time to read gets longer and > longer (frequently in a stair step pattern where every other iteration takes > longer). > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + tab <- NULL > + } > + }) >user system elapsed > 24.788 19.485 7.216 >user system elapsed > 24.952 21.786 9.225 >user system elapsed > 25.150 23.039 10.332 >user system elapsed > 25.382 31.012 17.995 >user system elapsed > 25.309 25.140 12.356 >user system elapsed > 25.302 26.975 13.938 >user system elapsed > 25.509 34.390 21.134 >user system elapsed > 25.674 28.195 15.048 >user system elapsed > 25.031 28.094 16.449 >user system elapsed > 25.825 37.165 23.379 > # total time: >user system elapsed > 256.178 299.671 175.119 > {code} > Interestingly, doing something as unrelated as > {{arrow:::default_memory_pool()}} which is [only getting the default memory > pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. > Other interactions totally unrelated to the table also similarly alleviate > this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or > proactively invalidating with {{tab$invalidate()}} > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + pool <- arrow:::default_memory_pool() > + tab <- NULL > + } > + }) >user system elapsed > 25.257 19.475 6.785 >user system elapsed > 25.271 19.838 6.821 >user system elapsed > 25.288 20.103 6.861 >user system elapsed > 25.188 20.290 7.217 >user system elapsed > 25.283 20.043 6.832 >user system elapsed > 25.194 19.947 6.906 >user system elapsed > 25.278 19.993 6.834 >user system elapsed > 25.355 20.018 6.833 >user system elapsed > 24.986 19.869 6.865 >user system elapsed > 25.130 19.878 6.798 > # total time: >user system elapsed > 255.381 210.598 83.109 > > > {code} > I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the > same behavior. > I checked against pyarrow, and do not see the same: > {code:python} > from pyarrow import csv > import time > for i in range(1, 10): > start = time.time() > table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") > print(time.time() - start) > del table > {code} > results: > {code} > 7.586184978485107 > 7.542470932006836 > 7.92852783203125 > 7.647372007369995 > 7.742412805557251 > 8.101378917694092 > 7.7359960079193115 > 7.843957901000977 > 7.6457719802856445 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content
[ https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated ARROW-11477: - Description: Collecting various ideas here for general ways to improve the R package README and vignettes for the 4.0.0 release: * Consider moving the "building" and "developing" content out of the REAMDE and into a vignette focused on that topic. (Rationale: most users of the R package today are downloading prebuilt binaries, not building their own; most users today are end users, not developers; a more valuable use for the README—especially since that it's the homepage of the R docs site—would be as a place to highlight key capabilities of the package, not to show folks all the technical details of building it.) * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles menu on the R docs site. * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) and describe differences in dplyr support for these different Arrow objects. * Check all the links in the "Project docs" menu on the docs site; some of them are currently broken or go to directory listings was: Collecting various ideas here for general ways to improve the R package README and vignettes for the 4.0.0 release: * Consider moving the "building" and "developing" content out of the REAMDE and into a vignette focused on that topic. (Rationale: most users of the R package today are downloading prebuilt binaries, not building their own; most users today are end users, not developers; a more valuable use for the README—especially since that it's the homepage of the R docs site—would be as a place to highlight key capabilities of the package, not to show folks all the technical details of building it.) * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles menu on the R docs site. * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) and describe differences in dplyr support for these different Arrow objects. > [R][Doc] Reorganize and improve README and vignette content > --- > > Key: ARROW-11477 > URL: https://issues.apache.org/jira/browse/ARROW-11477 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 3.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 4.0.0 > > > Collecting various ideas here for general ways to improve the R package > README and vignettes for the 4.0.0 release: > * Consider moving the "building" and "developing" content out of the REAMDE > and into a vignette focused on that topic. (Rationale: most users of the R > package today are downloading prebuilt binaries, not building their own; most > users today are end users, not developers; a more valuable use for the > README—especially since that it's the homepage of the R docs site—would be as > a place to highlight key capabilities of the package, not to show folks all > the technical details of building it.) > * Get the "Using the Arrow C++ Library in R" vignette to show in the > Articles menu on the R docs site. > * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear > that dplyr verbs can be used with Arrow Tables and RecordBatches (not just > Datasets) and describe differences in dplyr support for these different Arrow > objects. > * Check all the links in the "Project docs" menu on the docs site; some of > them are currently broken or go to directory listings -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277515#comment-17277515 ] Leonard Lausen commented on ARROW-11463: Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really useful. For example, the following code can replicate my use-case for the Plasma store based on providing a folder in {{/dev/shm}} as {{path}}. {code:python} import pickle import mmap def shm_pickle(path, tbl): idx = 0 def buffer_callback(buf): nonlocal idx with open(path / f'{idx}.bin', 'wb') as f: f.write(buf) idx += 1 with open(path / 'meta.pkl', 'wb') as f: pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback) def shm_unpickle(path): num_buffers = len(list(path.iterdir())) - 1 # exclude meta.idx buffers = [] for idx in range(num_buffers): f = open(path / f'{idx}.bin', 'rb') buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)) with open(path / 'meta.pkl', 'rb') as f: return pickle.load(f, buffers=buffers) {code} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11310) [Rust] Implement arrow JSON writer
[ https://issues.apache.org/jira/browse/ARROW-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11310. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9256 [https://github.com/apache/arrow/pull/9256] > [Rust] Implement arrow JSON writer > -- > > Key: ARROW-11310 > URL: https://issues.apache.org/jira/browse/ARROW-11310 > Project: Apache Arrow > Issue Type: Task > Components: Rust >Reporter: QP Hou >Assignee: QP Hou >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content
[ https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277511#comment-17277511 ] Neal Richardson commented on ARROW-11477: - Re the "Using the Arrow C++ Library in R" vignette, it is (or at least used to be) pkgdown's convention that `vignette("pkgname")` gets put as the "Get started" link in the top menu bar, and all other vignettes go under Articles. But we have the ability to override that. > [R][Doc] Reorganize and improve README and vignette content > --- > > Key: ARROW-11477 > URL: https://issues.apache.org/jira/browse/ARROW-11477 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 3.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 4.0.0 > > > Collecting various ideas here for general ways to improve the R package > README and vignettes for the 4.0.0 release: > * Consider moving the "building" and "developing" content out of the REAMDE > and into a vignette focused on that topic. (Rationale: most users of the R > package today are downloading prebuilt binaries, not building their own; most > users today are end users, not developers; a more valuable use for the > README—especially since that it's the homepage of the R docs site—would be as > a place to highlight key capabilities of the package, not to show folks all > the technical details of building it.) > * Get the "Using the Arrow C++ Library in R" vignette to show in the > Articles menu on the R docs site. > * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear > that dplyr verbs can be used with Arrow Tables and RecordBatches (not just > Datasets) and describe differences in dplyr support for these different Arrow > objects. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content
Ian Cook created ARROW-11477: Summary: [R][Doc] Reorganize and improve README and vignette content Key: ARROW-11477 URL: https://issues.apache.org/jira/browse/ARROW-11477 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 3.0.0 Reporter: Ian Cook Assignee: Ian Cook Fix For: 4.0.0 Collecting various ideas here for general ways to improve the R package README and vignettes for the 4.0.0 release: * Consider moving the "building" and "developing" content out of the REAMDE and into a vignette focused on that topic. (Rationale: most users of the R package today are downloading prebuilt binaries, not building their own; most users today are end users, not developers; a more valuable use for the README—especially since that it's the homepage of the R docs site—would be as a place to highlight key capabilities of the package, not to show folks all the technical details of building it.) * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles menu on the R docs site. * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) and describe differences in dplyr support for these different Arrow objects. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11474) [C++] Update bundled re2 version
[ https://issues.apache.org/jira/browse/ARROW-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-11474. --- Assignee: Kouhei Sutou Resolution: Duplicate Done in ARROW-11350 after all > [C++] Update bundled re2 version > > > Key: ARROW-11474 > URL: https://issues.apache.org/jira/browse/ARROW-11474 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Fix For: 4.0.0 > > > I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed > in a few builds with > {code} > /usr/bin/ar: > /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: > No such file or directory > make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 > make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 > {code} > (or similar). My theory is that something changed in their cmake build setup > so that either libre2.a is not where we expect it, or it's building a shared > library instead, or something. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11476) [Rust][DataFusion] Test running of TPCH benchmarks in CI
[ https://issues.apache.org/jira/browse/ARROW-11476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11476: --- Labels: pull-request-available (was: ) > [Rust][DataFusion] Test running of TPCH benchmarks in CI > > > Key: ARROW-11476 > URL: https://issues.apache.org/jira/browse/ARROW-11476 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Daniël Heres >Assignee: Daniël Heres >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11476) [Rust][DataFusion] Test running of TPCH benchmarks in CI
Daniël Heres created ARROW-11476: Summary: [Rust][DataFusion] Test running of TPCH benchmarks in CI Key: ARROW-11476 URL: https://issues.apache.org/jira/browse/ARROW-11476 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277432#comment-17277432 ] Ali Cetin commented on ARROW-11427: --- Cool. I can give it a try in the coming days. > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277421#comment-17277421 ] Antoine Pitrou commented on ARROW-11427: (removed previous post, sorry) [~ali.cetin] Can you try to install the following wheel and see it if fixes the issue? [https://github.com/ursacomputing/crossbow/releases/download/build-27-github-wheel-windows-cp38/pyarrow-3.1.0.dev112-cp38-cp38-win_amd64.whl] Also, it will allow you to inspect the current SIMD level, like this: {code:java} $ python -c "import pyarrow as pa; print(pa.runtime_info())" RuntimeInfo(simd_level='avx2', detected_simd_level='avx2') {code} You should get "avx2" on Windows Server 2012, and "avx512" on Windows Server 2016. > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11435) Allow creating ParquetPartition from external crate
[ https://issues.apache.org/jira/browse/ARROW-11435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11435. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9369 [https://github.com/apache/arrow/pull/9369] > Allow creating ParquetPartition from external crate > --- > > Key: ARROW-11435 > URL: https://issues.apache.org/jira/browse/ARROW-11435 > Project: Apache Arrow > Issue Type: Task > Components: Rust - DataFusion >Reporter: QP Hou >Assignee: QP Hou >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Without this functionality, it's not possible to implement table provider in > external crate that targets parquet format since ParquetExec takes > ParquetPartition as an argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11475) [C++] Upgrade mimalloc
Neal Richardson created ARROW-11475: --- Summary: [C++] Upgrade mimalloc Key: ARROW-11475 URL: https://issues.apache.org/jira/browse/ARROW-11475 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 I tried this in ARROW-11350 but ran into an issue (https://github.com/microsoft/mimalloc/issues/353). That has since been resolved and we could apply a patch to bring it in. Or we can wait for it to get into a proper release. There is also now a 1.7 release, which claims to work on the Apple M1, as well as a 2.0 version, which claims better performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11474) [C++] Update bundled re2 version
[ https://issues.apache.org/jira/browse/ARROW-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-11474: Description: I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed in a few builds with {code} /usr/bin/ar: /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: No such file or directory make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 {code} (or similar). My theory is that something changed in their cmake build setup so that either libre2.a is not where we expect it, or it's building a shared library instead, or something. was: I tried increasing the re2 version to 2020-11-01 in but it failed in a few builds with {code} /usr/bin/ar: /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: No such file or directory make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 {code} (or similar). My theory is that something changed in their cmake build setup so that either libre2.a is not where we expect it, or it's building a shared library instead, or something. > [C++] Update bundled re2 version > > > Key: ARROW-11474 > URL: https://issues.apache.org/jira/browse/ARROW-11474 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Neal Richardson >Priority: Major > Fix For: 4.0.0 > > > I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed > in a few builds with > {code} > /usr/bin/ar: > /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: > No such file or directory > make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 > make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 > {code} > (or similar). My theory is that something changed in their cmake build setup > so that either libre2.a is not where we expect it, or it's building a shared > library instead, or something. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11474) [C++] Update bundled re2 version
Neal Richardson created ARROW-11474: --- Summary: [C++] Update bundled re2 version Key: ARROW-11474 URL: https://issues.apache.org/jira/browse/ARROW-11474 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Neal Richardson Fix For: 4.0.0 I tried increasing the re2 version to 2020-11-01 in but it failed in a few builds with {code} /usr/bin/ar: /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a: No such file or directory make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9 make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2 {code} (or similar). My theory is that something changed in their cmake build setup so that either libre2.a is not where we expect it, or it's building a shared library instead, or something. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-11427: --- Comment: was deleted (was: [~ali.cetin] Could you try installing this wheel and see if it fixes the issue: [https://github.com/ursacomputing/crossbow/releases/download/build-26-github-wheel-windows-cp38/pyarrow-3.1.0.dev109-cp38-cp38-win_amd64.whl] ?) > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277392#comment-17277392 ] Antoine Pitrou commented on ARROW-11463: PyArrow serialization is deprecated, users should use pickle themselves. It is true that out-of-band data provides zero-copy support for buffers embedded in the pickled data. It is tested here: [https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_array.py#L1695] > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11427: --- Labels: pull-request-available (was: ) > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277373#comment-17277373 ] Leonard Lausen commented on ARROW-11463: Specifically, do you mean that PyArrow serialization is deprecated or that SerializationContext is deprecated? Ie should users use pickle themselves, or will PyArrow just use pickle internally when serializing? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11308) [Rust] [Parquet] Add Arrow decimal array writer
[ https://issues.apache.org/jira/browse/ARROW-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11308: --- Labels: pull-request-available (was: ) > [Rust] [Parquet] Add Arrow decimal array writer > --- > > Key: ARROW-11308 > URL: https://issues.apache.org/jira/browse/ARROW-11308 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277352#comment-17277352 ] Antoine Pitrou commented on ARROW-11427: [~ali.cetin] Could you try installing this wheel and see if it fixes the issue: [https://github.com/ursacomputing/crossbow/releases/download/build-26-github-wheel-windows-cp38/pyarrow-3.1.0.dev109-cp38-cp38-win_amd64.whl] ? > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Fix For: 4.0.0 > > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277348#comment-17277348 ] Leonard Lausen commented on ARROW-11463: Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork at https://github.com/alibaba/libvineyard which currently also uses PyArrow serialization and is thus affected from this issue. For PyArrow serialization and Pickle 5, I see that you are the author of the PEP. Thank you for driving that. Is it correct that the out-of-band data support makes it possible to use for zero-copy / shared memory applications? Is there any plan for PyArrow to uses Pickle 5 by default when running on Py3.8+? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11469: -- Summary: [Python] Performance degradation parquet reading of wide dataframes (was: [Python] Performance degradation wide dataframes) > [Python] Performance degradation parquet reading of wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11469) [Python] Performance degradation wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11469: -- Summary: [Python] Performance degradation wide dataframes (was: Performance degradation wide dataframes) > [Python] Performance degradation wide dataframes > > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11421) [Rust][DataFusion] Support group by Date32
[ https://issues.apache.org/jira/browse/ARROW-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-11421. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9355 [https://github.com/apache/arrow/pull/9355] > [Rust][DataFusion] Support group by Date32 > -- > > Key: ARROW-11421 > URL: https://issues.apache.org/jira/browse/ARROW-11421 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Daniël Heres >Assignee: Daniël Heres >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-11427: --- Summary: [C++] Arrow uses AVX512 instructions even when not supported by the OS (was: [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after upgrading to pyarrow 3.0) > [C++] Arrow uses AVX512 instructions even when not supported by the OS > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Fix For: 4.0.0 > > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11427) [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after upgrading to pyarrow 3.0
[ https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-11427: --- Fix Version/s: 4.0.0 > [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after upgrading > to pyarrow 3.0 > -- > > Key: ARROW-11427 > URL: https://issues.apache.org/jira/browse/ARROW-11427 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel > Xeon Platinum 8171m >Reporter: Ali Cetin >Priority: Major > Fix For: 4.0.0 > > > *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so > I'm unable to test it with other OS's. Azure VM's are assigned different > type of CPU's of same "class" depending on availability. I will try my "luck" > later. > VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after > upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when > reading parquet files larger than 4096 bits!? > Windows closes Python with exit code 255 and produces this: > > {code:java} > Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: > 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: > 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc > Faulting process id: 0x1b10 Faulting application start time: > 0x01d6f4a43dca3c14 Faulting application path: > D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe > Faulting module path: > D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code} > > Tested on: > ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs|| > |Windows Server 2012 Data Center|Fail|OK| > |Windows Server 2016 Data Center| OK|OK| > |Windows Server 2019 Data Center| | | > |Windows 10| |OK| > > Example code (Python): > {code:java} > import numpy as np > import pandas as pd > data_len = 2**5 > data = pd.DataFrame( > {"values": np.arange(0., float(data_len), dtype=float)}, > index=np.arange(0, data_len, dtype=int) > ) > data.to_parquet("test.parquet") > data = pd.read_parquet("test.parquet", engine="pyarrow") # fails here! > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11473) Needs a handling for missing columns while reading parquet file
[ https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jason khadka updated ARROW-11473: - Description: Currently there is no way to handle the error raised by missing columns in parquet file. If a column passed is missing, it just raises ArrowInvalid error {code:java} columns=[item1, item2, item3] #item3 is not there in parquet file pd.read_parquet(file_name, columns = columns) > ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code} There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored. Example : {code:java} from pyarrow.lib import ArrowInvalid read_columns = ['a','b','X'] df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) file_name = '/tmp/my_df.pq' df.to_parquet(file_name) try: df = pd.read_parquet(file_name, columns = read_columns) except ArrowInvalid as e: inval = e print(inval.args) >("Field named 'X' not found or not unique in the schema.",){code} You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example : {code:java} inval.field > 'X'{code} Or a better feature would be to have a error handling in read_table of pyarrow, where something like \{{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema. Example, in case above : {code:java} df = pd.read_parquet(file_name, columns = read_columns, error = 'ignore'){code} Would ignore the missing column 'X' was: Currently there is no way to handle the error raised by missing columns in parquet file. If a column passed is missing, it just raises ArrowInvalid error {code:java} columns=[item1, item2, item3] #item3 is not there in parquet file pd.read_parquet(file_name, columns = columns) > ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code} There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored. Example : {{}} {code:java} from pyarrow.lib import ArrowInvalid read_columns = ['a','b','X'] df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) file_name = '/tmp/my_df.pq' df.to_parquet(file_name) try: df = pd.read_parquet(file_name, columns = read_columns) except ArrowInvalid as e: inval = e print(inval.args) >("Field named 'X' not found or not unique in the schema.",){code} {{}} You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example : {{}} {code:java} inval.field > 'X'{code} Or a better feature would be to have a error handling in read_table of pyarrow, where something like {{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema. > Needs a handling for missing columns while reading parquet file > > > Key: ARROW-11473 > URL: https://issues.apache.org/jira/browse/ARROW-11473 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: jason khadka >Priority: Major > > Currently there is no way to handle the error raised by missing columns in > parquet file. > If a column passed is missing, it just raises ArrowInvalid error > {code:java} > columns=[item1, item2, item3] #item3 is not there in parquet file > pd.read_parquet(file_name, columns = columns) > > ArrowInvalid: Field named 'item3' not found or not unique in the > > schema.{code} > There is no way to handle this. The ArrowInvalid also does not carry any > information that can give out the field name so that in next try this filed > can be ignored. > Example : > {code:java} > from pyarrow.lib import ArrowInvalid > read_columns = ['a','b','X'] > df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) > file_name = '/tmp/my_df.pq' df.to_parquet(file_name) > try: > df = pd.read_parquet(file_name, columns = read_columns) > except ArrowInvalid as e: > inval = e > print(inval.args) > >("Field named 'X' not found or not unique in the schema.",){code} > > You could parse the message above to get 'X', but that is a bit of hectic > solution. It would be great if the error message contained the field name. > So, you could do for example : > > {code:java} > inval.field > > 'X'{code} > Or a better feature would be to have a error handling in read_table of > pyarrow, where something like \{{error='ignore'}}could be passed. This would > then ignore the missing column by checking the schema. > Example, in case
[jira] [Created] (ARROW-11473) Needs a handling for missing columns while reading parquet file
jason khadka created ARROW-11473: Summary: Needs a handling for missing columns while reading parquet file Key: ARROW-11473 URL: https://issues.apache.org/jira/browse/ARROW-11473 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: jason khadka Currently there is no way to handle the error raised by missing columns in parquet file. If a column passed is missing, it just raises ArrowInvalid error {code:java} columns=[item1, item2, item3] #item3 is not there in parquet file pd.read_parquet(file_name, columns = columns) > ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code} There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored. Example : {{}} {code:java} from pyarrow.lib import ArrowInvalid read_columns = ['a','b','X'] df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) file_name = '/tmp/my_df.pq' df.to_parquet(file_name) try: df = pd.read_parquet(file_name, columns = read_columns) except ArrowInvalid as e: inval = e print(inval.args) >("Field named 'X' not found or not unique in the schema.",){code} {{}} You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example : {{}} {code:java} inval.field > 'X'{code} Or a better feature would be to have a error handling in read_table of pyarrow, where something like {{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11469) Performance degradation wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277261#comment-17277261 ] Joris Van den Bossche edited comment on ARROW-11469 at 2/2/21, 4:27 PM: [~Axelg1] Thanks for the report We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it seems that some things deteriorated again. So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to use the old code path (another alternative is using the single-file {{pq.ParquetFile}} interface, this will never have overhead for dealing with potentially more complicated datasets). cc [~bkietz] There seems to be a lot of overhead being spent in the projection ({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, {{CheckProjectable}}, {{FieldRef}} finding, see the attached profile [^profile_wide300.svg] ), while in this case there is actually no projection happening. was (Author: jorisvandenbossche): [~Axelg1] Thanks for the report We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it seems that some things deteriorated again. So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to use the old code path (another alternative is using the single-file {{pq.ParquetFile}} interface, this will never have overhead for dealing with potentially more complicated datasets). cc [~bkietz] There seems to be a lot of overhead being spent in the projection ({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, {{CheckProjectable}}, {{FieldRef}} finding, see the attached profile), while in this case there is actually no projection happening. [^profile_wide300.svg] > Performance degradation wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11469) Performance degradation wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11469: -- Attachment: profile_wide300.svg > Performance degradation wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11469) Performance degradation wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277261#comment-17277261 ] Joris Van den Bossche commented on ARROW-11469: --- [~Axelg1] Thanks for the report We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it seems that some things deteriorated again. So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to use the old code path (another alternative is using the single-file {{pq.ParquetFile}} interface, this will never have overhead for dealing with potentially more complicated datasets). cc [~bkietz] There seems to be a lot of overhead being spent in the projection ({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, {{CheckProjectable}}, {{FieldRef}} finding, see the attached profile), while in this case there is actually no projection happening. [^profile_wide300.svg] > Performance degradation wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > Attachments: profile_wide300.svg > > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-11463. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9394 [https://github.com/apache/arrow/pull/9394] > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11462) [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX
[ https://issues.apache.org/jira/browse/ARROW-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-11462. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9391 [https://github.com/apache/arrow/pull/9391] > [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX > --- > > Key: ARROW-11462 > URL: https://issues.apache.org/jira/browse/ARROW-11462 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277239#comment-17277239 ] Joris Van den Bossche commented on ARROW-11456: --- bq. If you still need code, I can write a function to generate it. That would help, yes. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 22 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234 ] Pac A. He edited comment on ARROW-11456 at 2/2/21, 4:12 PM: For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such files. That's a workaround for now, if only for Python, until this issue is resolved. was (Author: apacman): For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such files. That's a workaround for now until this issue is resolved. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was effectively a unique base64 encoded length 22 string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pac A. He updated ARROW-11456: -- Description: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string. was: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] >
[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pac A. He updated ARROW-11456: -- Description: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. This problem started after I added a string column with 1.5 billion unique rows. Each value was effectively a unique base64 encoded length 22 string was: When reading a large parquet file, I have this error: {noformat} df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648 {noformat} Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses [31-bit computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] It's not even 32-bit as sizes are non-negative. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. > This problem started after I added a string column with 1.5 billion unique > rows. Each value was
[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings
[ https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234 ] Pac A. He commented on ARROW-11456: --- For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such files. That's a workaround for now until this issue is resolved. > [Python] Parquet reader cannot read large strings > - > > Key: ARROW-11456 > URL: https://issues.apache.org/jira/browse/ARROW-11456 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0, 3.0.0 > Environment: pyarrow 3.0.0 / 2.0.0 > pandas 1.2.1 > python 3.8.6 >Reporter: Pac A. He >Priority: Major > > When reading a large parquet file, I have this error: > > {noformat} > df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 459, in read_parquet > return impl.read( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", > line 221, in read > return self.api.parquet.read_table( > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 1638, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", > line 327, in read > return self.reader.read_all(column_indices=column_indices, > File "pyarrow/_parquet.pyx", line 1126, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: Capacity error: BinaryBuilder cannot reserve space for more than > 2147483646 child elements, got 2147483648 > {noformat} > Isn't pyarrow supposed to support large parquets? It let me write this > parquet file, but now it doesn't let me read it back. I don't understand why > arrow uses [31-bit > computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] > It's not even 32-bit as sizes are non-negative. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277230#comment-17277230 ] Antoine Pitrou commented on ARROW-11463: [~lausen] I'm not sure your question has a possible answer, but please note that both PyArrow serialization and Plasma are deprecated and unmaintained. For the former, the recommended replacement is pickle with protocol 5. For the latter, you may want to contact the developers of the Ray project (they used to maintain Plasma and decided to fork it). > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11469) Performance degradation wide dataframes
[ https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11469: -- Description: I noticed a relatively big performance degradation in version 1.0.0+ when trying to load wide dataframes. For example you should be able to reproduce by doing: {code:java} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame(np.random.rand(100, 1)) table = pa.Table.from_pandas(df) pq.write_table(table, "temp.parquet") %timeit pd.read_parquet("temp.parquet"){code} In version 0.17.0, this takes about 300-400 ms and for anything above and including 1.0.0, this suddenly takes around 2 seconds. Thanks for looking into this. was: I noticed a relatively big performance degradation in version 1.0.0+ when trying to load wide dataframes. For example you should be able to reproduce by doing: {code:java} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame(np.random.rand(100, 1)) table = pa.Table.from_pandas(df) pd.write_table(table, "temp.parquet") %timeit pd.read_parquet("temp.parquet"){code} In version 0.17.0, this takes about 300-400 ms and for anything above and including 1.0.0, this suddenly takes around 2 seconds. Thanks for looking into this. > Performance degradation wide dataframes > --- > > Key: ARROW-11469 > URL: https://issues.apache.org/jira/browse/ARROW-11469 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0 >Reporter: Axel G >Priority: Minor > > I noticed a relatively big performance degradation in version 1.0.0+ when > trying to load wide dataframes. > For example you should be able to reproduce by doing: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame(np.random.rand(100, 1)) > table = pa.Table.from_pandas(df) > pq.write_table(table, "temp.parquet") > %timeit pd.read_parquet("temp.parquet"){code} > In version 0.17.0, this takes about 300-400 ms and for anything above and > including 1.0.0, this suddenly takes around 2 seconds. > > Thanks for looking into this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170 ] Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:52 PM: - Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} was (Author: lausen): Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170 ] Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:51 PM: - Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} was (Author: lausen): Thank you Tao! How can we specify the IPC stream writer instance for the `_serialize_pyarrow_table` which is configured to be the default_serialization_handler and used by `plasma_client.put`? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{}}{{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170 ] Leonard Lausen commented on ARROW-11463: Thank you Tao! How can we specify the IPC stream writer instance for the `_serialize_pyarrow_table` which is configured to be the default_serialization_handler and used by `plasma_client.put`? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{}}{{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0
[ https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277142#comment-17277142 ] Joris Van den Bossche commented on ARROW-11400: --- Marking it as 3.0.0, as it was already fixed in the release, my PR was only adding a test > [Python] Pickled ParquetFileFragment has invalid partition_expresion with > dictionary type in pyarrow 2.0 > > > Key: ARROW-11400 > URL: https://issues.apache.org/jira/browse/ARROW-11400 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > From https://github.com/dask/dask/pull/7066#issuecomment-767156623 > Simplified reproducer: > {code:python} > import pyarrow.parquet as pq > import pyarrow.dataset as ds > table = pa.table({'part': ['A', 'B']*5, 'col': range(10)}) > pq.write_to_dataset(table, "test_partitioned_parquet", > partition_cols=["part"]) > # with partitioning_kwargs = {} there is no error > partitioning_kwargs = {"max_partition_dictionary_size": -1} > dataset = ds.dataset( > "test_partitioned_parquet/", format="parquet", > partitioning=ds.HivePartitioning.discover( **partitioning_kwargs) > ) > frag = list(dataset.get_fragments())[0] > {code} > Querying this fragment works fine, but after serialization/deserialization > with pickle, it gives errors (and with the original data example I actually > got a segfault as well): > {code} > In [16]: import pickle > In [17]: frag2 = pickle.loads(pickle.dumps(frag)) > In [19]: frag2.partition_expression > ... > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: > invalid continuation byte > In [20]: frag2.to_table(schema=schema, columns=columns) > Out[20]: > pyarrow.Table > col: int64 > part: dictionary > In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas() > ... > ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ArrowException: Unknown error: Wrapping ɻ� failed > {code} > It seems the issue was specifically with a partition expression with > dictionary type. > Also when using an integer columns as the partition column, you get wrong > values (but silently in this case): > {code:python} > In [42]: frag.partition_expression > Out[42]: >1, > 2 > ][0]:dictionary)> > In [43]: frag2.partition_expression > Out[43]: >170145232, > 32754 > ][0]:dictionary)> > {code} > Now, it seems this is fixed in master. But since I don't remember it was > fixed intentionally ([~bkietz]?), it would be good to add some tests for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0
[ https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-11400. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 9350 [https://github.com/apache/arrow/pull/9350] > [Python] Pickled ParquetFileFragment has invalid partition_expresion with > dictionary type in pyarrow 2.0 > > > Key: ARROW-11400 > URL: https://issues.apache.org/jira/browse/ARROW-11400 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > From https://github.com/dask/dask/pull/7066#issuecomment-767156623 > Simplified reproducer: > {code:python} > import pyarrow.parquet as pq > import pyarrow.dataset as ds > table = pa.table({'part': ['A', 'B']*5, 'col': range(10)}) > pq.write_to_dataset(table, "test_partitioned_parquet", > partition_cols=["part"]) > # with partitioning_kwargs = {} there is no error > partitioning_kwargs = {"max_partition_dictionary_size": -1} > dataset = ds.dataset( > "test_partitioned_parquet/", format="parquet", > partitioning=ds.HivePartitioning.discover( **partitioning_kwargs) > ) > frag = list(dataset.get_fragments())[0] > {code} > Querying this fragment works fine, but after serialization/deserialization > with pickle, it gives errors (and with the original data example I actually > got a segfault as well): > {code} > In [16]: import pickle > In [17]: frag2 = pickle.loads(pickle.dumps(frag)) > In [19]: frag2.partition_expression > ... > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: > invalid continuation byte > In [20]: frag2.to_table(schema=schema, columns=columns) > Out[20]: > pyarrow.Table > col: int64 > part: dictionary > In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas() > ... > ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ArrowException: Unknown error: Wrapping ɻ� failed > {code} > It seems the issue was specifically with a partition expression with > dictionary type. > Also when using an integer columns as the partition column, you get wrong > values (but silently in this case): > {code:python} > In [42]: frag.partition_expression > Out[42]: >1, > 2 > ][0]:dictionary)> > In [43]: frag2.partition_expression > Out[43]: >170145232, > 32754 > ][0]:dictionary)> > {code} > Now, it seems this is fixed in master. But since I don't remember it was > fixed intentionally ([~bkietz]?), it would be good to add some tests for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20
[ https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11472: --- Labels: pull-request-available (was: ) > [Python][CI] Kartothek integrations build is failing with numpy 1.20 > > > Key: ARROW-11472 > URL: https://issues.apache.org/jira/browse/ARROW-11472 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure > looks like: > {code} > ERROR collecting tests/io/dask/dataframe/test_read.py > _ > tests/io/dask/dataframe/test_read.py:185: in > @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) > kartothek/core/testing.py:65: in get_dataframe_not_nested > "unicode": pd.Series(["Ö"], dtype=np.unicode), > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: > in __init__ > data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: > in sanitize_array > subarr = _try_cast(data, dtype, copy, raise_cast_failure) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: > in _try_cast > maybe_cast_to_integer_array(arr, dtype) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: > in maybe_cast_to_integer_array > casted = np.array(arr, dtype=dtype, copy=copy) > E ValueError: invalid literal for int() with base 10: 'Ö' > {code} > So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with > numpy 1.20.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20
[ https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-11472: - Assignee: Joris Van den Bossche > [Python][CI] Kartothek integrations build is failing with numpy 1.20 > > > Key: ARROW-11472 > URL: https://issues.apache.org/jira/browse/ARROW-11472 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > > See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure > looks like: > {code} > ERROR collecting tests/io/dask/dataframe/test_read.py > _ > tests/io/dask/dataframe/test_read.py:185: in > @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) > kartothek/core/testing.py:65: in get_dataframe_not_nested > "unicode": pd.Series(["Ö"], dtype=np.unicode), > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: > in __init__ > data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: > in sanitize_array > subarr = _try_cast(data, dtype, copy, raise_cast_failure) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: > in _try_cast > maybe_cast_to_integer_array(arr, dtype) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: > in maybe_cast_to_integer_array > casted = np.array(arr, dtype=dtype, copy=copy) > E ValueError: invalid literal for int() with base 10: 'Ö' > {code} > So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with > numpy 1.20.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20
[ https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277113#comment-17277113 ] Joris Van den Bossche commented on ARROW-11472: --- Looking into this, and this seems to be cause by numpy's aliasing of {{np.unicode}} to be {{int}}. This is already reported and fixed (https://github.com/numpy/numpy/issues/18287), so I assume it will soon be in a 1.20.1 release. > [Python][CI] Kartothek integrations build is failing with numpy 1.20 > > > Key: ARROW-11472 > URL: https://issues.apache.org/jira/browse/ARROW-11472 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure > looks like: > {code} > ERROR collecting tests/io/dask/dataframe/test_read.py > _ > tests/io/dask/dataframe/test_read.py:185: in > @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) > kartothek/core/testing.py:65: in get_dataframe_not_nested > "unicode": pd.Series(["Ö"], dtype=np.unicode), > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: > in __init__ > data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: > in sanitize_array > subarr = _try_cast(data, dtype, copy, raise_cast_failure) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: > in _try_cast > maybe_cast_to_integer_array(arr, dtype) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: > in maybe_cast_to_integer_array > casted = np.array(arr, dtype=dtype, copy=copy) > E ValueError: invalid literal for int() with base 10: 'Ö' > {code} > So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with > numpy 1.20.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20
[ https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11472: -- Description: See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks like: {code} ERROR collecting tests/io/dask/dataframe/test_read.py _ tests/io/dask/dataframe/test_read.py:185: in @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) kartothek/core/testing.py:65: in get_dataframe_not_nested "unicode": pd.Series(["Ö"], dtype=np.unicode), /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: in _try_cast maybe_cast_to_integer_array(arr, dtype) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: in maybe_cast_to_integer_array casted = np.array(arr, dtype=dtype, copy=copy) E ValueError: invalid literal for int() with base 10: 'Ö' {code} So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with numpy 1.20.0 was: See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks like: {code} ERROR collecting tests/io/dask/dataframe/test_read.py _ tests/io/dask/dataframe/test_read.py:185: in @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) kartothek/core/testing.py:65: in get_dataframe_not_nested "unicode": pd.Series(["Ö"], dtype=np.unicode), /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: in _try_cast maybe_cast_to_integer_array(arr, dtype) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: in maybe_cast_to_integer_array casted = np.array(arr, dtype=dtype, copy=copy) E ValueError: invalid literal for int() with base 10: 'Ö' {code} So it seems that {{ pd.Series(["Ö"], dtype=np.unicode)}} stopped working with numpy 1.20.0 > [Python][CI] Kartothek integrations build is failing with numpy 1.20 > > > Key: ARROW-11472 > URL: https://issues.apache.org/jira/browse/ARROW-11472 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure > looks like: > {code} > ERROR collecting tests/io/dask/dataframe/test_read.py > _ > tests/io/dask/dataframe/test_read.py:185: in > @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) > kartothek/core/testing.py:65: in get_dataframe_not_nested > "unicode": pd.Series(["Ö"], dtype=np.unicode), > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: > in __init__ > data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: > in sanitize_array > subarr = _try_cast(data, dtype, copy, raise_cast_failure) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: > in _try_cast > maybe_cast_to_integer_array(arr, dtype) > /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: > in maybe_cast_to_integer_array > casted = np.array(arr, dtype=dtype, copy=copy) > E ValueError: invalid literal for int() with base 10: 'Ö' > {code} > So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with > numpy 1.20.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20
Joris Van den Bossche created ARROW-11472: - Summary: [Python][CI] Kartothek integrations build is failing with numpy 1.20 Key: ARROW-11472 URL: https://issues.apache.org/jira/browse/ARROW-11472 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks like: {code} ERROR collecting tests/io/dask/dataframe/test_read.py _ tests/io/dask/dataframe/test_read.py:185: in @pytest.mark.parametrize("col", get_dataframe_not_nested().columns) kartothek/core/testing.py:65: in get_dataframe_not_nested "unicode": pd.Series(["Ö"], dtype=np.unicode), /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480: in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587: in _try_cast maybe_cast_to_integer_array(arr, dtype) /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723: in maybe_cast_to_integer_array casted = np.array(arr, dtype=dtype, copy=copy) E ValueError: invalid literal for int() with base 10: 'Ö' {code} So it seems that {{ pd.Series(["Ö"], dtype=np.unicode)}} stopped working with numpy 1.20.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale
[ https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-7288: - Assignee: Kouhei Sutou (was: Neal Richardson) > [C++][R] read_parquet() freezes on Windows with Japanese locale > --- > > Key: ARROW-7288 > URL: https://issues.apache.org/jira/browse/ARROW-7288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.15.1 > Environment: R 3.6.1 on Windows 10 >Reporter: Hiroaki Yutani >Assignee: Kouhei Sutou >Priority: Critical > Labels: parquet, pull-request-available > Fix For: 4.0.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > The following example on read_parquet()'s doc freezes (seems to wait for the > result forever) on my Windows. > df <- read_parquet(system.file("v0.7.1.parquet", package="arrow")) > The CRAN checks are all fine, which means the example is successfully > executed on the CRAN Windows. So, I have no idea why it doesn't work on my > local. > [https://cran.r-project.org/web/checks/check_results_arrow.html] > Here's my session info in case it helps: > {code:java} > > sessioninfo::session_info() > - Session info > - > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language en > collate Japanese_Japan.932 > ctypeJapanese_Japan.932 > tz Asia/Tokyo > date 2019-12-01 > - Packages > - > package * version date lib source > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1) > assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0) > bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) > bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0) > cli 1.1.02019-03-19 [1] CRAN (R 3.6.0) > crayon1.3.42017-09-16 [1] CRAN (R 3.6.0) > fs1.3.12019-05-06 [1] CRAN (R 3.6.0) > glue 1.3.12019-03-12 [1] CRAN (R 3.6.0) > magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) > purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1) > R62.4.12019-11-12 [1] CRAN (R 3.6.1) > Rcpp 1.0.32019-11-08 [1] CRAN (R 3.6.1) > reprex0.3.02019-05-16 [1] CRAN (R 3.6.0) > rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1) > rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0) > sessioninfo 1.1.12018-11-05 [1] CRAN (R 3.6.0) > tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0) > withr 2.1.22018-03-15 [1] CRAN (R 3.6.0) > [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6 > [2] C:/Program Files/R/R-3.6.1/library > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale
[ https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-7288. --- Resolution: Fixed Issue resolved by pull request 9367 [https://github.com/apache/arrow/pull/9367] > [C++][R] read_parquet() freezes on Windows with Japanese locale > --- > > Key: ARROW-7288 > URL: https://issues.apache.org/jira/browse/ARROW-7288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.15.1 > Environment: R 3.6.1 on Windows 10 >Reporter: Hiroaki Yutani >Assignee: Neal Richardson >Priority: Critical > Labels: parquet, pull-request-available > Fix For: 4.0.0 > > Time Spent: 7h > Remaining Estimate: 0h > > The following example on read_parquet()'s doc freezes (seems to wait for the > result forever) on my Windows. > df <- read_parquet(system.file("v0.7.1.parquet", package="arrow")) > The CRAN checks are all fine, which means the example is successfully > executed on the CRAN Windows. So, I have no idea why it doesn't work on my > local. > [https://cran.r-project.org/web/checks/check_results_arrow.html] > Here's my session info in case it helps: > {code:java} > > sessioninfo::session_info() > - Session info > - > setting value > version R version 3.6.1 (2019-07-05) > os Windows 10 x64 > system x86_64, mingw32 > ui RStudio > language en > collate Japanese_Japan.932 > ctypeJapanese_Japan.932 > tz Asia/Tokyo > date 2019-12-01 > - Packages > - > package * version date lib source > arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1) > assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0) > bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) > bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0) > cli 1.1.02019-03-19 [1] CRAN (R 3.6.0) > crayon1.3.42017-09-16 [1] CRAN (R 3.6.0) > fs1.3.12019-05-06 [1] CRAN (R 3.6.0) > glue 1.3.12019-03-12 [1] CRAN (R 3.6.0) > magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) > purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1) > R62.4.12019-11-12 [1] CRAN (R 3.6.1) > Rcpp 1.0.32019-11-08 [1] CRAN (R 3.6.1) > reprex0.3.02019-05-16 [1] CRAN (R 3.6.0) > rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1) > rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0) > sessioninfo 1.1.12018-11-05 [1] CRAN (R 3.6.0) > tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0) > withr 2.1.22018-03-15 [1] CRAN (R 3.6.0) > [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6 > [2] C:/Program Files/R/R-3.6.1/library > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11410) [Rust][Parquet] Implement returning dictionary arrays from parquet reader
[ https://issues.apache.org/jira/browse/ARROW-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277063#comment-17277063 ] Andrew Lamb commented on ARROW-11410: - [~yordan-pavlov] I think this would be amazing -- and we would definitely use it in IOx. This is the kind of thing that is on our longer term roadmap and I would love to help (e.g. code review, or testing , or documentation, etc). Let me know! > [Rust][Parquet] Implement returning dictionary arrays from parquet reader > - > > Key: ARROW-11410 > URL: https://issues.apache.org/jira/browse/ARROW-11410 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Yordan Pavlov >Priority: Major > > Currently the Rust parquet reader returns a regular array (e.g. string array) > even when the column is dictionary encoded in the parquet file. > If the parquet reader had the ability to return dictionary arrays for > dictionary encoded columns this would bring many benefits such as: > * faster reading of dictionary encoded columns from parquet (as no > conversion/expansion into a regular array would be necessary) > * more efficient memory use as the dictionary array would use less memory > when loaded in memory > * faster filtering operations as SIMD can be used to filter over the numeric > keys of a dictionary string array instead of comparing string values in a > string array > [~nevime] , [~alamb] let me know what you think -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11464) [Python] pyarrow.parquet.read_pandas doesn't conform to its docs
[ https://issues.apache.org/jira/browse/ARROW-11464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-11464: -- Fix Version/s: 4.0.0 > [Python] pyarrow.parquet.read_pandas doesn't conform to its docs > > > Key: ARROW-11464 > URL: https://issues.apache.org/jira/browse/ARROW-11464 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 > Environment: latest >Reporter: Pac A. He >Priority: Major > Fix For: 4.0.0 > > > The {{*pyarrow.parquet.read_pandas*}} > [implementation|https://github.com/apache/arrow/blob/816c23af4478fe28f31d474e90b433baeadb7a78/python/pyarrow/parquet.py#L1740-L1754] > doesn't conform to its > [docs|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_pandas.html#pyarrow.parquet.read_pandas] > in at least these ways: > # The docs state that a *{{filesystem}}* option can be provided, as it > should be. Without this option I cannot read from S3, etc. The > implementation, however, doesn't have this option! As such I currently cannot > use it to read from S3! > # The docs (_not in the type definition!_) state that the default value for > *{{use_legacy_dataset}}* is False, whereas the implementation has a value of > True. Actually, however, if a value of True is used, then I can't read a > partitioned dataset at all, so in this case maybe its the doc that needs to > change to True. > It looks to have been implemented and reviewed pretty carelessly. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-11471) [Rust] DoubleEndedIterator for BitChunks
[ https://issues.apache.org/jira/browse/ARROW-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann reassigned ARROW-11471: -- Assignee: Jörn Horstmann > [Rust] DoubleEndedIterator for BitChunks > > > Key: ARROW-11471 > URL: https://issues.apache.org/jira/browse/ARROW-11471 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 3.0.0 >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > > The usecase is to efficiently find the last non-null value in an array slice > by iterating over the bits in chunks and using u64::leading_zeroes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11471) [Rust] DoubleEndedIterator for BitChunks
Jörn Horstmann created ARROW-11471: -- Summary: [Rust] DoubleEndedIterator for BitChunks Key: ARROW-11471 URL: https://issues.apache.org/jira/browse/ARROW-11471 Project: Apache Arrow Issue Type: Improvement Components: Rust Affects Versions: 3.0.0 Reporter: Jörn Horstmann The usecase is to efficiently find the last non-null value in an array slice by iterating over the bits in chunks and using u64::leading_zeroes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11470) [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides
[ https://issues.apache.org/jira/browse/ARROW-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-11470: --- Labels: pull-request-available (was: ) > [C++] Overflow occurs on integer multiplications in > Compute(Row|Column)MajorStrides > --- > > Key: ARROW-11470 > URL: https://issues.apache.org/jira/browse/ARROW-11470 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides > function occurs overflow. > https://oss-fuzz.com/testcase-detail/623225726408 > The same issue exists in ComputeColumnMajorStrides. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11470) [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides
Kenta Murata created ARROW-11470: Summary: [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides Key: ARROW-11470 URL: https://issues.apache.org/jira/browse/ARROW-11470 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides function occurs overflow. https://oss-fuzz.com/testcase-detail/623225726408 The same issue exists in ComputeColumnMajorStrides. -- This message was sent by Atlassian Jira (v8.3.4#803005)