date:20210202

[jira] [Resolved] (ARROW-11350) [C++] Bump dependency versions

2021-02-02 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11350.
--
Resolution: Fixed

Issue resolved by pull request 9296
[https://github.com/apache/arrow/pull/9296]

> [C++] Bump dependency versions
> --
>
> Key: ARROW-11350
> URL: https://issues.apache.org/jira/browse/ARROW-11350
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11470) [C++] Overflow occurs on integer multiplications in ComputeRowMajorStrides, ComputeColumnMajorStrides, and CheckTensorStridesValidity

2021-02-02 Thread Kenta Murata (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenta Murata updated ARROW-11470:
-
Summary: [C++] Overflow occurs on integer multiplications in 
ComputeRowMajorStrides, ComputeColumnMajorStrides, and 
CheckTensorStridesValidity  (was: [C++] Overflow occurs on integer 
multiplications in Compute(Row|Column)MajorStrides)

> [C++] Overflow occurs on integer multiplications in ComputeRowMajorStrides, 
> ComputeColumnMajorStrides, and CheckTensorStridesValidity
> -
>
> Key: ARROW-11470
> URL: https://issues.apache.org/jira/browse/ARROW-11470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides 
> function occurs overflow.
> https://oss-fuzz.com/testcase-detail/623225726408
> The same issue exists in ComputeColumnMajorStrides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-11398) [C++][Compute] Test failures with gcc-9.3 on aarch64

2021-02-02 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-11398.

Resolution: Won't Fix

To summarize:
- gcc-9.3 aarch64 auto vectorization generates buggy code for this [code 
block|https://github.com/apache/arrow/blob/d0ce28b7fdcfba16de404e20eda47097df17/cpp/src/arrow/util/bitmap_generate.h#L89-L96].
- gcc-7.5, gcc-10.1 don't have this problem.

> [C++][Compute] Test failures with gcc-9.3 on aarch64
> 
>
> Key: ARROW-11398
> URL: https://issues.apache.org/jira/browse/ARROW-11398
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: Tobias Mayer
>Assignee: Yibo Cai
>Priority: Major
> Attachments: aarch64.log
>
>
> The tests
> * arrow-compute-scalar-test
> ** "TestCompareKernel.PrimitiveRandomTests"
> * arrow-compute-vector-test
> ** "TestFilterKernelWithNumeric/3.CompareArrayAndFilterRandomNumeric"
> ** "TestFilterKernelWithNumeric/7.CompareArrayAndFilterRandomNumeric"
> Fail on aarch64 on NixOS. Full Build log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11398) [C++][Compute] Test failures with gcc-9.3 on aarch64

2021-02-02 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277665#comment-17277665
 ] 

Yibo Cai commented on ARROW-11398:
--

Wrote a simple test program to reproduce this issue.
Details in gcc bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98949


> [C++][Compute] Test failures with gcc-9.3 on aarch64
> 
>
> Key: ARROW-11398
> URL: https://issues.apache.org/jira/browse/ARROW-11398
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: Tobias Mayer
>Assignee: Yibo Cai
>Priority: Major
> Attachments: aarch64.log
>
>
> The tests
> * arrow-compute-scalar-test
> ** "TestCompareKernel.PrimitiveRandomTests"
> * arrow-compute-vector-test
> ** "TestFilterKernelWithNumeric/3.CompareArrayAndFilterRandomNumeric"
> ** "TestFilterKernelWithNumeric/7.CompareArrayAndFilterRandomNumeric"
> Fail on aarch64 on NixOS. Full Build log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2021-02-02 Thread Paul Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659
 ] 

Paul Taylor edited comment on ARROW-10255 at 2/3/21, 3:44 AM:
--

[~bhulette] I vote no on the current PR for 4 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.
# Adding deprecation warnings will add to the size of the lib, and likely in 
ways that can't be tree-shaken. In Python size on-disk isn't an issue, so 
people add deprecation warnings all the time, but without extensive tooling 
support it's difficult to do/guide users how to do in JS.



was (Author: paul.e.taylor):
[~bhulette] I vote no on the current PR for 3 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.


> [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
> ---
>
> Key: ARROW-10255
> URL: https://issues.apache.org/jira/browse/ARROW-10255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Presently most of our public classes can't be easily 
> [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library 
> consumers. This is a problem for libraries that only need to use parts of 
> Arrow.
> For example, the vis.gl projects have an integration test that imports three 
> of our simpler classes and tests the resulting bundle size:
> {code:javascript}
> import {Schema, Field, Float32} from 'apache-arrow';
> // | Bundle Size| Compressed 
> // | 202KB (207112) KB  | 45KB (46618) KB
> {code}
> We can help solve this with the following changes:
> * Add "sideEffects": false to our ESM package.json
> * Reorganize our imports to only include what's needed
> * Eliminate or move some static/member methods to standalone exported 
> functions
> * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't 
> compile in its own Buffer shim
> * Removing flatbuffers namespaces from generated TS because these defeat 
> Webpack's tree-shaking ability
> Candidate functions for removal/moving to standalone functions:
> * Schema.new, Schema.from, Schema.prototype.compareTo
> * Field.prototype.compareTo
> * Type.prototype.compareTo
> * Table.new, Table.from
> * Column.new
> * Vector.new, Vector.from
> * RecordBatchReader.from
> After applying a few of the above changes to the Schema and flatbuffers 
> files, I was able to reduce the vis.gl's import size 90%:
> {code:javascript}
> // Bundle Size  | Compressed
> // 24KB (24942) KB  | 6KB (6154) KB
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2021-02-02 Thread Paul Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277659#comment-17277659
 ] 

Paul Taylor commented on ARROW-10255:
-

[~bhulette] I vote no on the current PR for 3 reasons: 
# Arrow releases have moved to releases major-version-revs only, so npm won't 
upgrade libs/people by default
# The [very minor 
changes|https://github.com/apache/arrow/pull/8418/files#diff-281075682d7444bc1be962b47cff16401e18f1b9bafee2b58557a3f73fb54507]
 we had to make to our own tests makes me think this shouldn't be a huge pain 
to upgrade for users (not to mention Field and Schema aren't the most common 
APIs to interact with directly)
# These changes are needed by [vis.gl|https://github.com/visgl] to import 
ArrowJS in [loaders.gl|https://github.com/visgl/loaders.gl]. Currently they're 
reimplementing the bits of Schema/Field/DataType they need because importing 
ours adds ~250k (~24k minified) to their bundle, which is over their size 
budget.


> [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
> ---
>
> Key: ARROW-10255
> URL: https://issues.apache.org/jira/browse/ARROW-10255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Presently most of our public classes can't be easily 
> [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library 
> consumers. This is a problem for libraries that only need to use parts of 
> Arrow.
> For example, the vis.gl projects have an integration test that imports three 
> of our simpler classes and tests the resulting bundle size:
> {code:javascript}
> import {Schema, Field, Float32} from 'apache-arrow';
> // | Bundle Size| Compressed 
> // | 202KB (207112) KB  | 45KB (46618) KB
> {code}
> We can help solve this with the following changes:
> * Add "sideEffects": false to our ESM package.json
> * Reorganize our imports to only include what's needed
> * Eliminate or move some static/member methods to standalone exported 
> functions
> * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't 
> compile in its own Buffer shim
> * Removing flatbuffers namespaces from generated TS because these defeat 
> Webpack's tree-shaking ability
> Candidate functions for removal/moving to standalone functions:
> * Schema.new, Schema.from, Schema.prototype.compareTo
> * Field.prototype.compareTo
> * Type.prototype.compareTo
> * Table.new, Table.from
> * Column.new
> * Vector.new, Vector.from
> * RecordBatchReader.from
> After applying a few of the above changes to the Schema and flatbuffers 
> files, I was able to reduce the vis.gl's import size 90%:
> {code:javascript}
> // Bundle Size  | Compressed
> // 24KB (24942) KB  | 6KB (6154) KB
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Tao He (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277647#comment-17277647
 ] 

Tao He commented on ARROW-11463:


Thanks for the background of pickle 5 [~lausen] .

[~lausen] for your case, rather than invoking `pa.serialize`, you could create 
a stream using `pa.ipc.new_stream` and feed a IpcWriteOption, then use 
`.write_table()` to write you tables to the stream.

 

c.f.:

+ 
[https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream]

+ 
[https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchStreamWriter.html#pyarrow.RecordBatchStreamWriter.write_table]

 

Hope that could be helpful for you!

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 24 string.

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths]

[jira] [Updated] (ARROW-11479) Add method to return compressed size of row group

2021-02-02 Thread Manoj Karthick (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Karthick updated ARROW-11479:
---
Issue Type: New Feature  (was: Improvement)

> Add method to return compressed size of row group
> -
>
> Key: ARROW-11479
> URL: https://issues.apache.org/jira/browse/ARROW-11479
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Manoj Karthick
>Priority: Minor
>
> Create a method to return the compressed size of all columns in the row 
> group. This will help with calculating the total compressed size of the 
> Parquet File.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11479) Add method to return compressed size of row group

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11479:
---
Labels: pull-request-available  (was: )

> Add method to return compressed size of row group
> -
>
> Key: ARROW-11479
> URL: https://issues.apache.org/jira/browse/ARROW-11479
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Manoj Karthick
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Create a method to return the compressed size of all columns in the row 
> group. This will help with calculating the total compressed size of the 
> Parquet File.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11479) Add method to return compressed size of row group

2021-02-02 Thread Manoj Karthick (Jira)

Manoj Karthick created ARROW-11479:
--

 Summary: Add method to return compressed size of row group
 Key: ARROW-11479
 URL: https://issues.apache.org/jira/browse/ARROW-11479
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Manoj Karthick


Create a method to return the compressed size of all columns in the row group. 
This will help with calculating the total compressed size of the Parquet File.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11478) [R] Consider ways to make arrow.skip_nul option more user-friendly

2021-02-02 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277628#comment-17277628
 ] 

Ian Cook commented on ARROW-11478:
--

I think I'm in favor of option 3, assuming it's feasible (which I think it 
should be).

> [R] Consider ways to make arrow.skip_nul option more user-friendly
> --
>
> Key: ARROW-11478
> URL: https://issues.apache.org/jira/browse/ARROW-11478
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Minor
> Fix For: 4.0.0
>
>
> In Arrow 3.0.0, the {{arrow.skip_nul}} option effectively defaults to 
> {{FALSE}} for consistency with {{base::readLines}} and {{base::scan}}.
> If the user keeps this default option value, then conversion of string data 
> containing embedded nuls causes an error with a message like:
> {code:java}
> embedded nul in string: '\0' {code}
> If the user sets the option to {{TRUE}}, then no error occurs, but this 
> warning is issued:
> {code:java}
> Stripping '\0' (nul) from character vector {code}
> Consider whether we should:
>  # Keep this all as it is
>  # Change the default option value to {{TRUE}}
>  # Keep the default option value as it is, but catch the error and re-throw 
> it with a more actionable message that tells the user how to set the option



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11478) [R] Consider ways to make arrow.skip_nul option more user-friendly

2021-02-02 Thread Ian Cook (Jira)

Ian Cook created ARROW-11478:


 Summary: [R] Consider ways to make arrow.skip_nul option more 
user-friendly
 Key: ARROW-11478
 URL: https://issues.apache.org/jira/browse/ARROW-11478
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook
 Fix For: 4.0.0


In Arrow 3.0.0, the {{arrow.skip_nul}} option effectively defaults to {{FALSE}} 
for consistency with {{base::readLines}} and {{base::scan}}.

If the user keeps this default option value, then conversion of string data 
containing embedded nuls causes an error with a message like:
{code:java}
embedded nul in string: '\0' {code}
If the user sets the option to {{TRUE}}, then no error occurs, but this warning 
is issued:
{code:java}
Stripping '\0' (nul) from character vector {code}
Consider whether we should:
 # Keep this all as it is
 # Change the default option value to {{TRUE}}
 # Keep the default option value as it is, but catch the error and re-throw it 
with a more actionable message that tells the user how to set the option



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-951) [JS] Fix generated API documentation

2021-02-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-951.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9375
[https://github.com/apache/arrow/pull/9375]

> [JS] Fix generated API documentation
> 
>
> Key: ARROW-951
> URL: https://issues.apache.org/jira/browse/ARROW-951
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current generated API documentation doesn't respect the project's 
> namespaces, it simply lists all exported objects. We should see if we can 
> make typedoc display the project's structure (even if it means re-structuring 
> the code a bit), or find another approach for doc generation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11467) [R] Fix reference to json_table_reader() in R docs

2021-02-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-11467.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9393
[https://github.com/apache/arrow/pull/9393]

> [R] Fix reference to json_table_reader() in R docs
> --
>
> Key: ARROW-11467
> URL: https://issues.apache.org/jira/browse/ARROW-11467
> Project: Apache Arrow
>  Issue Type: Task
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The docs entry for the R function {{read_json_arrow()}} refers to the 
> nonexistent function {{json_table_reader()}}. This should be changed to 
> {{JsonTableReader$create()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

2021-02-02 Thread Jonathan Keane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277546#comment-17277546
 ] 

Jonathan Keane commented on ARROW-11433:


Yeah, I tried it with the system allocator and that alone doesn’t resolve it 

> [R] Unexpectedly slow results reading csv
> -
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  24.788  19.485   7.216 
>user  system elapsed 
>  24.952  21.786   9.225 
>user  system elapsed 
>  25.150  23.039  10.332 
>user  system elapsed 
>  25.382  31.012  17.995 
>user  system elapsed 
>  25.309  25.140  12.356 
>user  system elapsed 
>  25.302  26.975  13.938 
>user  system elapsed 
>  25.509  34.390  21.134 
>user  system elapsed 
>  25.674  28.195  15.048 
>user  system elapsed 
>  25.031  28.094  16.449 
>user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  25.257  19.475   6.785 
>user  system elapsed 
>  25.271  19.838   6.821 
>user  system elapsed 
>  25.288  20.103   6.861 
>user  system elapsed 
>  25.188  20.290   7.217 
>user  system elapsed 
>  25.283  20.043   6.832 
>user  system elapsed 
>  25.194  19.947   6.906 
>user  system elapsed 
>  25.278  19.993   6.834 
>user  system elapsed 
>  25.355  20.018   6.833 
>user  system elapsed 
>  24.986  19.869   6.865 
>user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>user  system elapsed 
> 255.381 210.598  83.109 
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

2021-02-02 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277544#comment-17277544
 ] 

Neal Richardson commented on ARROW-11433:
-

"Only on mac" and "freeing memory" makes me think of ARROW-6994 and the various 
issues linked to that (see also 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L67-L85).
 I forget, have you tried using the system memory allocator?

> [R] Unexpectedly slow results reading csv
> -
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  24.788  19.485   7.216 
>user  system elapsed 
>  24.952  21.786   9.225 
>user  system elapsed 
>  25.150  23.039  10.332 
>user  system elapsed 
>  25.382  31.012  17.995 
>user  system elapsed 
>  25.309  25.140  12.356 
>user  system elapsed 
>  25.302  26.975  13.938 
>user  system elapsed 
>  25.509  34.390  21.134 
>user  system elapsed 
>  25.674  28.195  15.048 
>user  system elapsed 
>  25.031  28.094  16.449 
>user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  25.257  19.475   6.785 
>user  system elapsed 
>  25.271  19.838   6.821 
>user  system elapsed 
>  25.288  20.103   6.861 
>user  system elapsed 
>  25.188  20.290   7.217 
>user  system elapsed 
>  25.283  20.043   6.832 
>user  system elapsed 
>  25.194  19.947   6.906 
>user  system elapsed 
>  25.278  19.993   6.834 
>user  system elapsed 
>  25.355  20.018   6.833 
>user  system elapsed 
>  24.986  19.869   6.865 
>user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>user  system elapsed 
> 255.381 210.598  83.109 
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

2021-02-02 Thread Jonathan Keane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277543#comment-17277543
 ] 

Jonathan Keane commented on ARROW-11433:


We thought it might be due to the mmaping of the file, but turning it off at 
https://github.com/apache/arrow/blob/master/r/R/csv.R#L187 with {{mmap = 
FALSE}}  still exhibits the same pattern.

> [R] Unexpectedly slow results reading csv
> -
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  24.788  19.485   7.216 
>user  system elapsed 
>  24.952  21.786   9.225 
>user  system elapsed 
>  25.150  23.039  10.332 
>user  system elapsed 
>  25.382  31.012  17.995 
>user  system elapsed 
>  25.309  25.140  12.356 
>user  system elapsed 
>  25.302  26.975  13.938 
>user  system elapsed 
>  25.509  34.390  21.134 
>user  system elapsed 
>  25.674  28.195  15.048 
>user  system elapsed 
>  25.031  28.094  16.449 
>user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  25.257  19.475   6.785 
>user  system elapsed 
>  25.271  19.838   6.821 
>user  system elapsed 
>  25.288  20.103   6.861 
>user  system elapsed 
>  25.188  20.290   7.217 
>user  system elapsed 
>  25.283  20.043   6.832 
>user  system elapsed 
>  25.194  19.947   6.906 
>user  system elapsed 
>  25.278  19.993   6.834 
>user  system elapsed 
>  25.355  20.018   6.833 
>user  system elapsed 
>  24.986  19.869   6.865 
>user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>user  system elapsed 
> 255.381 210.598  83.109 
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

2021-02-02 Thread Jonathan Keane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277537#comment-17277537
 ] 

Jonathan Keane commented on ARROW-11433:


Ben and I spent some time on this today. Turns out it's not reproducible on 
Ubuntu as far as we tried. So we suspect something specific with macOS.

As another example of even the most basic cpp <-> R interaction alleviating the 
behavior:

{code:r}
> system.time({
+   for (i in 1:10) {
+ print(system.time(tab <- 
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ dev_null <- arrow::CsvParseOptions$create()
+ tab <- NULL
+   }
+ })
   user  system elapsed 
 27.894  22.633  12.224 
   user  system elapsed 
 25.016  19.855   8.576 
   user  system elapsed 
 25.092  20.625   8.051 
   user  system elapsed 
 25.263  21.161   8.353 
   user  system elapsed 
 25.168  20.575   8.745 
   user  system elapsed 
 25.087  20.207   7.672 
   user  system elapsed 
 25.311  20.525   7.438 
   user  system elapsed 
 25.013  20.537   7.962 
   user  system elapsed 
 25.088  20.671   8.371 
   user  system elapsed 
 25.409  20.375   7.659 
   user  system elapsed 
258.004 224.511 106.238 
{code}

> [R] Unexpectedly slow results reading csv
> -
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  24.788  19.485   7.216 
>user  system elapsed 
>  24.952  21.786   9.225 
>user  system elapsed 
>  25.150  23.039  10.332 
>user  system elapsed 
>  25.382  31.012  17.995 
>user  system elapsed 
>  25.309  25.140  12.356 
>user  system elapsed 
>  25.302  26.975  13.938 
>user  system elapsed 
>  25.509  34.390  21.134 
>user  system elapsed 
>  25.674  28.195  15.048 
>user  system elapsed 
>  25.031  28.094  16.449 
>user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> + print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> +   }
> + })
>user  system elapsed 
>  25.257  19.475   6.785 
>user  system elapsed 
>  25.271  19.838   6.821 
>user  system elapsed 
>  25.288  20.103   6.861 
>user  system elapsed 
>  25.188  20.290   7.217 
>user  system elapsed 
>  25.283  20.043   6.832 
>user  system elapsed 
>  25.194  19.947   6.906 
>user  system elapsed 
>  25.278  19.993   6.834 
>user  system elapsed 
>  25.355  20.018   6.833 
>user  system elapsed 
>  24.986  19.869   6.865 
>user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>user  system elapsed 
> 255.381 210.598  83.109 
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content

2021-02-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-11477:
-
Description: 
Collecting various ideas here for general ways to improve the R package README 
and vignettes for the 4.0.0 release:
 * Consider moving the "building" and "developing" content out of the REAMDE 
and into a vignette focused on that topic. (Rationale: most users of the R 
package today are downloading prebuilt binaries, not building their own; most 
users today are end users, not developers; a more valuable use for the 
README—especially since that it's the homepage of the R docs site—would be as a 
place to highlight key capabilities of the package, not to show folks all the 
technical details of building it.)
 * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles 
menu on the R docs site.
 * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that 
dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) 
and describe differences in dplyr support for these different Arrow objects.
 * Check all the links in the "Project docs" menu on the docs site; some of 
them are currently broken or go to directory listings

  was:
Collecting various ideas here for general ways to improve the R package README 
and vignettes for the 4.0.0 release:
 * Consider moving the "building" and "developing" content out of the REAMDE 
and into a vignette focused on that topic. (Rationale: most users of the R 
package today are downloading prebuilt binaries, not building their own; most 
users today are end users, not developers; a more valuable use for the 
README—especially since that it's the homepage of the R docs site—would be as a 
place to highlight key capabilities of the package, not to show folks all the 
technical details of building it.)
 * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles 
menu on the R docs site.
 * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that 
dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) 
and describe differences in dplyr support for these different Arrow objects.


> [R][Doc] Reorganize and improve README and vignette content
> ---
>
> Key: ARROW-11477
> URL: https://issues.apache.org/jira/browse/ARROW-11477
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> Collecting various ideas here for general ways to improve the R package 
> README and vignettes for the 4.0.0 release:
>  * Consider moving the "building" and "developing" content out of the REAMDE 
> and into a vignette focused on that topic. (Rationale: most users of the R 
> package today are downloading prebuilt binaries, not building their own; most 
> users today are end users, not developers; a more valuable use for the 
> README—especially since that it's the homepage of the R docs site—would be as 
> a place to highlight key capabilities of the package, not to show folks all 
> the technical details of building it.)
>  * Get the "Using the Arrow C++ Library in R" vignette to show in the 
> Articles menu on the R docs site.
>  * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear 
> that dplyr verbs can be used with Arrow Tables and RecordBatches (not just 
> Datasets) and describe differences in dplyr support for these different Arrow 
> objects.
>  * Check all the links in the "Project docs" menu on the docs site; some of 
> them are currently broken or go to directory listings



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277515#comment-17277515
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really 
useful. For example, the following code can replicate my use-case for the 
Plasma store based on providing a folder in {{/dev/shm}} as {{path}}.
{code:python}
import pickle
import mmap

def shm_pickle(path, tbl):
idx = 0
def buffer_callback(buf):
nonlocal idx
with open(path / f'{idx}.bin', 'wb') as f:
f.write(buf)
idx += 1
with open(path / 'meta.pkl', 'wb') as f:
pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback)


def shm_unpickle(path):
num_buffers = len(list(path.iterdir())) - 1  # exclude meta.idx
buffers = []
for idx in range(num_buffers):
f = open(path / f'{idx}.bin', 'rb')
buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ))
with open(path / 'meta.pkl', 'rb') as f:
return pickle.load(f, buffers=buffers)
{code}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11310) [Rust] Implement arrow JSON writer

2021-02-02 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11310.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9256
[https://github.com/apache/arrow/pull/9256]

> [Rust] Implement arrow JSON writer
> --
>
> Key: ARROW-11310
> URL: https://issues.apache.org/jira/browse/ARROW-11310
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content

2021-02-02 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277511#comment-17277511
 ] 

Neal Richardson commented on ARROW-11477:
-

Re the "Using the Arrow C++ Library in R" vignette, it is (or at least used to 
be) pkgdown's convention that `vignette("pkgname")` gets put as the "Get 
started" link in the top menu bar, and all other vignettes go under Articles. 
But we have the ability to override that. 

> [R][Doc] Reorganize and improve README and vignette content
> ---
>
> Key: ARROW-11477
> URL: https://issues.apache.org/jira/browse/ARROW-11477
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> Collecting various ideas here for general ways to improve the R package 
> README and vignettes for the 4.0.0 release:
>  * Consider moving the "building" and "developing" content out of the REAMDE 
> and into a vignette focused on that topic. (Rationale: most users of the R 
> package today are downloading prebuilt binaries, not building their own; most 
> users today are end users, not developers; a more valuable use for the 
> README—especially since that it's the homepage of the R docs site—would be as 
> a place to highlight key capabilities of the package, not to show folks all 
> the technical details of building it.)
>  * Get the "Using the Arrow C++ Library in R" vignette to show in the 
> Articles menu on the R docs site.
>  * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear 
> that dplyr verbs can be used with Arrow Tables and RecordBatches (not just 
> Datasets) and describe differences in dplyr support for these different Arrow 
> objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content

2021-02-02 Thread Ian Cook (Jira)

Ian Cook created ARROW-11477:


 Summary: [R][Doc] Reorganize and improve README and vignette 
content
 Key: ARROW-11477
 URL: https://issues.apache.org/jira/browse/ARROW-11477
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook
 Fix For: 4.0.0


Collecting various ideas here for general ways to improve the R package README 
and vignettes for the 4.0.0 release:
 * Consider moving the "building" and "developing" content out of the REAMDE 
and into a vignette focused on that topic. (Rationale: most users of the R 
package today are downloading prebuilt binaries, not building their own; most 
users today are end users, not developers; a more valuable use for the 
README—especially since that it's the homepage of the R docs site—would be as a 
place to highlight key capabilities of the package, not to show folks all the 
technical details of building it.)
 * Get the "Using the Arrow C++ Library in R" vignette to show in the Articles 
menu on the R docs site.
 * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear that 
dplyr verbs can be used with Arrow Tables and RecordBatches (not just Datasets) 
and describe differences in dplyr support for these different Arrow objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-11474) [C++] Update bundled re2 version

2021-02-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-11474.
---
  Assignee: Kouhei Sutou
Resolution: Duplicate

Done in ARROW-11350 after all

> [C++] Update bundled re2 version
> 
>
> Key: ARROW-11474
> URL: https://issues.apache.org/jira/browse/ARROW-11474
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 4.0.0
>
>
> I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed 
> in a few builds with 
> {code}
> /usr/bin/ar: 
> /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a:
>  No such file or directory
> make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9
> make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2
> {code}
> (or similar). My theory is that something changed in their cmake build setup 
> so that either libre2.a is not where we expect it, or it's building a shared 
> library instead, or something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11476) [Rust][DataFusion] Test running of TPCH benchmarks in CI

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11476:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Test running of TPCH benchmarks in CI
> 
>
> Key: ARROW-11476
> URL: https://issues.apache.org/jira/browse/ARROW-11476
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11476) [Rust][DataFusion] Test running of TPCH benchmarks in CI

2021-02-02 Thread Jira

Daniël Heres created ARROW-11476:


 Summary: [Rust][DataFusion] Test running of TPCH benchmarks in CI
 Key: ARROW-11476
 URL: https://issues.apache.org/jira/browse/ARROW-11476
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread Ali Cetin (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277432#comment-17277432
 ] 

Ali Cetin commented on ARROW-11427:
---

Cool. I can give it a try in the coming days.

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277421#comment-17277421
 ] 

Antoine Pitrou commented on ARROW-11427:


(removed previous post, sorry)

[~ali.cetin] Can you try to install the following wheel and see it if fixes the 
issue?

[https://github.com/ursacomputing/crossbow/releases/download/build-27-github-wheel-windows-cp38/pyarrow-3.1.0.dev112-cp38-cp38-win_amd64.whl]

Also, it will allow you to inspect the current SIMD level, like this:
{code:java}
$ python -c "import pyarrow as pa; print(pa.runtime_info())"
RuntimeInfo(simd_level='avx2', detected_simd_level='avx2')
{code}

You should get "avx2" on Windows Server 2012, and "avx512" on Windows Server 
2016.

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11435) Allow creating ParquetPartition from external crate

2021-02-02 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11435.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9369
[https://github.com/apache/arrow/pull/9369]

> Allow creating ParquetPartition from external crate
> ---
>
> Key: ARROW-11435
> URL: https://issues.apache.org/jira/browse/ARROW-11435
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - DataFusion
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Without this functionality, it's not possible to implement table provider in 
> external crate that targets parquet format since ParquetExec takes 
> ParquetPartition as an argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11475) [C++] Upgrade mimalloc

2021-02-02 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-11475:
---

 Summary: [C++] Upgrade mimalloc
 Key: ARROW-11475
 URL: https://issues.apache.org/jira/browse/ARROW-11475
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson
 Fix For: 4.0.0


I tried this in ARROW-11350 but ran into an issue 
(https://github.com/microsoft/mimalloc/issues/353). That has since been 
resolved and we could apply a patch to bring it in. Or we can wait for it to 
get into a proper release.

There is also now a 1.7 release, which claims to work on the Apple M1, as well 
as a 2.0 version, which claims better performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11474) [C++] Update bundled re2 version

2021-02-02 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11474:

Description: 
I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed 
in a few builds with 

{code}
/usr/bin/ar: 
/root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a:
 No such file or directory
make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9
make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2
{code}

(or similar). My theory is that something changed in their cmake build setup so 
that either libre2.a is not where we expect it, or it's building a shared 
library instead, or something.

  was:
I tried increasing the re2 version to 2020-11-01 in 

but it failed in a few builds with 

{code}
/usr/bin/ar: 
/root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a:
 No such file or directory
make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9
make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2
{code}

(or similar). My theory is that something changed in their cmake build setup so 
that either libre2.a is not where we expect it, or it's building a shared 
library instead, or something.


> [C++] Update bundled re2 version
> 
>
> Key: ARROW-11474
> URL: https://issues.apache.org/jira/browse/ARROW-11474
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 4.0.0
>
>
> I tried increasing the re2 version to 2020-11-01 in ARROW-11350 but it failed 
> in a few builds with 
> {code}
> /usr/bin/ar: 
> /root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a:
>  No such file or directory
> make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9
> make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2
> {code}
> (or similar). My theory is that something changed in their cmake build setup 
> so that either libre2.a is not where we expect it, or it's building a shared 
> library instead, or something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11474) [C++] Update bundled re2 version

2021-02-02 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-11474:
---

 Summary: [C++] Update bundled re2 version
 Key: ARROW-11474
 URL: https://issues.apache.org/jira/browse/ARROW-11474
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson
 Fix For: 4.0.0


I tried increasing the re2 version to 2020-11-01 in 

but it failed in a few builds with 

{code}
/usr/bin/ar: 
/root/rpmbuild/BUILD/apache-arrow-3.1.0.dev107/cpp/build/re2_ep-install/lib/libre2.a:
 No such file or directory
make[2]: *** [release/libarrow_bundled_dependencies.a] Error 9
make[1]: *** [src/arrow/CMakeFiles/arrow_bundled_dependencies.dir/all] Error 2
{code}

(or similar). My theory is that something changed in their cmake build setup so 
that either libre2.a is not where we expect it, or it's building a shared 
library instead, or something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11427:
---
Comment: was deleted

(was: [~ali.cetin] Could you try installing this wheel and see if it fixes the 
issue:

[https://github.com/ursacomputing/crossbow/releases/download/build-26-github-wheel-windows-cp38/pyarrow-3.1.0.dev109-cp38-cp38-win_amd64.whl]

?)

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277392#comment-17277392
 ] 

Antoine Pitrou commented on ARROW-11463:


PyArrow serialization is deprecated, users should use pickle themselves.

It is true that out-of-band data provides zero-copy support for buffers 
embedded in the pickled data. It is tested here:

[https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_array.py#L1695]

 

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11427:
---
Labels: pull-request-available  (was: )

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277373#comment-17277373
 ] 

Leonard Lausen commented on ARROW-11463:


Specifically, do you mean that PyArrow serialization is deprecated or that 
SerializationContext is deprecated? Ie should users use pickle themselves, or 
will PyArrow just use pickle internally when serializing?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11308) [Rust] [Parquet] Add Arrow decimal array writer

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11308:
---
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Add Arrow decimal array writer
> ---
>
> Key: ARROW-11308
> URL: https://issues.apache.org/jira/browse/ARROW-11308
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277352#comment-17277352
 ] 

Antoine Pitrou commented on ARROW-11427:


[~ali.cetin] Could you try installing this wheel and see if it fixes the issue:

[https://github.com/ursacomputing/crossbow/releases/download/build-26-github-wheel-windows-cp38/pyarrow-3.1.0.dev109-cp38-cp38-win_amd64.whl]

?

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
> Fix For: 4.0.0
>
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277348#comment-17277348
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork 
at https://github.com/alibaba/libvineyard which currently also uses PyArrow 
serialization and is thus affected from this issue. For PyArrow serialization 
and Pickle 5, I see that you are the author of the PEP. Thank you for driving 
that. Is it correct that the out-of-band data support makes it possible to use 
for zero-copy / shared memory applications? Is there any plan for PyArrow to 
uses Pickle 5 by default when running on Py3.8+?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11469:
--
Summary: [Python] Performance degradation parquet reading of wide 
dataframes  (was: [Python] Performance degradation wide dataframes)

> [Python] Performance degradation parquet reading of wide dataframes
> ---
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
> Attachments: profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11469) [Python] Performance degradation wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11469:
--
Summary: [Python] Performance degradation wide dataframes  (was: 
Performance degradation wide dataframes)

> [Python] Performance degradation wide dataframes
> 
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
> Attachments: profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11421) [Rust][DataFusion] Support group by Date32

2021-02-02 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11421.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9355
[https://github.com/apache/arrow/pull/9355]

> [Rust][DataFusion] Support group by Date32
> --
>
> Key: ARROW-11421
> URL: https://issues.apache.org/jira/browse/ARROW-11421
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11427) [C++] Arrow uses AVX512 instructions even when not supported by the OS

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11427:
---
Summary: [C++] Arrow uses AVX512 instructions even when not supported by 
the OS  (was: [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after 
upgrading to pyarrow 3.0)

> [C++] Arrow uses AVX512 instructions even when not supported by the OS
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
> Fix For: 4.0.0
>
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11427) [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after upgrading to pyarrow 3.0

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11427:
---
Fix Version/s: 4.0.0

> [Python] Windows Server 2012 w/ Xeon Platinum 8171M crashes after upgrading 
> to pyarrow 3.0
> --
>
> Key: ARROW-11427
> URL: https://issues.apache.org/jira/browse/ARROW-11427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Windows Server 2012 Datacenter, Azure VM (D2_v2), Intel 
> Xeon Platinum 8171m
>Reporter: Ali Cetin
>Priority: Major
> Fix For: 4.0.0
>
>
> *Update*: Azure (D2_v2) VM no longer spins-up with Xeon Platinum 8171m, so 
> I'm unable to test it with other OS's.  Azure VM's are assigned different 
> type of CPU's of same "class" depending on availability. I will try my "luck" 
> later.
> VM's w/ Xeon Platinum 8171m running on Azure (D2_v2) start crashing after 
> upgrading from pyarrow 2.0 to pyarrow 3.0. However, this only happens when 
> reading parquet files larger than 4096 bits!?
> Windows closes Python with exit code 255 and produces this:
>  
> {code:java}
> Faulting application name: python.exe, version: 3.8.3150.1013, time stamp: 
> 0x5ebc7702 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 
> 0x60060ce3 Exception code: 0xc01d Fault offset: 0x0047aadc 
> Faulting process id: 0x1b10 Faulting application start time: 
> 0x01d6f4a43dca3c14 Faulting application path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\SomeApp.Fabric.Executor.ProcessActorPkg.Code.1.0.218-prod\Python38\python.exe
>  Faulting module path: 
> D:\SvcFab\_App\SomeApp.FabricType_App32\temp\Executions\50cfffe8-9250-4ac7-8ba8-08d8c2bb3edf\.venv\lib\site-packages\pyarrow\arrow.dll{code}
>  
> Tested on:
> ||OS||Xeon Platinum 8171m or 8272CL||Other CPUs||
> |Windows Server 2012 Data Center|Fail|OK|
> |Windows Server 2016 Data Center| OK|OK|
> |Windows Server 2019 Data Center| | |
> |Windows 10| |OK|
>  
> Example code (Python): 
> {code:java}
> import numpy as np
> import pandas as pd
> data_len = 2**5
> data = pd.DataFrame(
> {"values": np.arange(0., float(data_len), dtype=float)},
> index=np.arange(0, data_len, dtype=int)
> )
> data.to_parquet("test.parquet")
> data = pd.read_parquet("test.parquet", engine="pyarrow")  # fails here!
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11473) Needs a handling for missing columns while reading parquet file

2021-02-02 Thread jason khadka (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jason khadka updated ARROW-11473:
-
Description: 
Currently there is no way to handle the error raised by missing columns in 
parquet file.

If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file

pd.read_parquet(file_name, columns = columns)

> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any 
information that can give out the field name so that in next try this filed can 
be ignored.

Example :
{code:java}
from pyarrow.lib import ArrowInvalid 

read_columns = ['a','b','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 

file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 

try: 
df = pd.read_parquet(file_name, columns = read_columns) 
except ArrowInvalid as e: 
inval = e 

print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
 

You could parse the message above to get 'X', but that is a bit of hectic 
solution. It would be great if the error message contained the field name. So, 
you could do for example :

 
{code:java}
inval.field 
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, 
where something like \{{error='ignore'}}could be passed. This would then ignore 
the missing column by checking the schema.

Example, in case above :
{code:java}
df = pd.read_parquet(file_name, columns = read_columns, error = 'ignore'){code}
Would ignore the missing column 'X'

  was:
Currently there is no way to handle the error raised by missing columns in 
parquet file.

If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file

pd.read_parquet(file_name, columns = columns)

> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any 
information that can give out the field name so that in next try this filed can 
be ignored.

Example :

{{}}
{code:java}

from pyarrow.lib import ArrowInvalid 

read_columns = ['a','b','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 

file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 

try: 
df = pd.read_parquet(file_name, columns = read_columns) 
except ArrowInvalid as e: 
inval = e 

print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
 

{{}}

You could parse the message above to get 'X', but that is a bit of hectic 
solution. It would be great if the error message contained the field name. So, 
you could do for example :

 

{{}}
{code:java}
inval.field 
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, 
where something like {{error='ignore'}}could be passed. This would then ignore 
the missing column by checking the schema.


> Needs a handling for missing columns while reading parquet file 
> 
>
> Key: ARROW-11473
> URL: https://issues.apache.org/jira/browse/ARROW-11473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: jason khadka
>Priority: Major
>
> Currently there is no way to handle the error raised by missing columns in 
> parquet file.
> If a column passed is missing, it just raises ArrowInvalid error
> {code:java}
> columns=[item1, item2, item3] #item3 is not there in parquet file
> pd.read_parquet(file_name, columns = columns)
> > ArrowInvalid: Field named 'item3' not found or not unique in the 
> > schema.{code}
> There is no way to handle this. The ArrowInvalid also does not carry any 
> information that can give out the field name so that in next try this filed 
> can be ignored.
> Example :
> {code:java}
> from pyarrow.lib import ArrowInvalid 
> read_columns = ['a','b','X'] 
> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 
> file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 
> try: 
> df = pd.read_parquet(file_name, columns = read_columns) 
> except ArrowInvalid as e: 
> inval = e 
> print(inval.args)
> >("Field named 'X' not found or not unique in the schema.",){code}
>  
> You could parse the message above to get 'X', but that is a bit of hectic 
> solution. It would be great if the error message contained the field name. 
> So, you could do for example :
>  
> {code:java}
> inval.field 
> > 'X'{code}
> Or a better feature would be to have a error handling in read_table of 
> pyarrow, where something like \{{error='ignore'}}could be passed. This would 
> then ignore the missing column by checking the schema.
> Example, in case

[jira] [Created] (ARROW-11473) Needs a handling for missing columns while reading parquet file

2021-02-02 Thread jason khadka (Jira)

jason khadka created ARROW-11473:


 Summary: Needs a handling for missing columns while reading 
parquet file 
 Key: ARROW-11473
 URL: https://issues.apache.org/jira/browse/ARROW-11473
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: jason khadka


Currently there is no way to handle the error raised by missing columns in 
parquet file.

If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file

pd.read_parquet(file_name, columns = columns)

> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any 
information that can give out the field name so that in next try this filed can 
be ignored.

Example :

{{}}
{code:java}

from pyarrow.lib import ArrowInvalid 

read_columns = ['a','b','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 

file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 

try: 
df = pd.read_parquet(file_name, columns = read_columns) 
except ArrowInvalid as e: 
inval = e 

print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
 

{{}}

You could parse the message above to get 'X', but that is a bit of hectic 
solution. It would be great if the error message contained the field name. So, 
you could do for example :

 

{{}}
{code:java}
inval.field 
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, 
where something like {{error='ignore'}}could be passed. This would then ignore 
the missing column by checking the schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11469) Performance degradation wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277261#comment-17277261
 ] 

Joris Van den Bossche edited comment on ARROW-11469 at 2/2/21, 4:27 PM:


[~Axelg1] Thanks for the report

We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it 
seems that some things deteriorated again. 

So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to 
use the old code path (another alternative is using the single-file 
{{pq.ParquetFile}} interface, this will never have overhead for dealing with 
potentially more complicated datasets).

cc [~bkietz] There seems to be a lot of overhead being spent in the projection 
({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, 
{{CheckProjectable}}, {{FieldRef}} finding, see the attached profile 
[^profile_wide300.svg] ), while in this case there is actually no projection 
happening.   





was (Author: jorisvandenbossche):
[~Axelg1] Thanks for the report

We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it 
seems that some things deteriorated again. 

So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to 
use the old code path (another alternative is using the single-file 
{{pq.ParquetFile}} interface, this will never have overhead for dealing with 
potentially more complicated datasets).

cc [~bkietz] There seems to be a lot of overhead being spent in the projection 
({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, 
{{CheckProjectable}}, {{FieldRef}} finding, see the attached profile), while in 
this case there is actually no projection happening.   [^profile_wide300.svg] 




> Performance degradation wide dataframes
> ---
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
> Attachments: profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11469) Performance degradation wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11469:
--
Attachment: profile_wide300.svg

> Performance degradation wide dataframes
> ---
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
> Attachments: profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11469) Performance degradation wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277261#comment-17277261
 ] 

Joris Van den Bossche commented on ARROW-11469:
---

[~Axelg1] Thanks for the report

We have had similar issues in the past (eg ARROW-9924, ARROW-9827), but it 
seems that some things deteriorated again. 

So as a temporary workaround, you can specify {{use_legacy_dataset=True}} to 
use the old code path (another alternative is using the single-file 
{{pq.ParquetFile}} interface, this will never have overhead for dealing with 
potentially more complicated datasets).

cc [~bkietz] There seems to be a lot of overhead being spent in the projection 
({{RecordBatchProjector}}, and specifically {{SetInputSchema}}, 
{{CheckProjectable}}, {{FieldRef}} finding, see the attached profile), while in 
this case there is actually no projection happening.   [^profile_wide300.svg] 




> Performance degradation wide dataframes
> ---
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
> Attachments: profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-11463.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9394
[https://github.com/apache/arrow/pull/9394]

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11462) [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-11462.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9391
[https://github.com/apache/arrow/pull/9391]

> [Developer] Remove needless quote from the default DOCKER_VOLUME_PREFIX
> ---
>
> Key: ARROW-11462
> URL: https://issues.apache.org/jira/browse/ARROW-11462
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277239#comment-17277239
 ] 

Joris Van den Bossche commented on ARROW-11456:
---

bq.  If you still need code, I can write a function to generate it.

That would help, yes.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 22 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234
 ] 

Pac A. He edited comment on ARROW-11456 at 2/2/21, 4:12 PM:


For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now, if only for Python, until this issue is 
resolved.


was (Author: apacman):
For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now until this issue is resolved.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was effectively a unique base64 encoded length 22 string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string.

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
>

[jira] [Updated] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pac A. He updated ARROW-11456:
--
Description: 
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 1.5 billion unique 
rows. Each value was effectively a unique base64 encoded length 22 string

  was:
When reading a large parquet file, I have this error:

 
{noformat}
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 459, in read_parquet
return impl.read(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
line 221, in read
return self.api.parquet.read_table(
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
1638, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 
327, in read
return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in 
pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
2147483646 child elements, got 2147483648
{noformat}
Isn't pyarrow supposed to support large parquets? It let me write this parquet 
file, but now it doesn't let me read it back. I don't understand why arrow uses 
[31-bit 
computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
It's not even 32-bit as sizes are non-negative.


> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.
> This problem started after I added a string column with 1.5 billion unique 
> rows. Each value was

[jira] [Commented] (ARROW-11456) [Python] Parquet reader cannot read large strings

2021-02-02 Thread Pac A. He (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277234#comment-17277234
 ] 

Pac A. He commented on ARROW-11456:
---

For what it's worth, {{fastparquet}} v0.5.0 had no trouble at all reading such 
files. That's a workaround for now until this issue is resolved.

> [Python] Parquet reader cannot read large strings
> -
>
> Key: ARROW-11456
> URL: https://issues.apache.org/jira/browse/ARROW-11456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0, 3.0.0
> Environment: pyarrow 3.0.0 / 2.0.0
> pandas 1.2.1
> python 3.8.6
>Reporter: Pac A. He
>Priority: Major
>
> When reading a large parquet file, I have this error:
>  
> {noformat}
> df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 459, in read_parquet
> return impl.read(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", 
> line 221, in read
> return self.api.parquet.read_table(
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1638, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 327, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1126, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Capacity error: BinaryBuilder cannot reserve space for more than 
> 2147483646 child elements, got 2147483648
> {noformat}
> Isn't pyarrow supposed to support large parquets? It let me write this 
> parquet file, but now it doesn't let me read it back. I don't understand why 
> arrow uses [31-bit 
> computing.|https://arrow.apache.org/docs/format/Columnar.html#array-lengths] 
> It's not even 32-bit as sizes are non-negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277230#comment-17277230
 ] 

Antoine Pitrou commented on ARROW-11463:


[~lausen] I'm not sure your question has a possible answer, but please note 
that both PyArrow serialization and Plasma are deprecated and unmaintained. For 
the former, the recommended replacement is pickle with protocol 5. For the 
latter, you may want to contact the developers of the Ray project (they used to 
maintain Plasma and decided to fork it).

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11469) Performance degradation wide dataframes

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11469:
--
Description: 
I noticed a relatively big performance degradation in version 1.0.0+ when 
trying to load wide dataframes.

For example you should be able to reproduce by doing:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(np.random.rand(100, 1))
table = pa.Table.from_pandas(df)
pq.write_table(table, "temp.parquet")

%timeit pd.read_parquet("temp.parquet"){code}
In version 0.17.0, this takes about 300-400 ms and for anything above and 
including 1.0.0, this suddenly takes around 2 seconds.

 

Thanks for looking into this.

  was:
I noticed a relatively big performance degradation in version 1.0.0+ when 
trying to load wide dataframes.

For example you should be able to reproduce by doing:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(np.random.rand(100, 1))
table = pa.Table.from_pandas(df)
pd.write_table(table, "temp.parquet")

%timeit pd.read_parquet("temp.parquet"){code}
In version 0.17.0, this takes about 300-400 ms and for anything above and 
including 1.0.0, this suddenly takes around 2 seconds.

 

Thanks for looking into this.


> Performance degradation wide dataframes
> ---
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>Reporter: Axel G
>Priority: Minor
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 1))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170
 ] 

Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:52 PM:
-

Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}


was (Author: lausen):
 Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170
 ] 

Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:51 PM:
-

 Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}


was (Author: lausen):
 Thank you Tao! How can we specify the IPC stream writer instance for the 
`_serialize_pyarrow_table` which is configured to be the 
default_serialization_handler and used by `plasma_client.put`? It only supports 
specifying {{SerializationContext}} and I'm unsure how to configure the writer 
instance via {{}}{{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277170#comment-17277170
 ] 

Leonard Lausen commented on ARROW-11463:


 Thank you Tao! How can we specify the IPC stream writer instance for the 
`_serialize_pyarrow_table` which is configured to be the 
default_serialization_handler and used by `plasma_client.put`? It only supports 
specifying {{SerializationContext}} and I'm unsure how to configure the writer 
instance via {{}}{{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-02-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277142#comment-17277142
 ] 

Joris Van den Bossche commented on ARROW-11400:
---

Marking it as 3.0.0, as it was already fixed in the release, my PR was only 
adding a test

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-11400.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9350
[https://github.com/apache/arrow/pull/9350]

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11472:
---
Labels: pull-request-available  (was: )

> [Python][CI] Kartothek integrations build is failing with numpy 1.20
> 
>
> Key: ARROW-11472
> URL: https://issues.apache.org/jira/browse/ARROW-11472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure 
> looks like:
> {code}
>   ERROR collecting tests/io/dask/dataframe/test_read.py 
> _
> tests/io/dask/dataframe/test_read.py:185: in 
> @pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
> kartothek/core/testing.py:65: in get_dataframe_not_nested
> "unicode": pd.Series(["Ö"], dtype=np.unicode),
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: 
> in __init__
> data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
>  in sanitize_array
> subarr = _try_cast(data, dtype, copy, raise_cast_failure)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
>  in _try_cast
> maybe_cast_to_integer_array(arr, dtype)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
>  in maybe_cast_to_integer_array
> casted = np.array(arr, dtype=dtype, copy=copy)
> E   ValueError: invalid literal for int() with base 10: 'Ö'
> {code}
> So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
> numpy 1.20.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-11472:
-

Assignee: Joris Van den Bossche

> [Python][CI] Kartothek integrations build is failing with numpy 1.20
> 
>
> Key: ARROW-11472
> URL: https://issues.apache.org/jira/browse/ARROW-11472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>
> See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure 
> looks like:
> {code}
>   ERROR collecting tests/io/dask/dataframe/test_read.py 
> _
> tests/io/dask/dataframe/test_read.py:185: in 
> @pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
> kartothek/core/testing.py:65: in get_dataframe_not_nested
> "unicode": pd.Series(["Ö"], dtype=np.unicode),
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: 
> in __init__
> data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
>  in sanitize_array
> subarr = _try_cast(data, dtype, copy, raise_cast_failure)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
>  in _try_cast
> maybe_cast_to_integer_array(arr, dtype)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
>  in maybe_cast_to_integer_array
> casted = np.array(arr, dtype=dtype, copy=copy)
> E   ValueError: invalid literal for int() with base 10: 'Ö'
> {code}
> So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
> numpy 1.20.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20

2021-02-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277113#comment-17277113
 ] 

Joris Van den Bossche commented on ARROW-11472:
---

Looking into this, and this seems to be cause by numpy's aliasing of 
{{np.unicode}} to be {{int}}. This is already reported and fixed 
(https://github.com/numpy/numpy/issues/18287), so I assume it will soon be in a 
1.20.1 release. 


> [Python][CI] Kartothek integrations build is failing with numpy 1.20
> 
>
> Key: ARROW-11472
> URL: https://issues.apache.org/jira/browse/ARROW-11472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure 
> looks like:
> {code}
>   ERROR collecting tests/io/dask/dataframe/test_read.py 
> _
> tests/io/dask/dataframe/test_read.py:185: in 
> @pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
> kartothek/core/testing.py:65: in get_dataframe_not_nested
> "unicode": pd.Series(["Ö"], dtype=np.unicode),
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: 
> in __init__
> data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
>  in sanitize_array
> subarr = _try_cast(data, dtype, copy, raise_cast_failure)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
>  in _try_cast
> maybe_cast_to_integer_array(arr, dtype)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
>  in maybe_cast_to_integer_array
> casted = np.array(arr, dtype=dtype, copy=copy)
> E   ValueError: invalid literal for int() with base 10: 'Ö'
> {code}
> So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
> numpy 1.20.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11472:
--
Description: 
See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks 
like:

{code}
  ERROR collecting tests/io/dask/dataframe/test_read.py 
_
tests/io/dask/dataframe/test_read.py:185: in 
@pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
kartothek/core/testing.py:65: in get_dataframe_not_nested
"unicode": pd.Series(["Ö"], dtype=np.unicode),
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in 
__init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
 in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
 in _try_cast
maybe_cast_to_integer_array(arr, dtype)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
 in maybe_cast_to_integer_array
casted = np.array(arr, dtype=dtype, copy=copy)
E   ValueError: invalid literal for int() with base 10: 'Ö'
{code}

So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
numpy 1.20.0

  was:
See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks 
like:

{code}
  ERROR collecting tests/io/dask/dataframe/test_read.py 
_
tests/io/dask/dataframe/test_read.py:185: in 
@pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
kartothek/core/testing.py:65: in get_dataframe_not_nested
"unicode": pd.Series(["Ö"], dtype=np.unicode),
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in 
__init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
 in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
 in _try_cast
maybe_cast_to_integer_array(arr, dtype)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
 in maybe_cast_to_integer_array
casted = np.array(arr, dtype=dtype, copy=copy)
E   ValueError: invalid literal for int() with base 10: 'Ö'
{code}

So it seems that {{ pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
numpy 1.20.0


> [Python][CI] Kartothek integrations build is failing with numpy 1.20
> 
>
> Key: ARROW-11472
> URL: https://issues.apache.org/jira/browse/ARROW-11472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure 
> looks like:
> {code}
>   ERROR collecting tests/io/dask/dataframe/test_read.py 
> _
> tests/io/dask/dataframe/test_read.py:185: in 
> @pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
> kartothek/core/testing.py:65: in get_dataframe_not_nested
> "unicode": pd.Series(["Ö"], dtype=np.unicode),
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: 
> in __init__
> data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
>  in sanitize_array
> subarr = _try_cast(data, dtype, copy, raise_cast_failure)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
>  in _try_cast
> maybe_cast_to_integer_array(arr, dtype)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
>  in maybe_cast_to_integer_array
> casted = np.array(arr, dtype=dtype, copy=copy)
> E   ValueError: invalid literal for int() with base 10: 'Ö'
> {code}
> So it seems that {{pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
> numpy 1.20.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11472) [Python][CI] Kartothek integrations build is failing with numpy 1.20

2021-02-02 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-11472:
-

 Summary: [Python][CI] Kartothek integrations build is failing with 
numpy 1.20
 Key: ARROW-11472
 URL: https://issues.apache.org/jira/browse/ARROW-11472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


See eg https://github.com/ursacomputing/crossbow/runs/1804464537, failure looks 
like:

{code}
  ERROR collecting tests/io/dask/dataframe/test_read.py 
_
tests/io/dask/dataframe/test_read.py:185: in 
@pytest.mark.parametrize("col", get_dataframe_not_nested().columns)
kartothek/core/testing.py:65: in get_dataframe_not_nested
"unicode": pd.Series(["Ö"], dtype=np.unicode),
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/series.py:335: in 
__init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:480:
 in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/construction.py:587:
 in _try_cast
maybe_cast_to_integer_array(arr, dtype)
/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1723:
 in maybe_cast_to_integer_array
casted = np.array(arr, dtype=dtype, copy=copy)
E   ValueError: invalid literal for int() with base 10: 'Ö'
{code}

So it seems that {{ pd.Series(["Ö"], dtype=np.unicode)}} stopped working with 
numpy 1.20.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7288:
-

Assignee: Kouhei Sutou  (was: Neal Richardson)

> [C++][R] read_parquet() freezes on Windows with Japanese locale
> ---
>
> Key: ARROW-7288
> URL: https://issues.apache.org/jira/browse/ARROW-7288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
> Environment: R 3.6.1 on Windows 10
>Reporter: Hiroaki Yutani
>Assignee: Kouhei Sutou
>Priority: Critical
>  Labels: parquet, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> The following example on read_parquet()'s doc freezes (seems to wait for the 
> result forever) on my Windows.
> df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
> The CRAN checks are all fine, which means the example is successfully 
> executed on the CRAN Windows. So, I have no idea why it doesn't work on my 
> local.
> [https://cran.r-project.org/web/checks/check_results_arrow.html]
> Here's my session info in case it helps:
> {code:java}
> > sessioninfo::session_info()
> - Session info 
> -
>  setting  value
>  version  R version 3.6.1 (2019-07-05)
>  os   Windows 10 x64
>  system   x86_64, mingw32
>  ui   RStudio
>  language en
>  collate  Japanese_Japan.932
>  ctypeJapanese_Japan.932
>  tz   Asia/Tokyo
>  date 2019-12-01
> - Packages 
> -
>  package * version  date   lib source
>  arrow   * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1)
>  assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0)
>  bit   1.1-14   2018-05-29 [1] CRAN (R 3.6.0)
>  bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0)
>  cli   1.1.02019-03-19 [1] CRAN (R 3.6.0)
>  crayon1.3.42017-09-16 [1] CRAN (R 3.6.0)
>  fs1.3.12019-05-06 [1] CRAN (R 3.6.0)
>  glue  1.3.12019-03-12 [1] CRAN (R 3.6.0)
>  magrittr  1.5  2014-11-22 [1] CRAN (R 3.6.0)
>  purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1)
>  R62.4.12019-11-12 [1] CRAN (R 3.6.1)
>  Rcpp  1.0.32019-11-08 [1] CRAN (R 3.6.1)
>  reprex0.3.02019-05-16 [1] CRAN (R 3.6.0)
>  rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1)
>  rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0)
>  sessioninfo   1.1.12018-11-05 [1] CRAN (R 3.6.0)
>  tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0)
>  withr 2.1.22018-03-15 [1] CRAN (R 3.6.0)
> [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6
> [2] C:/Program Files/R/R-3.6.1/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale

2021-02-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7288.
---
Resolution: Fixed

Issue resolved by pull request 9367
[https://github.com/apache/arrow/pull/9367]

> [C++][R] read_parquet() freezes on Windows with Japanese locale
> ---
>
> Key: ARROW-7288
> URL: https://issues.apache.org/jira/browse/ARROW-7288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
> Environment: R 3.6.1 on Windows 10
>Reporter: Hiroaki Yutani
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: parquet, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The following example on read_parquet()'s doc freezes (seems to wait for the 
> result forever) on my Windows.
> df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
> The CRAN checks are all fine, which means the example is successfully 
> executed on the CRAN Windows. So, I have no idea why it doesn't work on my 
> local.
> [https://cran.r-project.org/web/checks/check_results_arrow.html]
> Here's my session info in case it helps:
> {code:java}
> > sessioninfo::session_info()
> - Session info 
> -
>  setting  value
>  version  R version 3.6.1 (2019-07-05)
>  os   Windows 10 x64
>  system   x86_64, mingw32
>  ui   RStudio
>  language en
>  collate  Japanese_Japan.932
>  ctypeJapanese_Japan.932
>  tz   Asia/Tokyo
>  date 2019-12-01
> - Packages 
> -
>  package * version  date   lib source
>  arrow   * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1)
>  assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0)
>  bit   1.1-14   2018-05-29 [1] CRAN (R 3.6.0)
>  bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0)
>  cli   1.1.02019-03-19 [1] CRAN (R 3.6.0)
>  crayon1.3.42017-09-16 [1] CRAN (R 3.6.0)
>  fs1.3.12019-05-06 [1] CRAN (R 3.6.0)
>  glue  1.3.12019-03-12 [1] CRAN (R 3.6.0)
>  magrittr  1.5  2014-11-22 [1] CRAN (R 3.6.0)
>  purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1)
>  R62.4.12019-11-12 [1] CRAN (R 3.6.1)
>  Rcpp  1.0.32019-11-08 [1] CRAN (R 3.6.1)
>  reprex0.3.02019-05-16 [1] CRAN (R 3.6.0)
>  rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1)
>  rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0)
>  sessioninfo   1.1.12018-11-05 [1] CRAN (R 3.6.0)
>  tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0)
>  withr 2.1.22018-03-15 [1] CRAN (R 3.6.0)
> [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6
> [2] C:/Program Files/R/R-3.6.1/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11410) [Rust][Parquet] Implement returning dictionary arrays from parquet reader

2021-02-02 Thread Andrew Lamb (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277063#comment-17277063
 ] 

Andrew Lamb commented on ARROW-11410:
-

[~yordan-pavlov] I think this would be amazing -- and we would definitely use 
it in IOx. This is the kind of thing that is on our longer term roadmap and I 
would love to help (e.g. code review, or testing , or documentation, etc).

Let me know! 

> [Rust][Parquet] Implement returning dictionary arrays from parquet reader
> -
>
> Key: ARROW-11410
> URL: https://issues.apache.org/jira/browse/ARROW-11410
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> Currently the Rust parquet reader returns a regular array (e.g. string array) 
> even when the column is dictionary encoded in the parquet file.
> If the parquet reader had the ability to return dictionary arrays for 
> dictionary encoded columns this would bring many benefits such as:
>  * faster reading of dictionary encoded columns from parquet (as no 
> conversion/expansion into a regular array would be necessary)
>  * more efficient memory use as the dictionary array would use less memory 
> when loaded in memory
>  * faster filtering operations as SIMD can be used to filter over the numeric 
> keys of a dictionary string array instead of comparing string values in a 
> string array
> [~nevime] , [~alamb]  let me know what you think



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11464) [Python] pyarrow.parquet.read_pandas doesn't conform to its docs

2021-02-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11464:
--
Fix Version/s: 4.0.0

> [Python] pyarrow.parquet.read_pandas doesn't conform to its docs
> 
>
> Key: ARROW-11464
> URL: https://issues.apache.org/jira/browse/ARROW-11464
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: latest
>Reporter: Pac A. He
>Priority: Major
> Fix For: 4.0.0
>
>
> The {{*pyarrow.parquet.read_pandas*}} 
> [implementation|https://github.com/apache/arrow/blob/816c23af4478fe28f31d474e90b433baeadb7a78/python/pyarrow/parquet.py#L1740-L1754]
>  doesn't conform to its 
> [docs|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_pandas.html#pyarrow.parquet.read_pandas]
>  in at least these ways:
>  # The docs state that a *{{filesystem}}* option can be provided, as it 
> should be. Without this option I cannot read from S3, etc. The 
> implementation, however, doesn't have this option! As such I currently cannot 
> use it to read from S3!
>  # The docs (_not in the type definition!_) state that the default value for 
> *{{use_legacy_dataset}}* is False, whereas the implementation has a value of 
> True. Actually, however, if a value of True is used, then I can't read a 
> partitioned dataset at all, so in this case maybe its the doc that needs to 
> change to True.
> It looks to have been implemented and reviewed pretty carelessly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11471) [Rust] DoubleEndedIterator for BitChunks

2021-02-02 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-11471:
--

Assignee: Jörn Horstmann

> [Rust] DoubleEndedIterator for BitChunks
> 
>
> Key: ARROW-11471
> URL: https://issues.apache.org/jira/browse/ARROW-11471
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>
> The usecase is to efficiently find the last non-null value in an array slice 
> by iterating over the bits in chunks and using u64::leading_zeroes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11471) [Rust] DoubleEndedIterator for BitChunks

2021-02-02 Thread Jira

Jörn Horstmann created ARROW-11471:
--

 Summary: [Rust] DoubleEndedIterator for BitChunks
 Key: ARROW-11471
 URL: https://issues.apache.org/jira/browse/ARROW-11471
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 3.0.0
Reporter: Jörn Horstmann


The usecase is to efficiently find the last non-null value in an array slice by 
iterating over the bits in chunks and using u64::leading_zeroes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11470) [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides

2021-02-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11470:
---
Labels: pull-request-available  (was: )

> [C++] Overflow occurs on integer multiplications in 
> Compute(Row|Column)MajorStrides
> ---
>
> Key: ARROW-11470
> URL: https://issues.apache.org/jira/browse/ARROW-11470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides 
> function occurs overflow.
> https://oss-fuzz.com/testcase-detail/623225726408
> The same issue exists in ComputeColumnMajorStrides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11470) [C++] Overflow occurs on integer multiplications in Compute(Row|Column)MajorStrides

2021-02-02 Thread Kenta Murata (Jira)

Kenta Murata created ARROW-11470:


 Summary: [C++] Overflow occurs on integer multiplications in 
Compute(Row|Column)MajorStrides
 Key: ARROW-11470
 URL: https://issues.apache.org/jira/browse/ARROW-11470
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


OSS-Fuzz reports the integer multiplication in ComputeRowMajorStrides function 
occurs overflow.

https://oss-fuzz.com/testcase-detail/623225726408

The same issue exists in ComputeColumnMajorStrides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

77 matches

Mail list logo