[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277515#comment-17277515
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really 
useful. For example, the following code can replicate my use-case for the 
Plasma store based on providing a folder in {{/dev/shm}} as {{path}}.
{code:python}
import pickle
import mmap

def shm_pickle(path, tbl):
idx = 0
def buffer_callback(buf):
nonlocal idx
with open(path / f'{idx}.bin', 'wb') as f:
f.write(buf)
idx += 1
with open(path / 'meta.pkl', 'wb') as f:
pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback)


def shm_unpickle(path):
num_buffers = len(list(path.iterdir())) - 1  # exclude meta.idx
buffers = []
for idx in range(num_buffers):
f = open(path / f'{idx}.bin', 'rb')
buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ))
with open(path / 'meta.pkl', 'rb') as f:
return pickle.load(f, buffers=buffers)
{code}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277373#comment-17277373
 ] 

Leonard Lausen commented on ARROW-11463:


Specifically, do you mean that PyArrow serialization is deprecated or that 
SerializationContext is deprecated? Ie should users use pickle themselves, or 
will PyArrow just use pickle internally when serializing?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277348#comment-17277348
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork 
at https://github.com/alibaba/libvineyard which currently also uses PyArrow 
serialization and is thus affected from this issue. For PyArrow serialization 
and Pickle 5, I see that you are the author of the PEP. Thank you for driving 
that. Is it correct that the out-of-band data support makes it possible to use 
for zero-copy / shared memory applications? Is there any plan for PyArrow to 
uses Pickle 5 by default when running on Py3.8+?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170
 ] 

Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:52 PM:
-

Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}


was (Author: lausen):
 Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170
 ] 

Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:51 PM:
-

 Thank you Tao! How can we specify the IPC stream writer instance for the 
{{_serialize_pyarrow_table}} which is configured to be the 
{{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only 
supports specifying {{SerializationContext}} and I'm unsure how to configure 
the writer instance via {{SerializationContext}}


was (Author: lausen):
 Thank you Tao! How can we specify the IPC stream writer instance for the 
`_serialize_pyarrow_table` which is configured to be the 
default_serialization_handler and used by `plasma_client.put`? It only supports 
specifying {{SerializationContext}} and I'm unsure how to configure the writer 
instance via {{}}{{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170
 ] 

Leonard Lausen commented on ARROW-11463:


 Thank you Tao! How can we specify the IPC stream writer instance for the 
`_serialize_pyarrow_table` which is configured to be the 
default_serialization_handler and used by `plasma_client.put`? It only supports 
specifying {{SerializationContext}} and I'm unsure how to configure the writer 
instance via {{}}{{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-01 Thread Leonard Lausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Lausen updated ARROW-11463:
---
Description: 
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization of the table with combined chunks (eg for Plasma store) will 
fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 
2^31 - 1 in length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding
{code:c++}
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
{code}

  was:
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding


{code:c++}
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
{code}




> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Priority: Major
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-01 Thread Leonard Lausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Lausen updated ARROW-11463:
---
Description: 
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding


{code:c++}
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
{code}



  was:
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding

```
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
```


> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Priority: Major
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization (eg for Plasma store) will fail due to 
> `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
> length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-01 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-11463:
--

 Summary: Allow configuration of IpcWriterOptions 64Bit from PyArrow
 Key: ARROW-11463
 URL: https://issues.apache.org/jira/browse/ARROW-11463
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Leonard Lausen


For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding

```
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11380) Plasma packages for arm64

2021-01-25 Thread Leonard Lausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Lausen updated ARROW-11380:
---
Description: 
"Note that Plasma packages are available only for amd64. Because 
nvidia-cuda-toolkit package isn't available for arm64." 
https://issues.apache.org/jira/browse/ARROW-6715

Nvidia supports Cuda on ARM, so this should be possible in principle?

Please also clarify the relation with https://github.com/alibaba/libvineyard Do 
you intend to continue develop Plasma?

  was:
"Note that Plasma packages are available only for amd64. Because 
nvidia-cuda-toolkit package isn't available for arm64." 
https://issues.apache.org/jira/browse/ARROW-6715

Nvidia supports Cuda on ARM, so this should be possible in principle?


> Plasma packages for arm64
> -
>
> Key: ARROW-11380
> URL: https://issues.apache.org/jira/browse/ARROW-11380
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Leonard Lausen
>Priority: Major
>
> "Note that Plasma packages are available only for amd64. Because 
> nvidia-cuda-toolkit package isn't available for arm64." 
> https://issues.apache.org/jira/browse/ARROW-6715
> Nvidia supports Cuda on ARM, so this should be possible in principle?
> Please also clarify the relation with https://github.com/alibaba/libvineyard 
> Do you intend to continue develop Plasma?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6715) [Website] Describe "non-free" component is needed for Plasma packages in install page

2021-01-25 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271504#comment-17271504
 ] 

Leonard Lausen commented on ARROW-6715:
---

Thanks. I opened https://issues.apache.org/jira/browse/ARROW-11380

> [Website] Describe "non-free" component is needed for Plasma packages in 
> install page
> -
>
> Key: ARROW-6715
> URL: https://issues.apache.org/jira/browse/ARROW-6715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Because Plasma packages depend on nvidia-cuda-toolkit package that in 
> non-free component.
> Note that Plasma packages are available only for amd64. Because 
> nvidia-cuda-toolkit package isn't available for arm64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11380) Plasma packages for arm64

2021-01-25 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-11380:
--

 Summary: Plasma packages for arm64
 Key: ARROW-11380
 URL: https://issues.apache.org/jira/browse/ARROW-11380
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Leonard Lausen


"Note that Plasma packages are available only for amd64. Because 
nvidia-cuda-toolkit package isn't available for arm64." 
https://issues.apache.org/jira/browse/ARROW-6715

Nvidia supports Cuda on ARM, so this should be possible in principle?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6715) [Website] Describe "non-free" component is needed for Plasma packages in install page

2021-01-25 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271409#comment-17271409
 ] 

Leonard Lausen commented on ARROW-6715:
---

Nvidia supports the CUDA on Arm. Could you elaborate what is missing for 
providing plasma an arm64?

> [Website] Describe "non-free" component is needed for Plasma packages in 
> install page
> -
>
> Key: ARROW-6715
> URL: https://issues.apache.org/jira/browse/ARROW-6715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Because Plasma packages depend on nvidia-cuda-toolkit package that in 
> non-free component.
> Note that Plasma packages are available only for amd64. Because 
> nvidia-cuda-toolkit package isn't available for arm64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10349) [Python] build and publish aarch64 wheels

2021-01-25 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271406#comment-17271406
 ] 

Leonard Lausen commented on ARROW-10349:


While not wheels, there are binaries for arm64 at 
https://arrow.apache.org/install/

> [Python] build and publish aarch64 wheels
> -
>
> Key: ARROW-10349
> URL: https://issues.apache.org/jira/browse/ARROW-10349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
> Environment: os: Linux
> arch: aarch64
>Reporter: Jonathan Swinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The currently released source distribution for Arrow on pypi.org doesn't 
> build on Ubuntu 20.04. It may be possible install additional build 
> dependencies to make it work, but it would be better to publish aarch64 
> (arm64) wheels to pypi.org in addition to the currently published x86_64 
> wheels for Linux.
> {{$ pip install pyarrow}}
> should just work on Linux/aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2021-01-24 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271015#comment-17271015
 ] 

Leonard Lausen commented on ARROW-9773:
---

There is a similar issue with large tables (many rows) of medium size lists 
(~512 elements per list). When using `pa.list_` type, `take` will fail due to 
`offset overflow while concatenating arrays`. Using `pa.large_list` works. (But 
in practice it doesn't help as `.take` performs 3 orders of magnitude (~1s vs 
~1ms) slower than indexing operations on pandas..)

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Priority: Major
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in take
> return call_function('take', [data, indices], options)
>   File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
>   File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
> {code}
> In this example, it would be useful if Take() or a higher-level wrapper could 
> generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10868) pip install --user fails to install lib

2020-12-09 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246964#comment-17246964
 ] 

Leonard Lausen commented on ARROW-10868:


Do you mean the environment variables?

I think the issue here is that the arrow `setup.py` file may not handle the 
`--user` case correctly.

> pip install --user fails to install lib
> ---
>
> Key: ARROW-10868
> URL: https://issues.apache.org/jira/browse/ARROW-10868
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Priority: Major
>
> Compiling and installing C++ library via:
> {code}
> cd ~/src/pyarrow/cpp
> mkdir build
> cd build
> CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON ..
> ninja
> sudo ninja install
> {code}
> Then installing python package as follows will claim to succeed, but actually 
> fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail)
> {code}
> cd ~/src/pyarrow/python
> pip install --user .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc

2020-12-09 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246963#comment-17246963
 ] 

Leonard Lausen commented on ARROW-10867:


It might be a valid gcc bug, but it may still be great if arrow can workaround 
the bug and if one of the arrow maintainers can report the bug to gcc as you 
are more familiar with the code

> build failure on aarch64 with -DARROW_PYTHON=ON and gcc
> ---
>
> Key: ARROW-10867
> URL: https://issues.apache.org/jira/browse/ARROW-10867
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Leonard Lausen
>Priority: Major
> Attachments: arrow
>
>
> Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on 
> aarch64 on a https://aws.amazon.com/ec2/instance-types/c6/ instance.
> Compiling with clang-11 works fine.
> ```
> ../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void 
> arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const 
> arrow
> ::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’:
>   
> ../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler 
> error: Segmentation fault
>  void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
>   ^~~~
> Please submit a full bug report,
> with preprocessed source if appropriate.  
>   See 
>  for instructions.
> ```
> Full build log attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc

2020-12-09 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246961#comment-17246961
 ] 

Leonard Lausen commented on ARROW-10867:


master branch 
https://github.com/apache/arrow/commit/3deae8dd50da773ba215704e567d9937b04b02c5

> build failure on aarch64 with -DARROW_PYTHON=ON and gcc
> ---
>
> Key: ARROW-10867
> URL: https://issues.apache.org/jira/browse/ARROW-10867
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Leonard Lausen
>Priority: Major
> Attachments: arrow
>
>
> Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on 
> aarch64 on a https://aws.amazon.com/ec2/instance-types/c6/ instance.
> Compiling with clang-11 works fine.
> ```
> ../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void 
> arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const 
> arrow
> ::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’:
>   
> ../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler 
> error: Segmentation fault
>  void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
>   ^~~~
> Please submit a full bug report,
> with preprocessed source if appropriate.  
>   See 
>  for instructions.
> ```
> Full build log attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10868) pip install --user fails to install lib

2020-12-09 Thread Leonard Lausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Lausen updated ARROW-10868:
---
Description: 
Compiling and installing C++ library via:


{code}
cd ~/src/pyarrow/cpp
mkdir build
cd build
CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON ..
ninja
sudo ninja install
{code}


Then installing python package as follows will claim to succeed, but actually 
fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail)

{code}
cd ~/src/pyarrow/python
pip install --user .
{code}

  was:
Compiling and installing C++ library via:

```
cd ~/src/pyarrow/cpp
mkdir build
cd build
CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON ..
ninja
sudo ninja install
```

Then installing python package as follows will claim to succeed, but actually 
fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail)

```
cd ~/src/pyarrow/python
pip install --user .
```



> pip install --user fails to install lib
> ---
>
> Key: ARROW-10868
> URL: https://issues.apache.org/jira/browse/ARROW-10868
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Priority: Major
>
> Compiling and installing C++ library via:
> {code}
> cd ~/src/pyarrow/cpp
> mkdir build
> cd build
> CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON ..
> ninja
> sudo ninja install
> {code}
> Then installing python package as follows will claim to succeed, but actually 
> fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail)
> {code}
> cd ~/src/pyarrow/python
> pip install --user .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10868) pip install --user fails to install lib

2020-12-09 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-10868:
--

 Summary: pip install --user fails to install lib
 Key: ARROW-10868
 URL: https://issues.apache.org/jira/browse/ARROW-10868
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Leonard Lausen


Compiling and installing C++ library via:

```
cd ~/src/pyarrow/cpp
mkdir build
cd build
CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON ..
ninja
sudo ninja install
```

Then installing python package as follows will claim to succeed, but actually 
fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail)

```
cd ~/src/pyarrow/python
pip install --user .
```




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc

2020-12-09 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-10867:
--

 Summary: build failure on aarch64 with -DARROW_PYTHON=ON and gcc
 Key: ARROW-10867
 URL: https://issues.apache.org/jira/browse/ARROW-10867
 Project: Apache Arrow
  Issue Type: Task
Reporter: Leonard Lausen
 Attachments: arrow

Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on aarch64 
on a https://aws.amazon.com/ec2/instance-types/c6/ instance.
Compiling with clang-11 works fine.

```
../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void 
arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const 
arrow
::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’:  

../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler 
error: Segmentation fault
 void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
  ^~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See 
 for instructions.
```

Full build log attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10866) manylinux aarch64 wheel

2020-12-09 Thread Leonard Lausen (Jira)
Leonard Lausen created ARROW-10866:
--

 Summary: manylinux aarch64 wheel
 Key: ARROW-10866
 URL: https://issues.apache.org/jira/browse/ARROW-10866
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Leonard Lausen


Please provide a aarch64 wheel on https://pypi.org/project/pyarrow/#files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)