[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Tao He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277647#comment-17277647
 ] 

Tao He commented on ARROW-11463:


Thanks for the background of pickle 5 [~lausen] .

[~lausen] for your case, rather than invoking `pa.serialize`, you could create 
a stream using `pa.ipc.new_stream` and feed a IpcWriteOption, then use 
`.write_table()` to write you tables to the stream.

 

c.f.:

+ 
[https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream]

+ 
[https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchStreamWriter.html#pyarrow.RecordBatchStreamWriter.write_table]

 

Hope that could be helpful for you!

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277515#comment-17277515
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really 
useful. For example, the following code can replicate my use-case for the 
Plasma store based on providing a folder in {{/dev/shm}} as {{path}}.
{code:python}
import pickle
import mmap

def shm_pickle(path, tbl):
idx = 0
def buffer_callback(buf):
nonlocal idx
with open(path / f'{idx}.bin', 'wb') as f:
f.write(buf)
idx += 1
with open(path / 'meta.pkl', 'wb') as f:
pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback)


def shm_unpickle(path):
num_buffers = len(list(path.iterdir())) - 1  # exclude meta.idx
buffers = []
for idx in range(num_buffers):
f = open(path / f'{idx}.bin', 'rb')
buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ))
with open(path / 'meta.pkl', 'rb') as f:
return pickle.load(f, buffers=buffers)
{code}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277392#comment-17277392
 ] 

Antoine Pitrou commented on ARROW-11463:


PyArrow serialization is deprecated, users should use pickle themselves.

It is true that out-of-band data provides zero-copy support for buffers 
embedded in the pickled data. It is tested here:

[https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_array.py#L1695]

 

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277373#comment-17277373
 ] 

Leonard Lausen commented on ARROW-11463:


Specifically, do you mean that PyArrow serialization is deprecated or that 
SerializationContext is deprecated? Ie should users use pickle themselves, or 
will PyArrow just use pickle internally when serializing?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277348#comment-17277348
 ] 

Leonard Lausen commented on ARROW-11463:


Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork 
at https://github.com/alibaba/libvineyard which currently also uses PyArrow 
serialization and is thus affected from this issue. For PyArrow serialization 
and Pickle 5, I see that you are the author of the PEP. Thank you for driving 
that. Is it correct that the out-of-band data support makes it possible to use 
for zero-copy / shared memory applications? Is there any plan for PyArrow to 
uses Pickle 5 by default when running on Py3.8+?

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277230#comment-17277230
 ] 

Antoine Pitrou commented on ARROW-11463:


[~lausen] I'm not sure your question has a possible answer, but please note 
that both PyArrow serialization and Plasma are deprecated and unmaintained. For 
the former, the recommended replacement is pickle with protocol 5. For the 
latter, you may want to contact the developers of the Ray project (they used to 
maintain Plasma and decided to fork it).

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-02 Thread Leonard Lausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170
 ] 

Leonard Lausen commented on ARROW-11463:


 Thank you Tao! How can we specify the IPC stream writer instance for the 
`_serialize_pyarrow_table` which is configured to be the 
default_serialization_handler and used by `plasma_client.put`? It only supports 
specifying {{SerializationContext}} and I'm unsure how to configure the writer 
instance via {{}}{{SerializationContext}}

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Assignee: Tao He
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow

2021-02-01 Thread Tao He (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276819#comment-17276819
 ] 

Tao He commented on ARROW-11463:


The `pa.ipc.new_stream` accepts an option: 
[https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream|https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream,]

I think we just need to expose the `allow_64bit` field to python.

I could make a PR for that.

> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> --
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Leonard Lausen
>Priority: Major
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization of the table with combined chunks (eg for Plasma 
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays 
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>/// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>///
>/// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>/// \brief The maximum permitted schema nesting depth.
>int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)