[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277647#comment-17277647 ] Tao He commented on ARROW-11463: Thanks for the background of pickle 5 [~lausen] . [~lausen] for your case, rather than invoking `pa.serialize`, you could create a stream using `pa.ipc.new_stream` and feed a IpcWriteOption, then use `.write_table()` to write you tables to the stream. c.f.: + [https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream] + [https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchStreamWriter.html#pyarrow.RecordBatchStreamWriter.write_table] Hope that could be helpful for you! > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277515#comment-17277515 ] Leonard Lausen commented on ARROW-11463: Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really useful. For example, the following code can replicate my use-case for the Plasma store based on providing a folder in {{/dev/shm}} as {{path}}. {code:python} import pickle import mmap def shm_pickle(path, tbl): idx = 0 def buffer_callback(buf): nonlocal idx with open(path / f'{idx}.bin', 'wb') as f: f.write(buf) idx += 1 with open(path / 'meta.pkl', 'wb') as f: pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback) def shm_unpickle(path): num_buffers = len(list(path.iterdir())) - 1 # exclude meta.idx buffers = [] for idx in range(num_buffers): f = open(path / f'{idx}.bin', 'rb') buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)) with open(path / 'meta.pkl', 'rb') as f: return pickle.load(f, buffers=buffers) {code} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277392#comment-17277392 ] Antoine Pitrou commented on ARROW-11463: PyArrow serialization is deprecated, users should use pickle themselves. It is true that out-of-band data provides zero-copy support for buffers embedded in the pickled data. It is tested here: [https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_array.py#L1695] > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277373#comment-17277373 ] Leonard Lausen commented on ARROW-11463: Specifically, do you mean that PyArrow serialization is deprecated or that SerializationContext is deprecated? Ie should users use pickle themselves, or will PyArrow just use pickle internally when serializing? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277348#comment-17277348 ] Leonard Lausen commented on ARROW-11463: Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork at https://github.com/alibaba/libvineyard which currently also uses PyArrow serialization and is thus affected from this issue. For PyArrow serialization and Pickle 5, I see that you are the author of the PEP. Thank you for driving that. Is it correct that the out-of-band data support makes it possible to use for zero-copy / shared memory applications? Is there any plan for PyArrow to uses Pickle 5 by default when running on Py3.8+? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277230#comment-17277230 ] Antoine Pitrou commented on ARROW-11463: [~lausen] I'm not sure your question has a possible answer, but please note that both PyArrow serialization and Plasma are deprecated and unmaintained. For the former, the recommended replacement is pickle with protocol 5. For the latter, you may want to contact the developers of the Ray project (they used to maintain Plasma and decided to fork it). > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170 ] Leonard Lausen commented on ARROW-11463: Thank you Tao! How can we specify the IPC stream writer instance for the `_serialize_pyarrow_table` which is configured to be the default_serialization_handler and used by `plasma_client.put`? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{}}{{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276819#comment-17276819 ] Tao He commented on ARROW-11463: The `pa.ipc.new_stream` accepts an option: [https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream|https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream,] I think we just need to expose the `allow_64bit` field to python. I could make a PR for that. > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Priority: Major > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)