[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277515#comment-17277515 ] Leonard Lausen commented on ARROW-11463: Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really useful. For example, the following code can replicate my use-case for the Plasma store based on providing a folder in {{/dev/shm}} as {{path}}. {code:python} import pickle import mmap def shm_pickle(path, tbl): idx = 0 def buffer_callback(buf): nonlocal idx with open(path / f'{idx}.bin', 'wb') as f: f.write(buf) idx += 1 with open(path / 'meta.pkl', 'wb') as f: pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback) def shm_unpickle(path): num_buffers = len(list(path.iterdir())) - 1 # exclude meta.idx buffers = [] for idx in range(num_buffers): f = open(path / f'{idx}.bin', 'rb') buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)) with open(path / 'meta.pkl', 'rb') as f: return pickle.load(f, buffers=buffers) {code} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277373#comment-17277373 ] Leonard Lausen commented on ARROW-11463: Specifically, do you mean that PyArrow serialization is deprecated or that SerializationContext is deprecated? Ie should users use pickle themselves, or will PyArrow just use pickle internally when serializing? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277348#comment-17277348 ] Leonard Lausen commented on ARROW-11463: Thank you [~apitrou] for the background. For Plasma, Tao is developing a fork at https://github.com/alibaba/libvineyard which currently also uses PyArrow serialization and is thus affected from this issue. For PyArrow serialization and Pickle 5, I see that you are the author of the PEP. Thank you for driving that. Is it correct that the out-of-band data support makes it possible to use for zero-copy / shared memory applications? Is there any plan for PyArrow to uses Pickle 5 by default when running on Py3.8+? > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170 ] Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:52 PM: - Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} was (Author: lausen): Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170 ] Leonard Lausen edited comment on ARROW-11463 at 2/2/21, 2:51 PM: - Thank you Tao! How can we specify the IPC stream writer instance for the {{_serialize_pyarrow_table}} which is configured to be the {{default_serialization_handler}} and used by {{PlasmaClient.put}}? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{SerializationContext}} was (Author: lausen): Thank you Tao! How can we specify the IPC stream writer instance for the `_serialize_pyarrow_table` which is configured to be the default_serialization_handler and used by `plasma_client.put`? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{}}{{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277170#comment-17277170 ] Leonard Lausen commented on ARROW-11463: Thank you Tao! How can we specify the IPC stream writer instance for the `_serialize_pyarrow_table` which is configured to be the default_serialization_handler and used by `plasma_client.put`? It only supports specifying {{SerializationContext}} and I'm unsure how to configure the writer instance via {{}}{{SerializationContext}} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Assignee: Tao He >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Lausen updated ARROW-11463: --- Description: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization of the table with combined chunks (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding {code:c++} modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; {code} was: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding {code:c++} modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; {code} > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Priority: Major > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization of the table with combined chunks (eg for Plasma > store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays > larger than 2^31 - 1 in length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Lausen updated ARROW-11463: --- Description: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding {code:c++} modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; {code} was: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding ``` modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; ``` > Allow configuration of IpcWriterOptions 64Bit from PyArrow > -- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Priority: Major > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization (eg for Plasma store) will fail due to > `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in > length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { >/// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. >/// >/// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > >/// \brief The maximum permitted schema nesting depth. >int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11463) Allow configuration of IpcWriterOptions 64Bit from PyArrow
Leonard Lausen created ARROW-11463: -- Summary: Allow configuration of IpcWriterOptions 64Bit from PyArrow Key: ARROW-11463 URL: https://issues.apache.org/jira/browse/ARROW-11463 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Leonard Lausen For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding ``` modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11380) Plasma packages for arm64
[ https://issues.apache.org/jira/browse/ARROW-11380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Lausen updated ARROW-11380: --- Description: "Note that Plasma packages are available only for amd64. Because nvidia-cuda-toolkit package isn't available for arm64." https://issues.apache.org/jira/browse/ARROW-6715 Nvidia supports Cuda on ARM, so this should be possible in principle? Please also clarify the relation with https://github.com/alibaba/libvineyard Do you intend to continue develop Plasma? was: "Note that Plasma packages are available only for amd64. Because nvidia-cuda-toolkit package isn't available for arm64." https://issues.apache.org/jira/browse/ARROW-6715 Nvidia supports Cuda on ARM, so this should be possible in principle? > Plasma packages for arm64 > - > > Key: ARROW-11380 > URL: https://issues.apache.org/jira/browse/ARROW-11380 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Leonard Lausen >Priority: Major > > "Note that Plasma packages are available only for amd64. Because > nvidia-cuda-toolkit package isn't available for arm64." > https://issues.apache.org/jira/browse/ARROW-6715 > Nvidia supports Cuda on ARM, so this should be possible in principle? > Please also clarify the relation with https://github.com/alibaba/libvineyard > Do you intend to continue develop Plasma? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6715) [Website] Describe "non-free" component is needed for Plasma packages in install page
[ https://issues.apache.org/jira/browse/ARROW-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271504#comment-17271504 ] Leonard Lausen commented on ARROW-6715: --- Thanks. I opened https://issues.apache.org/jira/browse/ARROW-11380 > [Website] Describe "non-free" component is needed for Plasma packages in > install page > - > > Key: ARROW-6715 > URL: https://issues.apache.org/jira/browse/ARROW-6715 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Because Plasma packages depend on nvidia-cuda-toolkit package that in > non-free component. > Note that Plasma packages are available only for amd64. Because > nvidia-cuda-toolkit package isn't available for arm64. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11380) Plasma packages for arm64
Leonard Lausen created ARROW-11380: -- Summary: Plasma packages for arm64 Key: ARROW-11380 URL: https://issues.apache.org/jira/browse/ARROW-11380 Project: Apache Arrow Issue Type: Improvement Reporter: Leonard Lausen "Note that Plasma packages are available only for amd64. Because nvidia-cuda-toolkit package isn't available for arm64." https://issues.apache.org/jira/browse/ARROW-6715 Nvidia supports Cuda on ARM, so this should be possible in principle? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6715) [Website] Describe "non-free" component is needed for Plasma packages in install page
[ https://issues.apache.org/jira/browse/ARROW-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271409#comment-17271409 ] Leonard Lausen commented on ARROW-6715: --- Nvidia supports the CUDA on Arm. Could you elaborate what is missing for providing plasma an arm64? > [Website] Describe "non-free" component is needed for Plasma packages in > install page > - > > Key: ARROW-6715 > URL: https://issues.apache.org/jira/browse/ARROW-6715 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Because Plasma packages depend on nvidia-cuda-toolkit package that in > non-free component. > Note that Plasma packages are available only for amd64. Because > nvidia-cuda-toolkit package isn't available for arm64. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10349) [Python] build and publish aarch64 wheels
[ https://issues.apache.org/jira/browse/ARROW-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271406#comment-17271406 ] Leonard Lausen commented on ARROW-10349: While not wheels, there are binaries for arm64 at https://arrow.apache.org/install/ > [Python] build and publish aarch64 wheels > - > > Key: ARROW-10349 > URL: https://issues.apache.org/jira/browse/ARROW-10349 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python > Environment: os: Linux > arch: aarch64 >Reporter: Jonathan Swinney >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > The currently released source distribution for Arrow on pypi.org doesn't > build on Ubuntu 20.04. It may be possible install additional build > dependencies to make it work, but it would be better to publish aarch64 > (arm64) wheels to pypi.org in addition to the currently published x86_64 > wheels for Linux. > {{$ pip install pyarrow}} > should just work on Linux/aarch64. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
[ https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271015#comment-17271015 ] Leonard Lausen commented on ARROW-9773: --- There is a similar issue with large tables (many rows) of medium size lists (~512 elements per list). When using `pa.list_` type, `take` will fail due to `offset overflow while concatenating arrays`. Using `pa.large_list` works. (But in practice it doesn't help as `.take` performs 3 orders of magnitude (~1s vs ~1ms) slower than indexing operations on pandas..) > [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array > --- > > Key: ARROW-9773 > URL: https://issues.apache.org/jira/browse/ARROW-9773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.0 >Reporter: David Li >Priority: Major > > Take() currently concatenates ChunkedArrays first. However, this breaks down > when calling Take() from a ChunkedArray or Table where concatenating the > arrays would result in an array that's too large. While inconvenient to > implement, it would be useful if this case were handled. > This could be done as a higher-level wrapper around Take(), perhaps. > Example in Python: > {code:python} > >>> import pyarrow as pa > >>> pa.__version__ > '1.0.0' > >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"]) > >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"]) > >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema) > >>> table.take([1, 0]) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take > File > "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", > line 268, in take > return call_function('take', [data, indices], options) > File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function > File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays > {code} > In this example, it would be useful if Take() or a higher-level wrapper could > generate multiple record batches as output. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10868) pip install --user fails to install lib
[ https://issues.apache.org/jira/browse/ARROW-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246964#comment-17246964 ] Leonard Lausen commented on ARROW-10868: Do you mean the environment variables? I think the issue here is that the arrow `setup.py` file may not handle the `--user` case correctly. > pip install --user fails to install lib > --- > > Key: ARROW-10868 > URL: https://issues.apache.org/jira/browse/ARROW-10868 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Priority: Major > > Compiling and installing C++ library via: > {code} > cd ~/src/pyarrow/cpp > mkdir build > cd build > CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON .. > ninja > sudo ninja install > {code} > Then installing python package as follows will claim to succeed, but actually > fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail) > {code} > cd ~/src/pyarrow/python > pip install --user . > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc
[ https://issues.apache.org/jira/browse/ARROW-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246963#comment-17246963 ] Leonard Lausen commented on ARROW-10867: It might be a valid gcc bug, but it may still be great if arrow can workaround the bug and if one of the arrow maintainers can report the bug to gcc as you are more familiar with the code > build failure on aarch64 with -DARROW_PYTHON=ON and gcc > --- > > Key: ARROW-10867 > URL: https://issues.apache.org/jira/browse/ARROW-10867 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Leonard Lausen >Priority: Major > Attachments: arrow > > > Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on > aarch64 on a https://aws.amazon.com/ec2/instance-types/c6/ instance. > Compiling with clang-11 works fine. > ``` > ../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void > arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const > arrow > ::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’: > > ../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler > error: Segmentation fault > void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { > ^~~~ > Please submit a full bug report, > with preprocessed source if appropriate. > See > for instructions. > ``` > Full build log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc
[ https://issues.apache.org/jira/browse/ARROW-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246961#comment-17246961 ] Leonard Lausen commented on ARROW-10867: master branch https://github.com/apache/arrow/commit/3deae8dd50da773ba215704e567d9937b04b02c5 > build failure on aarch64 with -DARROW_PYTHON=ON and gcc > --- > > Key: ARROW-10867 > URL: https://issues.apache.org/jira/browse/ARROW-10867 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Leonard Lausen >Priority: Major > Attachments: arrow > > > Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on > aarch64 on a https://aws.amazon.com/ec2/instance-types/c6/ instance. > Compiling with clang-11 works fine. > ``` > ../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void > arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const > arrow > ::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’: > > ../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler > error: Segmentation fault > void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { > ^~~~ > Please submit a full bug report, > with preprocessed source if appropriate. > See > for instructions. > ``` > Full build log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10868) pip install --user fails to install lib
[ https://issues.apache.org/jira/browse/ARROW-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Lausen updated ARROW-10868: --- Description: Compiling and installing C++ library via: {code} cd ~/src/pyarrow/cpp mkdir build cd build CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON .. ninja sudo ninja install {code} Then installing python package as follows will claim to succeed, but actually fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail) {code} cd ~/src/pyarrow/python pip install --user . {code} was: Compiling and installing C++ library via: ``` cd ~/src/pyarrow/cpp mkdir build cd build CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON .. ninja sudo ninja install ``` Then installing python package as follows will claim to succeed, but actually fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail) ``` cd ~/src/pyarrow/python pip install --user . ``` > pip install --user fails to install lib > --- > > Key: ARROW-10868 > URL: https://issues.apache.org/jira/browse/ARROW-10868 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Leonard Lausen >Priority: Major > > Compiling and installing C++ library via: > {code} > cd ~/src/pyarrow/cpp > mkdir build > cd build > CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON .. > ninja > sudo ninja install > {code} > Then installing python package as follows will claim to succeed, but actually > fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail) > {code} > cd ~/src/pyarrow/python > pip install --user . > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10868) pip install --user fails to install lib
Leonard Lausen created ARROW-10868: -- Summary: pip install --user fails to install lib Key: ARROW-10868 URL: https://issues.apache.org/jira/browse/ARROW-10868 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Leonard Lausen Compiling and installing C++ library via: ``` cd ~/src/pyarrow/cpp mkdir build cd build CC=clang-11 CXX=clang++-11 cmake -GNinja -DARROW_PYTHON=ON .. ninja sudo ninja install ``` Then installing python package as follows will claim to succeed, but actually fail to provide `pyarrow.lib` (`python3 -c 'import pyarrow.lib'` will fail) ``` cd ~/src/pyarrow/python pip install --user . ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10867) build failure on aarch64 with -DARROW_PYTHON=ON and gcc
Leonard Lausen created ARROW-10867: -- Summary: build failure on aarch64 with -DARROW_PYTHON=ON and gcc Key: ARROW-10867 URL: https://issues.apache.org/jira/browse/ARROW-10867 Project: Apache Arrow Issue Type: Task Reporter: Leonard Lausen Attachments: arrow Arrow will trigger compiler errors in (at least) gcc7, gcc8 and gcc9 on aarch64 on a https://aws.amazon.com/ec2/instance-types/c6/ instance. Compiling with clang-11 works fine. ``` ../src/arrow/compute/kernels/scalar_cast_nested.cc: In function ‘void arrow::compute::internal::CastListExec(arrow::compute::KernelContext*, const arrow ::compute::ExecBatch&, arrow::Datum*) [with Type = arrow::LargeListType]’: ../src/arrow/compute/kernels/scalar_cast_nested.cc:33:6: internal compiler error: Segmentation fault void CastListExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { ^~~~ Please submit a full bug report, with preprocessed source if appropriate. See for instructions. ``` Full build log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10866) manylinux aarch64 wheel
Leonard Lausen created ARROW-10866: -- Summary: manylinux aarch64 wheel Key: ARROW-10866 URL: https://issues.apache.org/jira/browse/ARROW-10866 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Leonard Lausen Please provide a aarch64 wheel on https://pypi.org/project/pyarrow/#files -- This message was sent by Atlassian Jira (v8.3.4#803005)