[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leonard Lausen updated ARROW-11463: ----------------------------------- Description: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding {code:c++} modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; {code} was: For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length` I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions) I was able to serialize successfully after changing the default and rebuilding ``` modified cpp/src/arrow/ipc/options.h @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { /// \brief If true, allow field lengths that don't fit in a signed 32-bit int. /// /// Some implementations may not be able to parse streams created with this option. - bool allow_64bit = false; + bool allow_64bit = true; /// \brief The maximum permitted schema nesting depth. int max_recursion_depth = kMaxNestingDepth; ``` > Allow configuration of IpcWriterOptions 64Bit from PyArrow > ---------------------------------------------------------- > > Key: ARROW-11463 > URL: https://issues.apache.org/jira/browse/ARROW-11463 > Project: Apache Arrow > Issue Type: Task > Components: Python > Reporter: Leonard Lausen > Priority: Major > > For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` > will be around 1000x slower compared to the `pyarrow.Table.take` on the table > with combined chunks (1 chunk). Unfortunately, if such table contains large > list data type, it's easy for the flattened table to contain more than 2**31 > rows and serialization (eg for Plasma store) will fail due to > `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in > length` > I couldn't find a way to enable 64bit support for the serialization as called > from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions > 64 bit setting; further the Python serialization APIs do not allow > specification of IpcWriteOptions) > I was able to serialize successfully after changing the default and rebuilding > {code:c++} > modified cpp/src/arrow/ipc/options.h > @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions { > /// \brief If true, allow field lengths that don't fit in a signed 32-bit > int. > /// > /// Some implementations may not be able to parse streams created with > this option. > - bool allow_64bit = false; > + bool allow_64bit = true; > > /// \brief The maximum permitted schema nesting depth. > int max_recursion_depth = kMaxNestingDepth; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)