[ 
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Lausen updated ARROW-11463:
-----------------------------------
    Description: 
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding


{code:c++}
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
{code}



  was:
For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will 
be around 1000x slower compared to the `pyarrow.Table.take` on the table with 
combined chunks (1 chunk). Unfortunately, if such table contains large list 
data type, it's easy for the flattened table to contain more than 2**31 rows 
and serialization (eg for Plasma store) will fail due to 
`pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
length`

I couldn't find a way to enable 64bit support for the serialization as called 
from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 
bit setting; further the Python serialization APIs do not allow specification 
of IpcWriteOptions)

I was able to serialize successfully after changing the default and rebuilding

```
modified   cpp/src/arrow/ipc/options.h
@@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
int.
   ///
   /// Some implementations may not be able to parse streams created with this 
option.
-  bool allow_64bit = false;
+  bool allow_64bit = true;
 
   /// \brief The maximum permitted schema nesting depth.
   int max_recursion_depth = kMaxNestingDepth;
```


> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> ----------------------------------------------------------
>
>                 Key: ARROW-11463
>                 URL: https://issues.apache.org/jira/browse/ARROW-11463
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Python
>            Reporter: Leonard Lausen
>            Priority: Major
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` 
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table 
> with combined chunks (1 chunk). Unfortunately, if such table contains large 
> list data type, it's easy for the flattened table to contain more than 2**31 
> rows and serialization (eg for Plasma store) will fail due to 
> `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in 
> length`
> I couldn't find a way to enable 64bit support for the serialization as called 
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 
> 64 bit setting; further the Python serialization APIs do not allow 
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified   cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
>    /// \brief If true, allow field lengths that don't fit in a signed 32-bit 
> int.
>    ///
>    /// Some implementations may not be able to parse streams created with 
> this option.
> -  bool allow_64bit = false;
> +  bool allow_64bit = true;
>  
>    /// \brief The maximum permitted schema nesting depth.
>    int max_recursion_depth = kMaxNestingDepth;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to