[ 
https://issues.apache.org/jira/browse/ARROW-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-11878:
----------------------------------
    Fix Version/s: 9.0.0
                       (was: 8.0.0)

> [C++] Improve Converter API to support chunking
> -----------------------------------------------
>
>                 Key: ARROW-11878
>                 URL: https://issues.apache.org/jira/browse/ARROW-11878
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Neal Richardson
>            Priority: Major
>             Fix For: 9.0.0
>
>
> We would like to be able to chunk a data frame when converting to Arrow Table 
> in R (see ARROW-9293). Apparently this is also not supported in pyarrow. 
> [~romainfrancois] says two things need to happen: 
>  - Converter api needs to be able to Extend() a range of values, as opposed 
> to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override 
> which says ingest that vector x and btw it has this many elements. 
>  - Chunker or perhaps another/new class would sit on top of that and perhaps 
> {{Chunker::Extend(x)}} would call multiple times (one for each chunk) 
> {{Converter$Extend(x, start, size)}}. 
> The current chunker solves I believe a different problem and is rooted in a 
> Converter that deals with elements one by one so that: 
>   - if the element can be Append() that’s fine
>   - if not, then create a new chunk and try again
> The current chunker has a multiple element method but it’s an all or nothing: 
> {code}
>   // we could get bit smarter here since the whole batch of appendable values
>   // will be rejected if a capacity error is raised
>   Status Extend(InputType values, int64_t size) {
>     auto status = converter_->Extend(values, size);
>     if (ARROW_PREDICT_FALSE(status.IsCapacityError())) {
>       if (converter_->builder()->length() == 0) {
>         return status;
>       }
>       ARROW_RETURN_NOT_OK(FinishChunk());
>       return Extend(values, size);
>     }
>     length_ += size;
>     return status;
>   }
> {code}
> This does not give a way to say e.g. take this vector and chunk it into 
> arrays of this size, which is what we want. 
> cc [~kszucs] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to