[jira] [Comment Edited] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Ying Zhou (Jira) Sun, 17 Jan 2021 20:48:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267009#comment-17267009
 ]


Ying Zhou edited comment on ARROW-4713 at 1/18/21, 4:47 AM:
------------------------------------------------------------

Hmm..this looks interesting. If @Yurui Zhou won’t take it I potentially can. 
However I don’t think I have time for that before July though. So if I will 
take it it will need to happen half a year later and won’t be available in 4.0.


was (Author: yingzhou474):
Hmm..this looks interesting. I don’t think I have time for that before July 
though. So if I will take it it will need to happen half a year later and won’t 
be available in 4.0.

> [C++] Improve C++ Orc Adapter performance and memory footprint
> --------------------------------------------------------------
>
>                 Key: ARROW-4713
>                 URL: https://issues.apache.org/jira/browse/ARROW-4713
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yurui Zhou
>            Assignee: Yurui Zhou
>            Priority: Major
>              Labels: orc, pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently the Arrow C++ provide a naive adapter implementation that allow 
> user to read orc file to Arrow RecordBatch. However, this implementation have 
> several drawbacks:
>  * Inefficient conversion that incurs huge memcpy overhead
>  ** currently the ORC adapter are performing byte to byte memcpy to move data 
> to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
> VectorBatch shares the same memory layout with Arrow in most of the Data Types
>  * Huge memory footprint because the lack of TableReader implementation
>  ** The ORC adapter currently only allow user to read data with the unit of 
> stripe. However, as a columnar format with high compression ration, data read 
> from a ORC stripe can potential takes over gigabytes of memory, which makes 
> the ORC adapter not quite usable in production environment.
> Here we propose a new ORC adapter implementation to fix the issues mentioned 
> above:
>  * To reduce conversion overhead, instead of performing naive data copy, the 
> new adapter would be able to fully taking advantage of the memory layout 
> similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new 
> adapter will perform pointer manipulation to transfer the memory ownership 
> from VectorBatch to Arrow RecordBatch whenever possible.
>  * The new ORC Adapter would be able to provide user a row level granularity 
> when reading data from Orc File. The user should be able to specify how many 
> rows should be expected on output RecordBatch and the ORC Adapter should make 
> sure no more the requested number of rows would be returned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Reply via email to