Remi Dettai created ARROW-8875:
----------------------------------

             Summary: [C++] use AWS SDK SetResponseStreamFactory to avoid a 
copy of bytes
                 Key: ARROW-8875
                 URL: https://issues.apache.org/jira/browse/ARROW-8875
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Remi Dettai


Currently, in `GetObjectRange` of f3fs the `GetObjectRequest` has no 
`ResponseStreamFactory` assigned. This means that the bytes returned by the S3 
API are first sent to a `std::basic_stringbuf`. To my understanding this has 
two performance impacts:
 * `std::basic_stringbuf` uses a growing array to buffer the response, so lots 
of allocations here
 * on top of that, you have a copy operation from the `std::basic_stringbuf` 
when data is read into the Arrow buffer.

This seems to be a bit costly.

With `ResponseStreamFactory`, we might manage to get the data directly into the 
Arrow buffer.

I can take a try at it, but I would need some advice. Is there an existing 
utility to stream data into an Arrow buffer (if it exists, it is well hidden!) 
? or should I stream the data into a plain array and then transfer ownership to 
Arrow ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to