[jira] [Created] (HADOOP-19101) Vectored Read into off-heap buffer broken

Steve Loughran (Jira) Mon, 04 Mar 2024 12:46:18 -0800

Steve Loughran created HADOOP-19101:
---------------------------------------


             Summary: Vectored Read into off-heap buffer broken
                 Key: HADOOP-19101
                 URL: https://issues.apache.org/jira/browse/HADOOP-19101
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs, fs/azure
    Affects Versions: 3.3.6, 3.4.0
            Reporter: Steve Loughran
            Assignee: Steve Loughran



{{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at 
position zero even when the range is at a different offset. As a result: you 
can get incorrect information.

Thanks for this is straightforward: we pass in a FileRange and use its offset 
as the starting position.

However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely 
read vectorIO into direct buffers through HDFS, ABFS or Azure. Note that we 
have never seen this in production because the parquet and ORC libraries both 
read into on-heap storage.

Those libraries needs to be audited to make sure that they never attempt to 
read into off-heap DirectBuffers. This is a bit trickier than you would think 
because an allocator is passed in. For PARQUET-2171 we will 
* only invoke the API on streams which explicitly declare their support for the 
API (so fallback in parquet itself)
* not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-19101) Vectored Read into off-heap buffer broken

Reply via email to