[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation

2024-03-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824782#comment-17824782
 ] 

Steve Loughran commented on HADOOP-19101:
-

* Hive Tez is also safe
* Hive LLAP is exposed as it reads into off-heap buffers

> Vectored Read into off-heap buffer broken in fallback implementation
> 
>
> Key: HADOOP-19101
> URL: https://issues.apache.org/jira/browse/HADOOP-19101
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/azure
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
>
> {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at 
> position zero even when the range is at a different offset. As a result: you 
> can get incorrect information.
> Thanks for this is straightforward: we pass in a FileRange and use its offset 
> as the starting position.
> However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely 
> read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we 
> have never seen this in production because the parquet and ORC libraries both 
> read into on-heap storage.
> Those libraries needs to be audited to make sure that they never attempt to 
> read into off-heap DirectBuffers. This is a bit trickier than you would think 
> because an allocator is passed in. For PARQUET-2171 we will 
> * only invoke the API on streams which explicitly declare their support for 
> the API (so fallback in parquet itself)
> * not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation

2024-03-06 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824167#comment-17824167
 ] 

Steve Loughran commented on HADOOP-19101:
-

Code review update
* Iceberg is safe reading parquet data
* Spark is safe reading directly (vectorized or not) and through iceberg

> Vectored Read into off-heap buffer broken in fallback implementation
> 
>
> Key: HADOOP-19101
> URL: https://issues.apache.org/jira/browse/HADOOP-19101
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/azure
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
>
> {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at 
> position zero even when the range is at a different offset. As a result: you 
> can get incorrect information.
> Thanks for this is straightforward: we pass in a FileRange and use its offset 
> as the starting position.
> However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely 
> read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we 
> have never seen this in production because the parquet and ORC libraries both 
> read into on-heap storage.
> Those libraries needs to be audited to make sure that they never attempt to 
> read into off-heap DirectBuffers. This is a bit trickier than you would think 
> because an allocator is passed in. For PARQUET-2171 we will 
> * only invoke the API on streams which explicitly declare their support for 
> the API (so fallback in parquet itself)
> * not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation

2024-03-06 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824166#comment-17824166
 ] 

Steve Loughran commented on HADOOP-19101:
-

* tests didn't validate the default impl, just native/local FS and s3a, all of 
which get it right.
* I'd added the abfs contract tests and things blew up or hing (see the PR 
there), but before looking at those I was trying to follow the VectorReadUtils 
stuff and make sense of what was passed down and concluded that either I didn't 
understand it *or* the code was broken. After a while I concluded it had to be 
#2


> Vectored Read into off-heap buffer broken in fallback implementation
> 
>
> Key: HADOOP-19101
> URL: https://issues.apache.org/jira/browse/HADOOP-19101
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/azure
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
>
> {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at 
> position zero even when the range is at a different offset. As a result: you 
> can get incorrect information.
> Thanks for this is straightforward: we pass in a FileRange and use its offset 
> as the starting position.
> However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely 
> read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we 
> have never seen this in production because the parquet and ORC libraries both 
> read into on-heap storage.
> Those libraries needs to be audited to make sure that they never attempt to 
> read into off-heap DirectBuffers. This is a bit trickier than you would think 
> because an allocator is passed in. For PARQUET-2171 we will 
> * only invoke the API on streams which explicitly declare their support for 
> the API (so fallback in parquet itself)
> * not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation

2024-03-06 Thread Harshit Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824079#comment-17824079
 ] 

Harshit Gupta commented on HADOOP-19101:


[~ste...@apache.org] how did you discover this issue? I thought we had tests 
that defined and changed the offset of the ranges being read irrespective of 
the offset? 

> Vectored Read into off-heap buffer broken in fallback implementation
> 
>
> Key: HADOOP-19101
> URL: https://issues.apache.org/jira/browse/HADOOP-19101
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs, fs/azure
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
>
> {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at 
> position zero even when the range is at a different offset. As a result: you 
> can get incorrect information.
> Thanks for this is straightforward: we pass in a FileRange and use its offset 
> as the starting position.
> However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely 
> read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we 
> have never seen this in production because the parquet and ORC libraries both 
> read into on-heap storage.
> Those libraries needs to be audited to make sure that they never attempt to 
> read into off-heap DirectBuffers. This is a bit trickier than you would think 
> because an allocator is passed in. For PARQUET-2171 we will 
> * only invoke the API on streams which explicitly declare their support for 
> the API (so fallback in parquet itself)
> * not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org