[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation
[ https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824782#comment-17824782 ] Steve Loughran commented on HADOOP-19101: - * Hive Tez is also safe * Hive LLAP is exposed as it reads into off-heap buffers > Vectored Read into off-heap buffer broken in fallback implementation > > > Key: HADOOP-19101 > URL: https://issues.apache.org/jira/browse/HADOOP-19101 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs, fs/azure >Affects Versions: 3.4.0, 3.3.6 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Blocker > > {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at > position zero even when the range is at a different offset. As a result: you > can get incorrect information. > Thanks for this is straightforward: we pass in a FileRange and use its offset > as the starting position. > However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely > read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we > have never seen this in production because the parquet and ORC libraries both > read into on-heap storage. > Those libraries needs to be audited to make sure that they never attempt to > read into off-heap DirectBuffers. This is a bit trickier than you would think > because an allocator is passed in. For PARQUET-2171 we will > * only invoke the API on streams which explicitly declare their support for > the API (so fallback in parquet itself) > * not invoke when direct buffer allocation is in use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation
[ https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824167#comment-17824167 ] Steve Loughran commented on HADOOP-19101: - Code review update * Iceberg is safe reading parquet data * Spark is safe reading directly (vectorized or not) and through iceberg > Vectored Read into off-heap buffer broken in fallback implementation > > > Key: HADOOP-19101 > URL: https://issues.apache.org/jira/browse/HADOOP-19101 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs, fs/azure >Affects Versions: 3.4.0, 3.3.6 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Blocker > > {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at > position zero even when the range is at a different offset. As a result: you > can get incorrect information. > Thanks for this is straightforward: we pass in a FileRange and use its offset > as the starting position. > However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely > read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we > have never seen this in production because the parquet and ORC libraries both > read into on-heap storage. > Those libraries needs to be audited to make sure that they never attempt to > read into off-heap DirectBuffers. This is a bit trickier than you would think > because an allocator is passed in. For PARQUET-2171 we will > * only invoke the API on streams which explicitly declare their support for > the API (so fallback in parquet itself) > * not invoke when direct buffer allocation is in use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation
[ https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824166#comment-17824166 ] Steve Loughran commented on HADOOP-19101: - * tests didn't validate the default impl, just native/local FS and s3a, all of which get it right. * I'd added the abfs contract tests and things blew up or hing (see the PR there), but before looking at those I was trying to follow the VectorReadUtils stuff and make sense of what was passed down and concluded that either I didn't understand it *or* the code was broken. After a while I concluded it had to be #2 > Vectored Read into off-heap buffer broken in fallback implementation > > > Key: HADOOP-19101 > URL: https://issues.apache.org/jira/browse/HADOOP-19101 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs, fs/azure >Affects Versions: 3.4.0, 3.3.6 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Blocker > > {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at > position zero even when the range is at a different offset. As a result: you > can get incorrect information. > Thanks for this is straightforward: we pass in a FileRange and use its offset > as the starting position. > However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely > read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we > have never seen this in production because the parquet and ORC libraries both > read into on-heap storage. > Those libraries needs to be audited to make sure that they never attempt to > read into off-heap DirectBuffers. This is a bit trickier than you would think > because an allocator is passed in. For PARQUET-2171 we will > * only invoke the API on streams which explicitly declare their support for > the API (so fallback in parquet itself) > * not invoke when direct buffer allocation is in use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation
[ https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824079#comment-17824079 ] Harshit Gupta commented on HADOOP-19101: [~ste...@apache.org] how did you discover this issue? I thought we had tests that defined and changed the offset of the ranges being read irrespective of the offset? > Vectored Read into off-heap buffer broken in fallback implementation > > > Key: HADOOP-19101 > URL: https://issues.apache.org/jira/browse/HADOOP-19101 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs, fs/azure >Affects Versions: 3.4.0, 3.3.6 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Blocker > > {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at > position zero even when the range is at a different offset. As a result: you > can get incorrect information. > Thanks for this is straightforward: we pass in a FileRange and use its offset > as the starting position. > However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely > read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we > have never seen this in production because the parquet and ORC libraries both > read into on-heap storage. > Those libraries needs to be audited to make sure that they never attempt to > read into off-heap DirectBuffers. This is a bit trickier than you would think > because an allocator is passed in. For PARQUET-2171 we will > * only invoke the API on streams which explicitly declare their support for > the API (so fallback in parquet itself) > * not invoke when direct buffer allocation is in use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org