[ 
https://issues.apache.org/jira/browse/HADOOP-16202?focusedWorklogId=535586&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-535586
 ]

ASF GitHub Bot logged work on HADOOP-16202:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Jan/21 17:56
            Start Date: 13/Jan/21 17:56
    Worklog Time Spent: 10m 
      Work Description: steveloughran edited a comment on pull request #2584:
URL: https://github.com/apache/hadoop/pull/2584#issuecomment-759582014


   I'm thinking we should be more ambitious in read policy than just "fadvise", 
because we can then use it as a declaration for the input streams to tune all 
their params, eg. buffer sizing, whether to do async prefetch. 
   
   Then we could allow stores to support not-just seek policies, but declare 
what you were planning to read, e.g. "parquet-bytebuffer", to mean "I'm reading 
parquet files through the bytebuffer positioned read API"
   
   ```
   openFile("s3a://datasets/set1/input.parquet).
     opt("fs.openfile.policy, "parquet-vectored, impala, parquet,random")
    .build().get()
   ```
   
   
   example` opt(fs.openfile.read.policy, "parquet-vectored, parquet, random")`  
to mean "optimise for impala for vectored IO, then generic vectored IO, then 
generic random IO". Store implementors would get to make their own decisions as 
to what to set based on profiling &c. We'd need the applications to set policy 
on `openFile()` -so would need to know what names to use. That we can discuss 
with them,  maybe by predefining some options which *may* be supported


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 535586)
    Time Spent: 6h 40m  (was: 6.5h)

> Stabilize openFile() and adopt internally
> -----------------------------------------
>
>                 Key: HADOOP-16202
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16202
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3, tools/distcp
>    Affects Versions: 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The {{openFile()}} builder API lets us add new options when reading a file
> Add an option {{"fs.s3a.open.option.length"}} which takes a long and allows 
> the length of the file to be declared. If set, *no check for the existence of 
> the file is issued when opening the file*
> Also: withFileStatus() to take any FileStatus implementation, rather than 
> only S3AFileStatus -and not check that the path matches the path being 
> opened. Needed to support viewFS-style wrapping and mounting.
> and Adopt where appropriate to stop clusters with S3A reads switched to 
> random IO from killing download/localization
> * fs shell copyToLocal
> * distcp
> * IOUtils.copy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to