[jira] [Commented] (PARQUET-2213) Add an alternative InputFile.newStream that allow an input range

ASF GitHub Bot (Jira) Tue, 15 Nov 2022 03:56:04 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634325#comment-17634325
 ]


ASF GitHub Bot commented on PARQUET-2213:
-----------------------------------------

steveloughran commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1022697746


##########
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##########
@@ -41,4 +41,16 @@ public interface InputFile {
    */
   SeekableInputStream newStream() throws IOException;
 
+  /**
+   * Open a new {@link SeekableInputStream} for the underlying data file,
+   * in the range of '[offset, offset + length)'
+   *
+   * @param offset the offset in the file to read from
+   * @param length the total number of bytes to read
+   * @return a new {@link SeekableInputStream} to read the file
+   * @throws IOException if the stream cannot be opened
+   */
+  default SeekableInputStream newStream(long offset, long length) throws 
IOException {

Review Comment:
   you should go with the hadoop 
https://issues.apache.org/jira/browse/HADOOP-16202 options; s3a fs now reads 
them and it lines up abfs/gcs for the same. you can declare split start/end as 
well as file length so that
   * length => client can skip existence probes, they know the file limit
   * spilt range: they know to not prefetch past the end of the split, if they 
prefetch
   * read policy: standard set of policies and a parse policy of "csv list of 
policies -pick the first one you recognise". again, can be used by all the 
stores





> Add an alternative InputFile.newStream that allow an input range
> ----------------------------------------------------------------
>
>                 Key: PARQUET-2213
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2213
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Chao Sun
>            Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2213) Add an alternative InputFile.newStream that allow an input range

Reply via email to