[
https://issues.apache.org/jira/browse/PARQUET-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634325#comment-17634325
]
ASF GitHub Bot commented on PARQUET-2213:
-----------------------------------------
steveloughran commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1022697746
##########
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##########
@@ -41,4 +41,16 @@ public interface InputFile {
*/
SeekableInputStream newStream() throws IOException;
+ /**
+ * Open a new {@link SeekableInputStream} for the underlying data file,
+ * in the range of '[offset, offset + length)'
+ *
+ * @param offset the offset in the file to read from
+ * @param length the total number of bytes to read
+ * @return a new {@link SeekableInputStream} to read the file
+ * @throws IOException if the stream cannot be opened
+ */
+ default SeekableInputStream newStream(long offset, long length) throws
IOException {
Review Comment:
you should go with the hadoop
https://issues.apache.org/jira/browse/HADOOP-16202 options; s3a fs now reads
them and it lines up abfs/gcs for the same. you can declare split start/end as
well as file length so that
* length => client can skip existence probes, they know the file limit
* spilt range: they know to not prefetch past the end of the split, if they
prefetch
* read policy: standard set of policies and a parse policy of "csv list of
policies -pick the first one you recognise". again, can be used by all the
stores
> Add an alternative InputFile.newStream that allow an input range
> ----------------------------------------------------------------
>
> Key: PARQUET-2213
> URL: https://issues.apache.org/jira/browse/PARQUET-2213
> Project: Parquet
> Issue Type: Improvement
> Reporter: Chao Sun
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)