[
https://issues.apache.org/jira/browse/HADOOP-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050609#comment-18050609
]
ASF GitHub Bot commented on HADOOP-19767:
-----------------------------------------
anujmodi2021 commented on PR #8153:
URL: https://github.com/apache/hadoop/pull/8153#issuecomment-3723523348
> @anujmodi2021 I am trying to propose a single optimised implementation of
an input stream across cloud implementations, as I think we all need this kind
of logic. Ideally I want to get to a place where 80% of the logic is shared in
a common layer, and then we only implement cloud specific clients to actually
make the requests separately.
>
> There is some consensus to move the shared logic into the parquet-java
repo: https://lists.apache.org/thread/nbksq32cs8h1ldj8762y6wh9zzp8gqx6 , and
some buy-in from the team at google. I'll be following up on this in the new
year.
>
> Would be great to get your thoughts and if your team would also like to
collaborate on this.
Thanks for heads up @ahmarsuhail
This sounds like a good plan to me as well. We will surely keep a close eye
on the updates on this thread and try to contribute to make things better in
best way possible.
With this change we are not chaning how ABFS handles parquet file though.
This just improves the infra and add capability for future improvements to be
plugged in seemlessly. We will surely help address any gaps in ABFS to make
things better for the common ground you are gearing up to improve.
> ABFS: [Read] Introduce Abfs Input Policy for detecting read patterns
> --------------------------------------------------------------------
>
> Key: HADOOP-19767
> URL: https://issues.apache.org/jira/browse/HADOOP-19767
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.4.2
> Reporter: Anuj Modi
> Assignee: Anuj Modi
> Priority: Major
> Labels: pull-request-available
>
> Since the onset of ABFS Driver, there has been a single implementation of
> AbfsInputStream. Different kinds of workloads require different heuristics to
> give the best performance for that type of workload. For example:
> # Sequential Read Workloads like DFSIO and DistCP gain performance
> improvement from prefetched
> # Random Read Workloads on other hand do not need Prefetches and enabling
> prefetches for them is an overhead and TPS heavy
> # Query Workloads involving Parquet/ORC files benefit from improvements like
> Footer Read and Small Files Reads
> To accomodate this we need to determine the pattern and accordingly create
> Input Streams implemented for that particular pattern.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]