Ádám Szita created HIVE-23947:
---------------------------------
Summary: Cache affinity is unset for text files read by LLAP
Key: HIVE-23947
URL: https://issues.apache.org/jira/browse/HIVE-23947
Project: Hive
Issue Type: Bug
Reporter: Ádám Szita
Assignee: Ádám Szita
LLAP relies on HostAffinitySplitLocationProvider to route the same splits to
always the same LLAP daemons. By having such consistent split of data among the
nodes we can gain a good hit ratio and thus good performance.
For text files this is almost never granted: HostAffinitySplitLocationProvider
is never used, because HS2 does not set the cache affinity flag in the job conf
for text inputformat content during compile. The launched Tez AM will have to
rely on HDFS location information to route the splits (and therefore tasks) to
the executor nodes. This location information might not have a good overlap
with where the actual daemons are, or in S3 case, the Tez AM will mostly choose
executors in a random way.
This in turn will result in the hit ratio hardly reaching 100%, each time we
re-run the same query, some disk/s3 read will still occur. That is until the
same content gets populated into all the daemons (after running the query tens
or hundreds of times) causing poor performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)