Hello Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#3).

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating 
consistent remote placement
......................................................................

[WIP] IMPALA-8630: Hash the full path when calculating consistent remote 
placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

This is a proof of concept and is not ready to go.

Work remaining (if this is the approach we take):
 - Add some scheduler tests to simulate this approach
 - Look into an end-to-end test based on the Docker tests (which do remote IO)

Alternatives:
 - Hash the full path when reading the file metadata in the catalog
   and store it in FbFileDesc.
 - Construct the DescriptorTbl in the scheduler and use that to
   reconstruct the full path or partition path for a file and hash it.

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
6 files changed, 87 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/3
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to