[ https://issues.apache.org/jira/browse/HIVE-21910?focusedWorklogId=269225&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-269225 ]
ASF GitHub Bot logged work on HIVE-21910: ----------------------------------------- Author: ASF GitHub Bot Created on: 28/Jun/19 12:31 Start Date: 28/Jun/19 12:31 Worklog Time Spent: 10m Work Description: pvary commented on pull request #690: HIVE-21910: Multiple target location generation in HostAffinitySplitLocationProvider URL: https://github.com/apache/hive/pull/690#discussion_r298573326 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HostAffinitySplitLocationProvider.java ########## @@ -72,11 +78,17 @@ public HostAffinitySplitLocationProvider(List<String> knownLocations) { FileSplit fsplit = (FileSplit) split; String splitDesc = "Split at " + fsplit.getPath() + " with offset= " + fsplit.getStart() + ", length=" + fsplit.getLength(); - List<String> preferredLocations = preferLocations(fsplit); - String location = - preferredLocations.get(determineLocation(preferredLocations, fsplit.getPath().toString(), - fsplit.getStart(), splitDesc)); - return (location != null) ? new String[] { location } : null; + List<String> preferredLocations = new ArrayList<>(preferLocations(fsplit)); + List<String> finalLocations = new ArrayList<>(numberOfLocations); + // Generate new preferred locations until we need more, or we do not have any preferred + // location left + while (finalLocations.size() < numberOfLocations && preferredLocations.size() > 0) { + String nextLocation = preferredLocations.get(determineLocation(preferredLocations, + fsplit.getPath().toString(), fsplit.getStart(), splitDesc)); + finalLocations.add(nextLocation); + preferredLocations.remove(nextLocation); Review comment: I did some measurements for the split generation with this code: ` @Test (timeout = 5000000) public void testOrcSplitsBasic() throws IOException { HostAffinitySplitLocationProvider locationProvider = new HostAffinitySplitLocationProvider(executorLocations, true, 1); InputSplit os1 = createMockFileSplit(true, "path1", 0, 1000, new String[] {locations.get(0), locations.get(1), locations.get(2), locations.get(3)}); long start = System.nanoTime(); for(int i=0;i<100000;i++) { locationProvider.getLocations(os1); } LOG.error("TIME: " + (System.nanoTime()-start)/1000000); } ` I got the following results: Original code (~6100ms for 100k requests): - 5859 - 6511 - 6813 - 5721 - 5663 New code with 1 location (~5823ms for 100k requests): - 5877 - 5621 - 5613 - 5883 - 6120 New code with 2 locations (~6579ms for 100k request): - 6433 - 6825 - 6574 - 6444 - 6621 I do not see why the new code should be faster, so this means probably high variation for the data. Generating 2 locations instead of 1 seems like a 10% overhead. Since this is 0.006ms per request this seems reasonable for me. What is your opinion? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 269225) Time Spent: 1.5h (was: 1h 20m) > Multiple target location generation in HostAffinitySplitLocationProvider > ------------------------------------------------------------------------ > > Key: HIVE-21910 > URL: https://issues.apache.org/jira/browse/HIVE-21910 > Project: Hive > Issue Type: Sub-task > Components: llap > Reporter: Peter Vary > Assignee: Peter Vary > Priority: Major > Labels: pull-request-available > Attachments: HIVE-21910.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > We need to generate multiple target locations by > HostAffinitySplitLocationProvider, so we will have deterministic fallback > nodes in case the target node is disabled -- This message was sent by Atlassian JIRA (v7.6.3#76005)