[ 
https://issues.apache.org/jira/browse/HIVE-21910?focusedWorklogId=269225&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-269225
 ]

ASF GitHub Bot logged work on HIVE-21910:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Jun/19 12:31
            Start Date: 28/Jun/19 12:31
    Worklog Time Spent: 10m 
      Work Description: pvary commented on pull request #690: HIVE-21910: 
Multiple target location generation in HostAffinitySplitLocationProvider
URL: https://github.com/apache/hive/pull/690#discussion_r298573326
 
 

 ##########
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HostAffinitySplitLocationProvider.java
 ##########
 @@ -72,11 +78,17 @@ public HostAffinitySplitLocationProvider(List<String> 
knownLocations) {
     FileSplit fsplit = (FileSplit) split;
     String splitDesc = "Split at " + fsplit.getPath() + " with offset= " + 
fsplit.getStart()
         + ", length=" + fsplit.getLength();
-    List<String> preferredLocations = preferLocations(fsplit);
-    String location =
-        preferredLocations.get(determineLocation(preferredLocations, 
fsplit.getPath().toString(),
-            fsplit.getStart(), splitDesc));
-    return (location != null) ? new String[] { location } : null;
+    List<String> preferredLocations = new ArrayList<>(preferLocations(fsplit));
+    List<String> finalLocations = new ArrayList<>(numberOfLocations);
+    // Generate new preferred locations until we need more, or we do not have 
any preferred
+    // location left
+    while (finalLocations.size() < numberOfLocations && 
preferredLocations.size() > 0) {
+      String nextLocation = 
preferredLocations.get(determineLocation(preferredLocations,
+          fsplit.getPath().toString(), fsplit.getStart(), splitDesc));
+      finalLocations.add(nextLocation);
+      preferredLocations.remove(nextLocation);
 
 Review comment:
   I did some measurements for the split generation with this code:
   `
   @Test (timeout = 5000000)
   public void testOrcSplitsBasic() throws IOException {
     HostAffinitySplitLocationProvider locationProvider = new 
HostAffinitySplitLocationProvider(executorLocations, true, 1);
   
     InputSplit os1 = createMockFileSplit(true, "path1", 0, 1000, new String[] 
{locations.get(0), locations.get(1), locations.get(2), locations.get(3)});
   
     long start = System.nanoTime();
     for(int i=0;i<100000;i++) {
       locationProvider.getLocations(os1);
     }
     LOG.error("TIME: " + (System.nanoTime()-start)/1000000);
   }
   `
   
   I got the following results:
   Original code (~6100ms for 100k requests):
   - 5859
   - 6511
   - 6813
   - 5721
   - 5663
   
   New code with 1 location (~5823ms for 100k requests):
   - 5877
   - 5621
   - 5613
   - 5883
   - 6120
   
   New code with 2 locations (~6579ms for 100k request):
   - 6433
   - 6825
   - 6574
   - 6444
   - 6621
   
   I do not see why the new code should be faster, so this means probably high 
variation for the data. Generating 2 locations instead of 1 seems like a 10% 
overhead. Since this is 0.006ms per request this seems reasonable for me.
   
   What is your opinion?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 269225)
    Time Spent: 1.5h  (was: 1h 20m)

> Multiple target location generation in HostAffinitySplitLocationProvider
> ------------------------------------------------------------------------
>
>                 Key: HIVE-21910
>                 URL: https://issues.apache.org/jira/browse/HIVE-21910
>             Project: Hive
>          Issue Type: Sub-task
>          Components: llap
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-21910.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We need to generate multiple target locations by 
> HostAffinitySplitLocationProvider, so we will have deterministic fallback 
> nodes in case the target node is disabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to