Re:Re: How to use all yarn cluster resources for one tez job

skaterQiang Fri, 10 Apr 2015 22:14:12 -0700

Thanks Gopal:
            I have checked the tez 0.6.0 code

From the code, there is a locationHint there(I am not sure when the location 
hint is set),
But it looks like task will reuse the container before going to ask new 
container.
Also I have seen there is a warm up function in the client side. So I think if 
I can warm up 1000 container for one tez job, then that tez should be run fast.
And my network is 10GE, so I think network io is not a bottle neck for my case, 
so whether make the data aware hint is not that important.
But if I can make all tasks to be run at the same time, then it will be good.



>set tez.grouping.split-waves=1.7
the split-waves measures current queue capacity *1.7x to go wider than the 
actual available capacity
[skater]But my env use default queue and root user, so *1.7 is useless to this 
case >set tez.grouping.min-size=16777216
[skater] What does this parameter mean? Do we have a wiki to trace the options?


THanks in advance,


Following is the code I found for make new container:

if (locationHint != null) {
      TaskBasedLocationAffinity taskAffinity = 
locationHint.getAffinitizedTask();
      if (taskAffinity != null) {
        Vertex vertex = 
appContext.getCurrentDAG().getVertex(taskAffinity.getVertexName());
        Preconditions.checkNotNull(vertex, "Invalid vertex in task based 
affinity " + taskAffinity 
            + " for attempt: " + taskAttempt.getID());
        int taskIndex = taskAffinity.getTaskIndex(); 
        Preconditions.checkState(taskIndex >=0 && taskIndex < 
vertex.getTotalTasks(), 
            "Invalid taskIndex in task based affinity " + taskAffinity 
            + " for attempt: " + taskAttempt.getID());
        TaskAttempt affinityAttempt = 
vertex.getTask(taskIndex).getSuccessfulAttempt();
        if (affinityAttempt != null) {
          Preconditions.checkNotNull(affinityAttempt.getAssignedContainerID(), 
affinityAttempt.getID());
          taskScheduler.allocateTask(taskAttempt,
              event.getCapability(),
              affinityAttempt.getAssignedContainerID(),
              Priority.newInstance(event.getPriority()),
              event.getContainerContext(),
              event);
          return;
        }
        LOG.info("Attempt: " + taskAttempt.getID() + " has task based affinity 
to " + taskAffinity 
            + " but no locality information exists for it. Ignoring hint.");
        // fall through with null hosts/racks
      } else {
        hosts = (locationHint.getHosts() != null) ? locationHint
            .getHosts().toArray(
                new String[locationHint.getHosts().size()]) : null;
        racks = (locationHint.getRacks() != null) ? locationHint.getRacks()
            .toArray(new String[locationHint.getRacks().size()]) : null;
      }
    }
    
    taskScheduler.allocateTask(taskAttempt,
                               event.getCapability(),
                               hosts,
                               racks,
                               Priority.newInstance(event.getPriority()),
                               event.getContainerContext(),
                               event);






At 2015-04-11 12:58:06, "Gopal Vijayaraghavan" <[email protected]> wrote:
>
>>         I have a hive full scan job , with hive on mr I can fully use
>>the whole cluster's 1000 cpu vcores(I use the split size to make mapper
>>tasks to be 1200),
>> But in tez, tez only use around 700 vcores, I have also set the same
>>hive split size. So how do I configure tez? to make tez fully use all the
>>cluster resources?
>
>If you¹re on hive-1.0/later, the option to go wide is called
>tez.grouping.split-waves.
>
>With ORC, the regular MRv2 splits generates empty tasks (so that not all
>map-tasks process valid ranges).
>
>But to get it as wide as possible
>
>set mapred.max.split.size=33554432
>set tez.grouping.split-waves=1.7
>set tez.grouping.min-size=16777216
>
>should do the trick, the split-waves measures current queue capacity *
>1.7x to go wider than the actual available capacity.
>
>In previous versions (0.13/0.14), ³set² commands don¹t work, so the
>options are prefixed by the tez.am.* - you have to do
>
>hive -hiveconf tez.am.grouping.split-waves=1.7 -hiveconf
>tez.grouping.min-size=16777216 -hiveconf mapred.max.split.size=33554432
>
>
>We hope to throw away these hacks in hive-1.2 & for this Prasanth checked
>in a couple of different split strategies for ORC in hive-1.2.0
>(ETL/BI/HYBRID) etc.
>
>I will probably send out my slides about ORC (incl. new split gen) after
>Hadoop Summit Europe, if you want more details.
>
>Ideally, any tests with the latest code would help me fix anything that¹s
>specific to your use-cases.
>
>
>Cheers,
>Gopal
>
>
>
>
>
>
>

Re:Re: How to use all yarn cluster resources for one tez job

Reply via email to