> From the code, there is a locationHint there(I am not sure when the >location hint is set), > But it looks like task will reuse the container before going to ask new >container.
Yes, then we end up being able to reuse a container which has a lot of benefits when running 10,000 tasks of the same vertex. We do not deserialize the Operator pipeline, as long as we¹re getting a new split of the same vertex, since the plan is identical & cache other details which are same across splits. Reuse produces the JIT performance improvements, particularly when dealing with queries which take < 10s. > And my network is 10GE, so I think network io is not a bottle neck for >my case, so whether make the data aware hint is not that important. > But if I can make all tasks to be run at the same time, then it will be >good. The overhead of non-locality is not the network layer, but the DFS Client & IO pipeline - there is a lot of DFS audit logging and data checksum verifications which happen for each read sequence. On local disks, we bypass the HDFS Daemon after the block open and read data off the disks directly via a unix domain socket mechanism called short-circuit reads - there¹s a highly optimized assembly crc32 checksum implementation there that works really fast. We can actually max out a 100Gbps switch with just shuffles - not the single host ethernet card, but the 20 machines exchanging data, while connected to a single switch. That said, you need to tune kernel parameters to get that level of performance - all dirs mounted noatime,nodiratime, nscd running on all hosts, transparent huge page defrag turned off, increase somaxconn to > 4k, tune wmem_max/rmem_max sizes, enable tcp offload to the NIC, set vm.dirty_ratio to 50% as well as vm.dirty_background_ratio=30 (yes, that¹s what you need, sadly). Anyway, all you need to max out is a single host to get a slowdown in query performance. That said, it is entirely possible that Tez is running your tasks in less time than it waits for locality (~400ms). I sent out the YARN swimlanes example to someone else earlier on the hive lists https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlane s.sh You can see your locality delays (look for blank gaps, between usage or a staggering of the green line) - http://people.apache.org/~gopalv/query27.svg If you want to send me something to analyze, please send me ³yarn logs -applicationId <appid> | grep HISTORY², which is the raw data for that chart. >set tez.grouping.min-size=16777216 > [skater] What does this parameter mean? Do we have a wiki to trace the >options? That parameter configures the equivalent of CombineInputFormat for Tez. Instead of combining file ranges, Tez groups existing InputSplit[] together - this groups splits until it gets to 16Mb or until their count add up to 1.7x cluster capacity (or a max-size of 1Gb by default). This is necessary for HIVE ACID (insert, update, delete transactions) to work correctly since each split gets more than 1 file - a base file + a sequence of delta file. We had some discussion around docs here recently - http://mail-archives.apache.org/mod_mbox/tez-user/201504.mbox/%3CCAJqL3EKWi Y4y9BxwLFoLBmu5yB4MSEjDW_59J98yK9joKx375w%40mail.gmail.com%3E In hive-0.6, other than grouping, you can also set the min-held containers to your maximum so that we don¹t let any containers expire between queries. This hurts concurrency, but since you want to run 1 big query on an entire cluster, this would help (as Bikas said). hive -hiveconf hive.prewarm.enabled=true -hiveconf hive.prewarm.numcontainers=1000 if you want to do a land-grab on the cluster & hold onto those containers indefinitely. Since Tez does some predicate pruning during split generation, make sure you turn on hive.optimize.index.filter=true¹ and hive.optimize.ppd=true¹ to use the ROW INDEX data within ORC data. For more hive specific questions, ask the hive user list. Cheers, Gopal
