Re: Question about the task assignment strategy

2012-09-12 Thread Hiroyuki Yamada
Hi, Thank you for replaying the experiments. I launched a job through hive with default TextInputFormat. The job is TPC-H Q1 query, which is a simple selection query for lineitem table. The each size of data (data01...data14) is about 300GB, so about 4.2TB(=300GB*14) in total. I really

Question about the task assignment strategy

2012-09-11 Thread Hiroyuki Yamada
Hi, I want to make sure my understanding about task assignment in hadoop is correct or not. When scanning a file with multiple tasktrackers, I am wondering how a task is assigned to each tasktracker . Is it based on the block sequence or data locality ? Let me explain my question by example.

Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
Hi, Task assignment takes data locality into account first and not block sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. When such a request comes to the jobtracker, it will try to look for an unassigned task which needs data that is close to the tasktracker and will

Re: Question about the task assignment strategy

2012-09-11 Thread Hiroyuki Yamada
I figured out the cause. HDFS block size is 128MB, but I specify mapred.min.split.size as 512MB, and data local I/O processing goes wrong for some reason. When I remove the mapred.min.split.size configuration, tasktrackers pick data-local tasks. Why does it happen ? It seems like a bug. Split is

Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
Hi, I tried a similar experiment as yours but couldn't replicate the issue. I generated 64 MB files and added them to my DFS - one file from every machine, with a replication factor of 1, like you did. My block size was 64MB. I verified the blocks were located on the same machine as where I