Hi,
Thank you for replaying the experiments.
I launched a job through hive with default TextInputFormat.
The job is TPC-H Q1 query, which is a simple selection query for lineitem table.
The each size of data (data01...data14) is about 300GB, so about
4.2TB(=300GB*14) in total.
I really
Hi,
I want to make sure my understanding about task assignment in hadoop
is correct or not.
When scanning a file with multiple tasktrackers,
I am wondering how a task is assigned to each tasktracker .
Is it based on the block sequence or data locality ?
Let me explain my question by example.
Hi,
Task assignment takes data locality into account first and not block
sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks.
When such a request comes to the jobtracker, it will try to look for an
unassigned task which needs data that is close to the tasktracker and will
I figured out the cause.
HDFS block size is 128MB, but
I specify mapred.min.split.size as 512MB,
and data local I/O processing goes wrong for some reason.
When I remove the mapred.min.split.size configuration,
tasktrackers pick data-local tasks.
Why does it happen ?
It seems like a bug.
Split is
Hi,
I tried a similar experiment as yours but couldn't replicate the issue.
I generated 64 MB files and added them to my DFS - one file from every
machine, with a replication factor of 1, like you did. My block size was
64MB. I verified the blocks were located on the same machine as where I