subject:"\[jira\] \[Commented\] \(MAPREDUCE\-2636\) Scheduling over disks horizontally"

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

2013-01-07 Thread Eli Collins (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546326#comment-13546326
 ] 

Eli Collins commented on MAPREDUCE-2636:


Btw as of HDFS-3672 the disk location information is exposed.

 Scheduling over disks horizontally
 --

 Key: MAPREDUCE-2636
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2636
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: job submission
Reporter: Evert Lammerts
Priority: Minor

 Based on this message: 
 http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201106.mbox/browser
 The JT schedules tasks on nodes based on metadata it gets from the NN. The 
 namenode does not know on which disk a block resides. It might happen that on 
 a node running 4 tasks, all read from the same disk. This can affect 
 performance.
 An optimization might be to schedule horizontally over disks instead of 
 nodes. Any ideas?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

2012-12-21 Thread Qinghe Jin (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537755#comment-13537755
]

Qinghe Jin commented on MAPREDUCE-2636:
---

Hi Steve, although the number of disks may several times than the number of
nodes, but I think there is only several bits different to identify it.Does it
really matter that much?

It's good idea to consider output and itermediate data, but do we need to think
about it for each task? I think the best configuration is to ensure the
locality of each task, which means it reads, writes to the same disk. In this
way, it makes more sense to the sheduler or user.

Conflict detection is necessary.If we rush to assign task to the busy nodes,
it's not only more harmful to the running tasks, but also will cause load
unblance problem. For conflict detection, there are two ways:1, find out how
many task running on the node; 2,monitor the actual usage of different
resources(for disk, we can use disk waiting time). I prefer the second method
for there maybe more than one hadoop deployment.

Scheduling over disks horizontally
--

Key: MAPREDUCE-2636
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2636
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: job submission
Reporter: Evert Lammerts
Priority: Minor

Based on this message:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201106.mbox/browser
The JT schedules tasks on nodes based on metadata it gets from the NN. The
namenode does not know on which disk a block resides. It might happen that on
a node running 4 tasks, all read from the same disk. This can affect
performance.
An optimization might be to schedule horizontally over disks instead of
nodes. Any ideas?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

2011-07-05 Thread Steve Loughran (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059852#comment-13059852
]

Steve Loughran commented on MAPREDUCE-2636:
---

right now the JT is ignorant of where blocks live on a server, only that they
are server-local. Indeed, the Namenode doesn't know either. It would be quite a
large change to add this information, and if it took up more namenode memory,
large-filesystem sites would be reluctant to adopt the improvement.

You'd also have to take into account not just the source disk, but the output
disks, and maybe the location of the any intermediate/overspill storage; the JT
would need to know not just how many slots there were free, but which disks
each active task is reading and writing, either by having this data pushed to
it, or by checking prior to scheduling.

Like I said, a lot of work. Rather than rushing to do this, I'd recommend you
come up with a way of measuring the conflict that is occurring. That way
different clusters (with different #s of disks/server) could get data on how
much of an issue this is -and whether adding more HDDs to a server improves
things, or, as more tasks get executed on multicore CPUs, whether it gets worse.

Scheduling over disks horizontally
--

Key: MAPREDUCE-2636
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2636
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: job submission
Reporter: Evert Lammerts
Priority: Minor

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

[jira] [Commented] (MAPREDUCE-2636) Scheduling over disks horizontally

3 matches

Site Navigation

Mail list logo

Footer information