[jira] [Updated] (MAPREDUCE-6062) Use TestDFSIO test random read : job failed

2016-05-06 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-6062:

Assignee: (was: Koji Noguchi)

> Use TestDFSIO test random read : job failed
> ---
>
> Key: MAPREDUCE-6062
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6062
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: benchmarks
>Affects Versions: 2.2.0
> Environment: command : hadoop jar $JAR_PATH TestDFSIO-read -random 
> -nrFiles 12 -size 8000
>Reporter: chongyuanhuang
>
> This is log:
> 2014-09-01 13:57:29,876 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.IllegalArgumentException: n must be 
> positive
>   at java.util.Random.nextInt(Random.java:300)
>   at 
> org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.nextOffset(TestDFSIO.java:601)
>   at 
> org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.doIO(TestDFSIO.java:580)
>   at 
> org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.doIO(TestDFSIO.java:546)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:134)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:37)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> 2014-09-01 13:57:29,886 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
> cleanup for the task
> 2014-09-01 13:57:29,894 WARN [main] 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
> hdfs://m101:8020/benchmarks/TestDFSIO/io_random_read/_temporary/1/_temporary/attempt_1409538816633_0005_m_01_0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5439) mpared-default.xml has missing properties

2013-07-31 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725651#comment-13725651
 ] 

Koji Noguchi commented on MAPREDUCE-5439:
-

I believe mapreduce.{map,reduce}.java.opts were intentionally left out from 
mapred-default.xml so that it won't overwrite user's mapred.child.java.opts 
setting.


> mpared-default.xml has missing properties
> -
>
> Key: MAPREDUCE-5439
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5439
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.1.0-beta
>Reporter: Siddharth Wagle
> Fix For: 2.1.0-beta
>
>
> Properties that need to be added:
> mapreduce.map.memory.mb
> mapreduce.map.java.opts
> mapreduce.reduce.memory.mb
> mapreduce.reduce.java.opts
> Properties that need to be fixed:
> mapred.child.java.opts should not be in mapred-default.
> yarn.app.mapreduce.am.command-opts description needs fixing

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAPREDUCE-114) All reducer tasks are finished, while some mapper tasks are still running

2013-04-11 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved MAPREDUCE-114.


Resolution: Won't Fix

Fixed in Yarn(MAPREDUCE-279). Not getting fixed in 0.20.*/1.*.

> All reducer tasks are finished, while some mapper tasks are still running
> -
>
> Key: MAPREDUCE-114
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-114
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Qi Liu
> Attachments: hadoop-bug-overview.png, hadoop-bug-useless-task.png
>
>
> In a high load environment (i.e. multiple jobs are queued up to be executed), 
> when all reducer tasks of a job are finished, some mapper tasks of the same 
> job may still running (possibly re-executed due to lost task tracker, etc).
> This should not happen when a job has at least one reducer task. When all 
> reducer tasks are in SUCCEEDED state, the Hadoop JobTracker should kill all 
> running mapper tasks, since execution would be meaningless. The job should 
> also switch to SUCCEEDED state when all reducer tasks of that job succeeded 
> successfully.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-3688) Need better Error message if AM is killed/throws exception

2013-03-06 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-3688:


Attachment: mapreduce-3688-h0.23-v02.patch

Another common error is ApplicationMaster going out of memory when number of 
tasks are large.  Adding error message to stdout so that OOM would show.

{quote}
Diagnostics: Application application_1362579399138_0003 failed 1 times due 
to AM Container for appattempt_1362579399138_0003_01 exited with exitCode: 
255 due to: Error starting MRAppMaster: java.lang.OutOfMemoryError: Java heap 
space at 
{quote}

Forgot to mention but having these messages to UI also means it would show up 
on jobclient(console) side as well.

> Need better Error message if AM is killed/throws exception
> --
>
> Key: MAPREDUCE-3688
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3688
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am, mrv2
>Affects Versions: 0.23.1
>Reporter: David Capwell
>Assignee: Sandy Ryza
> Fix For: 0.23.2
>
> Attachments: mapreduce-3688-h0.23-v01.patch, 
> mapreduce-3688-h0.23-v02.patch
>
>
> We need better error messages in the UI if the AM gets killed or throws an 
> Exception.
> If the following error gets thrown: 
> java.lang.NumberFormatException: For input string: "9223372036854775807l" // 
> last char is an L
> then the UI should say this exception.  Instead I get the following:
> Application application_1326504761991_0018 failed 1 times due to AM Container 
> for appattempt_1326504761991_0018_01
> exited with exitCode: 1 due to: Exception from container-launch: 
> org.apache.hadoop.util.Shell$ExitCodeException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-3688) Need better Error message if AM is killed/throws exception

2013-03-05 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-3688:


Attachment: mapreduce-3688-h0.23-v01.patch

This has been a pain for our users as well.

I don't think this patch will fly well with the reviewers, but maybe it'll help 
move the discussion forward. 

I didn't see a good way of communicating the error message to the caller so 
decided to sacrifice the stdout that current MRAppMaster does not use. 

After the patch, webUI would show

{quote}
Diagnostics: Application application_1362527487477_0005 failed 1 times due 
to AM Container for appattempt_1362527487477_0005_01 exited with exitCode: 
1 due to: Error starting MRAppMaster: org.apache.hadoop.yarn.YarnException: 
java.io.IOException: Split metadata size exceeded 20. Aborting job 
job_1362527487477_0005 at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1290)
 at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1146)
 at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1118)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:382)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
 at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
 at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:823) at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:121) at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1094)
 at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:998) 
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1273) 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1221)
 at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1269)
 at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1226) 
Caused by: java.io.IOException: Split metadata size exceeded 20. Aborting job 
job_1362527487477_0005 at 
org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:53)
 at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1285)
 ... 16 more .Failing this attempt.. Failing the application.
{quote}

(This patch is based on 0.23)

> Need better Error message if AM is killed/throws exception
> --
>
> Key: MAPREDUCE-3688
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3688
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am, mrv2
>Affects Versions: 0.23.1
>Reporter: David Capwell
>Assignee: Sandy Ryza
> Fix For: 0.23.2
>
> Attachments: mapreduce-3688-h0.23-v01.patch
>
>
> We need better error messages in the UI if the AM gets killed or throws an 
> Exception.
> If the following error gets thrown: 
> java.lang.NumberFormatException: For input string: "9223372036854775807l" // 
> last char is an L
> then the UI should say this exception.  Instead I get the following:
> Application application_1326504761991_0018 failed 1 times due to AM Container 
> for appattempt_1326504761991_0018_01
> exited with exitCode: 1 due to: Exception from container-launch: 
> org.apache.hadoop.util.Shell$ExitCodeException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x

2012-12-06 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-4499:


Description: 
When there are lots of jobs and tasks active in a cluster, the process of 
figuring out whether or not to launch a speculative task becomes very 
expensive. 

I could be missing something but it certainly looks like on every heartbeat we 
could be scanning 10's of thousands of tasks looking for something which might 
need to be speculatively executed. In most cases, nothing gets chosen so we 
completely trashed our data cache and didn't even find a task to schedule, just 
to do it all over again on the next heartbeat.

On busy jobtrackers, the following backtrace is very common:

"IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
nid=0xb50 runnable [0x45adb000]
  java.lang.Thread.State: RUNNABLE
   at java.util.TreeMap.valEquals(TreeMap.java:1182)
   at java.util.TreeMap.containsValue(TreeMap.java:227)
   at java.util.TreeMap$Values.contains(TreeMap.java:940)
   at
org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
   at
org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
   - locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
   at
org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
   - locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
   at
org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
   - locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075)
   at
org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044)
   - locked <0x2aab6e27a4c8> (a
org.apache.hadoop.mapred.CapacityTaskScheduler)
   at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398)
   - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker)
...)

  was:  


> Looking for speculative tasks is very expensive in 1.x
> --
>
> Key: MAPREDUCE-4499
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv1, performance
>Affects Versions: 1.0.3
>Reporter: Nathan Roberts
>Assignee: Koji Noguchi
> Fix For: 1.2.0
>
> Attachments: mapreduce-4499-v1.0.2-1.patch
>
>
> When there are lots of jobs and tasks active in a cluster, the process of 
> figuring out whether or not to launch a speculative task becomes very 
> expensive. 
> I could be missing something but it certainly looks like on every heartbeat 
> we could be scanning 10's of thousands of tasks looking for something which 
> might need to be speculatively executed. In most cases, nothing gets chosen 
> so we completely trashed our data cache and didn't even find a task to 
> schedule, just to do it all over again on the next heartbeat.
> On busy jobtrackers, the following backtrace is very common:
> "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
> nid=0xb50 runnable [0x45adb000]
>   java.lang.Thread.State: RUNNABLE
>at java.util.TreeMap.valEquals(TreeMap.java:1182)
>at java.util.TreeMap.containsValue(TreeMap.java:227)
>at java.util.TreeMap$Values.contains(TreeMap.java:940)
>at
> org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
>at
> org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
>- locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
>at
> org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
>- locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
>at
> org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
>- locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
>at
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
>at
> org.apache.hadoop.mapred.CapacityTaskSched

[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x

2012-12-06 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-4499:


Description: (was: When there are lots of jobs and tasks active in a 
cluster, the process of figuring out whether or not to launch a speculative 
task becomes very expensive. 

I could be missing something but it certainly looks like on every heartbeat we 
could be scanning 10's of thousands of tasks looking for something which might 
need to be speculatively executed. In most cases, nothing gets chosen so we 
completely trashed our data cache and didn't even find a task to schedule, just 
to do it all over again on the next heartbeat.

On busy jobtrackers, the following backtrace is very common:

"IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
nid=0xb50 runnable [0x45adb000]
   java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.valEquals(TreeMap.java:1182)
at java.util.TreeMap.containsValue(TreeMap.java:227)
at java.util.TreeMap$Values.contains(TreeMap.java:940)
at
org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
at
org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
- locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
at
org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
- locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
at
org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
- locked <0x2aaefde82338> (a
org.apache.hadoop.mapred.JobInProgress)
at
org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
at
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
at
org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075)
at
org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044)
- locked <0x2aab6e27a4c8> (a
org.apache.hadoop.mapred.CapacityTaskScheduler)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398)
- locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker)
...)

> Looking for speculative tasks is very expensive in 1.x
> --
>
> Key: MAPREDUCE-4499
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv1, performance
>Affects Versions: 1.0.3
>Reporter: Nathan Roberts
>Assignee: Koji Noguchi
> Fix For: 1.2.0
>
> Attachments: mapreduce-4499-v1.0.2-1.patch
>
>
>   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2012-11-27 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504767#comment-13504767
 ] 

Koji Noguchi commented on MAPREDUCE-4819:
-

bq. I don't want the correctness of the job to depend on the marker on hdfs.

I meant, hdfs on user space like outputpath.  If this is stored elsewhere where 
user cannot access, I have no problem. 

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2012-11-27 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504758#comment-13504758
 ] 

Koji Noguchi commented on MAPREDUCE-4819:
-

bq. like the client never being notified at all because the AM crashes after 
unregistering with the RM but before it notifies the client.

As long as client eventually fail, that's not a problem.

Critical problem we have here is false-positive from the client's perspective.
Client is getting 'success' but output is incomplete or corrupt(due to retried 
application/job (over)writing to the same target path.)

If we can have AM and RM to agree on the job status before telling the client, 
I think that would work.  There could be a corner case when AM and RM say the 
job was successful but client thinks it failed. This false-negative is much 
better than false-positive issue we have now.  Even in 0.20, we had cases when 
JobTracker reports job was successful but client thinks it failed due to 
communication failure to the JobTracker.  This is fine to happen and we should 
let the client handle the recovery-or-retry.


bq. In general it seems like we need to come up with a set of markers that 
previous AM's leave behind

I don't want the correctness of the job to depend on the marker on hdfs.




> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x

2012-08-16 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-4499:


Attachment: mapreduce-4499-v1.0.2-1.patch

Attaching a patch with if&else rewrite.  Trying to change the order of boolean 
condition but not changing the logic.


> Looking for speculative tasks is very expensive in 1.x
> --
>
> Key: MAPREDUCE-4499
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv1, performance
>Affects Versions: 1.0.3
>Reporter: Nathan Roberts
> Attachments: mapreduce-4499-v1.0.2-1.patch
>
>
> When there are lots of jobs and tasks active in a cluster, the process of 
> figuring out whether or not to launch a speculative task becomes very 
> expensive. 
> I could be missing something but it certainly looks like on every heartbeat 
> we could be scanning 10's of thousands of tasks looking for something which 
> might need to be speculatively executed. In most cases, nothing gets chosen 
> so we completely trashed our data cache and didn't even find a task to 
> schedule, just to do it all over again on the next heartbeat.
> On busy jobtrackers, the following backtrace is very common:
> "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
> nid=0xb50 runnable [0x45adb000]
>java.lang.Thread.State: RUNNABLE
> at java.util.TreeMap.valEquals(TreeMap.java:1182)
> at java.util.TreeMap.containsValue(TreeMap.java:227)
> at java.util.TreeMap$Values.contains(TreeMap.java:940)
> at
> org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
> at
> org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044)
> - locked <0x2aab6e27a4c8> (a
> org.apache.hadoop.mapred.CapacityTaskScheduler)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398)
> - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker)
> ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x

2012-08-16 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436225#comment-13436225
 ] 

Koji Noguchi commented on MAPREDUCE-4499:
-

Looked at one of the busy JobTrackers.  Attached btrace for couple of secs and 
counted the booleans.

Out of 2093791 JobInProgress.findSpeculativeTask calls, 2437 of them had 
shouldRemove=true. 
Out of 2213670 TaskInProgress.hasSpeculativeTask calls, 137 of them were 
'true'.  

Of course these numbers differ from cluster to cluster, but I believe it shows 
the possibility of some savings. 


> Looking for speculative tasks is very expensive in 1.x
> --
>
> Key: MAPREDUCE-4499
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv1, performance
>Affects Versions: 1.0.3
>Reporter: Nathan Roberts
>
> When there are lots of jobs and tasks active in a cluster, the process of 
> figuring out whether or not to launch a speculative task becomes very 
> expensive. 
> I could be missing something but it certainly looks like on every heartbeat 
> we could be scanning 10's of thousands of tasks looking for something which 
> might need to be speculatively executed. In most cases, nothing gets chosen 
> so we completely trashed our data cache and didn't even find a task to 
> schedule, just to do it all over again on the next heartbeat.
> On busy jobtrackers, the following backtrace is very common:
> "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
> nid=0xb50 runnable [0x45adb000]
>java.lang.Thread.State: RUNNABLE
> at java.util.TreeMap.valEquals(TreeMap.java:1182)
> at java.util.TreeMap.containsValue(TreeMap.java:227)
> at java.util.TreeMap$Values.contains(TreeMap.java:940)
> at
> org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
> at
> org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044)
> - locked <0x2aab6e27a4c8> (a
> org.apache.hadoop.mapred.CapacityTaskScheduler)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398)
> - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker)
> ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x

2012-08-15 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435573#comment-13435573
 ] 

Koji Noguchi commented on MAPREDUCE-4499:
-

Assuming the cost of tip.hasRunOnMachine is expensive, we can try reordering 
the if/else so that we call it less often.

>From JobInProgress.java
{noformat}
2196   if (!tip.hasRunOnMachine(ttStatus.getHost(),
2197ttStatus.getTrackerName())) {
2198 if (tip.hasSpeculativeTask(currentTime, avgProgress)) {
2199   // In case of shared list we don't remove it. Since the TIP 
failed 
2200   // on this tracker can be scheduled on some other tracker.
2201   if (shouldRemove) {
2202 iter.remove(); //this tracker is never going to run it again
2203   }
2204   return tip;
2205 }
2206   } else {
2207 // Check if this tip can be removed from the list.
2208 // If the list is shared then we should not remove.
2209 if (shouldRemove) {
2210   // This tracker will never speculate this tip
2211   iter.remove();
2212 }
2213   }
2214 }
{noformat}

Checking the action for each true&false. 
{noformat}
tip.hasRuntip.hasSpeculative  shouldRemove   Action
   F F F -
   F F T -
   F T F return tip
   F T T iter.remove() & return 
tip;
   T F F -
   T F T iter.remove()
   T T F -
   T T T iter.remove()
{noformat}

Can we rewrite the above logic to 
{noformat}
if (tip.hasSpeculative) { 
  if(shouldRemove){
iter.remove();
  } 
  if(!tip.hasRun) { 
return tip;
  } 
} else { 
  if (shouldRemove && tip.hasRun ){
iter.remove();
  } 
} 
{noformat}


>From the jstack we see, I can tell that shouldRemove is often 'false' in our 
>case.  Depending on the value of tip.hasSpeculative, we may reduce the 
>tip.hasRun calls with this rewrite.
(I don't know how often 'hasSpeculative' is true.)


> Looking for speculative tasks is very expensive in 1.x
> --
>
> Key: MAPREDUCE-4499
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv1, performance
>Affects Versions: 1.0.3
>Reporter: Nathan Roberts
>
> When there are lots of jobs and tasks active in a cluster, the process of 
> figuring out whether or not to launch a speculative task becomes very 
> expensive. 
> I could be missing something but it certainly looks like on every heartbeat 
> we could be scanning 10's of thousands of tasks looking for something which 
> might need to be speculatively executed. In most cases, nothing gets chosen 
> so we completely trashed our data cache and didn't even find a task to 
> schedule, just to do it all over again on the next heartbeat.
> On busy jobtrackers, the following backtrace is very common:
> "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800
> nid=0xb50 runnable [0x45adb000]
>java.lang.Thread.State: RUNNABLE
> at java.util.TreeMap.valEquals(TreeMap.java:1182)
> at java.util.TreeMap.containsValue(TreeMap.java:227)
> at java.util.TreeMap$Values.contains(TreeMap.java:940)
> at
> org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072)
> at
> org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432)
> - locked <0x2aaefde82338> (a
> org.apache.hadoop.mapred.JobInProgress)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
> at
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
> at
> org.apache.hadoop.mapred.CapacityTaskSchedul

[jira] [Updated] (MAPREDUCE-1684) ClusterStatus can be cached in CapacityTaskScheduler.assignTasks()

2012-08-15 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-1684:


Attachment: mapreduce-1684-v1.0.2-1.patch

bq. Currently, CapacityTaskScheduler.assignTasks() calls getClusterStatus() 
thrice

I think it calls getClusterStatus calls #jobs times in the worst case.

For each heartbeat from TaskTracker with some slots available, 
{noformat}
heartbeat --> assignTasks 
  --> addMap/ReduceTasks
  --> TaskSchedulingMgr.assignTasks
  --> For each queue : queuesForAssigningTasks)
  --> getTaskFromQueue(queue)
  --> For each j : queue.getRunningJobs()
  --> obtainNewTask --> **getClusterStatus**
{noformat}

bq. It can be cached in assignTasks() and re-used.

Attaching a patch.  Would this work?

Motivation is, we see getClusterStatus way too often in our jstack holding the 
global lock.
{noformat}
"IPC Server handler 15 on 50300" daemon prio=10 tid=0x5fc5d800 
nid=0x6828 runnable [0x44847000]
   java.lang.Thread.State: RUNNABLE
  at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:4065)
  - locked <0x2aab6e638bd8> (a org.apache.hadoop.mapred.JobTracker)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:503)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075)
  at 
org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044)
  - locked <0x2aab6e7ffb10> (a 
org.apache.hadoop.mapred.CapacityTaskScheduler)
  at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398)
  - locked <0x2aab6e638bd8> (a org.apache.hadoop.mapred.JobTracker)
  at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
{noformat}


> ClusterStatus can be cached in CapacityTaskScheduler.assignTasks()
> --
>
> Key: MAPREDUCE-1684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1684
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: capacity-sched
>Reporter: Amareshwari Sriramadasu
> Attachments: mapreduce-1684-v1.0.2-1.patch
>
>
> Currently,  CapacityTaskScheduler.assignTasks() calls getClusterStatus() 
> thrice: once in assignTasks(), once in MapTaskScheduler and once in 
> ReduceTaskScheduler. It can be cached in assignTasks() and re-used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4352) Jobs fail during resource localization when directories in file cache reaches to unix directory limit

2012-06-19 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396773#comment-13396773
 ] 

Koji Noguchi commented on MAPREDUCE-4352:
-

Sounds similar to MAPREDUCE-1538 (pre-2.0).

> Jobs fail during resource localization when directories in file cache reaches 
> to unix directory limit
> -
>
> Key: MAPREDUCE-4352
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4352
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.1-alpha, 3.0.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>
> If we have multiple jobs which uses distributed cache with small size of 
> files, the directory limit reaches before reaching the cache size and fails 
> to create any directories in file cache. The jobs start failing with the 
> below exception.
> {code:xml}
> java.io.IOException: mkdir of 
> /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
>   at 
> org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {code}
> We should have a mechanism to clean the cache files if it crosses specified 
> number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2765) DistCp Rewrite

2011-08-12 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084215#comment-13084215
 ] 

Koji Noguchi commented on MAPREDUCE-2765:
-

bq. Would the reviewers/watchers kindly comment on whether it's alright to 
deprecate the "-filelimit" and "-sizelimit" options, in DistCpV2?
bq.
+1.  I think we (Yahoo) requested but ended up not using it at all.  

Just to be clear
bq. The file-listing isn't filtered until the map-task runs 
bq.
This used to be the case in old old distcp.  We changed that when we added this 
-filelimit feature (that we never used).
 

> DistCp Rewrite
> --
>
> Key: MAPREDUCE-2765
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 0.20.203.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: distcpv2.20.203.patch, distcpv2_trunk.patch
>
>
> This is a slightly modified version of the DistCp rewrite that Yahoo uses in 
> production today. The rewrite was ground-up, with specific focus on:
> 1. improved startup time (postponing as much work as possible to the MR job)
> 2. support for multiple copy-strategies
> 3. new features (e.g. -atomic, -async, -bandwidth.)
> 4. improved programmatic use
> Some effort has gone into refactoring what used to be achieved by a single 
> large (1.7 KLOC) source file, into a design that (hopefully) reads better too.
> The proposed DistCpV2 preserves command-line-compatibility with the old 
> version, and should be a drop-in replacement.
> New to v2:
> 1. Copy-strategies and the DynamicInputFormat:
>   A copy-strategy determines the policy by which source-file-paths are 
> distributed between map-tasks. (These boil down to the choice of the 
> input-format.) 
>   If no strategy is explicitly specified on the command-line, the policy 
> chosen is "uniform size", where v2 behaves identically to old-DistCp. (The 
> number of bytes transferred by each map-task is roughly equal, at a per-file 
> granularity.) 
>   Alternatively, v2 ships with a "dynamic" copy-strategy (in the 
> DynamicInputFormat). This policy acknowledges that 
>   (a)  dividing files based only on file-size might not be an 
> even distribution (E.g. if some datanodes are slower than others, or if some 
> files are skipped.)
>   (b) a "static" association of a source-path to a map increases 
> the likelihood of long-tails during copy.
>   The "dynamic" strategy divides the list-of-source-paths into a number 
> (> nMaps) of smaller parts. When each map completes its current list of 
> paths, it picks up a new list to process, if available. So if a map-task is 
> stuck on a slow (and not necessarily large) file, other maps can pick up the 
> slack. The thinner the file-list is sliced, the greater the parallelism (and 
> the lower the chances of long-tails). Within reason, of course: the number of 
> these short-lived list-files is capped at an overridable maximum.
>   Internal benchmarks against source/target clusters with some slow(ish) 
> datanodes have indicated significant performance gains when using the 
> dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
>   Please note that the DynamicInputFormat might prove useful outside of 
> DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also 
> note that the copy-strategies have no bearing on the CopyMapper.map() 
> implementation.
>   
> 2. Improved startup-time and programmatic use:
>   When the old-DistCp runs with -update, and creates the 
> list-of-source-paths, it attempts to filter out files that might be skipped 
> (by comparing file-sizes, checksums, etc.) This significantly increases the 
> startup time (or the time spent in serial processing till the MR job is 
> launched), blocking the calling-thread. This becomes pronounced as nFiles 
> increases. (Internal benchmarks have seen situations where more time is spent 
> setting up the job than on the actual transfer.)
>   DistCpV2 postpones as much work as possible to the MR job. The 
> file-listing isn't filtered until the map-task runs (at which time, identical 
> files are skipped). DistCpV2 can now be run "asynchronously". The program 
> quits at job-launch, logging the job-id for tracking. Programmatically, the 
> DistCp.execute() returns a Job instance for progress-tracking.
>   
> 3. New features:
>   (a)   -async: As described in #2.
>   (b)   -atomic: Data is copied to a (user-specifiable) tmp-location, and 
> then moved atomically to destination.
>   (c)   -bandwidth: Enforces a limit on the bandwidth consumed per map.
>   (d)   -strategy: As above.
>   
> A more comprehensive descri

[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere

2011-08-01 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073827#comment-13073827
 ] 

Koji Noguchi commented on MAPREDUCE-2324:
-

bq. Should we just disable that check?
+1

> Job should fail if a reduce task can't be scheduled anywhere
> 
>
> Key: MAPREDUCE-2324
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2, 0.20.205.0
>Reporter: Todd Lipcon
>Assignee: Robert Joseph Evans
> Fix For: 0.20.205.0
>
> Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2510) TaskTracker throw OutOfMemoryError after upgrade to jetty6

2011-05-25 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039365#comment-13039365
 ] 

Koji Noguchi commented on MAPREDUCE-2510:
-

> > Is there an issue about upgrading Jetty to 6.1.26?
> None that I'm aware of. Upgrading from Jetty5 to Jetty6 was painful, but 
> upgrading within Jetty6 has been very easy.
>
After 6.1.26 upgrade, I think we started seeing various fetch failure issues 
that persists on TaskTracker/jetty delaying the jobs. (MAPREDUCE-2529, 
MAPREDUCE-2530, etc)  So far we haven't found any fixes and instead working on 
a workaround.


> TaskTracker throw OutOfMemoryError after upgrade to jetty6
> --
>
> Key: MAPREDUCE-2510
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2510
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Liyin Liang
>
> Our product cluster's TaskTracker sometimes throw OutOfMemoryError after 
> upgrade to jetty6. The exception in TT's log is as follows:
> 2011-05-17 19:16:40,756 ERROR org.mortbay.log: Error for /mapOutput
> java.lang.OutOfMemoryError: Java heap space
> at java.io.BufferedInputStream.(BufferedInputStream.java:178)
> at 
> org.apache.hadoop.fs.BufferedFSInputStream.(BufferedFSInputStream.java:44)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359)
> at 
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3040)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:324)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
> at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> Exceptions in .out file:
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "process reaper" java.lang.OutOfMemoryError: Java heap 
> space
> Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap 
> space
> java.lang.OutOfMemoryError: Java heap space
> java.lang.reflect.InvocationTargetException
> Exception in thread "IPC Server handler 6 on 50050" at 
> sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.mortbay.log.Slf4jLog.warn(Slf4jLog.java:126)
> at org.mortbay.log.Log.warn(Log.java:181)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:449)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:324)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
> at org.mortbay.jetty.

[jira] [Commented] (MAPREDUCE-2476) Set mapreduce scheduler to capacity scheduler for RPM/Debian packages by default

2011-05-05 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029625#comment-13029625
 ] 

Koji Noguchi commented on MAPREDUCE-2476:
-

In addition to Todd's point, do many users really need the features from 
capacity scheduler (and/or fair scheduler) ? 


> Set mapreduce scheduler to capacity scheduler for RPM/Debian packages by 
> default
> 
>
> Key: MAPREDUCE-2476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2476
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: build
>Affects Versions: 0.20.203.1
> Environment: Redhat 5.5, Java 6
>Reporter: Eric Yang
>Assignee: Eric Yang
>
> Hadoop RPM/Debian package is default to use the default scheduler.  It would 
> be nice to setup the packages to use capacity scheduler instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2291) TaskTracker Decommission that waits for all map(intermediate) outputs to be pulled

2011-01-31 Thread Koji Noguchi (JIRA)
TaskTracker Decommission that waits for all map(intermediate) outputs to be 
pulled 
---

 Key: MAPREDUCE-2291
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2291
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Reporter: Koji Noguchi


On our clusters, users were getting affected when ops were decommissioning a 
large number of TaskTracker nodes. 

Correct me if I'm wrong, but current decommission of TaskTrackers only waits 
for the running tasks to finish but not
the jobs where map outputs are kept on that decommissioning tasktrackers.

Any ways we can handle this better?


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (MAPREDUCE-2075) Show why the job failed (e.g. Job ___ failed because task ____ failed 4 times)

2010-09-16 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved MAPREDUCE-2075.
-

Resolution: Duplicate

Duplicate of  MAPREDUCE-343.

> Show why the job failed  (e.g. Job ___ failed because task  failed 4 
> times)
> ---
>
> Key: MAPREDUCE-2075
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2075
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Koji Noguchi
>Priority: Minor
> Fix For: 0.22.0
>
>
> When our users have questions about their jobs' failure, they tend to 
> copy&paste all the userlog exceptions they see on the webui/console.  
> However, most of them are not the one that caused the job to fail.   When we 
> tell them 'This task failed 4 times", sometimes that's enough information for 
> them to solve the problem on their own.
> It would be nice if jobclient or job status page shows the reason for the job 
> being flagged as fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-2076) Showing inputsplit filename/offset inside the webui or tasklog

2010-09-16 Thread Koji Noguchi (JIRA)
Showing inputsplit filename/offset inside the webui or tasklog
--

 Key: MAPREDUCE-2076
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2076
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Koji Noguchi
Priority: Minor
 Fix For: 0.22.0


For debugging purposes, it would be nice to have inputsplit's  filename and 
offset for FileInputFormat and alike.
(in addition to input split's node list that is already shown)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-2075) Show why the job failed (e.g. Job ___ failed because task ____ failed 4 times)

2010-09-16 Thread Koji Noguchi (JIRA)
Show why the job failed  (e.g. Job ___ failed because task  failed 4 times)
---

 Key: MAPREDUCE-2075
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2075
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Koji Noguchi
Priority: Minor
 Fix For: 0.22.0


When our users have questions about their jobs' failure, they tend to 
copy&paste all the userlog exceptions they see on the webui/console.  However, 
most of them are not the one that caused the job to fail.   When we tell them 
'This task failed 4 times", sometimes that's enough information for them to 
solve the problem on their own.

It would be nice if jobclient or job status page shows the reason for the job 
being flagged as fail.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-2074) Task should fail when symlink creation fail

2010-09-16 Thread Koji Noguchi (JIRA)
Task should fail when symlink creation fail
---

 Key: MAPREDUCE-2074
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2074
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distributed-cache
Affects Versions: 0.20.2
Reporter: Koji Noguchi
Priority: Minor
 Fix For: 0.22.0


If I pass an invalid symlink as   
-Dmapred.cache.files=/user/knoguchi/onerecord.txt#abc/abc

Task only reports a WARN and goes on.

{noformat} 
2010-09-16 21:38:49,782 INFO org.apache.hadoop.mapred.TaskRunner: Creating 
symlink: 
/0/tmp/mapred-local/taskTracker/knoguchi/distcache/-5031501808205559510_-128488332_1354038698/abc-nn1.def.com/user/knoguchi/onerecord.txt
 <- 
/0/tmp/mapred-local/taskTracker/knoguchi/jobcache/job_201008310107_15105/attempt_201008310107_15105_m_00_0/work/./abc/abc
2010-09-16 21:38:49,789 WARN org.apache.hadoop.mapred.TaskRunner: Failed to 
create symlink: 
/0/tmp/mapred-local/taskTracker/knoguchi/distcache/-5031501808205559510_-128488332_1354038698/abc-nn1.def.com/user/knoguchi/onerecord.txt
 <- 
/0/tmp/mapred-local/taskTracker/knoguchi/jobcache/job_201008310107_15105/attempt_201008310107_15105_m_00_0/work/./abc/abc
{noformat} 

I believe we should fail the task at this point.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898990#action_12898990
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

{quote}
The TaskTracker, on being requested to run a task requiring CAR resource md5_F 
checks whether md5_F is localized.

* If md5_F is already localized - then nothing more needs to be done. the 
localized version is used by the Task
* If md5_F is not localized - then its fetched from the CAR repository
{quote}
What are we gaining by using md5_F on the TaskTracker side?
Can we use the existing 'cacheStatus.mtime == confFileStamp' check and change 
the order of the check so that no unnecessary getFileStatus call is made 
(MAPRED-2011)? 
Otherwise, this can only be used for dist files loaded by this framework and 
would require two separate logic on the TaskTracker side.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup

2010-08-16 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898981#action_12898981
 ] 

Koji Noguchi commented on MAPREDUCE-2011:
-

MAPREDUCE-1901 has a detail proposal of how to handle distributed cache better 
for those loaded by jobclient (-libjars).
As part of it, it mentions 

{quote}
The TaskTracker, on being requested to run a task requiring CAR resource md5_F 
checks whether md5_F is localized.

* If md5_F is already localized - then nothing more needs to be done. the 
localized version is used by the Task
* If md5_F is not localized - then its fetched from the CAR repository
{quote}

This Jira is basically almost asking the same except for asking to use existing 
mtime instead of a new md5_F proposed.
Just to reduce the mtime/getFileStatus calls, mtime check is enough and can 
keep the change small.



> Reduce number of getFileStatus call made from every 
> task(TaskDistributedCache) setup
> 
>
> Key: MAPREDUCE-2011
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Reporter: Koji Noguchi
>
> On our cluster, we had jobs with 20 dist cache and very short-lived tasks 
> resulting in 500 map tasks launched per second resulting in  10,000 
> getFileStatus calls to the namenode.  Namenode can handle this but asking to 
> see if we can reduce this somehow.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898484#action_12898484
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. For me, that's not of a worry. It may delay individual job submissions, but 
the overall load to the hdfs isn't much.
bq. (at least compared to later phase of hundreds and thousands of tasktrackers 
looking up mtime of 'all those jars'.)

Since my problem is just about lookup of mtime, created a new jira 
MAPREDUCE-2011.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup

2010-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898481#action_12898481
 ] 

Koji Noguchi commented on MAPREDUCE-2011:
-

When a task is initialized, it calls getFileStatus for every distributed cache 
file/archive entries it has (_dfsFileStamp_) and compare it with task's 
timestamp specified in the config (_confFileStamp_).
This makes sure that tasks fail *at start up* if distributed cache files were 
changed after the job was submitted and before the task started.

(It still doesn't guarantee that job would fail reliably since all the tasks 
could have been started before the modification.)


Now asking if we can change this logic to,
If exact localized cache exists ('lcacheStatus.mtime == confFileStamp ') on the 
TaskTracker, use that and do not call getFileStatus(_dfsFileStamp_). 

With this, no getFileStatus calls are made if TaskTracker already has the 
localized cache with the same timestamp.  This should reduce the amount of 
getFileStatus calls significantly when people submit jobs using the same 
distributed cache files.

This still makes sure that all the tasks use the same dist cache files 
specified at the job startup. (corectness)

But with this change, tasks that would have failed at start-up due to 
(_dfsFileStamp_ != _confFileStamp_) can now succeed.



> Reduce number of getFileStatus call made from every 
> task(TaskDistributedCache) setup
> 
>
> Key: MAPREDUCE-2011
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distributed-cache
>Reporter: Koji Noguchi
>
> On our cluster, we had jobs with 20 dist cache and very short-lived tasks 
> resulting in 500 map tasks launched per second resulting in  10,000 
> getFileStatus calls to the namenode.  Namenode can handle this but asking to 
> see if we can reduce this somehow.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup

2010-08-13 Thread Koji Noguchi (JIRA)
Reduce number of getFileStatus call made from every task(TaskDistributedCache) 
setup


 Key: MAPREDUCE-2011
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distributed-cache
Reporter: Koji Noguchi


On our cluster, we had jobs with 20 dist cache and very short-lived tasks 
resulting in 500 map tasks launched per second resulting in  10,000 
getFileStatus calls to the namenode.  Namenode can handle this but asking to 
see if we can reduce this somehow.  


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898300#action_12898300
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. u would get a trace of all those jars being uploaded to hdfs. it's 
ridiculous.

For me, that's not of a worry.  It may delay individual job submissions, but 
the overall load to the hdfs isn't much.
(at least compared to later phase of hundreds and thousands of tasktrackers 
looking up mtime of 'all those jars'.)

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-12 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897833#action_12897833
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. we have started testing this patch internally and this would become 
production in a couple of weeks.

Joydeep, is this being tested on your production?  How does the load look like?
I don't know the details, but I like the "part of the goal here is to not have 
to look up mtimes again and again. " part.

We certainly have applications with many small tasks having multiple 
libjar/distributed-caches resulting with too many getfileinfo calls to the 
namenode.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1839) HadoopArchives should provide a way to configure replication

2010-06-03 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875222#action_12875222
 ] 

Koji Noguchi commented on MAPREDUCE-1839:
-

bq. I tested it yesterday on Hadoop 0.20 and it doesn't work. 

Could you clarify what didn't work?
If the mapreduce archive job failed with unknown param, maybe you don't have 
MAPREDUCE-826 which sets the ToolRunner.

I tried just now.  Got 

{noformat}
% hadoop dfs -lsr mytest1.har  
-rw---   5 knoguchi users947 2010-06-03 17:47 
/user/knoguchi/mytest1.har/_index
-rw---   5 knoguchi users 23 2010-06-03 17:47 
/user/knoguchi/mytest1.har/_masterindex
-rw---   2 knoguchi users  68064 2010-06-03 17:46 
/user/knoguchi/mytest1.har/part-0
%
{noformat} 

Replication was successfully set to 2. 

Maybe you're talking about the replication shown when doing listStatus on the 
files inside the har ?

When I do 
hadoop dfs -lsr har:///user/knoguchi/mytest1.har , it shows 
{noformat}
...
-rw---   5 knoguchi users  17018 2010-06-03 17:47 
/user/knoguchi/mytest1.har/tmptmp/abc
{noformat}

This is because permission and replication factor is simply taken from the 
_index file.  This is fixed in MAPREDUCE-1628.


> HadoopArchives should provide a way to configure replication
> 
>
> Key: MAPREDUCE-1839
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1839
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Affects Versions: 0.20.1
>Reporter: Ramkumar Vadali
>Priority: Minor
>
> When creating HAR archives, the part files use the default replication of the 
> filesystem. This should be made configurable through either the configuration 
> file or command line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1839) HadoopArchives should provide a way to configure replication

2010-06-02 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874909#action_12874909
 ] 

Koji Noguchi commented on MAPREDUCE-1839:
-

Probably a silly question, but can't we set it through the command line?  
{noformat}
% hadoop archive -Ddfs.replication=2  ...
{noformat}  


> HadoopArchives should provide a way to configure replication
> 
>
> Key: MAPREDUCE-1839
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1839
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Affects Versions: 0.20.1
>Reporter: Ramkumar Vadali
>Priority: Minor
>
> When creating HAR archives, the part files use the default replication of the 
> filesystem. This should be made configurable through either the configuration 
> file or command line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1648) Use RollingFileAppender to limit tasklogs

2010-04-06 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854051#action_12854051
 ] 

Koji Noguchi commented on MAPREDUCE-1648:
-

When reviewing the patch, please test the performance and make sure we don't 
re-introduce the slowness observed at HADOOP-1553.

> Use RollingFileAppender to limit tasklogs
> -
>
> Key: MAPREDUCE-1648
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1648
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Reporter: Guilin Sun
>Priority: Minor
>
> There are at least two types of task-logs: syslog and stdlog
> Task-Jvm outputs syslog by log4j with TaskLogAppender, TaskLogAppender looks 
> just like "tail -c", it stores last N byte/line logs in memory(via queue), 
> and do real output only if all logs is commit and Appender is going to close.
> The common problem of TaskLogAppender and 'tail -c'  is keep everything in 
> memory and user can't see any log output while task is in progress.
> So I'm going to try RollingFileAppender  instead of  TaskLogAppender, use 
> MaxFileSize&MaxBackupIndex to limit log file size.
> RollingFileAppender is also suitable for stdout/stderr, just redirect 
> stdout/stderr to log4j via LoggingOutputStream, no client code have to be 
> changed, and RollingFileAppender seems better than 'tail -c' too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-114) All reducer tasks are finished, while some mapper tasks are still running

2010-03-11 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844220#action_12844220
 ] 

Koji Noguchi commented on MAPREDUCE-114:


Is this related to MAPREDUCE-1060 ?

> All reducer tasks are finished, while some mapper tasks are still running
> -
>
> Key: MAPREDUCE-114
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-114
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Qi Liu
> Attachments: hadoop-bug-overview.png, hadoop-bug-useless-task.png
>
>
> In a high load environment (i.e. multiple jobs are queued up to be executed), 
> when all reducer tasks of a job are finished, some mapper tasks of the same 
> job may still running (possibly re-executed due to lost task tracker, etc).
> This should not happen when a job has at least one reducer task. When all 
> reducer tasks are in SUCCEEDED state, the Hadoop JobTracker should kill all 
> running mapper tasks, since execution would be meaningless. The job should 
> also switch to SUCCEEDED state when all reducer tasks of that job succeeded 
> successfully.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020

2010-02-22 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi reassigned MAPREDUCE-837:
--

Assignee: Mahadev konar

> harchive fail when output directory has URI with default port of 8020
> -
>
> Key: MAPREDUCE-837
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-837
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: harchive
>Affects Versions: 0.20.1
>Reporter: Koji Noguchi
>Assignee: Mahadev konar
>Priority: Minor
>
> % hadoop archive -archiveName abc.har /user/knoguchi/abc 
> hdfs://mynamenode:8020/user/knoguchi
> doesn't work on 0.18 nor 0.20

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete

2010-02-09 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831893#action_12831893
 ] 

Koji Noguchi commented on MAPREDUCE-1305:
-

bq. Is supporting Trash useful for DistCp users running with -delete?

To me, yes.
I've seen many of our users deleting their files accidentally.  
Trash has saved us great time.

I'd like to request the Trash part to stay if there's not much performance 
problem.

> Massive performance problem with DistCp and -delete
> ---
>
> Key: MAPREDUCE-1305
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 0.20.1
>Reporter: Peter Romianowski
>Assignee: Peter Romianowski
> Attachments: M1305-1.patch, MAPREDUCE-1305.patch
>
>
> *First problem*
> In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus 
> objects when the path is all we need.
> The performance problem comes from 
> org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries 
> to retrieve file permissions by issuing a "ls -ld " which is painfully 
> slow.
> Changed that to just serialize Path and not FileStatus.
> *Second problem*
> To delete the files we invoke the "hadoop" command line tool with option 
> "-rmr ". Again, for each file.
> Changed that to dstfs.delete(path, true)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1372) ConcurrentModificationException in JobInProgress

2010-01-12 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799247#action_12799247
 ] 

Koji Noguchi commented on MAPREDUCE-1372:
-

When we hit this, that task never get scheduled and job would stuck forever.

> ConcurrentModificationException in JobInProgress
> 
>
> Key: MAPREDUCE-1372
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1372
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1
>Reporter: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
>
> We have seen the following  ConcurrentModificationException in one of our 
> clusters
> {noformat}
> java.io.IOException: java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
> at java.util.HashMap$KeyIterator.next(HashMap.java:828)
> at 
> org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2018)
> at 
> org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:1077)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:796)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:589)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:677)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:348)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTask(CapacityTaskScheduler.java:1397)
> at 
> org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1349)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2976)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception

2009-09-09 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-826:
---

Attachment: mapreduce-826-2.patch

Thanks Mahadev. Made the change.

Since this is a patch around main, didn't find a straight forward way to do a 
unit test.

One manual test.  Before the patch.
$ hadoop archive -archiveName myhar.har -p /tmp/somenonexistdir  somedir 
/user/knoguchi
null
$ echo $?
0

After the patch, 

$ hadoop archive -archiveName myhar.har -p /tmp/somenonexistdir  somedir 
/user/knoguchi
Exception in archives
null
lieliftbean-lm:trunk knoguchi$ echo $?
1

I guess we should also fix the NPE when src doesn't exist.  I'm leaving it for 
now since this was a good manual test case.



> harchive doesn't use ToolRunner / harchive returns 0 even if the job fails 
> with exception
> -
>
> Key: MAPREDUCE-826
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-826
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: harchive
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Fix For: 0.21.0
>
> Attachments: mapreduce-826-1.patch, mapreduce-826-2.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-883) harchive: Document how to unarchive

2009-08-18 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-883:
---

Attachment: mapreduce-883-0.patch

Simple doc suggesting to use cp/distcp for unarchiving.

> harchive: Document how to unarchive
> ---
>
> Key: MAPREDUCE-883
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-883
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation, harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-883-0.patch
>
>
> I was thinking of implementing harchive's 'unarchive' feature, but realized 
> it has been implemented already ever since harchive was introduced.
> It just needs to be documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-17 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744331#action_12744331
 ] 

Koji Noguchi commented on MAPREDUCE-865:


Simple testing.
Created har file with 
/a/b/2000files/xa to xaadnj
and /a/b/2000files/2000files/xa to xaadnj

Created har archive called myarchive.har.

About 4500 files. 

Withot the patch, 
/usr/bin/time hadoop dfs -lsr har:///user/knoguchi/myarchive.har > /dev/null
  
31.72user 5.23system *1:13.19* elapsed 50%CPU (0avgtext+0avgdata 0maxresident)

with 9000 open calls to Namenode. (_masterindex and _index) and also 4500 
filestatus calls to _index (I think).

With the patch, 
23.59user 0.58system *0:22.97* elapsed 105%CPU (0avgtext+0avgdata 0maxresident)

with one _master open call and five _index open calls.
Setting -Dfs.har.indexcache.num=1 changed the number of _index open calls  to 
10 times, but elapsed  time didn't change much.


The goal of the patch is more for reducing the load/calls to the namenode than 
speeding up the 'ls' commands.

Note that since client caches the entire _masterindex and also caches each 
STORE(cache range) it reads, initial call would be slower.



> harchive: Reduce the number of open calls  to _index and _masterindex 
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-865-0.patch
>
>
> When I have har file with 1000 files in it, 
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-17 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-865:
---

Attachment: mapreduce-865-0.patch

Primitive patch for discussion.

bq. So instead of open->read->close _index for each part file, thinking of 
keeping the index file open when possible.

Instead of keeping an open handle, this one simply reads 'Stores' (range of 
caches) and keep last 5 of them (configurable) in memory.
If the files are typical mapreduce outputs with many part-* files, number of 
open calls to _index  will be significantly reduced.




> harchive: Reduce the number of open calls  to _index and _masterindex 
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-865-0.patch
>
>
> When I have har file with 1000 files in it, 
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-883) harchive: Document how to unarchive

2009-08-17 Thread Koji Noguchi (JIRA)
harchive: Document how to unarchive
---

 Key: MAPREDUCE-883
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-883
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: documentation, harchive
Reporter: Koji Noguchi
Priority: Minor


I was thinking of implementing harchive's 'unarchive' feature, but realized it 
has been implemented already ever since harchive was introduced.
It just needs to be documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041
 ] 

Koji Noguchi commented on MAPREDUCE-865:


I believe  _masterindex is probably small enough to fit in memory(cache)
For  _index file, 1 million files can correspond to _index size of 100MBytes. 
(It depend on the path length)
Creating a local copy could be costly.

In our clusters, most of the files are mapreduce output files. 
/a/b/part-0
/a/b/part-1
/a/b/part-2
...
These show up as a set in _index file in this order since 
HarFileSystem.getHarHash is written that way.
So instead of open->read->close _index for each part file, thinking of  keeping 
the index file open when possible.


> harchive: Reduce the number of open calls  to _index and _masterindex 
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
>
> When I have har file with 1000 files in it, 
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-13 Thread Koji Noguchi (JIRA)
harchive: Reduce the number of open calls  to _index and _masterindex 
--

 Key: MAPREDUCE-865
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: harchive
Reporter: Koji Noguchi
Priority: Minor


When I have har file with 1000 files in it, 
   % hadoop dfs -lsr har:///user/knoguchi/myhar.har/
would open/read/close the _index/_masterindex files 1000 times.

This makes the client slow and add some load to the namenode as well.
Any ways to reduce this number?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020

2009-08-07 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740697#action_12740697
 ] 

Koji Noguchi commented on MAPREDUCE-837:


bq. I'll create a separate Jira for the 0.20 job succeeding part.

Created MAPREDUCE-838

> harchive fail when output directory has URI with default port of 8020
> -
>
> Key: MAPREDUCE-837
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-837
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: harchive
>Affects Versions: 0.20.1
>Reporter: Koji Noguchi
>Priority: Minor
>
> % hadoop archive -archiveName abc.har /user/knoguchi/abc 
> hdfs://mynamenode:8020/user/knoguchi
> doesn't work on 0.18 nor 0.20

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-838) Task succeeds even when committer.commitTask fails with IOException

2009-08-07 Thread Koji Noguchi (JIRA)
Task succeeds even when committer.commitTask fails with IOException
---

 Key: MAPREDUCE-838
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-838
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1
Reporter: Koji Noguchi


In MAPREDUCE-837, job succeeded with empty output even though all the tasks 
were throwing IOException at commiter.commitTask.

{noformat}
2009-08-07 17:51:47,458 INFO org.apache.hadoop.mapred.TaskRunner: Task 
attempt_200907301448_8771_r_00_0 is allowed to commit now
2009-08-07 17:51:47,466 WARN org.apache.hadoop.mapred.TaskRunner: Failure 
committing: java.io.IOException: Can not get the relative path: \
base = 
hdfs://mynamenode:8020/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0
 \
child = 
hdfs://mynamenode/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0/_index
  at 
org.apache.hadoop.mapred.FileOutputCommitter.getFinalPath(FileOutputCommitter.java:150)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:106)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
  at 
org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
  at org.apache.hadoop.mapred.Task.commit(Task.java:768)
  at org.apache.hadoop.mapred.Task.done(Task.java:692)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)

2009-08-07 17:51:47,468 WARN org.apache.hadoop.mapred.TaskRunner: Failure 
asking whether task can commit: java.io.IOException: \
Can not get the relative path: base = 
hdfs://mynamenode:8020/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0
 \
child = 
hdfs://mynamenode/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0/_index
  at 
org.apache.hadoop.mapred.FileOutputCommitter.getFinalPath(FileOutputCommitter.java:150)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:106)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
  at 
org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
  at org.apache.hadoop.mapred.Task.commit(Task.java:768)
  at org.apache.hadoop.mapred.Task.done(Task.java:692)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)

2009-08-07 17:51:47,469 INFO org.apache.hadoop.mapred.TaskRunner: Task 
attempt_200907301448_8771_r_00_0 is allowed to commit now
2009-08-07 17:51:47,472 INFO org.apache.hadoop.mapred.TaskRunner: Task 
'attempt_200907301448_8771_r_00_0' done.


{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020

2009-08-07 Thread Koji Noguchi (JIRA)
harchive fail when output directory has URI with default port of 8020
-

 Key: MAPREDUCE-837
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-837
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: harchive
Affects Versions: 0.20.1
Reporter: Koji Noguchi
Priority: Minor


% hadoop archive -archiveName abc.har /user/knoguchi/abc 
hdfs://mynamenode:8020/user/knoguchi

doesn't work on 0.18 nor 0.20


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020

2009-08-07 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740692#action_12740692
 ] 

Koji Noguchi commented on MAPREDUCE-837:


hadoop archive -archiveName abc.har /user/knoguchi/abc 
hdfs://mynamenode:8020/user/knoguchi

in 0.18, job fails with
{noformat}
09/08/07 19:41:57 INFO mapred.JobClient: Task Id :
attempt_200908071938_0001_m_00_2, Status : FAILED
Failed to rename output with the exception: java.io.IOException: Can not get the
relative path: base =
hdfs://mynamenode:8020/user/knoguchi/abc.har/_temporary/_attempt_200908071938_0001_m_00_2
child =
hdfs://mynamenode/user/knoguchi/abc.har/_temporary/_attempt_200908071938_0001_m_00_2/part-0
at org.apache.hadoop.mapred.Task.getFinalPath(Task.java:590)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:603)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:621)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:565)
at
org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2616)
{noformat}

in 0.20, it logs the above warning but job succeeds with empty output directory.
(which is worse)

I'll create a separate Jira for the 0.20 job succeeding part.



> harchive fail when output directory has URI with default port of 8020
> -
>
> Key: MAPREDUCE-837
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-837
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: harchive
>Affects Versions: 0.20.1
>Reporter: Koji Noguchi
>Priority: Minor
>
> % hadoop archive -archiveName abc.har /user/knoguchi/abc 
> hdfs://mynamenode:8020/user/knoguchi
> doesn't work on 0.18 nor 0.20

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception

2009-08-04 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-826:
---

Attachment: mapreduce-826-1.patch

1) Calls ToolRunner.run
2) Took out catch(Excepttion e) and let main fail with stack dump. At least 
return value would be non-zero.

> harchive doesn't use ToolRunner / harchive returns 0 even if the job fails 
> with exception
> -
>
> Key: MAPREDUCE-826
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-826
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Trivial
> Attachments: mapreduce-826-1.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception

2009-08-04 Thread Koji Noguchi (JIRA)
harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with 
exception
-

 Key: MAPREDUCE-826
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-826
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: harchive
Reporter: Koji Noguchi
Priority: Trivial




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.