[jira] Commented: (MAPREDUCE-1821) IFile.Reader should check whether data crc has checked before it stop reading.

2010-05-26 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872118#action_12872118
 ] 

ZhuGuanyin commented on MAPREDUCE-1821:
---

positionToNextRecord() in IFile.Reader should check whether checksumIn had crc 
checked when reader get EOF_MARKER for currentKeyLength and currentValueLength.

if IFileInputStream had crc checked in doRead() , it should set a class private 
variable flag, and the Reader could check througth a IFileInputStream  public 
method to query this variable.

if IFile.Reader get 2 byrtes -1 for keylength and valuelength and crc not 
checked in checksumIn,  it would throw a exception to fail this task.

> IFile.Reader should check whether data crc has checked before it stop reading.
> --
>
> Key: MAPREDUCE-1821
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1821
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Reporter: ZhuGuanyin
>Assignee: ZhuGuanyin
>
> Currently IFile data has crc checked in IFileInputStream (doRead method), 
> Normally the IFile would end with 2 bytes of -1, which means EOF_MARKER for 
> keylength and valuelength, and then with 4 bytes crc checksum;
> IFileInputStream  checksumIn would check crc before IFile.Reader get 
> EOF_MARKER, 
> IFile.Reader would stop reading when positionToNextRecord() read keylength 
> EOF_MARKER(-1),and valuelength  EOF_MARKER(-1);
> But if something error happened(IFile corrupted), if the IFileReader read -1, 
> -1 not at end of the IFile, the data may not checked! 
> Then Reader thought it had got all data and close reader..the task may 
> fake success without any WARNing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1821) IFile.Reader should check whether data crc has checked before it stop reading.

2010-05-26 Thread ZhuGuanyin (JIRA)
IFile.Reader should check whether data crc has checked before it stop reading.
--

 Key: MAPREDUCE-1821
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1821
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: ZhuGuanyin
Assignee: ZhuGuanyin


Currently IFile data has crc checked in IFileInputStream (doRead method), 
Normally the IFile would end with 2 bytes of -1, which means EOF_MARKER for 
keylength and valuelength, and then with 4 bytes crc checksum;
IFileInputStream  checksumIn would check crc before IFile.Reader get 
EOF_MARKER, 
IFile.Reader would stop reading when positionToNextRecord() read keylength 
EOF_MARKER(-1),and valuelength  EOF_MARKER(-1);

But if something error happened(IFile corrupted), if the IFileReader read -1, 
-1 not at end of the IFile, the data may not checked! 
Then Reader thought it had got all data and close reader..the task may fake 
success without any WARNing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-14 Thread ZhuGuanyin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhuGuanyin updated MAPREDUCE-1277:
--

Attachment: streaming-1277-new.patch

regenerate the patch using svn diff at root dir, thanks.

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
>Assignee: ZhuGuanyin
> Fix For: 0.21.0
>
> Attachments: streaming-1277-new.patch, streaming-1277.patch
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-10 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789081#action_12789081
 ] 

ZhuGuanyin commented on MAPREDUCE-1277:
---

I think the framework should not care what the characterset of the input and 
user log, may be the input or output has more than one characterset.

what hadoop need to do is read raw data for user mapper or reducer, collect raw 
stdout and stderr data and save them on hdfs or tasktracker local disk.

raw in, raw out, no matter what characterset it is.

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
>Assignee: ZhuGuanyin
> Fix For: 0.21.0
>
> Attachments: streaming-1277.patch
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1254) job.xml should add crc check in tasktracker and sub jvm.

2009-12-10 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788657#action_12788657
 ] 

ZhuGuanyin commented on MAPREDUCE-1254:
---

I just show the example that the inexpensive disk are not reliable, the kernel 
doesn't notice the hardware failture while it has being truncated.

1)job.xml in configuration are loaded asynchronous, and if it could  corrupted 
or missing before parse it, if it does happen, the corrupted data or default 
data would load without notice(that means some task run the right 
configuration, but some would run with wrong configurations);

2)the job.xml has so many important parameters, it need check before used;

3) if it doesn't crc check, why we generate the crc checksum file?  :)

> job.xml should add crc check in tasktracker and sub jvm.
> 
>
> Key: MAPREDUCE-1254
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1254
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: task, tasktracker
>Affects Versions: 0.22.0
>Reporter: ZhuGuanyin
>
> Currently job.xml in tasktracker and subjvm are write to local disk through 
> ChecksumFilesystem, and already had crc checksum information, but load the 
> job.xml file without crc check. It would cause the mapred job finished 
> successful but with wrong data because of disk error.  Example: The 
> tasktracker and sub task jvm would load the default configuration if it 
> doesn't successfully load the job.xml which maybe replace the mapper with 
> IdentityMapper. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788489#action_12788489
 ] 

ZhuGuanyin commented on MAPREDUCE-1277:
---

this patch change 

System.err.println(lineStr);

to
 
System.err.write(line.getBytes(),0,line.getLength());
System.err.println();

I think it could be verified by review, and it not very easy to write a 
testcase for this jira.

manual steps to check this :

1)copy a small file to hdfs

2)run streaming job using the mapper as follows:

#!/bin/sh
cat >/dev/null

echo "㊣ ?※" >&2
echo "礙骯襖壩闆辦" >&2

3) check the task stderr output, the logs would corrupted.

4) add the patch, and run the streaming job again, the task stderr would be 
fine.

this patch is usefull when user need write some debug message, example: some 
input record which might be encoded by big5, GBK and so on.

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
> Fix For: 0.21.0
>
> Attachments: streaming-1277.patch
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhuGuanyin updated MAPREDUCE-1277:
--

Attachment: streaming-1277.patch

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
> Fix For: 0.21.0
>
> Attachments: streaming-1277.patch
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhuGuanyin updated MAPREDUCE-1277:
--

Status: Patch Available  (was: Open)

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
> Fix For: 0.21.0
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788021#action_12788021
 ] 

ZhuGuanyin commented on MAPREDUCE-1277:
---

test case:
using the following mapper, and you would see the stderr log has corrupted.

#!/bin/sh
cat >/dev/null

echo "㊣ ?※" >&2
echo "礙骯襖壩闆辦" >&2



> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
> Fix For: 0.21.0
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788022#action_12788022
 ] 

ZhuGuanyin commented on MAPREDUCE-1277:
---

a simple solution:

change line 492 in PipeMapRed.java

System.err.println(lineStr);

to:
System.err.write(line.getBytes(),0,line.getLength());
System.err.println();

I will attach the patch soon. 

> Streaming job should support other characterset in user's stderr log, not 
> only utf8
> ---
>
> Key: MAPREDUCE-1277
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming
>Affects Versions: 0.21.0
>Reporter: ZhuGuanyin
> Fix For: 0.21.0
>
>
> Current implementation in streaming  only support utf8 encoded user stderr 
> log, it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1277) Streaming job should support other characterset in user's stderr log, not only utf8

2009-12-09 Thread ZhuGuanyin (JIRA)
Streaming job should support other characterset in user's stderr log, not only 
utf8
---

 Key: MAPREDUCE-1277
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1277
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.21.0
Reporter: ZhuGuanyin
 Fix For: 0.21.0


Current implementation in streaming  only support utf8 encoded user stderr log, 
it should encode free to support other characterset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1254) job.xml should add crc check in tasktracker and sub jvm.

2009-12-04 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785838#action_12785838
 ] 

ZhuGuanyin commented on MAPREDUCE-1254:
---

Because the local inexpensive disks are not reliable, and we once found the non 
zero file became zero length, but the os kernel message has no warning, while 
some minutes later, the kernel message report the disk failtures. Durining that 
time,  the read operation return success without throw any IOException. 

In current implementation, it would throw IOException if the job.xml missing, 
but it couldn't detect the configuration file has corrupted or has being 
truncated.

> job.xml should add crc check in tasktracker and sub jvm.
> 
>
> Key: MAPREDUCE-1254
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1254
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: task, tasktracker
>Affects Versions: 0.22.0
>Reporter: ZhuGuanyin
>
> Currently job.xml in tasktracker and subjvm are write to local disk through 
> ChecksumFilesystem, and already had crc checksum information, but load the 
> job.xml file without crc check. It would cause the mapred job finished 
> successful but with wrong data because of disk error.  Example: The 
> tasktracker and sub task jvm would load the default configuration if it 
> doesn't successfully load the job.xml which maybe replace the mapper with 
> IdentityMapper. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-222) Shuffle should be refactored to a separate task by itself

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784075#action_12784075
 ] 

ZhuGuanyin commented on MAPREDUCE-222:
--

I think it would be better if shuffle and sort phase  seperate from reduce task.

1) The reschduled reduce need shuffle and sort again if the former reduce task 
failed in current implentation. Example, the reduce shuffle and sort phase cost 
a lot of time if a reduce need fetch map midoutput  from 100k maps.

2) we could shuffle and sort while anothers job's or tasks' reducer running, 
which would maximize resource utilization. In current implentation, the reduce 
slots are comsumed if it is shuffle or waiting the map finished.

3) we could localized the reduce task on the tasktracker where it has shuffled.

> Shuffle should be refactored to a separate task by itself
> -
>
> Key: MAPREDUCE-222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-222
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Devaraj Das
>
> Currently, shuffle phase is part of the reduce task. The idea here is to move 
> out the shuffle as a first-class task. This will improve the usage of the 
> network since we will then be able to schedule shuffle tasks independently, 
> and later on pin reduce tasks to those nodes. This will make most sense for 
> apps where there are multiple waves of reduces (the second wave of reduces 
> can directly start off doing the "reducer" phase).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1099) Setup and cleanup tasks could affect job latency if they are caught running on bad nodes

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784067#action_12784067
 ] 

ZhuGuanyin commented on MAPREDUCE-1099:
---

We have encountered the same problem, so We just remove the setup and cleanup 
task (inport the patch https://issues.apache.org/jira/browse/MAPREDUCE-463 )

> Setup and cleanup tasks could affect job latency if they are caught running 
> on bad nodes
> 
>
> Key: MAPREDUCE-1099
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1099
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1
>Reporter: Hemanth Yamijala
>
> We found cases on our clusters where a setup task got scheduled on a bad node 
> and took upwards of several minutes to run, adversely affecting job runtimes. 
> Speculation did not help here as speculation is not used for setup tasks. I 
> suspect the same could happen for cleanup tasks as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1254) job.xml should add crc check in tasktracker and sub jvm.

2009-11-30 Thread ZhuGuanyin (JIRA)
job.xml should add crc check in tasktracker and sub jvm.


 Key: MAPREDUCE-1254
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1254
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: task, tasktracker
Affects Versions: 0.22.0
Reporter: ZhuGuanyin


Currently job.xml in tasktracker and subjvm are write to local disk through 
ChecksumFilesystem, and already had crc checksum information, but load the 
job.xml file without crc check. It would cause the mapred job finished 
successful but with wrong data because of disk error.  Example: The tasktracker 
and sub task jvm would load the default configuration if it doesn't 
successfully load the job.xml which maybe replace the mapper with 
IdentityMapper. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784019#action_12784019
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

I agree, seperate the overtime lock method from heartbeat thread and never do 
i/o operations holding locks is the best solution. We had tried, but found it's 
not very easy to achieved and would not resolve recently, I propose a tempary 
solution. 

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1248) Redundant memory copying in StreamKeyValUtil

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783614#action_12783614
 ] 

ZhuGuanyin commented on MAPREDUCE-1248:
---

the same thing happenes in KeyValueLineRecordReader.java, when it calles the 
next() method.

> Redundant memory copying in StreamKeyValUtil
> 
>
> Key: MAPREDUCE-1248
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1248
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: contrib/streaming
>Reporter: Ruibang He
>Priority: Minor
>
> I found that when MROutputThread collecting the output of  Reducer, it calls 
> StreamKeyValUtil.splitKeyVal() and two local byte-arrays are allocated there 
> for each line of output. Later these two byte-arrays are passed to variable 
> key and val. There are twice memory copying here, one is the 
> System.arraycopy() method, the other is inside key.set() / val.set().
> This causes double times of memory copying for the whole output (may lead to 
> higher CPU consumption), and frequent temporay object allocation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783604#action_12783604
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

We could make the out-of-band heartbeat thread in tasktracker as 
optionally(default not start the thread through a configurable parameter),  
small cluster (running small jobs) are not needed. The additional thread is 
very usefull for the cluster running large jobs. Our Product hadoop cluster 
became more Robustness and never fake-lost-tasktracker any more,  I would 
attach the patch if someone interested.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783574#action_12783574
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

The taskCleanup thread lock the TaskTracker when it call 
MapOutputFile.removeAll() through TaskTracker.purgeTask() to cleanup a task or 
TaskTracker.purgeJob() to cleanup a job, if the midoutput file larger than 
50GB, and there some other io operations on this disk, it would hold the 
tasktracker lock for a long time enough to let the jobtracker treat this 
tasktracker as dead.

I think the current heartbeat thread has to handle too many things which 
doesn't its duty.  the deadlock in tasktracker currently may still happen and 
may not be found in current implentition. And I don't think it is the 
hearbeat's duty to found the deadlock in tasktracker.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783566#action_12783566
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

The out-of-band heartbeat thread (or we could call it the true heartbeat 
thread) only send tasktracker's name to jobtracker, and the jobtracker just 
update it's last seen time, we could add a new interface to 
InterTrackerProtocol, so it doesn't add  a lot of confusion or complexity. 



> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-30 Thread ZhuGuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783564#action_12783564
 ] 

ZhuGuanyin commented on MAPREDUCE-1247:
---

We print the java jstack when it became fake lost tasktracker on hadoop version 
0.19,  and found:

7 times the heartbeat thread waiting the TaskTracker lock ( 5 times because of 
taskCleanup thread hold for a long time, 2 times because of reduce sub jvm call 
TaskTracker.getMapCompletionEvents())


4 times the heartbeat thread waiting for the TaskTracker.TaskInProgress lock ( 
3 times because of taskCleanup thread hold for a long time, 1 time because of 
TaskLauncher hold for a long time)

2 times the heartbeat thread waiting for the AllocatorPerContext lock 


The heartbeat thread should only answer for the live or death status of 
tasktracker, but current implentition it has too many others things to do, we 
should let the heartbeat thread only do what it has to do.

> Send out-of-band heartbeat to avoid fake lost tasktracker
> -
>
> Key: MAPREDUCE-1247
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: ZhuGuanyin
>
> Currently the TaskTracker report task status to jobtracker through heartbeat, 
> sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
> like remove task temp data on disk, the heartbeat thread would hang for a 
> long time while waiting for the lock, so the jobtracker just thought it had 
> lost and would reschedule all its finished maps or un finished reduce on 
> other tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
> acceptable especially when we run some large jobs.  So We introduce a 
> out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1247) Send out-of-band heartbeat to avoid fake lost tasktracker

2009-11-29 Thread ZhuGuanyin (JIRA)
Send out-of-band heartbeat to avoid fake lost tasktracker
-

 Key: MAPREDUCE-1247
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1247
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: ZhuGuanyin


Currently the TaskTracker report task status to jobtracker through heartbeat, 
sometimes if the tasktracker  lock the tasktracker to do some cleanup  job, 
like remove task temp data on disk, the heartbeat thread would hang for a long 
time while waiting for the lock, so the jobtracker just thought it had lost and 
would reschedule all its finished maps or un finished reduce on other 
tasktrackers, we call it "fake lost tasktracker", some times it doesn't 
acceptable especially when we run some large jobs.  So We introduce a 
out-of-band heartbeat mechanism to send an out-of-band heartbeat in that case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.