[jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0

2011-03-18 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008452#comment-13008452
 ] 

Arun C Murthy commented on MAPREDUCE-279:
-

Thanks for your f/b Tom.

bq. I wonder if it would be easier not to move the 
src/java/org/apache/hadoop/mapred(uce) trees at this stage.

The main issue is the dependency chain - currently the mr-client depends purely 
on apis in yarn package. In the alternate proposal (which we considered) 
mr-client would need to depend on yarn and src/java for the runtime. 

The current scheme is both more modular and enforces discipline by ensuring 
that the MapReduce runtime (map, sort, shuffle, merge, reduce) cannot, even 
accidentally, start relying on classes in the server package i.e. JT/TT etc. 
This also seems like the right end-state for the project.

Also, as you pointed out, the changes to classes in 
src/java/org/apache/hadoop/mapred(uce) are very minor and the 'svn mv' is both 
well documented (MR-279_MR_files_to_move.txt, MR-279.sh) and straight-forward.



bq. MAPREDUCE-1638 is highly relevant for this work

Thanks! MAPREDUCE-1638 is very relevant. MAPREDUCE-279 already has some of the 
changes you proposed there i.e. keeping server classes in a separate source 
structure from the implementation classes - we should collaborate both on trunk 
and on the MR-279 branch to ensure consistency. I'm happy to merge if 
necessary. 


> Map-Reduce 2.0
> --
>
> Key: MAPREDUCE-279
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-279
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Fix For: 0.23.0
>
> Attachments: MR-279.patch, MR-279.patch, MR-279.sh, 
> MR-279_MR_files_to_move.txt
>
>
> Re-factor MapReduce into a generic resource scheduler and a per-job, 
> user-defined component that manages the application execution. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0

2011-03-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008517#comment-13008517
 ] 

Todd Lipcon commented on MAPREDUCE-279:
---

Hi Arun. I spent the train ride this morning looking over yarn/src/main/avro in 
the branch. Here are a few comments, sorry for the somewhat 
stream-of-consciousness format.


- Is the correct suffix still .genavro? Thought we'd changed the name to 
.avroidl or something?
- Apache licenses needed on these files
- Does AvroIDL convert javadoc-style comments on records/protocols into JavaDoc 
on generated code? If so we should do more of that.


- AMRMProtocol:
-- the "release" parameter to allocate is strange: (a) it seems the function is 
misnamed if you can also release things as you call it, and (b) why isn't it an 
array?
-- if you want to cancel previous resource requests, do you submit a new one 
with a negative numContainers?


- ApplicationSubmissionContext:
-- would be good to have some kind of scheduler-specific parameters here? eg 
maybe a scheduler has something beyond just "priority" (eg. perhaps a deadline)
-- using just URL type directly for resources - seems not quite flexible 
enough? eg one useful construct would be a URL + checksum
-- what's resources_todo going to be?
-- passing "user" - agreed, this should be more flexible than simple string.
-- Why not contain a ContainerLaunchContext to specify the container in which 
to run the AM? Seems like lots of duplicated fields.

- ContainerManager:
-- not following YarnContainerTags - these are opaque enums, how do they get 
interpolated in a string?
-- how does one access stderr/stdout contents? both while they're being written 
and after a container has terminated? (maybe I just haven't gotten to that bit 
yet somewhere else)

- yarn-types.avro:
-- For the typesafe ID classes, do we need to specify explicit comparison 
orderings? I don't know Avro behavior here.
-- Did you consider making the ids all strings instead of ints? The pro would 
be that there could be canonical formats, like "AM-" for app masters vs 
"C-" for containers. AWS does a good job of this.
-- Resource: field names should include units, like "int memoryMB"
-- what are ContainerTokens? could use some extra doc at the protocol layer 
here. (I assume this is for security?)
-- The "Container" type doesn't appear 
-- the URL record is missing user/password used for http basic auth or s3n auth
-- there are some hard tabs in this file
-- ApplicationMaster:
--- httpPort seems like it would be better described as something like 
"httpStatusURL"?
-- LocalResourceVisibility:
--- just to clarify, APPLICATION visibility means "only to this application 
submitted by this user". ie if joe and bob both submit MapReduce 2.x.y jobs 
with identical jars, it still won't share, even if sha1s match?
--- if bob submits the same application (ie MR 2.x.y) twice, do APPLICATION 
visibility files get shared?


> Map-Reduce 2.0
> --
>
> Key: MAPREDUCE-279
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-279
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Fix For: 0.23.0
>
> Attachments: MR-279.patch, MR-279.patch, MR-279.sh, 
> MR-279_MR_files_to_move.txt
>
>
> Re-factor MapReduce into a generic resource scheduler and a per-job, 
> user-defined component that manages the application execution. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Updated: (MAPREDUCE-2368) RAID DFS regression

2011-03-18 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2368:
---

Status: Open  (was: Patch Available)

Will resubmit patch now that MiniMRCluster delays are resolved.

> RAID DFS regression
> ---
>
> Key: MAPREDUCE-2368
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2368
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/raid
>Reporter: Ramkumar Vadali
>Assignee: Ramkumar Vadali
> Fix For: 0.20.3
>
> Attachments: MAPREDUCE-2368.patch
>
>
> The patch for MAPREDUCE-2248 did not handle zero-length files correctly, 
> which leads to ArrayIndexOutOfBoundsException when opening a zero-length 
> file. That case needs special handling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Updated: (MAPREDUCE-2368) RAID DFS regression

2011-03-18 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2368:
---

Status: Patch Available  (was: Open)

> RAID DFS regression
> ---
>
> Key: MAPREDUCE-2368
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2368
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/raid
>Reporter: Ramkumar Vadali
>Assignee: Ramkumar Vadali
> Fix For: 0.20.3
>
> Attachments: MAPREDUCE-2368.patch
>
>
> The patch for MAPREDUCE-2248 did not handle zero-length files correctly, 
> which leads to ArrayIndexOutOfBoundsException when opening a zero-length 
> file. That case needs special handling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel

2011-03-18 Thread gopikannan venugopalsamy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008557#comment-13008557
 ] 

gopikannan venugopalsamy commented on MAPREDUCE-2257:
-

I wanna work on this, hey nikhil .. would you like to discuss

> distcp can copy blocks in parallel
> --
>
> Key: MAPREDUCE-2257
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-2345) Optimize jobtracker's memory usage

2011-03-18 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008597#comment-13008597
 ] 

Allen Wittenauer commented on MAPREDUCE-2345:
-

> But how about a running job with tens of thousands of tasks? We see that big 
> running 
> jobs use much memory in the cluster. 

This is almost always a sign that either the data being read is not laid out 
efficiently/too small of block size, that one needs to use 
CombinedFileInputFormat, or there just too many reducers in play.  There is 
almost never a reason to have jobs in the x0,000 area unless the dataset is 
Just That Big.

> Optimize jobtracker's  memory usage  
> -
>
> Key: MAPREDUCE-2345
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2345
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 0.21.0
>Reporter: MengWang
>  Labels: hadoop
> Fix For: 0.23.0
>
> Attachments: jt-memory-useage.bmp
>
>
> Too many tasks will eat up a considerable amount of JobTracker's heap space. 
> According to our observation, 50GB heap size can support to 5,000,000 tasks, 
> so we should optimize jobtracker's memory usage for more jobs and tasks. 
> Yourkit java profile show that counters, duplicate strings, task waste too 
> much memory. Our optimization around these three points reduced jobtracker's 
> memory to 1/3. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0

2011-03-18 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008648#comment-13008648
 ] 

Chris Douglas commented on MAPREDUCE-279:
-

bq. Why not contain a ContainerLaunchContext to specify the container in which 
to run the AM? Seems like lots of duplicated fields.
Agreed. Fixing this also addresses the URL as insufficient for resources. The 
\_todo form was introduced to effect this, and remains in-progress.

bq. how does one access stderr/stdout contents? both while they're being 
written and after a container has terminated? (maybe I just haven't gotten to 
that bit yet somewhere else)
This is still a TODO (working on it now). In the short term, something similar 
to what the TT does is probably sufficient, I hope.

bq. Did you consider making the ids all strings instead of ints? The pro would 
be that there could be canonical formats, like "AM-" for app masters vs 
"C-" for containers.
Some of the implementation ended up relying on a consistent mapping of int ids 
to strings, so going all the way could make sense. On the other hand, parsing 
strings to determine relationships between containers and applications is 
regrettable.

bq. the URL record is missing user/password used for http basic auth or s3n auth
Agreed, full URIs should be supported, though pushing that all the way through 
FileContext and FileSystem could be painful.

bq. just to clarify, APPLICATION visibility means "only to this application 
submitted by this user". ie if joe and bob both submit MapReduce 2.x.y jobs 
with identical jars, it still won't share, even if sha1s match?
Right. The target layout for the NodeManager looks roughly like this:
{noformat}
for x in localdir:
$x/filecache # public cache
$x/usercache
$x/usercache/$user
$x/usercache/filecache # private cache
$x/usercache/$user/appcache
$x/usercache/$user/appcache/$appid
$x/usercache/$user/appcache/$appid/filecache # application cache
$x/usercache/$user/appcache/$appid/$containerid
$x/usercache/$user/appcache/$appid/output # output retained after container 
exits, i.e. intermediate data
{noformat}
So the end of the container and application can just delete those subdirs. 
Matching a job jar between invocations would require one to register that 
resource as PUBLIC/PRIVATE. The APPLICATION scope is more for job.xml and the 
like.

> Map-Reduce 2.0
> --
>
> Key: MAPREDUCE-279
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-279
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Fix For: 0.23.0
>
> Attachments: MR-279.patch, MR-279.patch, MR-279.sh, 
> MR-279_MR_files_to_move.txt
>
>
> Re-factor MapReduce into a generic resource scheduler and a per-job, 
> user-defined component that manages the application execution. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0

2011-03-18 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008655#comment-13008655
 ] 

Chris Douglas commented on MAPREDUCE-279:
-

Sorry, the location of the private cache is {{$x/usercache/$user/filecache}}, 
not {{$x/usercache/filecache}}.

> Map-Reduce 2.0
> --
>
> Key: MAPREDUCE-279
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-279
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker, tasktracker
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Fix For: 0.23.0
>
> Attachments: MR-279.patch, MR-279.patch, MR-279.sh, 
> MR-279_MR_files_to_move.txt
>
>
> Re-factor MapReduce into a generic resource scheduler and a per-job, 
> user-defined component that manages the application execution. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel

2011-03-18 Thread Rosie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008669#comment-13008669
 ] 

Rosie Li commented on MAPREDUCE-2257:
-

I'm working on this feature right now. Already done writing the code. Testing 
now.

> distcp can copy blocks in parallel
> --
>
> Key: MAPREDUCE-2257
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2396) InMemFSMergeThread.doInMemMerge() miss the lass MapOutput for each merge round

2011-03-18 Thread Elton Tian (JIRA)
InMemFSMergeThread.doInMemMerge() miss the lass MapOutput for each merge round  


 Key: MAPREDUCE-2396
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2396
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2
Reporter: Elton Tian
Priority: Critical


In ReduceTask.shuffleInMemory(), once a new MapOutput is read, the 
ramManager.closeInMemoryFile() is called to notify waitForDataToMerge()  to 
check if a in memory merge is required. If the threshold is met, the 
doInMemMerge() may start before the MapOutput just read been added to 
mapOutputsFilesInMemory. So the "dataAvailable.notify();" should be removed and 
let the noteCopiedMapOutput() notify the merge. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (MAPREDUCE-2397) Allow user to sort jobs in different sections (Completed, Failed, etc.) by the various columns available

2011-03-18 Thread Stephen Tunney (JIRA)
Allow user to sort jobs in different sections (Completed, Failed, etc.) by the 
various columns available


 Key: MAPREDUCE-2397
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2397
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Reporter: Stephen Tunney
Priority: Trivial


It would be nice (IMHO) to be able to sort the tables on the jobtracker.jsp 
page by any column (jobID would be most logical at first) so that one could 
eliminate scrolling all of the time.  Perhaps also have the page save the 
user's sorting preferences per table too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira