[jira] Updated: (HADOOP-2596) add SequenceFile.createWriter() method that takes block size as parameter

2008-01-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2596:
--

Status: Patch Available  (was: Open)

 add SequenceFile.createWriter() method that takes block size as parameter
 -

 Key: HADOOP-2596
 URL: https://issues.apache.org/jira/browse/HADOOP-2596
 Project: Hadoop
  Issue Type: Improvement
  Components: io
 Environment: all
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Minor
 Fix For: 0.16.0

 Attachments: patch2596.txt


 Currently it is not possible to create a SequenceFile.Writer using a block 
 size other than the default.
 The createWriter() method should be overloaded with a signature receiving 
 block size as parameter should be added to the the SequenceFile class.
 With all the current signatures for this method there is a significant code 
 duplication, if possible the createWriter() methods  should be refactored to 
 avoid such duplication.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2596) add SequenceFile.createWriter() method that takes block size as parameter

2008-01-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2596:
--

Status: Open  (was: Patch Available)

Re-submitting to hudson since an unrelated test failed...

 add SequenceFile.createWriter() method that takes block size as parameter
 -

 Key: HADOOP-2596
 URL: https://issues.apache.org/jira/browse/HADOOP-2596
 Project: Hadoop
  Issue Type: Improvement
  Components: io
 Environment: all
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Minor
 Fix For: 0.16.0

 Attachments: patch2596.txt


 Currently it is not possible to create a SequenceFile.Writer using a block 
 size other than the default.
 The createWriter() method should be overloaded with a signature receiving 
 block size as parameter should be added to the the SequenceFile class.
 With all the current signatures for this method there is a significant code 
 duplication, if possible the createWriter() methods  should be refactored to 
 avoid such duplication.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2402) Lzo compression compresses each write from TextOutputFormat

2008-01-17 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2402:
--

Status: Open  (was: Patch Available)

Mostly looks ok, but there are too many unrelated white-space changes - hence, 
I'm cancelling this patch.

 Lzo compression compresses each write from TextOutputFormat
 ---

 Key: HADOOP-2402
 URL: https://issues.apache.org/jira/browse/HADOOP-2402
 Project: Hadoop
  Issue Type: Bug
  Components: io, mapred, native
Reporter: Chris Douglas
Assignee: Chris Douglas
 Fix For: 0.16.0

 Attachments: 2402-0.patch, 2402-1.patch


 Outputting with TextOutputFormat and Lzo compression generates a file such 
 that each key, tab delimiter, and value are compressed separately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Attachment: mapred_tutorial.html
HADOOP-2574_1_20080114.patch

Updated to incorporate Phu's original ask and Amar's feedback... again, I've 
attached the generate mapred_tutorial.html for folks to review it without 
having to figure forrest.

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3

 Attachments: HADOOP-2574_0_20080110.patch, 
 HADOOP-2574_1_20080114.patch, mapred_tutorial.html, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Fix Version/s: (was: 0.16.0)
   Status: Patch Available  (was: Open)

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3

 Attachments: HADOOP-2574_0_20080110.patch, 
 HADOOP-2574_1_20080114.patch, mapred_tutorial.html, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2574) bugs in mapred tutorial

2008-01-14 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558569#action_12558569
 ] 

Arun C Murthy commented on HADOOP-2574:
---

Phu - does this patch 
(http://issues.apache.org/jira/secure/attachment/12373083/mapred_tutorial.html) 
address your concerns?

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3

 Attachments: HADOOP-2574_0_20080110.patch, 
 HADOOP-2574_1_20080114.patch, mapred_tutorial.html, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2516) HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0

2008-01-14 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558574#action_12558574
 ] 

Arun C Murthy commented on HADOOP-2516:
---

I'd still go with Owen's comment: mark it as won't fix and then move 
HADOOP-1819 to the *INCOMPATIBLE CHANGES* section...

 HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0
 

 Key: HADOOP-2516
 URL: https://issues.apache.org/jira/browse/HADOOP-2516
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.15.1
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.15.3


 HADOOP-1819 removed a 0.14.0 public api {{JobTracker.getTracker}} in 0.15.0.
 http://svn.apache.org/viewvc?view=revrevision=575438 and
 http://svn.apache.org/viewvc/lucene/hadoop/branches/branch-0.15/src/java/org/apache/hadoop/mapred/JobTracker.java?r1=573708r2=575438diff_format=h
 There is a simple work-around i.e. use the return value of 
 {{JobTracker.startTracker}} ... yet, is this a 0.15.2 blocker?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2077:
--

Attachment: HADOOP-2077_1_20080114.patch

Slight modification to the log-msg:

{noformat}
08/01/14 17:33:25 INFO dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = neo/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.16.0-dev
STARTUP_MSG:   build = http://svn.apache.org/repos/asf/lucene/hadoop/trunk -r 
611760; compiled by 'arun' on Mon Jan 14 17:33:13 IST 2008
/
{noformat}

 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0

 Attachments: HADOOP-2077_0_20080110.patch, 
 HADOOP-2077_0_20080110.patch, HADOOP-2077_1_20080114.patch


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2077:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0

 Attachments: HADOOP-2077_0_20080110.patch, 
 HADOOP-2077_0_20080110.patch, HADOOP-2077_1_20080114.patch


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3

 Attachments: HADOOP-2574_0_20080110.patch, 
 HADOOP-2574_1_20080114.patch, mapred_tutorial.html, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2574) bugs in mapred tutorial

2008-01-14 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558917#action_12558917
 ] 

Arun C Murthy commented on HADOOP-2574:
---

I've clarified in the tutorial that WordCount v1 works with local, 
pseudo-distributed and fully-distributed modes while  v2 needs HDFS to be up 
and running (pseudo-distributed or fully-distributed) - primarily due to the 
usage of the DistributedCache. Works?

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3

 Attachments: HADOOP-2574_0_20080110.patch, 
 HADOOP-2574_1_20080114.patch, mapred_tutorial.html, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-13 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Status: Open  (was: Patch Available)

Uh, I missed:

 bq. The quickstart tutorial does not make it clear which examples work under 
which scenarios (Stand alone, Pseudo-Distributed, or Fully-Distributed).

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3, 0.16.0

 Attachments: HADOOP-2574_0_20080110.patch, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-12 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2570:
--

Status: Open  (was: Patch Available)

Re-trying hudson...

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: HADOOP-2570_1_20080112.patch, patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-12 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2570:
--

Status: Patch Available  (was: Open)

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: HADOOP-2570_1_20080112.patch, patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1876) Persisting completed jobs status

2008-01-12 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1876:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Alejandro!

 Persisting completed jobs status
 

 Key: HADOOP-1876
 URL: https://issues.apache.org/jira/browse/HADOOP-1876
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: all
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch1876.txt, patch1876.txt, patch1876.txt


 Currently the JobTracker keeps information about completed jobs in memory. 
 This information is  flushed from the cache when it has outlived 
 (#RETIRE_JOB_INTERVAL) or because the limit of completed jobs in memory has 
 been reach (#MAX_COMPLETE_USER_JOBS_IN_MEMORY). 
 Also, if the JobTracker is restarted (due to being recycled or due to a 
 crash) information about completed jobs is lost.
 If any of the above scenarios happens before the job information is queried 
 by a hadoop client (normally the job submitter or a monitoring component) 
 there is no way to obtain such information.
 A way to avoid this is the JobTracker to persist in DFS the completed jobs 
 information upon job completion. This would be done at the time the job is 
 moved to the completed jobs queue. Then when querying the JobTracker for 
 information about a completed job, if it is not found in the memory queue, a 
 lookup  in DFS would be done to retrieve the completed job information. 
 A directory in DFS (under mapred/system) would be used to persist completed 
 job information, for each completed job there would be a directory with the 
 job ID, within that directory all the information about the job: status, 
 jobprofile, counters and completion events.
 A configuration property will indicate for how log persisted job information 
 should be kept in DFS. After such period it will be cleaned up automatically.
 This improvement would not introduce API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-11 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558080#action_12558080
 ] 

Arun C Murthy commented on HADOOP-2570:
---

All tests fail with:

{noformat}
2008-01-11 17:35:53,433 INFO  mapred.TaskTracker 
(TaskTracker.java:launchTaskForJob(703)) - 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_20080735_0001/work in any of the configured local 
directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1395)
at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1469)
at 
org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:693)
at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:686)
at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1279)
at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:920)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1315)
at 
org.apache.hadoop.mapred.MiniMRCluster$TaskTrackerRunner.run(MiniMRCluster.java:144)
at java.lang.Thread.run(Thread.java:595)
{noformat}

The problem is that the LocalDirAllocator.getLocalPathToRead throws and 
exception when the path is not found - this patch should handle that exception 
and go-ahead to create the symlink...


 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Attachment: HADOOP-2574_0_20080110.patch

Here is patch which addresses most of Phu's concerns...

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3, 0.16.0

 Attachments: HADOOP-2574_0_20080110.patch


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HADOOP-2574) bugs in mapred tutorial

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy reassigned HADOOP-2574:
-

Assignee: Arun C Murthy

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3, 0.16.0

 Attachments: HADOOP-2574_0_20080110.patch


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Attachment: mapred_tutorial.html

Here is how the tutorial looks with this patch...

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3, 0.16.0

 Attachments: HADOOP-2574_0_20080110.patch, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-11 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558087#action_12558087
 ] 

Arun C Murthy commented on HADOOP-2570:
---

Sigh, this exception seems to stem from the fact that the LocalDirAllocator is 
not used to create the *taskTracker/jobcache/jobid/work* directory at all. It 
is always created in the same partition as the *taskTracker/jobcache/jobid/* 
directory.

This means LocalDirAllocator doesn't know about the 
*taskTracker/jobcache/jobid/work* directory at all and hence the 
DiskErrorException.

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2574) bugs in mapred tutorial

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2574:
--

Status: Patch Available  (was: Open)

 bugs in mapred tutorial
 ---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
Assignee: Arun C Murthy
 Fix For: 0.15.3, 0.16.0

 Attachments: HADOOP-2574_0_20080110.patch, mapred_tutorial.html


 Sam Pullara sends me:
 {noformat}
 Phu was going through the WordCount example... lines 52 and 53 should have 
 args[0] and args[1]:
 http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html
 The javac and jar command are also wrong, they don't include the directories 
 for the packages, should be:
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d 
 classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-11 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558133#action_12558133
 ] 

Arun C Murthy commented on HADOOP-2570:
---

Please ignore my previous comments... it's been a long day (maybe the following 
ones too! *smile*)

It seems like the test cases don't have a jar and hence there is an 'if' check 
in TaskTracker.localizeJob which fails and hence the work directory isn't 
created. This explains the exception seen in the TaskTracker.launchTaskForJob 
function. 

I didn't make any headway after that...

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1965) Handle map output buffers better

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1965:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amar - this was a long-drawn affair!

 Handle map output buffers better
 

 Key: HADOOP-1965
 URL: https://issues.apache.org/jira/browse/HADOOP-1965
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.16.0
Reporter: Devaraj Das
Assignee: Amar Kamat
 Fix For: 0.16.0

 Attachments: 1965_single_proc_150mb_gziped.jpeg, 
 1965_single_proc_150mb_gziped.pdf, 1965_single_proc_150mb_gziped_breakup.png, 
 HADOOP-1965-1.patch, HADOOP-1965-Benchmark.patch, 
 HADOOP-1965-Benchmark.patch, HADOOP-1965-Benchmark.patch, 
 HADOOP-1965-Benchmark.patch, HADOOP-1965-Benchmark.patch, HADOOP-2419.patch, 
 HADOOP-2419.patch, HADOOP-2419.patch, HADOOP-2419.patch


 Today, the map task stops calling the map method while sort/spill is using 
 the (single instance of) map output buffer. One improvement that can be done 
 to improve performance of the map task is to have another buffer for writing 
 the map outputs to, while sort/spill is using the first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2570:
--

Attachment: HADOOP-2570_1_20080112.patch

bq. It seems like the test cases don't have a jar and hence there is an 'if' 
check in TaskTracker.localizeJob which fails and hence the work directory 
isn't created. This explains the exception seen in the 
TaskTracker.launchTaskForJob function.

Here is patch which fixes TaskTracker.localizeJob to fix the problem described 
above, along with Amareshwari's original fix.

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: HADOOP-2570_1_20080112.patch, patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-11 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2570:
--

Status: Patch Available  (was: Open)

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: HADOOP-2570_1_20080112.patch, patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557636#action_12557636
 ] 

Arun C Murthy commented on HADOOP-2570:
---

Sigh, the only way is see is fix this post HADOOP-2227 is symlink the work 
directory from the partition on which the task's cwd is present; this is so 
because user scripts could just use ../work/ path and there is no way for us 
to pass them extra configuration parameters etc.

Thoughts?

 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
 Fix For: 0.15.3


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557757#action_12557757
 ] 

Arun C Murthy commented on HADOOP-2573:
---

I'd like to throw *job priority* into this festering pool...

At least changing the job-priority (done by the cluster-admin) should  in a 
change in number of max_slots... thoughts?



 limit running tasks per job
 ---

 Key: HADOOP-2573
 URL: https://issues.apache.org/jira/browse/HADOOP-2573
 Project: Hadoop
  Issue Type: New Feature
  Components: mapred
Reporter: Doug Cutting
 Fix For: 0.17.0


 It should be possible to specify a limit to the number of tasks per job 
 permitted to run simultaneously.  If, for example, you have a cluster of 50 
 nodes, with 100 map task slots and 100 reduce task slots, and the configured 
 limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
 run at a time.  This will permit short jobs to pass longer-running jobs.  
 This also avoids some problems we've seen with HOD, where nodes are 
 underutilized in their tail, and it should permit improved input locality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557754#action_12557754
 ] 

Arun C Murthy commented on HADOOP-2510:
---

bq. I would, however, argue that the JobScheduler should not be part of 
MapReduce itself and rather a separate component. 

Sure, that is _precisely_ the idea. I guess we are on the same page now. 
JobScheduler is the big-daddy of the cluster.

As Eric alludes, the gravy is that by moving MR into the client-code 
(JobManager) we can support multiple parallel-computation paradigms, in 
addition to MR itself. Clearly, we are a long way ...

 Map-Reduce 2.0
 --

 Key: HADOOP-2510
 URL: https://issues.apache.org/jira/browse/HADOOP-2510
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy

 We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
 provisioning/scheduling mechanism. 
 With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
 allocates these from a global pool and also provisions a private Map-Reduce 
 cluster for the user. She then runs her jobs and shuts the cluster down via 
 HoD when done. All user-private clusters use the same humongous, static HDFS 
 (e.g. 2k node HDFS). 
 More details about HoD are available here: HADOOP-1301.
 
 h3. Motivation
 The current deployment (Hadoop + HoD) has a couple of implications:
  * _Non-optimal Cluster Utilization_
1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
 could be *idle* for atleast a while before being detected and shut-down.
2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
 much-smaller no. of reduces; with maps being light and quick and reduces 
 being i/o heavy and longer-running. Users typically allocate clusters 
 depending on the no. of maps (i.e. input size) which leads to the scenario 
 where all the maps are done (idle nodes in the cluster) and the few reduces 
 are chugging along. Right now, we do not have the ability to shrink the 
 HoD'ed Map-Reduce clusters which would alleviate this issue. 
  * _Impact on data-locality_
 With the current setup of a static, large HDFS and much smaller (5/10/20/50 
 node) clusters there is a good chance of losing one of Map-Reduce's primary 
 features: ability to execute tasks on the datanodes where the input splits 
 are located. In fact, we have seen the data-local tasks go down to 20-25 
 percent in the GridMix benchmarks, from the 95-98 percent we see on the 
 randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
 synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
 Map-Reduce) helps significantly here.
 
 Primarily, the notion of *job-level scheduling* leading to private clusers, 
 as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
 of the blame.
 Keeping the above factors in mind, here are some thoughts on how to 
 re-structure Hadoop Map-Reduce to solve some of these issues.
 
 h3. State of the Art
 As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
 for a bit) does provide task-level scheduling; however as it exists today, 
 it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
 Lets review it's current architecture and main components:
  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
 (tasktrackers send task-statuses via periodic heartbeats), which implies it 
 is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
 framework i.e. its failure implies that all the jobs in the system fail. This 
 means a static, large Map-Reduce cluster is fairly susceptible and a definite 
 suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
 above drawbacks.
  * TaskTracker: The slave in the system which executes one task at-a-time 
 under directions from the JobTracker.
  * JobClient: The per-job client which just submits the job and polls the 
 JobTracker for status. 
 
 h3. Proposal - Map-Reduce 2.0 
 The primary idea is to move to task-level scheduling and static Map-Reduce 
 clusters (so as to maintain the same storage cluster and compute cluster 
 paradigm) as a way to directly tackle the two main issues illustrated above. 
 Clearly, we will have to get around the existing problems, especially w.r.t. 
 scalability and reliability.
 The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, 
 static cluster. 
 Here is an overview of how its main components would look like:
  * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. 
 Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) 
 Maui/Moab are  candidates for being the scheduler, in which case, the 
 JobScheduler is just a thin wrapper 

[jira] Updated: (HADOOP-2131) Speculative execution should be allowed for reducers only

2008-01-10 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2131:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

 Speculative execution should be allowed for reducers only
 -

 Key: HADOOP-2131
 URL: https://issues.apache.org/jira/browse/HADOOP-2131
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: Hadoop job, map fetches data from external systems
Reporter: Srikanth Kakani
Assignee: Amareshwari Sri Ramadasu
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch-2131.txt, patch-2131.txt, patch-2131.txt


 Consider hadoop jobs where maps fetch data from external systems, and emit 
 the data. The reducers in this are identity reducers. The data processed by 
 these jobs is huge. There could be slow nodes in this cluster and some of the 
 reducers run twice as slow as their counterparts. This could result in a long 
 tail. Speculative execution would help greatly in such cases. However given 
 the current hadoop, we have to select speculative execution for both maps and 
 reducers. In this case hurting the map performance as they are fetching data 
 from external systems thereby overloading the external systems.
 Speculative execution only on reducers would be a great way to solve this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557781#action_12557781
 ] 

Arun C Murthy commented on HADOOP-2510:
---

bq. What I meant was more of a SW organization point of view. The JobScheduler 
should not be part of the MapReduce sub-project.

Ah, point taken. I misunderstood your previous comment...

 Map-Reduce 2.0
 --

 Key: HADOOP-2510
 URL: https://issues.apache.org/jira/browse/HADOOP-2510
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy

 We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
 provisioning/scheduling mechanism. 
 With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
 allocates these from a global pool and also provisions a private Map-Reduce 
 cluster for the user. She then runs her jobs and shuts the cluster down via 
 HoD when done. All user-private clusters use the same humongous, static HDFS 
 (e.g. 2k node HDFS). 
 More details about HoD are available here: HADOOP-1301.
 
 h3. Motivation
 The current deployment (Hadoop + HoD) has a couple of implications:
  * _Non-optimal Cluster Utilization_
1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
 could be *idle* for atleast a while before being detected and shut-down.
2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
 much-smaller no. of reduces; with maps being light and quick and reduces 
 being i/o heavy and longer-running. Users typically allocate clusters 
 depending on the no. of maps (i.e. input size) which leads to the scenario 
 where all the maps are done (idle nodes in the cluster) and the few reduces 
 are chugging along. Right now, we do not have the ability to shrink the 
 HoD'ed Map-Reduce clusters which would alleviate this issue. 
  * _Impact on data-locality_
 With the current setup of a static, large HDFS and much smaller (5/10/20/50 
 node) clusters there is a good chance of losing one of Map-Reduce's primary 
 features: ability to execute tasks on the datanodes where the input splits 
 are located. In fact, we have seen the data-local tasks go down to 20-25 
 percent in the GridMix benchmarks, from the 95-98 percent we see on the 
 randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
 synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
 Map-Reduce) helps significantly here.
 
 Primarily, the notion of *job-level scheduling* leading to private clusers, 
 as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
 of the blame.
 Keeping the above factors in mind, here are some thoughts on how to 
 re-structure Hadoop Map-Reduce to solve some of these issues.
 
 h3. State of the Art
 As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
 for a bit) does provide task-level scheduling; however as it exists today, 
 it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
 Lets review it's current architecture and main components:
  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
 (tasktrackers send task-statuses via periodic heartbeats), which implies it 
 is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
 framework i.e. its failure implies that all the jobs in the system fail. This 
 means a static, large Map-Reduce cluster is fairly susceptible and a definite 
 suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
 above drawbacks.
  * TaskTracker: The slave in the system which executes one task at-a-time 
 under directions from the JobTracker.
  * JobClient: The per-job client which just submits the job and polls the 
 JobTracker for status. 
 
 h3. Proposal - Map-Reduce 2.0 
 The primary idea is to move to task-level scheduling and static Map-Reduce 
 clusters (so as to maintain the same storage cluster and compute cluster 
 paradigm) as a way to directly tackle the two main issues illustrated above. 
 Clearly, we will have to get around the existing problems, especially w.r.t. 
 scalability and reliability.
 The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, 
 static cluster. 
 Here is an overview of how its main components would look like:
  * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. 
 Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) 
 Maui/Moab are  candidates for being the scheduler, in which case, the 
 JobScheduler is just a thin wrapper around them. 
  * TaskTracker: These stay as before, without some minor changes as 
 illustrated later in the piece.
  * JobClient: Fatten up the JobClient my putting a lot more intelligence into 
 it. Enhance it to talk to the JobTracker to ask for available 

[jira] Commented: (HADOOP-2570) streaming jobs fail after HADOOP-2227

2008-01-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557906#action_12557906
 ] 

Arun C Murthy commented on HADOOP-2570:
---

bq. the 2 places where jobcache dir was used in streaming was to 'chmod' the 
executable and to lookup this directory in PATH. Would be it OK to construct 
jobCacheDir as done in HADOOP-2227 ?

Lohit, that still won't help scripts which use ../work/myscript - so this 
is the best approach for 0.15.3.

In light of this bug HADOOP-2116 is a little more complicated than originally 
thought, I have a few thoughts about this which I'll put up there.


bq. Submiting the patch with symlinks to ../work from taskdir

+1



 streaming jobs fail after HADOOP-2227
 -

 Key: HADOOP-2570
 URL: https://issues.apache.org/jira/browse/HADOOP-2570
 Project: Hadoop
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 0.15.2
Reporter: lohit vijayarenu
Assignee: Amareshwari Sri Ramadasu
Priority: Blocker
 Fix For: 0.15.3

 Attachments: patch-2570.txt


 HADOOP-2227 changes jobCacheDir. In streaming, jobCacheDir was constructed 
 like this
 {code}
 File jobCacheDir = new File(currentDir.getParentFile().getParent(), work);
 {code}
 We should change this to get it working. Referring to the changes made in 
 HADOOP-2227, I see that the APIs used in there to construct the path are not 
 public. And hard coding the path in streaming does not look good. thought?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2116) Job.local.dir to be exposed to tasks

2008-01-10 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2116:
--

Status: Open  (was: Patch Available)

I light of HADOOP-2570, I'm cancelling this patch.

Reasoning:

The *-file* option works by putting the script into the job's jar file by 
unjar-ing, copying and then jar-ing it again. (yuck!) 

This means that on the TaskTracker the script has moved from jobCache/work to 
jobCache/job_jar_xml (I propose we rename that to *private*, heh). Clearly 
user-scripts which rely on ../work/script_name will break again...

Having said that we need to debate whether this feature is an 
incompatible-change, what do folks think?

If people say otherwise we need to ensure all files in jobCache/private are 
smylinked into jobCache/work... ugh!



I'd like to take this opportunity to take a hard look at streaming's *-file* 
option too. The unjar/jar way is completely backwards! We _should_ rework the 
-file option to use the DistributedCache and the symlink option it provides.
So, user-scripts can simply be ./script rather than ../work/script. 
Yes, the way to maintain compatibility (if we want) is to use the previous 
option of symlinking files into jobCache/work also. I'd strongly vote for this 
option.

Thoughts?

 Job.local.dir to be exposed to tasks
 

 Key: HADOOP-2116
 URL: https://issues.apache.org/jira/browse/HADOOP-2116
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.14.3
 Environment: All
Reporter: Milind Bhandarkar
Assignee: Amareshwari Sri Ramadasu
 Fix For: 0.16.0

 Attachments: patch-2116.txt, patch-2116.txt


 Currently, since all task cwds are created under a jobcache directory, users 
 that need a job-specific shared directory for use as scratch space, create 
 ../work. This is hacky, and will break when HADOOP-2115 is addressed. For 
 such jobs, hadoop mapred should expose job.local.dir via localized 
 configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1876) Persisting completed jobs status

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1876:
--

Assignee: Alejandro Abdelnur
  Status: Open  (was: Patch Available)

bq. It seems to me that it would be much easier to retrofit the JobHistory to 
use info out the files the patch is writing that the other way around.

I guess we should consider the fact that we might be better off, in the long 
run, moving away from the custom, textual format used today by the 
{{JobHistory}} and go the {{Writable}} way - much lesser and more standard 
code. I don't believe the textual format buys us much, and is a pain to 
maintain.

If folks agree, I'm okay with this patch going in as-is (oh, and yes, this is a 
very different use-case) and then fixing {{JobHistory}} to use {{Writable}} to 
serialize the necessary data-structures. Thoughts?



That said, some comments about the patch:

Alejandro, could you please ensure that the {{completedJobsStoreThread}} isn't 
_started at all_ if the feature is switched off?

Maybe we could add a boolean {{mapred.job.tracker.persist.jobstatus}} flag to 
turn the feature on/off.



 Persisting completed jobs status
 

 Key: HADOOP-1876
 URL: https://issues.apache.org/jira/browse/HADOOP-1876
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: all
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch1876.txt, patch1876.txt


 Currently the JobTracker keeps information about completed jobs in memory. 
 This information is  flushed from the cache when it has outlived 
 (#RETIRE_JOB_INTERVAL) or because the limit of completed jobs in memory has 
 been reach (#MAX_COMPLETE_USER_JOBS_IN_MEMORY). 
 Also, if the JobTracker is restarted (due to being recycled or due to a 
 crash) information about completed jobs is lost.
 If any of the above scenarios happens before the job information is queried 
 by a hadoop client (normally the job submitter or a monitoring component) 
 there is no way to obtain such information.
 A way to avoid this is the JobTracker to persist in DFS the completed jobs 
 information upon job completion. This would be done at the time the job is 
 moved to the completed jobs queue. Then when querying the JobTracker for 
 information about a completed job, if it is not found in the memory queue, a 
 lookup  in DFS would be done to retrieve the completed job information. 
 A directory in DFS (under mapred/system) would be used to persist completed 
 job information, for each completed job there would be a directory with the 
 job ID, within that directory all the information about the job: status, 
 jobprofile, counters and completion events.
 A configuration property will indicate for how log persisted job information 
 should be kept in DFS. After such period it will be cleaned up automatically.
 This improvement would not introduce API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1281) Speculative map tasks aren't getting killed although the TIP completed

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1281:
--

Attachment: HADOOP-1281_2_20080109.patch

Exact same patch as before, but added comments rationalizing the fix...

 Speculative map tasks aren't getting killed although the TIP completed
 --

 Key: HADOOP-1281
 URL: https://issues.apache.org/jira/browse/HADOOP-1281
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0

 Attachments: HADOOP-1281_1_20071117.patch, 
 HADOOP-1281_2_20071123.patch, HADOOP-1281_2_20080109.patch


 The speculative map tasks run to completion although the TIP succeeded since 
 the other task completed elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1281) Speculative map tasks aren't getting killed although the TIP completed

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1281:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 Speculative map tasks aren't getting killed although the TIP completed
 --

 Key: HADOOP-1281
 URL: https://issues.apache.org/jira/browse/HADOOP-1281
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0

 Attachments: HADOOP-1281_1_20071117.patch, 
 HADOOP-1281_2_20071123.patch, HADOOP-1281_2_20080109.patch


 The speculative map tasks run to completion although the TIP succeeded since 
 the other task completed elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-1876) Persisting completed jobs status

2008-01-09 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557397#action_12557397
 ] 

Arun C Murthy commented on HADOOP-1876:
---

bq. Can this patch make JobHistory log obsolete? Or at least is that intended? 
I hate to see same information logged at different places in different forms 
using different code paths.

This patch doesn't do that, but definitely that is the direction I'd go too... 
+1.

Should we broaden the scope HADOOP-2178 to re-work JobHistory to use Writables 
rather than the custom format? Or is that a new jira?

bq. Other than being in text format (which has its pros and cons), job history 
log is event based [...]

Yes, moving to Writable wouldn't hurt the _job analysis_ part since, as you 
point out, it's event-based - we just need to use Writable.readFields rather 
than the custom text-parsing... anyone sees other issues?



 Persisting completed jobs status
 

 Key: HADOOP-1876
 URL: https://issues.apache.org/jira/browse/HADOOP-1876
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: all
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch1876.txt, patch1876.txt


 Currently the JobTracker keeps information about completed jobs in memory. 
 This information is  flushed from the cache when it has outlived 
 (#RETIRE_JOB_INTERVAL) or because the limit of completed jobs in memory has 
 been reach (#MAX_COMPLETE_USER_JOBS_IN_MEMORY). 
 Also, if the JobTracker is restarted (due to being recycled or due to a 
 crash) information about completed jobs is lost.
 If any of the above scenarios happens before the job information is queried 
 by a hadoop client (normally the job submitter or a monitoring component) 
 there is no way to obtain such information.
 A way to avoid this is the JobTracker to persist in DFS the completed jobs 
 information upon job completion. This would be done at the time the job is 
 moved to the completed jobs queue. Then when querying the JobTracker for 
 information about a completed job, if it is not found in the memory queue, a 
 lookup  in DFS would be done to retrieve the completed job information. 
 A directory in DFS (under mapred/system) would be used to persist completed 
 job information, for each completed job there would be a directory with the 
 job ID, within that directory all the information about the job: status, 
 jobprofile, counters and completion events.
 A configuration property will indicate for how log persisted job information 
 should be kept in DFS. After such period it will be cleaned up automatically.
 This improvement would not introduce API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2268) JobControl classes should use interfaces rather than implemenations

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2268:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Adrian!

 JobControl classes should use interfaces rather than implemenations
 ---

 Key: HADOOP-2268
 URL: https://issues.apache.org/jira/browse/HADOOP-2268
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.15.0
Reporter: Adrian Woodhead
Assignee: Adrian Woodhead
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2268-1.patch, HADOOP-2268-2.patch, 
 HADOOP-2268-3.patch, HADOOP-2268-4.patch


 See HADOOP-2202 for background on this issue. Arun C. Murthy agrees that when 
 possible it is preferable to program against the interface rather than a 
 concrete implementation (more flexible, allows for changes of the 
 implementation in future etc.) JobControl currently exposes running, waiting, 
 ready, successful and dependent jobs as ArrayList rather than List. I propose 
 to change this to List.
 I will code up a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy reassigned HADOOP-2077:
-

Assignee: Arun C Murthy

 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2077:
--

Attachment: HADOOP-2077_0_20080110.patch

Simple fix. 
I haven't gotten around to testing it much since svn.apache.org is _super_ 
slow...

 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0

 Attachments: HADOOP-2077_0_20080110.patch


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2077:
--

Attachment: HADOOP-2077_0_20080110.patch

Minor change in formatting of the output, which now looks like:


{noformat}
2008-01-10 03:57:15,143 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = neo/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version  = 0.16.0-dev
STARTUP_MSG:   subversion = http://svn.apache.org/repos/asf/lucene/hadoop/trunk 
-r 610541
STARTUP_MSG:   compiled-by = arun on Thu Jan 10 03:57:03 IST 2008
/
{noformat}


 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0

 Attachments: HADOOP-2077_0_20080110.patch, 
 HADOOP-2077_0_20080110.patch


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2077) Logging version number (and compiled date) at STARTUP_MSG

2008-01-09 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2077:
--

Status: Patch Available  (was: Open)

 Logging version number (and compiled date) at STARTUP_MSG  
 ---

 Key: HADOOP-2077
 URL: https://issues.apache.org/jira/browse/HADOOP-2077
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Koji Noguchi
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0

 Attachments: HADOOP-2077_0_20080110.patch, 
 HADOOP-2077_0_20080110.patch


 This will help us figure out which version of hadoop we were running when 
 looking back the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2131) Speculative execution should be allowed for reducers only

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2131:
--

Status: Open  (was: Patch Available)

Please go ahead and deprecate the old {{mapred.speculative.execution}} config 
in favour of the new ones which should be set to *true* in hadoop-default.xml.

For 0.16.0 we should let {{mapred.speculative.execution}} over-ride the new 
ones since it means folks actually went ahead and cared about the non-default 
value and hence set it in their hadoop-site.xml.

 Speculative execution should be allowed for reducers only
 -

 Key: HADOOP-2131
 URL: https://issues.apache.org/jira/browse/HADOOP-2131
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: Hadoop job, map fetches data from external systems
Reporter: Srikanth Kakani
Assignee: Amareshwari Sri Ramadasu
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch-2131.txt


 Consider hadoop jobs where maps fetch data from external systems, and emit 
 the data. The reducers in this are identity reducers. The data processed by 
 these jobs is huge. There could be slow nodes in this cluster and some of the 
 reducers run twice as slow as their counterparts. This could result in a long 
 tail. Speculative execution would help greatly in such cases. However given 
 the current hadoop, we have to select speculative execution for both maps and 
 reducers. In this case hurting the map performance as they are fetching data 
 from external systems thereby overloading the external systems.
 Speculative execution only on reducers would be a great way to solve this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2285) TextInputFormat is slow compared to reading files.

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2285:
--

Status: Open  (was: Patch Available)

Minor nit: this patch removes a public constructor rather than add a new one:

{noformat}
-  public LineRecordReader(InputStream in, long offset, long endOffset)
+  public LineRecordReader(InputStream in, long offset, long endOffset,
+  Configuration job)
{noformat}

 TextInputFormat is slow compared to reading files.
 --

 Key: HADOOP-2285
 URL: https://issues.apache.org/jira/browse/HADOOP-2285
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.16.0

 Attachments: fast-line.patch


 The LineRecordReader reads from the source byte by byte, which seems to be 
 half as fast as if the readLine method was defined on the memory buffer 
 directly instead of as an InputStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2285) TextInputFormat is slow compared to reading files.

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2285:
--

Attachment: fast-line2.patch

Attaching a simple fix to my previous comment on Owen's behalf...

 TextInputFormat is slow compared to reading files.
 --

 Key: HADOOP-2285
 URL: https://issues.apache.org/jira/browse/HADOOP-2285
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.16.0

 Attachments: fast-line.patch, fast-line2.patch


 The LineRecordReader reads from the source byte by byte, which seems to be 
 half as fast as if the readLine method was defined on the memory buffer 
 directly instead of as an InputStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2285) TextInputFormat is slow compared to reading files.

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2285:
--

Status: Patch Available  (was: Open)

 TextInputFormat is slow compared to reading files.
 --

 Key: HADOOP-2285
 URL: https://issues.apache.org/jira/browse/HADOOP-2285
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Fix For: 0.16.0

 Attachments: fast-line.patch, fast-line2.patch


 The LineRecordReader reads from the source byte by byte, which seems to be 
 half as fast as if the readLine method was defined on the memory buffer 
 directly instead of as an InputStream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2487) Provide an option to get job status for all jobs run by or submitted to a job tracker

2008-01-08 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12556886#action_12556886
 ] 

Arun C Murthy commented on HADOOP-2487:
---

You are right, I withdraw my earlier comments...

+1

 Provide an option to get job status for all jobs run by or submitted to a job 
 tracker
 -

 Key: HADOOP-2487
 URL: https://issues.apache.org/jira/browse/HADOOP-2487
 Project: Hadoop
  Issue Type: New Feature
  Components: mapred
Reporter: Hemanth Yamijala
Assignee: Amareshwari Sri Ramadasu
 Fix For: 0.16.0

 Attachments: patch-2487.txt


 This is an RFE for providing an RPC in Hadoop that can expose status 
 information for jobs submitted to a JobTracker. Such a feature can be used 
 for developing tools that can be used to analyse jobs.
 It is possible that other information is also useful - such as running times 
 of jobs, etc.
 Comments ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1281) Speculative map tasks aren't getting killed although the TIP completed

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1281:
--

Status: Patch Available  (was: Reopened)

I finally got around to testing this patch throughly, hence marking it PA.

 Speculative map tasks aren't getting killed although the TIP completed
 --

 Key: HADOOP-1281
 URL: https://issues.apache.org/jira/browse/HADOOP-1281
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0

 Attachments: HADOOP-1281_1_20071117.patch, 
 HADOOP-1281_2_20071123.patch


 The speculative map tasks run to completion although the TIP succeeded since 
 the other task completed elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1660) add support for native library toDistributedCache

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1660:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 add support for native library toDistributedCache 
 --

 Key: HADOOP-1660
 URL: https://issues.apache.org/jira/browse/HADOOP-1660
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: unix (different handling would be required for windows)
Reporter: Alejandro Abdelnur
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-1660_0_20080108.patch


 Currently if a M/R job depends on JNI based component the dynamic library 
 must be available in all the task nodes. This is not possible specially when 
 you have not control on the cluster machines, just using it as a service.
 It should be possible to specify using the DistributedCache what are the 
 native libraries a job needs.
 For example via a new method 'public void addLibrary(Path libraryPath, 
 JobConf conf)'.
 The added libraries would make it to the local FS of the task nodes (same way 
 as cached resources) but instead been part of the classpath they would be 
 copied to a lib directory and that lib directory would be added t the 
 LD_LIBRARY_PATH of the task JVM.
 An alternative would be to set the '-Djava.library.path=' task JVM parameter 
 to the lib directory above. However, this would break for libraries that 
 depend on other libraries as the dependent one would not be in the 
 LD_LIBRARY_PATH and the OS would fail to find it as it is not the JVM the one 
 doing the load of the dependent one.
 For uncached usage of native libraries, a special directory in the JAR could 
 be used for native libraries. But I'd argue that the DistributedCache 
 enhancement would be enough, and if somebody wants to use a native library 
 s/he should use the DistributedCached. Or a JobConf addLibrary method that 
 uses the DistributedCached under the hood at submission time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HADOOP-622) Users should be able to change the environment in which there maps/reduces run.

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy resolved HADOOP-622.
--

Resolution: Duplicate

Fixed as a part of HADOOP-1660.

 Users should be able to change the environment in which there maps/reduces 
 run.
 ---

 Key: HADOOP-622
 URL: https://issues.apache.org/jira/browse/HADOOP-622
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Mahadev konar
Assignee: Owen O'Malley
Priority: Minor

 This would be useful with caching. So you would be avble to say, cache file X 
 and then should be able to change the environment variable like 
 PATH/LD_LIBRARY_PATH to include the local path hwere the file was cached. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2131) Speculative execution should be allowed for reducers only

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2131:
--

Attachment: patch-2131.txt

Attaching an updated patch( with a couple of minor javadoc fixes) on 
Amareshwari's behalf so that Hudson can pick this up right-away...

 Speculative execution should be allowed for reducers only
 -

 Key: HADOOP-2131
 URL: https://issues.apache.org/jira/browse/HADOOP-2131
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: Hadoop job, map fetches data from external systems
Reporter: Srikanth Kakani
Assignee: Amareshwari Sri Ramadasu
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch-2131.txt, patch-2131.txt, patch-2131.txt


 Consider hadoop jobs where maps fetch data from external systems, and emit 
 the data. The reducers in this are identity reducers. The data processed by 
 these jobs is huge. There could be slow nodes in this cluster and some of the 
 reducers run twice as slow as their counterparts. This could result in a long 
 tail. Speculative execution would help greatly in such cases. However given 
 the current hadoop, we have to select speculative execution for both maps and 
 reducers. In this case hurting the map performance as they are fetching data 
 from external systems thereby overloading the external systems.
 Speculative execution only on reducers would be a great way to solve this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2131) Speculative execution should be allowed for reducers only

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2131:
--

Status: Open  (was: Patch Available)

Need to update this patch to reflect recent changes to trunk...

 Speculative execution should be allowed for reducers only
 -

 Key: HADOOP-2131
 URL: https://issues.apache.org/jira/browse/HADOOP-2131
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: Hadoop job, map fetches data from external systems
Reporter: Srikanth Kakani
Assignee: Amareshwari Sri Ramadasu
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch-2131.txt, patch-2131.txt


 Consider hadoop jobs where maps fetch data from external systems, and emit 
 the data. The reducers in this are identity reducers. The data processed by 
 these jobs is huge. There could be slow nodes in this cluster and some of the 
 reducers run twice as slow as their counterparts. This could result in a long 
 tail. Speculative execution would help greatly in such cases. However given 
 the current hadoop, we have to select speculative execution for both maps and 
 reducers. In this case hurting the map performance as they are fetching data 
 from external systems thereby overloading the external systems.
 Speculative execution only on reducers would be a great way to solve this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2131) Speculative execution should be allowed for reducers only

2008-01-08 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2131:
--

Status: Patch Available  (was: Open)

 Speculative execution should be allowed for reducers only
 -

 Key: HADOOP-2131
 URL: https://issues.apache.org/jira/browse/HADOOP-2131
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: Hadoop job, map fetches data from external systems
Reporter: Srikanth Kakani
Assignee: Amareshwari Sri Ramadasu
Priority: Critical
 Fix For: 0.16.0

 Attachments: patch-2131.txt, patch-2131.txt, patch-2131.txt


 Consider hadoop jobs where maps fetch data from external systems, and emit 
 the data. The reducers in this are identity reducers. The data processed by 
 these jobs is huge. There could be slow nodes in this cluster and some of the 
 reducers run twice as slow as their counterparts. This could result in a long 
 tail. Speculative execution would help greatly in such cases. However given 
 the current hadoop, we have to select speculative execution for both maps and 
 reducers. In this case hurting the map performance as they are fetching data 
 from external systems thereby overloading the external systems.
 Speculative execution only on reducers would be a great way to solve this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-07 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch, 
 HADOOP-2535_1_20080107.patch, HADOOP-2535_2_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1660) add support for native library toDistributedCache

2008-01-07 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1660:
--

Fix Version/s: 0.16.0
   Status: Patch Available  (was: Open)

 add support for native library toDistributedCache 
 --

 Key: HADOOP-1660
 URL: https://issues.apache.org/jira/browse/HADOOP-1660
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: unix (different handling would be required for windows)
Reporter: Alejandro Abdelnur
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-1660_0_20080108.patch


 Currently if a M/R job depends on JNI based component the dynamic library 
 must be available in all the task nodes. This is not possible specially when 
 you have not control on the cluster machines, just using it as a service.
 It should be possible to specify using the DistributedCache what are the 
 native libraries a job needs.
 For example via a new method 'public void addLibrary(Path libraryPath, 
 JobConf conf)'.
 The added libraries would make it to the local FS of the task nodes (same way 
 as cached resources) but instead been part of the classpath they would be 
 copied to a lib directory and that lib directory would be added t the 
 LD_LIBRARY_PATH of the task JVM.
 An alternative would be to set the '-Djava.library.path=' task JVM parameter 
 to the lib directory above. However, this would break for libraries that 
 depend on other libraries as the dependent one would not be in the 
 LD_LIBRARY_PATH and the OS would fail to find it as it is not the JVM the one 
 doing the load of the dependent one.
 For uncached usage of native libraries, a special directory in the JAR could 
 be used for native libraries. But I'd argue that the DistributedCache 
 enhancement would be enough, and if somebody wants to use a native library 
 s/he should use the DistributedCached. Or a JobConf addLibrary method that 
 uses the DistributedCached under the hood at submission time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1660) add support for native library toDistributedCache

2008-01-07 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1660:
--

Attachment: HADOOP-1660_0_20080108.patch

Here is a candidate patch which adds the child task's cwd to it's 
{{java.library.path}}, I've also updated the forrest-docs to reflect this.

 add support for native library toDistributedCache 
 --

 Key: HADOOP-1660
 URL: https://issues.apache.org/jira/browse/HADOOP-1660
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: unix (different handling would be required for windows)
Reporter: Alejandro Abdelnur
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-1660_0_20080108.patch


 Currently if a M/R job depends on JNI based component the dynamic library 
 must be available in all the task nodes. This is not possible specially when 
 you have not control on the cluster machines, just using it as a service.
 It should be possible to specify using the DistributedCache what are the 
 native libraries a job needs.
 For example via a new method 'public void addLibrary(Path libraryPath, 
 JobConf conf)'.
 The added libraries would make it to the local FS of the task nodes (same way 
 as cached resources) but instead been part of the classpath they would be 
 copied to a lib directory and that lib directory would be added t the 
 LD_LIBRARY_PATH of the task JVM.
 An alternative would be to set the '-Djava.library.path=' task JVM parameter 
 to the lib directory above. However, this would break for libraries that 
 depend on other libraries as the dependent one would not be in the 
 LD_LIBRARY_PATH and the OS would fail to find it as it is not the JVM the one 
 doing the load of the dependent one.
 For uncached usage of native libraries, a special directory in the JAR could 
 be used for native libraries. But I'd argue that the DistributedCache 
 enhancement would be enough, and if somebody wants to use a native library 
 s/he should use the DistributedCached. Or a JobConf addLibrary method that 
 uses the DistributedCached under the hood at submission time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2487) Provide an option to get job status for all jobs run by or submitted to a job tracker

2008-01-07 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12556677#action_12556677
 ] 

Arun C Murthy commented on HADOOP-2487:
---

This patch enhances the _bin/hadoop job_ command like so:
{noformat}
$ bin/hadoop job -list all
{noformat}

However, I think it's better to keep the _job_ command per-job specific and add 
a *listAllJobs* switch to the _bin/hadoop jobtracker_ command:

{noformat}
$ bin/hadoop jobtracker -listalljobs
{noformat}


 Provide an option to get job status for all jobs run by or submitted to a job 
 tracker
 -

 Key: HADOOP-2487
 URL: https://issues.apache.org/jira/browse/HADOOP-2487
 Project: Hadoop
  Issue Type: New Feature
  Components: mapred
Reporter: Hemanth Yamijala
Assignee: Amareshwari Sri Ramadasu
 Fix For: 0.16.0

 Attachments: patch-2487.txt


 This is an RFE for providing an RPC in Hadoop that can expose status 
 information for jobs submitted to a JobTracker. Such a feature can be used 
 for developing tools that can be used to analyse jobs.
 It is possible that other information is also useful - such as running times 
 of jobs, etc.
 Comments ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HADOOP-1660) add support for native library toDistributedCache

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy reassigned HADOOP-1660:
-

Assignee: Arun C Murthy

 add support for native library toDistributedCache 
 --

 Key: HADOOP-1660
 URL: https://issues.apache.org/jira/browse/HADOOP-1660
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
 Environment: unix (different handling would be required for windows)
Reporter: Alejandro Abdelnur
Assignee: Arun C Murthy

 Currently if a M/R job depends on JNI based component the dynamic library 
 must be available in all the task nodes. This is not possible specially when 
 you have not control on the cluster machines, just using it as a service.
 It should be possible to specify using the DistributedCache what are the 
 native libraries a job needs.
 For example via a new method 'public void addLibrary(Path libraryPath, 
 JobConf conf)'.
 The added libraries would make it to the local FS of the task nodes (same way 
 as cached resources) but instead been part of the classpath they would be 
 copied to a lib directory and that lib directory would be added t the 
 LD_LIBRARY_PATH of the task JVM.
 An alternative would be to set the '-Djava.library.path=' task JVM parameter 
 to the lib directory above. However, this would break for libraries that 
 depend on other libraries as the dependent one would not be in the 
 LD_LIBRARY_PATH and the OS would fail to find it as it is not the JVM the one 
 doing the load of the dependent one.
 For uncached usage of native libraries, a special directory in the JAR could 
 be used for native libraries. But I'd argue that the DistributedCache 
 enhancement would be enough, and if somebody wants to use a native library 
 s/he should use the DistributedCached. Or a JobConf addLibrary method that 
 uses the DistributedCached under the hood at submission time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-2535) Indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)
Indentation fix in TaskRunner.java
--

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Trivial
 Fix For: 0.16.0


TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Priority: Minor  (was: Trivial)
 Summary: Remove support for deprecated mapred.child.heap.size and 
indentation fix in TaskRunner.java  (was: Indentation fix in TaskRunner.java)

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Attachment: HADOOP-2535_0_20080107.patch

Candidate patch which removes support for deprecated {{mapred.child.heap.size}} 
and fixes the indentation irritants. I've also updated some comments...

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Status: Patch Available  (was: Open)

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Status: Open  (was: Patch Available)

I noticed, a little too late, that another old comment is _wrong_ ... *smile*

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Attachment: HADOOP-2535_1_20080107.patch

Updated patch to fix:

{noformat}
-  // namemapred.child.optional.jvm.args/name
{noformat}

as

{noformat}
-  // namemapred.child.java.opts/name
{noformat}



 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch, 
 HADOOP-2535_1_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Status: Patch Available  (was: Open)

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch, 
 HADOOP-2535_1_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2535) Remove support for deprecated mapred.child.heap.size and indentation fix in TaskRunner.java

2008-01-06 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2535:
--

Status: Open  (was: Patch Available)

I need to fix some more docs for {{mapred.child.heap.size}} ...

 Remove support for deprecated mapred.child.heap.size and indentation fix in 
 TaskRunner.java
 ---

 Key: HADOOP-2535
 URL: https://issues.apache.org/jira/browse/HADOOP-2535
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.2
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2535_0_20080107.patch, 
 HADOOP-2535_1_20080107.patch


 TaskRunner.java (289-344) have wrong indentation - 4 spaces rather than the 
 standard 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-05 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12556271#action_12556271
 ] 

Arun C Murthy commented on HADOOP-2528:
---

I'm with Doug on the need for failing early if either of the input or output 
directories aren't readable/writable.

I'm wondering if it makes sense to add utility apis in fs which given a 
directory name checks for existence, validates against given set of permissions 
etc. We could then use this to validate the job inputs via a single rpc, rather 
than one per file as today (without this patch). Thoughts?

 check permissions for job inputs and outputs
 

 Key: HADOOP-2528
 URL: https://issues.apache.org/jira/browse/HADOOP-2528
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Doug Cutting
 Fix For: 0.16.0

 Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch


 On job submission, filesystem permissions should be checked to ensure that 
 the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2516) HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0

2008-01-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555850#action_12555850
 ] 

Arun C Murthy commented on HADOOP-2516:
---

bq. There isn't a way to get the old functionality without leaving the static 
variable. I think the static variable and the usage of it was causing trouble 
because the reference was visible before the object was finished being 
constructed.

Fair point. Should we just mark HADOOP-1819 as an *incompatible change* for 
reference?

bq. But whatever the outcome, it certainly shouldn't be marked for fixing in 
0.16.

+1



 HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0
 

 Key: HADOOP-2516
 URL: https://issues.apache.org/jira/browse/HADOOP-2516
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.15.1
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 HADOOP-1819 removed a 0.14.0 public api {{JobTracker.getTracker}} in 0.15.0.
 http://svn.apache.org/viewvc?view=revrevision=575438 and
 http://svn.apache.org/viewvc/lucene/hadoop/branches/branch-0.15/src/java/org/apache/hadoop/mapred/JobTracker.java?r1=573708r2=575438diff_format=h
 There is a simple work-around i.e. use the return value of 
 {{JobTracker.startTracker}} ... yet, is this a 0.15.2 blocker?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2106) Hadoop daemons should support generic command-line options by implementing the Tool interface

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2106:
--

Component/s: mapred
 dfs

I'm marking this for 0.17.0 after a discussion with Hemanth, the original 
requestor of this feature.

 Hadoop daemons should support generic command-line options by implementing 
 the Tool interface
 -

 Key: HADOOP-2106
 URL: https://issues.apache.org/jira/browse/HADOOP-2106
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0


 Hadoop daemons (NN/DN/JT/TT) should support generic command-line options 
 (i.e. -nn / -jt/ -conf / -D) by implementing the Tool interface.
 This is particularly useful for cases where the masters(NN/JT) are to be 
 configured dynamically e.g. via HoD.
 (I suspect we will need to possibly tweak some the the hadoop scripts too, 
 possibly.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1622:
--

Component/s: mapred
Description: 
More likely than not, a user's job may depend on multiple jars.
Right now, when submitting a job through bin/hadoop, there is no way for the 
user to specify that. 
A walk around for that is to re-package all the dependent jars into a new jar 
or put the dependent jar files in the lib dir of the new jar.
This walk around causes unnecessary inconvenience to the user. Furthermore, if 
the user does not own the main function 
(like the case when the user uses Aggregate, or datajoin, streaming), the user 
has to re-package those system jar files too.
It is much desired that hadoop provides a clean and simple way for the user to 
specify a list of dependent jar files at the time 
of job submission. Someting like:

bin/hadoop  --depending_jars j1.jar:j2.jar 


  was:

More likely than not, a user's job may depend on multiple jars.
Right now, when submitting a job through bin/hadoop, there is no way for the 
user to specify that. 
A walk around for that is to re-package all the dependent jars into a new jar 
or put the dependent jar files in the lib dir of the new jar.
This walk around causes unnecessary inconvenience to the user. Furthermore, if 
the user does not own the main function 
(like the case when the user uses Aggregate, or datajoin, streaming), the user 
has to re-package those system jar files too.
It is much desired that hadoop provides a clean and simple way for the user to 
specify a list of dependent jar files at the time 
of job submission. Someting like:

bin/hadoop  --depending_jars j1.jar:j2.jar 



 Hadoop should provide a way to allow the user to specify jar file(s) the user 
 job depends on
 

 Key: HADOOP-1622
 URL: https://issues.apache.org/jira/browse/HADOOP-1622
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Runping Qi
Assignee: Dennis Kubes
 Fix For: 0.16.0

 Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, 
 HADOOP-1622-6.patch, HADOOP-1622-7.patch, HADOOP-1622-8.patch, 
 HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch, 
 multipleJobResources2.patch


 More likely than not, a user's job may depend on multiple jars.
 Right now, when submitting a job through bin/hadoop, there is no way for the 
 user to specify that. 
 A walk around for that is to re-package all the dependent jars into a new jar 
 or put the dependent jar files in the lib dir of the new jar.
 This walk around causes unnecessary inconvenience to the user. Furthermore, 
 if the user does not own the main function 
 (like the case when the user uses Aggregate, or datajoin, streaming), the 
 user has to re-package those system jar files too.
 It is much desired that hadoop provides a clean and simple way for the user 
 to specify a list of dependent jar files at the time 
 of job submission. Someting like:
 bin/hadoop  --depending_jars j1.jar:j2.jar 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2099) Pending, running, completed tasks should also be shown as percentage

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2099:
--

Component/s: mapred

I'm marking this for 0.17.0.

 Pending, running, completed tasks should also be shown as percentage
 

 Key: HADOOP-2099
 URL: https://issues.apache.org/jira/browse/HADOOP-2099
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.14.0
Reporter: Amar Kamat
Assignee: Amar Kamat
Priority: Minor
 Fix For: 0.16.0

 Attachments: HADOOP-2099.patch, percent.png




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2106) Hadoop daemons should support generic command-line options by implementing the Tool interface

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2106:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

 Hadoop daemons should support generic command-line options by implementing 
 the Tool interface
 -

 Key: HADOOP-2106
 URL: https://issues.apache.org/jira/browse/HADOOP-2106
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.17.0


 Hadoop daemons (NN/DN/JT/TT) should support generic command-line options 
 (i.e. -nn / -jt/ -conf / -D) by implementing the Tool interface.
 This is particularly useful for cases where the masters(NN/JT) are to be 
 configured dynamically e.g. via HoD.
 (I suspect we will need to possibly tweak some the the hadoop scripts too, 
 possibly.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2148) Inefficient FSDataset.getBlockFile()

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2148:
--

Component/s: dfs

 Inefficient FSDataset.getBlockFile()
 

 Key: HADOOP-2148
 URL: https://issues.apache.org/jira/browse/HADOOP-2148
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs
Affects Versions: 0.14.0
Reporter: Konstantin Shvachko
 Fix For: 0.16.0


 FSDataset.getBlockFile() first verifies that the block is valid and then 
 returns the file name corresponding to the block.
 Doing that it performs the data-node blockMap lookup twice. Only one lookup 
 is needed here. 
 This is important since the data-node blockMap is big.
 Another observation is that data-nodes do not need the blockMap at all. File 
 names can be derived from the block IDs,
 there is no need to hold Block to File mapping in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2206) Design/implement a general log-aggregation framework for Hadoop

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2206:
--

  Component/s: mapred
   dfs
Fix Version/s: (was: 0.16.0)
   0.17.0

I'm marking for 0.17.0.

 Design/implement a general log-aggregation framework for Hadoop
 ---

 Key: HADOOP-2206
 URL: https://issues.apache.org/jira/browse/HADOOP-2206
 Project: Hadoop
  Issue Type: New Feature
  Components: dfs, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.17.0


 I'd like to propose a log-aggregation framework which facilitates collection, 
 aggregation and storage of the logs of the Hadoop Map-Reduce framework and 
 user-jobs in HDFS. Clearly the design/implementation of this framework is 
 heavily influenced and limited by Hadoop itself for e.g. lack of appends, not 
 too many small files (think: stdout/stderr/syslog of each map/reduce task) 
 and so on. 
 This framework will be especially useful once HoD (HADOOP-1301) is used to 
 provision dynamic, per-user, Map-Reduce clusters.
 h4. Requirements:
 *  Store the various logs to a configurable location in the Hadoop 
 Distributed FileSystem
 ** User task logs (stdout, stderr, syslog)
 ** Map-Reduce daemons' logs (JobTracker and TaskTracker)
 * Integrate well with Hadoop and ensure no adverse performance impact on the 
 Map-Reduce framework.
 * It must not use a HDFS file (or more!) per a task, which would swamp the 
 NameNode capabilities.
 * The aggregation system must be distributed and reliable.
 * Facilities/tools to read the aggregated logs.
 * The aggregated logs should be compressed.
 h4. Architecture:
 Here is a high-level overview of the log-aggregation framework:
 h5. Logging
 * Provision a cloud of log-aggregators in the cluster (outside of the Hadoop 
 cluster, running on the subset of nodes in the cluster). Lets call each one 
 in the cloud as a Log Aggregator i.e. LA.
 * Each LA writes out 2 files per Map-Reduce cluster: an index file and a data 
 file. The LA maintains one directory per Map-Reduce cluster on HDFS.
 * The index file format is simple:
 ** streamid (_streamid_ is either daemon identifier e.g. 
 tasktracker_foo.bar.com:57891 or $jobid-$taskid-(stdout|stderr|syslog) or 
 individual task-logs)
 ** timestamp
 ** logs-data start offset
 ** no. of bytes
 * Each Hadoop daemon (JT/TT) is given the entire list of LAs in the cluster.
 * Each daemon picks one LA (at random) from the list, opens an exclusive 
 stream with the LA after identifying itself (i.e. ${daemonid}) and sends it's 
 logs. In case of error/failure to log it just connects to another LA as above 
 and starts logging to it.
 * The logs are sent to the LA by a new log4j appender. The appender provides 
 some amount of buffering on the client-side.
 * Implement a feature in the TaskTracker which lets it use the same appender 
 to send out the userlogs (stdout/stderr/syslog) to the LA after task 
 completion. This is important to ensure that logging to the LA at runtime 
 doesn't hurt the task's performance (see HADOOP-1553). The TaskTracker picks 
 an LA per task in a manner similar to the one it uses for it's own logs, 
 identifies itself (${jobid}, ${taskid}, {stdout|stderr|syslog}) and streams 
 the entire task-log at one go. In fact we can pick different LAs for each of 
 the task's stdout, stderr and syslog logs - each an exclusive stream to a 
 single LA.
 * The LA buffers some amount of data in memory (say 16K) and then flushes 
 that data to the HDFS file (per LA per cluster) after writing out an entry to 
 the index file.
 * The LA periodically purges old logs (monthly, fortnightly or weekly as 
 today). 
 h5. Getting the logged information
 The main requirement is to implement a simple set of tools to query the LA 
 (i.e. the index/data files on HDFS) to glean the logged information.
 If we can think of each Map-Reduce cluster's logs as a set of archives (i.e. 
 one file per cluster per LA used) we need the ability to query the 
 log-archive to figure out the available streams and the ability to get one 
 entire stream or a subset of time based on timestamp-ranges. Essentially 
 these are simple tools which parse the index files of each LA (for a given 
 Hadoop cluster) and return the required information.
 h6. Query for available streams
 The query just returns all the available streams in an cluster-log archive 
 identified by the HDFS path.
 It looks something like this for a cluster with 3 nodes which ran 2 jobs, 
 first of which had 2 maps, 1 reduce and the second had 1 map, 1 reduce:
 {noformat}
$ la -query /log-aggregation/cluster-20071113
Available streams:
jobtracker_foo.bar.com:57893
tasktracker_baz.bar.com:57841

[jira] Updated: (HADOOP-2447) HDFS should be capable of limiting the total number of inodes in the system

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2447:
--

Component/s: dfs

 HDFS should be capable of limiting the total number of inodes in the system
 ---

 Key: HADOOP-2447
 URL: https://issues.apache.org/jira/browse/HADOOP-2447
 Project: Hadoop
  Issue Type: New Feature
  Components: dfs
Reporter: Sameer Paranjpye
Assignee: dhruba borthakur
 Fix For: 0.16.0

 Attachments: fileLimit.patch, fileLimit2.patch


 The HDFS Namenode should be capable of limiting the total number of Inodes 
 (files + directories). The can be done through a config variable, settable in 
 hadoop-site.xml. The default should be no limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2054) Improve memory model for map-side sorts

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2054:
--

Fix Version/s: (was: 0.16.0)

Pushing this to 0.17.0 and beyond...

 Improve memory model for map-side sorts
 ---

 Key: HADOOP-2054
 URL: https://issues.apache.org/jira/browse/HADOOP-2054
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which 
 defaults to a buffer of size 32-bytes, and the {{DataOutputBuffer#write}} 
 call doubles the underlying byte-array when it needs more space.
 However for maps which output any decent amount of data (e.g. 128MB in 
 examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to 
 2^28, and each time this results in a new array being created, followed by an 
 array-copy:
 {noformat}
 public void write(DataInput in, int len) throws IOException {
   int newcount = count + len;
   if (newcount  buf.length) {
 byte newbuf[] = new byte[Math.max(buf.length  1, newcount)];
 System.arraycopy(buf, 0, newbuf, 0, count);
 buf = newbuf;
   }
   in.readFully(buf, count, len);
   count = newcount;
 }
 {noformat}
 I reckon we could do much better in the {{MapTask}}, specifically... 
 For e.g. we start with a buffer of size 1/4KB and quadruple, rather than 
 double, upto, say 4/8/16MB. Then we resume doubling (or less).
 This means that it quickly ramps up to minimize no. of {{System.arrayCopy}} 
 calls and small-sized buffers to GC; and later start doubling to ensure we 
 don't ramp-up too quickly to minimize memory wastage due to fragmentation.
 Of course, this issue is about benchmarking and figuring if all this is worth 
 it, and, if so, what are the right set of trade-offs to make.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2099) Pending, running, completed tasks should also be shown as percentage

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2099:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

 Pending, running, completed tasks should also be shown as percentage
 

 Key: HADOOP-2099
 URL: https://issues.apache.org/jira/browse/HADOOP-2099
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.14.0
Reporter: Amar Kamat
Assignee: Amar Kamat
Priority: Minor
 Fix For: 0.17.0

 Attachments: HADOOP-2099.patch, percent.png




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2120) dfs -getMerge does not do what it says it does

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2120:
--

Component/s: (was: mapred)
 fs

 dfs -getMerge does not do what it says it does
 --

 Key: HADOOP-2120
 URL: https://issues.apache.org/jira/browse/HADOOP-2120
 Project: Hadoop
  Issue Type: Bug
  Components: fs
Affects Versions: 0.14.3
 Environment: All
Reporter: Milind Bhandarkar
 Fix For: 0.16.0


 dfs -getMerge, which calls FileUtil.CopyMerge, contains this javadoc:
 {code}
 Get all the files in the directories that match the source file pattern
* and merge and sort them to only one file on local fs 
* srcf is kept.
 {code}
 However, it only concatenates the set of input files, rather than merging 
 them in sorted order.
 Ideally, the copyMerge should be equivalent to a map-reduce job with 
 IdentityMapper and IdentityReducer with numReducers = 1. However, not having 
 to run this as a map-reduce job has some advantages, since it increases 
 cluster utilization during reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2141) speculative execution start up condition based on completion time

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2141:
--


Moving this to 0.17.0 as discussions are still on...

 speculative execution start up condition based on completion time
 -

 Key: HADOOP-2141
 URL: https://issues.apache.org/jira/browse/HADOOP-2141
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.15.0
Reporter: Koji Noguchi
Assignee: Arun C Murthy
 Fix For: 0.17.0


 We had one job with speculative execution hang.
 4 reduce tasks were stuck with 95% completion because of a bad disk. 
 Devaraj pointed out 
 bq . One of the conditions that must be met for launching a speculative 
 instance of a task is that it must be at least 20% behind the average 
 progress, and this is not true here.
 It would be nice if speculative execution also starts up when tasks stop 
 making progress.
 Devaraj suggested 
 bq. Maybe, we should introduce a condition for average completion time for 
 tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1986:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

I'm moving this to 0.17.0 while we continue discussions here...

 Add support for a general serialization mechanism for Map Reduce
 

 Key: HADOOP-1986
 URL: https://issues.apache.org/jira/browse/HADOOP-1986
 Project: Hadoop
  Issue Type: New Feature
  Components: mapred
Reporter: Tom White
Assignee: Tom White
 Fix For: 0.17.0

 Attachments: hadoop-serializer-v2.tar.gz, SerializableWritable.java, 
 serializer-v1.patch, serializer-v2.patch


 Currently Map Reduce programs have to use WritableComparable-Writable 
 key-value pairs. While it's possible to write Writable wrappers for other 
 serialization frameworks (such as Thrift), this is not very convenient: it 
 would be nicer to be able to use arbitrary types directly, without explicit 
 wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2125) Exception thrown for URL.openConnection used in the shuffle phase should be caught thus making it possible to reuse the connection for future use

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2125:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

I'm moving it for 0.17.0, we need more investigation into HTTP keep-alive etc.

 Exception thrown for URL.openConnection used in the shuffle phase should be 
 caught thus making it possible to reuse the connection for future use
 -

 Key: HADOOP-2125
 URL: https://issues.apache.org/jira/browse/HADOOP-2125
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.16.0
Reporter: Amar Kamat
Assignee: Amar Kamat
 Fix For: 0.17.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HADOOP-2167) Reduce tips complete 100%, but job does not complete saying reduces still running.

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy resolved HADOOP-2167.
---

Resolution: Cannot Reproduce

We haven't seen this nor can we seem to repro it. Also HADOOP-2216 led us 
astray...

I'm closing this for now, please re-open if required.

 Reduce tips complete 100%, but job does not complete saying reduces still 
 running.
 --

 Key: HADOOP-2167
 URL: https://issues.apache.org/jira/browse/HADOOP-2167
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Reporter: Amareshwari Sri Ramadasu
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0


 Job's reduces are stuck at 99.43% progress and 2 reduces in running state and 
 Job is not complete. 
 But the reduce task list on the job tracker shows they are complete 100% and 
 marked as SUCCEEDED and Finishtime is available jobtasks.jsp and jobhistory 
 also.
 With ipc.client.timeout = 60, the exceptions on TT's running the reduces 
 are
 On one of the TTs, the logs show the following:
 2007-11-07 08:34:16,092 INFO org.apache.hadoop.mapred.TaskTracker: Task 
 task_200711070637_0001_r_000150_0 is done.
 2007-11-07 08:35:34,013 INFO org.apache.hadoop.mapred.TaskTracker: Task 
 task_200711070637_0001_r_000156_0 is done.
 2007-11-07 08:42:44,751 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
 exception: java.net.SocketTimeoutException: timedout waiting for rpc response
 at org.apache.hadoop.ipc.Client.call(Client.java:484)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
 at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
 at 
 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897)
 at 
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799)
 at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193)
 at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055)
 2007-11-07 08:42:44,767 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
 'status' to .
 On the other TT,
 2007-11-07 08:40:30,484 INFO org.apache.hadoop.mapred.TaskTracker: Task 
 task_200711070637_0001_r_000160_0 is done.
 2007-11-07 08:42:45,508 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
 exception: java.net.SocketTimeoutException: timedout waiting for rpc response
 at org.apache.hadoop.ipc.Client.call(Client.java:484)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
 at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
 at 
 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897)
 at 
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799)
 at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193)
 at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055)
 2007-11-07 08:42:45,508 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
 'status' to ..
 On JT logs, the reduce tasks are done successfully:
 2007-11-07 06:39:09,151 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
 'task_200711070637_0001_r_000160_0' to tip tip_200711070637_0001_r_000160, 
 for tracker 'x'
 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.TaskRunner: Saved 
 output of task 'task_200711070637_0001_r_000160_0' to 'y'
 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.JobInProgress: Task 
 'task_200711070637_0001_r_000160_0' has completed 
 tip_200711070637_0001_r_000160 successfully.
 This would suggest that if tasks are done before the timeout, the problem 
 occurs in progress update. This is also not consistent since other reduce 
 tasks in the same situation are successful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HADOOP-1733) LocalJobRunner uses old-style job/tip ids

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy resolved HADOOP-1733.
---

Resolution: Won't Fix

HADOOP-544 will make this moot... either way, as Doug notes:

{quote}
I don't see a need for job ids to be identical between localrunner and 
jobtracker. User code should not rely on the format of job ids. Having them 
different helps enforce that! We should never parse job ids, but only require 
them to be sufficiently unique.
{quote}

 LocalJobRunner uses old-style job/tip ids
 -

 Key: HADOOP-1733
 URL: https://issues.apache.org/jira/browse/HADOOP-1733
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.14.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0


 We should rework LocalJobRunner to use the new style job/tip ids (post 
 HADOOP-1473).
 Is this a *blocker*? This isn't a functionality bug, yet ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2221) Configuration.toString is broken

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2221:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

Moving this to 0.17.0.

 Configuration.toString is broken
 

 Key: HADOOP-2221
 URL: https://issues.apache.org/jira/browse/HADOOP-2221
 Project: Hadoop
  Issue Type: Bug
  Components: conf
Affects Versions: 0.15.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.17.0

 Attachments: HADOOP-2221_1_2007117.patch


 {{Configuration.toString}} doesn't string-ify the {{Configuration.resources}} 
 field which was added in HADOOP-785.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2141) speculative execution start up condition based on completion time

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2141:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

 speculative execution start up condition based on completion time
 -

 Key: HADOOP-2141
 URL: https://issues.apache.org/jira/browse/HADOOP-2141
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Affects Versions: 0.15.0
Reporter: Koji Noguchi
Assignee: Arun C Murthy
 Fix For: 0.17.0


 We had one job with speculative execution hang.
 4 reduce tasks were stuck with 95% completion because of a bad disk. 
 Devaraj pointed out 
 bq . One of the conditions that must be met for launching a speculative 
 instance of a task is that it must be at least 20% behind the average 
 progress, and this is not true here.
 It would be nice if speculative execution also starts up when tasks stop 
 making progress.
 Devaraj suggested 
 bq. Maybe, we should introduce a condition for average completion time for 
 tasks in the speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2165) Augment JobHistory to store tasks' userlogs

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2165:
--

Fix Version/s: (was: 0.16.0)
   0.17.0

Pushing this to 0.17.0.

 Augment JobHistory to store tasks' userlogs
 ---

 Key: HADOOP-2165
 URL: https://issues.apache.org/jira/browse/HADOOP-2165
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy
 Fix For: 0.17.0


 It will be very useful to be able to see the job's userlogs (the 
 stdout/stderr/syslog of the tasks) from the JobHistory page. It will greatly 
 aid in debugging etc.
 At the very minimum we should have links from the JobHistory to the logs on 
 the TT.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-1281) Speculative map tasks aren't getting killed although the TIP completed

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-1281:
--

Priority: Critical  (was: Major)

I'm marking up the priority to reflect that this is an important bug to fix for 
0.16.0, we are losing lots of cycles due to this.

 Speculative map tasks aren't getting killed although the TIP completed
 --

 Key: HADOOP-1281
 URL: https://issues.apache.org/jira/browse/HADOOP-1281
 Project: Hadoop
  Issue Type: Bug
  Components: mapred
Affects Versions: 0.15.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 0.16.0

 Attachments: HADOOP-1281_1_20071117.patch, 
 HADOOP-1281_2_20071123.patch


 The speculative map tasks run to completion although the TIP succeeded since 
 the other task completed elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555883#action_12555883
 ] 

Arun C Murthy commented on HADOOP-2510:
---

bq. [...] as opposed to a regular heartbeat from the TaskTracker. Probes can be 
done intelligently depending on the state of the overall Job and could 
significantly reduce network RPC traffic. Does this matter in practice on large 
clusters?

Yes. To clarify, the idea is that the JobManager pings the TaskTrackers (today 
the TaskTracker pings the JobTracker) for status-updates for its tasks. Clearly 
it only pings the TaskTrackers which are _currently_ running its tasks.

bq. currently, SPECULATIVE_GAP and SPECULATIVE_LAG control speculative 
execution at the task level. As with heartbeats versus probes, wouldn't this be 
better handled at the JobManager/MapReduce master level? Either way, this 
should be a JobConf param. 
Yes. Again the idea is that the JobManager decides to schedule 
speculative-tasks via SPECULATIVE_{LAG|GAP} etc., same as the normal tasks. It 
then asks the JobScheduler for free TaskTrackers. 

Thus _which_ task needs to run (normal/failed/speculative) is decided by the 
JobManager, whereas _where_ the task should be run (i.e. TaskTracker) is 
decided by the JobScheduler, it doesn't care about the nature of the task (it 
does care about the job's priorities etc.).

bq. We set this to off for our cluster because it has caused severe instability 
when running many jobs simultaneously.

Which version of Hadoop are you running? Things have improved a fair bit 
recently; further improvements are underway (HADOOP-2141).

 Map-Reduce 2.0
 --

 Key: HADOOP-2510
 URL: https://issues.apache.org/jira/browse/HADOOP-2510
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy

 We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
 provisioning/scheduling mechanism. 
 With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
 allocates these from a global pool and also provisions a private Map-Reduce 
 cluster for the user. She then runs her jobs and shuts the cluster down via 
 HoD when done. All user-private clusters use the same humongous, static HDFS 
 (e.g. 2k node HDFS). 
 More details about HoD are available here: HADOOP-1301.
 
 h3. Motivation
 The current deployment (Hadoop + HoD) has a couple of implications:
  * _Non-optimal Cluster Utilization_
1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
 could be *idle* for atleast a while before being detected and shut-down.
2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
 much-smaller no. of reduces; with maps being light and quick and reduces 
 being i/o heavy and longer-running. Users typically allocate clusters 
 depending on the no. of maps (i.e. input size) which leads to the scenario 
 where all the maps are done (idle nodes in the cluster) and the few reduces 
 are chugging along. Right now, we do not have the ability to shrink the 
 HoD'ed Map-Reduce clusters which would alleviate this issue. 
  * _Impact on data-locality_
 With the current setup of a static, large HDFS and much smaller (5/10/20/50 
 node) clusters there is a good chance of losing one of Map-Reduce's primary 
 features: ability to execute tasks on the datanodes where the input splits 
 are located. In fact, we have seen the data-local tasks go down to 20-25 
 percent in the GridMix benchmarks, from the 95-98 percent we see on the 
 randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
 synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
 Map-Reduce) helps significantly here.
 
 Primarily, the notion of *job-level scheduling* leading to private clusers, 
 as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
 of the blame.
 Keeping the above factors in mind, here are some thoughts on how to 
 re-structure Hadoop Map-Reduce to solve some of these issues.
 
 h3. State of the Art
 As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
 for a bit) does provide task-level scheduling; however as it exists today, 
 it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
 Lets review it's current architecture and main components:
  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
 (tasktrackers send task-statuses via periodic heartbeats), which implies it 
 is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
 framework i.e. its failure implies that all the jobs in the system fail. This 
 means a static, large Map-Reduce cluster is fairly susceptible and a definite 
 suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
 above drawbacks.
  * 

[jira] Updated: (HADOOP-2390) Document the user-controls for intermediate/output compression via forrest

2008-01-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2390:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

 Document the user-controls for intermediate/output compression via forrest
 --

 Key: HADOOP-2390
 URL: https://issues.apache.org/jira/browse/HADOOP-2390
 Project: Hadoop
  Issue Type: Improvement
  Components: documentation, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2390_1_20071221.patch, 
 HADOOP-2390_2_20080102.patch


 We should document the user-controls for compressing the intermediate and job 
 outputs, including the types (record/block) and the various codecs in the 
 hadoop website via forrest (mapred_tutorial.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-2516) HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0

2008-01-03 Thread Arun C Murthy (JIRA)
HADOOP-1819 removed a public api JobTracker.getTracker in 0.15.0


 Key: HADOOP-2516
 URL: https://issues.apache.org/jira/browse/HADOOP-2516
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.15.1
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0


HADOOP-1819 removed a 0.14.0 public api {{JobTracker.getTracker}} in 0.15.0.

http://svn.apache.org/viewvc?view=revrevision=575438 and
http://svn.apache.org/viewvc/lucene/hadoop/branches/branch-0.15/src/java/org/apache/hadoop/mapred/JobTracker.java?r1=573708r2=575438diff_format=h

There is a simple work-around i.e. use the return value of 
{{JobTracker.startTracker}} ... yet, is this a 0.15.2 blocker?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555634#action_12555634
 ] 

Arun C Murthy commented on HADOOP-2510:
---

bq. 2) One of our problems [...]

Right, this will not affect your special case at all... you can continue to run 
multiple clusters on the same machines with different configs, ports etc.

bq. I'm not totally sure [...]

Yep. The point is to get people to think about ways of improving Map-Reduce to 
be scalable/reliable and maintain the single static MR cluster and do away with 
the notion of job-private clusters i.e. HoD; as expounded in the Motivation 
section. 

The stretch is to see if we can enhance it to support other, non-MR paradigms 
too.

bq. You discuss the jobtracker being a single point of failure, but the 
namenode is already a more serious point of failure, since it is much more work 
to rebuild a namenode if it dies.

Sure, that is at least as important; however I believe it's unrelated to this 
discussion.

 Map-Reduce 2.0
 --

 Key: HADOOP-2510
 URL: https://issues.apache.org/jira/browse/HADOOP-2510
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy

 We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
 provisioning/scheduling mechanism. 
 With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
 allocates these from a global pool and also provisions a private Map-Reduce 
 cluster for the user. She then runs her jobs and shuts the cluster down via 
 HoD when done. All user-private clusters use the same humongous, static HDFS 
 (e.g. 2k node HDFS). 
 More details about HoD are available here: HADOOP-1301.
 
 h3. Motivation
 The current deployment (Hadoop + HoD) has a couple of implications:
  * _Non-optimal Cluster Utilization_
1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
 could be *idle* for atleast a while before being detected and shut-down.
2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
 much-smaller no. of reduces; with maps being light and quick and reduces 
 being i/o heavy and longer-running. Users typically allocate clusters 
 depending on the no. of maps (i.e. input size) which leads to the scenario 
 where all the maps are done (idle nodes in the cluster) and the few reduces 
 are chugging along. Right now, we do not have the ability to shrink the 
 HoD'ed Map-Reduce clusters which would alleviate this issue. 
  * _Impact on data-locality_
 With the current setup of a static, large HDFS and much smaller (5/10/20/50 
 node) clusters there is a good chance of losing one of Map-Reduce's primary 
 features: ability to execute tasks on the datanodes where the input splits 
 are located. In fact, we have seen the data-local tasks go down to 20-25 
 percent in the GridMix benchmarks, from the 95-98 percent we see on the 
 randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
 synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
 Map-Reduce) helps significantly here.
 
 Primarily, the notion of *job-level scheduling* leading to private clusers, 
 as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
 of the blame.
 Keeping the above factors in mind, here are some thoughts on how to 
 re-structure Hadoop Map-Reduce to solve some of these issues.
 
 h3. State of the Art
 As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
 for a bit) does provide task-level scheduling; however as it exists today, 
 it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
 Lets review it's current architecture and main components:
  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
 (tasktrackers send task-statuses via periodic heartbeats), which implies it 
 is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
 framework i.e. its failure implies that all the jobs in the system fail. This 
 means a static, large Map-Reduce cluster is fairly susceptible and a definite 
 suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
 above drawbacks.
  * TaskTracker: The slave in the system which executes one task at-a-time 
 under directions from the JobTracker.
  * JobClient: The per-job client which just submits the job and polls the 
 JobTracker for status. 
 
 h3. Proposal - Map-Reduce 2.0 
 The primary idea is to move to task-level scheduling and static Map-Reduce 
 clusters (so as to maintain the same storage cluster and compute cluster 
 paradigm) as a way to directly tackle the two main issues illustrated above. 
 Clearly, we will have to get around the existing problems, especially w.r.t. 
 scalability and reliability.
 The proposal is to re-work Hadoop 

[jira] Updated: (HADOOP-2344) Free up the buffers (input and error) while executing a shell command before waiting for it to finish.

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2344:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amar!

 Free up the buffers (input and error) while executing a shell command before 
 waiting for it to finish.
 --

 Key: HADOOP-2344
 URL: https://issues.apache.org/jira/browse/HADOOP-2344
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.16.0
Reporter: Amar Kamat
Assignee: Amar Kamat
 Fix For: 0.16.0

 Attachments: HADOOP-2231.patch, HADOOP-2344.patch, HADOOP-2344.patch, 
 HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch


 Process.waitFor() should be invoked after freeing up the input and error 
 stream.  While fixing https://issues.apache.org/jira/browse/HADOOP-2231 we 
 found that this might be a possible cause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2390) Document the user-controls for intermediate/output compression via forrest

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2390:
--

Status: Open  (was: Patch Available)

Devaraj pointed out that the current patch doesn't talk about native-hadoop 
compression libraries...

Unfortunately it would mean that we need to port 
http://wiki.apache.org/lucene-hadoop/NativeHadoop to forrest and then link off 
it; a new patch is forth-coming.

 Document the user-controls for intermediate/output compression via forrest
 --

 Key: HADOOP-2390
 URL: https://issues.apache.org/jira/browse/HADOOP-2390
 Project: Hadoop
  Issue Type: Improvement
  Components: documentation, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2390_1_20071221.patch


 We should document the user-controls for compressing the intermediate and job 
 outputs, including the types (record/block) and the various codecs in the 
 hadoop website via forrest (mapred_tutorial.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2390) Document the user-controls for intermediate/output compression via forrest

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2390:
--

Status: Patch Available  (was: Open)

 Document the user-controls for intermediate/output compression via forrest
 --

 Key: HADOOP-2390
 URL: https://issues.apache.org/jira/browse/HADOOP-2390
 Project: Hadoop
  Issue Type: Improvement
  Components: documentation, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2390_1_20071221.patch, 
 HADOOP-2390_2_20080102.patch


 We should document the user-controls for compressing the intermediate and job 
 outputs, including the types (record/block) and the various codecs in the 
 hadoop website via forrest (mapred_tutorial.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2390) Document the user-controls for intermediate/output compression via forrest

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2390:
--

Attachment: HADOOP-2390_2_20080102.patch

Promised patch incorporating details found in 
http://wiki.apache.org/lucene-hadoop/NativeHadoop into forrest-based 
native_libraries.html.

 Document the user-controls for intermediate/output compression via forrest
 --

 Key: HADOOP-2390
 URL: https://issues.apache.org/jira/browse/HADOOP-2390
 Project: Hadoop
  Issue Type: Improvement
  Components: documentation, mapred
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2390_1_20071221.patch, 
 HADOOP-2390_2_20080102.patch


 We should document the user-controls for compressing the intermediate and job 
 outputs, including the types (record/block) and the various codecs in the 
 hadoop website via forrest (mapred_tutorial.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-2510) Map-Reduce 2.0

2008-01-02 Thread Arun C Murthy (JIRA)
Map-Reduce 2.0
--

 Key: HADOOP-2510
 URL: https://issues.apache.org/jira/browse/HADOOP-2510
 Project: Hadoop
  Issue Type: Improvement
  Components: mapred
Reporter: Arun C Murthy


We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
provisioning/scheduling mechanism. 

With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
allocates these from a global pool and also provisions a private Map-Reduce 
cluster for the user. She then runs her jobs and shuts the cluster down via HoD 
when done. All user-private clusters use the same humongous, static HDFS (e.g. 
2k node HDFS). 

More details about HoD are available here: HADOOP-1301.



h3. Motivation

The current deployment (Hadoop + HoD) has a couple of implications:

 * _Non-optimal Cluster Utilization_

   1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
could be *idle* for atleast a while before being detected and shut-down.

   2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
much-smaller no. of reduces; with maps being light and quick and reduces being 
i/o heavy and longer-running. Users typically allocate clusters depending on 
the no. of maps (i.e. input size) which leads to the scenario where all the 
maps are done (idle nodes in the cluster) and the few reduces are chugging 
along. Right now, we do not have the ability to shrink the HoD'ed Map-Reduce 
clusters which would alleviate this issue. 

 * _Impact on data-locality_

With the current setup of a static, large HDFS and much smaller (5/10/20/50 
node) clusters there is a good chance of losing one of Map-Reduce's primary 
features: ability to execute tasks on the datanodes where the input splits are 
located. In fact, we have seen the data-local tasks go down to 20-25 percent in 
the GridMix benchmarks, from the 95-98 percent we see on the randomwriter+sort 
runs run as part of the hadoopqa benchmarks (admittedly a synthetic benchmark, 
but yet). Admittedly, HADOOP-1985 (rack-aware Map-Reduce) helps significantly 
here.



Primarily, the notion of *job-level scheduling* leading to private clusers, as 
opposed to *task-level scheduling*, is a good peg to hang-on the majority of 
the blame.

Keeping the above factors in mind, here are some thoughts on how to 
re-structure Hadoop Map-Reduce to solve some of these issues.



h3. State of the Art

As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD for 
a bit) does provide task-level scheduling; however as it exists today, it's 
scalability to tens-of-thousands of user-jobs, per-week, is in question.

Lets review it's current architecture and main components:

 * JobTracker: It does both *task-scheduling* and *task-monitoring* 
(tasktrackers send task-statuses via periodic heartbeats), which implies it is 
fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
framework i.e. its failure implies that all the jobs in the system fail. This 
means a static, large Map-Reduce cluster is fairly susceptible and a definite 
suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
above drawbacks.
 * TaskTracker: The slave in the system which executes one task at-a-time under 
directions from the JobTracker.
 * JobClient: The per-job client which just submits the job and polls the 
JobTracker for status. 



h3. Proposal - Map-Reduce 2.0 

The primary idea is to move to task-level scheduling and static Map-Reduce 
clusters (so as to maintain the same storage cluster and compute cluster 
paradigm) as a way to directly tackle the two main issues illustrated above. 
Clearly, we will have to get around the existing problems, especially w.r.t. 
scalability and reliability.

The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, 
static cluster. 

Here is an overview of how its main components would look like:
 * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. 
Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) 
Maui/Moab are  candidates for being the scheduler, in which case, the 
JobScheduler is just a thin wrapper around them. 
 * TaskTracker: These stay as before, without some minor changes as illustrated 
later in the piece.
 * JobClient: Fatten up the JobClient my putting a lot more intelligence into 
it. Enhance it to talk to the JobTracker to ask for available TaskTrackers and 
then contact them to schedule and monitor the tasks. So we'll have lots of 
per-job clients talking to the JobScheduler and the relevant TaskTrackers for 
their respective jobs, a big change from today. Lets call this the *JobManager* 
henceforth. 

A broad sketch of how things would work: 

h4. Deployment

There is a single, static, large Map-Reduce cluster, and no per-job clusters.

Essentially there is one global JobScheduler with thousands of independent 

[jira] Updated: (HADOOP-2511) HADOOP-2344 introduced a javadoc warning

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2511:
--

Attachment: HADOOP-2511_1_20080103.patch

Straight-forward fix.

 HADOOP-2344 introduced a javadoc warning
 

 Key: HADOOP-2511
 URL: https://issues.apache.org/jira/browse/HADOOP-2511
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.16.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2511_1_20080103.patch


 {noformat}
   [javadoc] 
 /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/util/Shell.java:70:
  warning - @param argument Interval is not a parameter name.
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2511) HADOOP-2344 introduced a javadoc warning

2008-01-02 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2511:
--

Status: Patch Available  (was: Open)

 HADOOP-2344 introduced a javadoc warning
 

 Key: HADOOP-2511
 URL: https://issues.apache.org/jira/browse/HADOOP-2511
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.16.0
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.16.0

 Attachments: HADOOP-2511_1_20080103.patch


 {noformat}
   [javadoc] 
 /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/util/Shell.java:70:
  warning - @param argument Interval is not a parameter name.
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-2344) Free up the buffers (input and error) while executing a shell command before waiting for it to finish.

2008-01-02 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555424#action_12555424
 ] 

Arun C Murthy commented on HADOOP-2344:
---

Ugh, the long story is that a similar patch didn't generate a javadoc... sad 
excuse. My bad.

I filed/fixed HADOOP-2511 to fix the javadoc warning. 

 Free up the buffers (input and error) while executing a shell command before 
 waiting for it to finish.
 --

 Key: HADOOP-2344
 URL: https://issues.apache.org/jira/browse/HADOOP-2344
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.16.0
Reporter: Amar Kamat
Assignee: Amar Kamat
 Fix For: 0.16.0

 Attachments: HADOOP-2231.patch, HADOOP-2344.patch, HADOOP-2344.patch, 
 HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch


 Process.waitFor() should be invoked after freeing up the input and error 
 stream.  While fixing https://issues.apache.org/jira/browse/HADOOP-2231 we 
 found that this might be a possible cause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

2007-12-29 Thread Arun C Murthy (JIRA)
Implement utility-tools for working with SequenceFiles
--

 Key: HADOOP-2501
 URL: https://issues.apache.org/jira/browse/HADOOP-2501
 Project: Hadoop
  Issue Type: New Feature
  Components: io
Reporter: Arun C Murthy


It would be nice to implement a bunch of utilities to work with SequenceFiles:

 * info (print-out header information such as key/value types, compression 
type/codec etc.)
 * cat
 * head/tail
 * merge multiple seq-files into one
 * ...

I'd imagine this would look like:
{noformat}
$ bin/hadoop seq -info /user/joe/blah.seq
$ bin/hadoop seq -head -n 10 /user/joe/blah.seq
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-2344) Free up the buffers (input and error) while executing a shell command before waiting for it to finish.

2007-12-29 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2344:
--

Status: Patch Available  (was: Open)

 Free up the buffers (input and error) while executing a shell command before 
 waiting for it to finish.
 --

 Key: HADOOP-2344
 URL: https://issues.apache.org/jira/browse/HADOOP-2344
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.16.0
Reporter: Amar Kamat
Assignee: Amar Kamat
 Fix For: 0.16.0

 Attachments: HADOOP-2231.patch, HADOOP-2344.patch, HADOOP-2344.patch, 
 HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch, HADOOP-2344.patch


 Process.waitFor() should be invoked after freeing up the input and error 
 stream.  While fixing https://issues.apache.org/jira/browse/HADOOP-2231 we 
 found that this might be a possible cause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >