[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744414#action_12744414
 ] 

Hemanth Yamijala commented on MAPREDUCE-711:


bq. Can you please run tests on Hudson (Giridharan could help with it I 
suppose) and commit the changes to HDFS when the tests pass.

I have already run the tests with the updated jars locally. There does not 
appear to be a way to run these off Hudson. So, we are planning to commit the 
jars and then trigger a Hudson HDFS build to make sure things work still. If 
something breaks, we will revert the commit and check again. (But given they 
pass locally, I am hoping we won't get to it).

Also, the MapReduce build failure in the tests is being tracked in 
MAPREDUCE-880 and is unrelated to this commit.

Giri, can you please commit the common and Map/Reduce jars to HDFS and trigger 
a build ?

 Move Distributed Cache from Common to Map/Reduce
 

 Key: MAPREDUCE-711
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Vinod K V
 Attachments: MAPREDUCE-711-20090709-common.txt, 
 MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
 MAPREDUCE-711-20090710.txt


 Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.

2009-08-18 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744415#action_12744415
 ] 

Jothi Padmanabhan commented on MAPREDUCE-157:
-

Regarding the interface for readers, we could support two kinds of users:

# Users who want fine grained control and would handle the individual events 
themselves. 
# Users who want a much more granular, summary kind of information. 

For users of type 1, who want finer grained information, they could use Event 
Readers to iterate through events and do the necessary processing

For users of type 2, we could provide more granular information through a 
JobHistoryParser class. This class would internally build the Job-Task-Attempt 
hierarchy/information by consuming all events using a event reader and make the 
summary information available for users to access. Users could do some thing 
like

{code}

parser.init(history file or stream)

JobInfo jobInfo = parser.getJobInfo();

// use the getters to get jobinfo (example: start time, finish time, counters, 
id, user name, conf, total maps, total reds, among others)

ListTaskInfo taskInfoList = jobInfo.getAllTasks();

// Iterate through the list and do necessary processing. Getters for taskinfo 
would include taskid, task type, status, splits, counters, etc

ListTaskAttemptInfo attemptsList = taskinfo.getAllAttempts();

// Attempt info would have getters for attempt id, errors, status, state, start 
time, finish time, tracker name, port etc.

{code}


Comments/Suggestions/Thoughts?

 Job History log file format is not friendly for external tools.
 ---

 Key: MAPREDUCE-157
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Jothi Padmanabhan

 Currently, parsing the job history logs with external tools is very difficult 
 because of the format. The most critical problem is that newlines aren't 
 escaped in the strings. That makes using tools like grep, sed, and awk very 
 tricky.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-773) LineRecordReader can report non-zero progress while it is processing a compressed stream

2009-08-18 Thread Devaraj Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated MAPREDUCE-773:
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this.

 LineRecordReader can report non-zero progress while it is processing a 
 compressed stream
 

 Key: MAPREDUCE-773
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-773
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 0.21.0

 Attachments: 773.2.patch, 773.3.patch, 773.patch, 773.patch


 Currently, the LineRecordReader returns 0.0 from getProgress() for most 
 inputs (since the end of the filesplit is set to Long.MAX_VALUE for 
 compressed inputs). This can be improved to return a non-zero progress even 
 for compressed streams (though it may not be very reflective of the actual 
 progress).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues

2009-08-18 Thread Sreekanth Ramakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreekanth Ramakrishnan updated MAPREDUCE-862:
-

Attachment: initialscreen.png
detailspage.png
clustersummarymodification.png

Attaching screens of how the UI would look for modified queue design.

Cluster summary would be modified, to introduce a new column which will have a 
number of Queue, which will be linked to modified queue details page, which is 
described in initialscreen.png.

From initalscreen.png we can click on the queue hierarchy, which would have 
two pages, for {{ContainerQueues}} we would not have a job list and for 
{{JobQueue}} we have a job list apart from scheduling information.

 Modify UI to support a hierarchy of queues
 --

 Key: MAPREDUCE-862
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-862
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Hemanth Yamijala
 Attachments: clustersummarymodification.png, detailspage.png, 
 initialscreen.png, subqueue.png


 MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
 framework. This JIRA is for defining changes to the UI related to queues. 
 This includes the hadoop queue CLI and the web UI on the JobTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues

2009-08-18 Thread Sreekanth Ramakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreekanth Ramakrishnan updated MAPREDUCE-862:
-

Attachment: subqueue.png

 Modify UI to support a hierarchy of queues
 --

 Key: MAPREDUCE-862
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-862
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Hemanth Yamijala
 Attachments: clustersummarymodification.png, detailspage.png, 
 initialscreen.png, subqueue.png


 MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
 framework. This JIRA is for defining changes to the UI related to queues. 
 This includes the hadoop queue CLI and the web UI on the JobTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-430) Task stuck in cleanup with OutOfMemoryErrors

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1271#action_1271
 ] 

Hadoop QA commented on MAPREDUCE-430:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12416767/MAPREDUCE-430-v1.7.patch
  against trunk revision 805081.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/console

This message is automatically generated.

 Task stuck in cleanup with OutOfMemoryErrors
 

 Key: MAPREDUCE-430
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-430
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Amareshwari Sriramadasu
Assignee: Amar Kamat
 Fix For: 0.20.1

 Attachments: MAPREDUCE-430-v1.6-branch-0.20.patch, 
 MAPREDUCE-430-v1.6.patch, MAPREDUCE-430-v1.7.patch


 Obesrved a task with OutOfMemory error, stuck in cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1279#action_1279
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-849:
---

Configuration properties in  Mapreduce project are be catagorized into the 
following and suggested name for each catagory.
||Catagory|| Suggested Name||
|Cluster config | mapreduce.* |
|JobTracker config | mapreduce.jobtracker.* |
|TaskTracker config | mapreduce.tasktracker.* |
|Job-level config | mapreduce.job.* |
|Task-level config | mapreduce.task.* |
|Map task config | mapreduce.map.* |
|Reduce task config | mapreduce.reduce.* |
|Job client config | mapreduce.jobclient.* |
|Pipes config | mapreduce.pipes.* |
|Lib config | mapreduce.libname.* |
|Example config | mapreduce.example-name.* |
|Test config | mapreduce.test.* |
|Streaming config | mapreduce.streaming.* or streaming.*|
|Contrib project config | mapreduce.contrib-project.* or contrib-project.* |

Thoughts?

 Renaming of configuration property names in mapreduce
 -

 Key: MAPREDUCE-849
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Amareshwari Sriramadasu
Assignee: Amareshwari Sriramadasu
 Fix For: 0.21.0


 In-line with HDFS-531, property names in configuration files should be 
 standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.

2009-08-18 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744452#action_12744452
 ] 

rahul k singh commented on MAPREDUCE-861:
-

As mentioned above , we had an internal agreement that we would be going ahead 
with
xml based configuration for hierarchial queues.

In terms of how configuration would be structured for hierarchial queues , we 
had 2 options in mind.

Option 1:
--
mapred-queues.xml would consist of the hierarchial queue hierarchy.

Typical hierarchial queue configuration would look like:
{code:xml}
queue
nameq1/name
queue
nameq1q1/name
administrators
u1,u2,u3
/administrators
submitters
u1,u2
/submitters
statestop/running/state
schedulingContext
capacity/capacity
maxCapacity/maxCapacity
/schedulingContext
/queue
/queue
{code}

The configuration above defines a queue q1 and a single child q1q1

schedulingContext tag would act as an black box kind of section for the 
mapred based parsers.
The xsd definition of schedulingContext would be
{code:xml}
xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded
  xs:complexType
xs:any/
  /xs:complexType
/xs:element
{code}

By defining schedulingContext as xs:any we can extend this section of 
configuration to add any kind of
tags to the schedulingContext.

Advantage:
1.This approach allows to have a single configuration file 
2. It is generic enough as in it allows users to declare scheduler properties 
the way they want.

Disadvantage:
1. This would result in having the parsing logic at different places , for 
framework level changes in framework and scheduler
specific parsing would be done in scheduler.
2. More cumbersome to implement .

Option 2:
-
Same as option 1  except the definition of schedulingContext would change . 
It would have
child tags key and value which would define the key value mappings of the 
various properties required
by schedulers.

For example:
{code:xml}
queue
nameq1/name
queue
nameq1q1/name
administrators
   u1,u2,u3
/administrators
submitters
   u1,u2
/submitters
statestop/running/state
schedulingContext
keycapacity/key
value/value
keymaxCapacity/key
value/value
/schedulingContext
/queue
/queue
{code}

the new xsd for schedulingContext would look like
{code:xml}
xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded
  xs:complexType
xs:sequence
  xs:element name=key minOccurs=0 maxOccurs=unbounded
  /xs:element
  /xs:element
  xs:element name=value minOccurs=0 maxOccurs=unbounded
  /xs:element
/xs:sequence
  /xs:complexType
/xs:element
{code}

Advantage:
1. Allows to have a single configuration file.
2. Provides constant way to specify scheduling properties.
3. Easier to implement and parsing logic now resides at one common place.

Disadvantage:
1. Doesn't allows the nested settings for scheduler properties.
2. Assumes that scheduler properties would always be in key value format.

 Modify queue configuration format and parsing to support a hierarchy of 
 queues.
 ---

 Key: MAPREDUCE-861
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-861
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Hemanth Yamijala
Assignee: rahul k singh

 MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
 framework. This JIRA is for defining changes to the configuration related to 
 queues. 
 The current format for defining a queue and its properties is as follows: 
 mapred.queue.queue-name.property-name. For e.g. 
 mapred.queue.queue-name.acl-submit-job. The reason for using this verbose 
 format was to be able to reuse the Configuration parser in Hadoop. However, 
 administrators currently using the queue configuration have already indicated 
 a very strong desire for a more manageable format. Since, this becomes more 
 unwieldy with hierarchical queues, the time may be good to introduce a new 
 format for representing queue configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744459#action_12744459
 ] 

Vinod K V commented on MAPREDUCE-849:
-

These names look a lot cleaner. +1 for the overall direction. But, we should 
also think of ways to continue doing this going forward even after this issue 
gets committed.

While doing this, if we can create the corresponding java.lang.String property 
names, ala HADOOP-3583 , and use them everywhere, it will be real good. For 
e.g.,
{code}
static final String MAPREDUCE_CLUSTER_EXAMPLE_CONFIG_PROPERTY = 
mapreduce.cluster.example.config
{code}

Also, I think usage of strings like _mapreduce.map.max.attempts_ and 
_mapreduce.jobtracker.maxtasks.per.job_ should be discouraged in favour of 
_mapreduce.map.max-attempts_ and _mapreduce.jobtracker.maxtasks-per-job_ 
respectively. Thoughts about this?

I am assuming that configuration related to sub-components should start with a 
prefix of the parent component. For e.g., _mapred.healthChecker.script.args_ 
will be _mapreduce.tasktracker.healthChecker.script-args_ . Right?

 Renaming of configuration property names in mapreduce
 -

 Key: MAPREDUCE-849
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Amareshwari Sriramadasu
Assignee: Amareshwari Sriramadasu
 Fix For: 0.21.0


 In-line with HDFS-531, property names in configuration files should be 
 standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744462#action_12744462
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-849:
---

bq. I am assuming that configuration related to sub-components should start 
with a prefix of the parent component. For e.g., 
mapred.healthChecker.script.args will be 
mapreduce.tasktracker.healthChecker.script-args . Right?
Yes. I will post a document which contains complete change-list of old name to 
new name. 

 Renaming of configuration property names in mapreduce
 -

 Key: MAPREDUCE-849
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Amareshwari Sriramadasu
Assignee: Amareshwari Sriramadasu
 Fix For: 0.21.0


 In-line with HDFS-531, property names in configuration files should be 
 standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-181) mapred.system.dir should be accessible only to hadoop daemons

2009-08-18 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744467#action_12744467
 ] 

Devaraj Das commented on MAPREDUCE-181:
---

I wonder whether it makes sense to have the jobclient write two files per a 
split file:

1) the splits info (the actual bytes) written to a secure location on the hdfs 
(with permissions 700)
2) the split metadata, which is a set of entries like 
{map-id:location_1location_2..location_n, 
start-offset-in-split-filelength} for each map-id. This is serialized over 
RPC, and the JobTracker writes it to the well known mapred-system-directory 
(which the JobTracker owns with perms 700).

The JobTracker just reads/loads the metadata, and creates the TIP cache.

The TaskTracker is handed off a split object that looks something like 
{start-offset-in-split-filelength}. As part of task localization, the TT 
copies the specific bytes from the split file (securely), and launches the task 
that then reads the split or the TT could simply stream it over RPC to the 
child. The replication factor could be set to a high number for the splits info 
file.. 

Doing it in this way should reduce the size of the split file information 
considerably (and we can have a cap on the metadata size as well), and also 
provide security for the user generated split files' content.

For the JobConf, passing the basic and the minimum info to the JobTracker as 
Hong suggested on MAPREDUCE-841 seems to make sense. For all other conf 
properties, the Task can load them directly from the HDFS. The max size (in 
terms of #bytes) of the basic information could be easily derived and we could 
have a cap on that for the RPC communication.

Thoughts?

 mapred.system.dir should be accessible only to hadoop daemons 
 --

 Key: MAPREDUCE-181
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-181
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Amar Kamat
Assignee: Amar Kamat
 Attachments: hadoop-3578-branch-20-example-2.patch, 
 hadoop-3578-branch-20-example.patch, HADOOP-3578-v2.6.patch, 
 HADOOP-3578-v2.7.patch


 Currently the jobclient accesses the {{mapred.system.dir}} to add job 
 details. Hence the {{mapred.system.dir}} has the permissions of 
 {{rwx-wx-wx}}. This could be a security loophole where the job files might 
 get overwritten/tampered after the job submission. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Hemanth Yamijala (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala resolved MAPREDUCE-711.


   Resolution: Fixed
Fix Version/s: 0.21.0
 Release Note: 
- Removed distributed cache classes and package from the Common project. 
- Added the same to the mapreduce project. 
- This will mean that users using Distributed Cache will now necessarily need 
the mapreduce jar in Hadoop 0.21.
- Modified the package name to o.a.h.mapreduce.filecache from o.a.h.filecache 
and deprecated the old package name.
 Hadoop Flags: [Incompatible change, Reviewed]

HDFS tests have also passed. Now, all the projects are sync'ed up.

I committed this to trunk. Thanks, Vinod !

 Move Distributed Cache from Common to Map/Reduce
 

 Key: MAPREDUCE-711
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Vinod K V
 Fix For: 0.21.0

 Attachments: MAPREDUCE-711-20090709-common.txt, 
 MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
 MAPREDUCE-711-20090710.txt


 Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.

2009-08-18 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744505#action_12744505
 ] 

rahul k singh commented on MAPREDUCE-861:
-

There is small error in the xsd mentioned above for option 2:
{code:xml}
xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded
  xs:complexType
xs:sequence minOccurs=0 maxOccurs=unbounded
  xs:element name=key minOccurs=1 maxOccurs=1
  /xs:element
  xs:element name=value minOccurs=1 maxOccurs=1
  /xs:element
/xs:sequence
  /xs:complexType
/xs:element


{code}


 Modify queue configuration format and parsing to support a hierarchy of 
 queues.
 ---

 Key: MAPREDUCE-861
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-861
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Hemanth Yamijala
Assignee: rahul k singh

 MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
 framework. This JIRA is for defining changes to the configuration related to 
 queues. 
 The current format for defining a queue and its properties is as follows: 
 mapred.queue.queue-name.property-name. For e.g. 
 mapred.queue.queue-name.acl-submit-job. The reason for using this verbose 
 format was to be able to reuse the Configuration parser in Hadoop. However, 
 administrators currently using the queue configuration have already indicated 
 a very strong desire for a more manageable format. Since, this becomes more 
 unwieldy with hierarchical queues, the time may be good to introduce a new 
 format for representing queue configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744516#action_12744516
 ] 

Hadoop QA commented on MAPREDUCE-476:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12416836/MAPREDUCE-476-20090818.txt
  against trunk revision 805324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 14 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/console

This message is automatically generated.

 extend DistributedCache to work locally (LocalJobRunner)
 

 Key: MAPREDUCE-476
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: sam rash
Assignee: Philip Zeyliger
Priority: Minor
 Attachments: HADOOP-2914-v1-full.patch, 
 HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, 
 MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, 
 MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, 
 MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, 
 MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, 
 MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch


 The DistributedCache does not work locally when using the outlined recipe at 
 http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html
  
 Ideally, LocalJobRunner would take care of populating the JobConf and copying 
 remote files to the local file sytem (http, assume hdfs = default fs = local 
 fs when doing local development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-883) harchive: Document how to unarchive

2009-08-18 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-883:
---

Attachment: mapreduce-883-0.patch

Simple doc suggesting to use cp/distcp for unarchiving.

 harchive: Document how to unarchive
 ---

 Key: MAPREDUCE-883
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-883
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: documentation, harchive
Reporter: Koji Noguchi
Priority: Minor
 Attachments: mapreduce-883-0.patch


 I was thinking of implementing harchive's 'unarchive' feature, but realized 
 it has been implemented already ever since harchive was introduced.
 It just needs to be documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-885:


Status: Patch Available  (was: Open)

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)
More efficient SQL queries for DBInputFormat


 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch

DBInputFormat generates InputSplits by counting the available rows in a table, 
and selecting subsections of the table via the LIMIT and OFFSET SQL 
keywords. These are only meaningful in an ordered context, so the query also 
includes an ORDER BY clause on an index column. The resulting queries are 
often inefficient and require full table scans. Actually using multiple mappers 
with these queries can lead to O(n^2) behavior in the database, where n is the 
number of splits. Attempting to use parallelism with these queries is 
counter-productive.

A better mechanism is to organize splits based on data values themselves, which 
can be performed in the WHERE clause, allowing for index range scans of tables, 
and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744809#action_12744809
 ] 

Hadoop QA commented on MAPREDUCE-885:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12416936/MAPREDUCE-885.patch
  against trunk revision 805324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/490/console

This message is automatically generated.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744811#action_12744811
 ] 

Aaron Kimball commented on MAPREDUCE-885:
-

I think this patch won't apply until MAPREDUCE-875 is in.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744816#action_12744816
 ] 

Aaron Kimball commented on MAPREDUCE-875:
-

The failing Sqoop tests claim that it's failing because it can't find Avro. Not 
sure why this is happening -- Sqoop doesn't make use of Avro anywhere. 
Recycling patch status in case this was transient. If not, do I have to put 
some more random libraries in ivy.xml? 

According to {{git-blame}}, this was added to the root ivy.xml earlier that day:

{code}
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 276) dependency 
org=org.apache.hadoop
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 277)   name=avro
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 278)   rev=1.0.0
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 279)   
conf=common-default/
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 280) dependency 
org=org.codehaus.jackso
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 281)   
name=jackson-mapper-asl
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 282)   rev=1.0.1
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 283)   
conf=common-default/
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 284) dependency 
org=com.thoughtworks.pa
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 285)   
name=paranamer
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 286)   rev=1.5
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 287)   
conf=common-default/
{code}

 Make DBRecordReader execute queries lazily
 --

 Key: MAPREDUCE-875
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-875.patch


 DBInputFormat's DBRecordReader executes the user's SQL query in the 
 constructor. If the query is long-running, this can cause task timeout. The 
 user is unable to spawn a background thread (e.g., in a MapRunnable) to 
 inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-875:


Status: Patch Available  (was: Open)

 Make DBRecordReader execute queries lazily
 --

 Key: MAPREDUCE-875
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-875.patch


 DBInputFormat's DBRecordReader executes the user's SQL query in the 
 constructor. If the query is long-running, this can cause task timeout. The 
 user is unable to spawn a background thread (e.g., in a MapRunnable) to 
 inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-875:


Attachment: MAPREDUCE-875.2.patch

Attaching new patch after resync'ing with trunk. Just realized that avro was 
already added to sqoop's ivy.xml

 Make DBRecordReader execute queries lazily
 --

 Key: MAPREDUCE-875
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-875.2.patch, MAPREDUCE-875.patch


 DBInputFormat's DBRecordReader executes the user's SQL query in the 
 constructor. If the query is long-running, this can cause task timeout. The 
 user is unable to spawn a background thread (e.g., in a MapRunnable) to 
 inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job

2009-08-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-336:


Attachment: MAPREDUCE-336_0_20090818.patch

Straight-forward fix.

 The logging level of the tasks should be configurable by the job
 

 Key: MAPREDUCE-336
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-336
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Arun C Murthy
 Fix For: 0.21.0

 Attachments: MAPREDUCE-336_0_20090818.patch


 It would be nice to be able to configure the logging level of the Task JVM's 
 separately from the server JVM's. Reducing logging substantially increases 
 performance and reduces the consumption of local disk on the task trackers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job

2009-08-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-336:


Fix Version/s: 0.21.0
   Status: Patch Available  (was: Open)

 The logging level of the tasks should be configurable by the job
 

 Key: MAPREDUCE-336
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-336
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Owen O'Malley
Assignee: Arun C Murthy
 Fix For: 0.21.0

 Attachments: MAPREDUCE-336_0_20090818.patch


 It would be nice to be able to configure the logging level of the Task JVM's 
 separately from the server JVM's. Reducing logging substantially increases 
 performance and reduces the consumption of local disk on the task trackers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-880) TestRecoveryManager times out

2009-08-18 Thread Amar Kamat (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744883#action_12744883
 ] 

Amar Kamat commented on MAPREDUCE-880:
--

Looked into this. Looks like the problem is with the case where the jobtracker 
is dead and the tasktrackers have some tasks running. In such cases 
MiniMRCluster.shutdown() waits forever for the task to finish (tracker to be 
idle). Somehow earlier the tasks were not scheduled and it used to work fine. 
Continuing with the debugging. 

 TestRecoveryManager times out
 -

 Key: MAPREDUCE-880
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-880
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Reporter: Amar Kamat



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.