[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744414#action_12744414 ] Hemanth Yamijala commented on MAPREDUCE-711: bq. Can you please run tests on Hudson (Giridharan could help with it I suppose) and commit the changes to HDFS when the tests pass. I have already run the tests with the updated jars locally. There does not appear to be a way to run these off Hudson. So, we are planning to commit the jars and then trigger a Hudson HDFS build to make sure things work still. If something breaks, we will revert the commit and check again. (But given they pass locally, I am hoping we won't get to it). Also, the MapReduce build failure in the tests is being tracked in MAPREDUCE-880 and is unrelated to this commit. Giri, can you please commit the common and Map/Reduce jars to HDFS and trigger a build ? Move Distributed Cache from Common to Map/Reduce Key: MAPREDUCE-711 URL: https://issues.apache.org/jira/browse/MAPREDUCE-711 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Vinod K V Attachments: MAPREDUCE-711-20090709-common.txt, MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, MAPREDUCE-711-20090710.txt Distributed Cache logically belongs as part of map/reduce and not Common. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
[ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744415#action_12744415 ] Jothi Padmanabhan commented on MAPREDUCE-157: - Regarding the interface for readers, we could support two kinds of users: # Users who want fine grained control and would handle the individual events themselves. # Users who want a much more granular, summary kind of information. For users of type 1, who want finer grained information, they could use Event Readers to iterate through events and do the necessary processing For users of type 2, we could provide more granular information through a JobHistoryParser class. This class would internally build the Job-Task-Attempt hierarchy/information by consuming all events using a event reader and make the summary information available for users to access. Users could do some thing like {code} parser.init(history file or stream) JobInfo jobInfo = parser.getJobInfo(); // use the getters to get jobinfo (example: start time, finish time, counters, id, user name, conf, total maps, total reds, among others) ListTaskInfo taskInfoList = jobInfo.getAllTasks(); // Iterate through the list and do necessary processing. Getters for taskinfo would include taskid, task type, status, splits, counters, etc ListTaskAttemptInfo attemptsList = taskinfo.getAllAttempts(); // Attempt info would have getters for attempt id, errors, status, state, start time, finish time, tracker name, port etc. {code} Comments/Suggestions/Thoughts? Job History log file format is not friendly for external tools. --- Key: MAPREDUCE-157 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Jothi Padmanabhan Currently, parsing the job history logs with external tools is very difficult because of the format. The most critical problem is that newlines aren't escaped in the strings. That makes using tools like grep, sed, and awk very tricky. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-773) LineRecordReader can report non-zero progress while it is processing a compressed stream
[ https://issues.apache.org/jira/browse/MAPREDUCE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated MAPREDUCE-773: -- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this. LineRecordReader can report non-zero progress while it is processing a compressed stream Key: MAPREDUCE-773 URL: https://issues.apache.org/jira/browse/MAPREDUCE-773 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 0.21.0 Attachments: 773.2.patch, 773.3.patch, 773.patch, 773.patch Currently, the LineRecordReader returns 0.0 from getProgress() for most inputs (since the end of the filesplit is set to Long.MAX_VALUE for compressed inputs). This can be improved to return a non-zero progress even for compressed streams (though it may not be very reflective of the actual progress). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues
[ https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreekanth Ramakrishnan updated MAPREDUCE-862: - Attachment: initialscreen.png detailspage.png clustersummarymodification.png Attaching screens of how the UI would look for modified queue design. Cluster summary would be modified, to introduce a new column which will have a number of Queue, which will be linked to modified queue details page, which is described in initialscreen.png. From initalscreen.png we can click on the queue hierarchy, which would have two pages, for {{ContainerQueues}} we would not have a job list and for {{JobQueue}} we have a job list apart from scheduling information. Modify UI to support a hierarchy of queues -- Key: MAPREDUCE-862 URL: https://issues.apache.org/jira/browse/MAPREDUCE-862 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Hemanth Yamijala Attachments: clustersummarymodification.png, detailspage.png, initialscreen.png, subqueue.png MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce framework. This JIRA is for defining changes to the UI related to queues. This includes the hadoop queue CLI and the web UI on the JobTracker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues
[ https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreekanth Ramakrishnan updated MAPREDUCE-862: - Attachment: subqueue.png Modify UI to support a hierarchy of queues -- Key: MAPREDUCE-862 URL: https://issues.apache.org/jira/browse/MAPREDUCE-862 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Hemanth Yamijala Attachments: clustersummarymodification.png, detailspage.png, initialscreen.png, subqueue.png MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce framework. This JIRA is for defining changes to the UI related to queues. This includes the hadoop queue CLI and the web UI on the JobTracker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-430) Task stuck in cleanup with OutOfMemoryErrors
[ https://issues.apache.org/jira/browse/MAPREDUCE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1271#action_1271 ] Hadoop QA commented on MAPREDUCE-430: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416767/MAPREDUCE-430-v1.7.patch against trunk revision 805081. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/console This message is automatically generated. Task stuck in cleanup with OutOfMemoryErrors Key: MAPREDUCE-430 URL: https://issues.apache.org/jira/browse/MAPREDUCE-430 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Amareshwari Sriramadasu Assignee: Amar Kamat Fix For: 0.20.1 Attachments: MAPREDUCE-430-v1.6-branch-0.20.patch, MAPREDUCE-430-v1.6.patch, MAPREDUCE-430-v1.7.patch Obesrved a task with OutOfMemory error, stuck in cleanup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1279#action_1279 ] Amareshwari Sriramadasu commented on MAPREDUCE-849: --- Configuration properties in Mapreduce project are be catagorized into the following and suggested name for each catagory. ||Catagory|| Suggested Name|| |Cluster config | mapreduce.* | |JobTracker config | mapreduce.jobtracker.* | |TaskTracker config | mapreduce.tasktracker.* | |Job-level config | mapreduce.job.* | |Task-level config | mapreduce.task.* | |Map task config | mapreduce.map.* | |Reduce task config | mapreduce.reduce.* | |Job client config | mapreduce.jobclient.* | |Pipes config | mapreduce.pipes.* | |Lib config | mapreduce.libname.* | |Example config | mapreduce.example-name.* | |Test config | mapreduce.test.* | |Streaming config | mapreduce.streaming.* or streaming.*| |Contrib project config | mapreduce.contrib-project.* or contrib-project.* | Thoughts? Renaming of configuration property names in mapreduce - Key: MAPREDUCE-849 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Amareshwari Sriramadasu Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 In-line with HDFS-531, property names in configuration files should be standardized in MAPREDUCE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.
[ https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744452#action_12744452 ] rahul k singh commented on MAPREDUCE-861: - As mentioned above , we had an internal agreement that we would be going ahead with xml based configuration for hierarchial queues. In terms of how configuration would be structured for hierarchial queues , we had 2 options in mind. Option 1: -- mapred-queues.xml would consist of the hierarchial queue hierarchy. Typical hierarchial queue configuration would look like: {code:xml} queue nameq1/name queue nameq1q1/name administrators u1,u2,u3 /administrators submitters u1,u2 /submitters statestop/running/state schedulingContext capacity/capacity maxCapacity/maxCapacity /schedulingContext /queue /queue {code} The configuration above defines a queue q1 and a single child q1q1 schedulingContext tag would act as an black box kind of section for the mapred based parsers. The xsd definition of schedulingContext would be {code:xml} xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded xs:complexType xs:any/ /xs:complexType /xs:element {code} By defining schedulingContext as xs:any we can extend this section of configuration to add any kind of tags to the schedulingContext. Advantage: 1.This approach allows to have a single configuration file 2. It is generic enough as in it allows users to declare scheduler properties the way they want. Disadvantage: 1. This would result in having the parsing logic at different places , for framework level changes in framework and scheduler specific parsing would be done in scheduler. 2. More cumbersome to implement . Option 2: - Same as option 1 except the definition of schedulingContext would change . It would have child tags key and value which would define the key value mappings of the various properties required by schedulers. For example: {code:xml} queue nameq1/name queue nameq1q1/name administrators u1,u2,u3 /administrators submitters u1,u2 /submitters statestop/running/state schedulingContext keycapacity/key value/value keymaxCapacity/key value/value /schedulingContext /queue /queue {code} the new xsd for schedulingContext would look like {code:xml} xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded xs:complexType xs:sequence xs:element name=key minOccurs=0 maxOccurs=unbounded /xs:element /xs:element xs:element name=value minOccurs=0 maxOccurs=unbounded /xs:element /xs:sequence /xs:complexType /xs:element {code} Advantage: 1. Allows to have a single configuration file. 2. Provides constant way to specify scheduling properties. 3. Easier to implement and parsing logic now resides at one common place. Disadvantage: 1. Doesn't allows the nested settings for scheduler properties. 2. Assumes that scheduler properties would always be in key value format. Modify queue configuration format and parsing to support a hierarchy of queues. --- Key: MAPREDUCE-861 URL: https://issues.apache.org/jira/browse/MAPREDUCE-861 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Hemanth Yamijala Assignee: rahul k singh MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce framework. This JIRA is for defining changes to the configuration related to queues. The current format for defining a queue and its properties is as follows: mapred.queue.queue-name.property-name. For e.g. mapred.queue.queue-name.acl-submit-job. The reason for using this verbose format was to be able to reuse the Configuration parser in Hadoop. However, administrators currently using the queue configuration have already indicated a very strong desire for a more manageable format. Since, this becomes more unwieldy with hierarchical queues, the time may be good to introduce a new format for representing queue configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744459#action_12744459 ] Vinod K V commented on MAPREDUCE-849: - These names look a lot cleaner. +1 for the overall direction. But, we should also think of ways to continue doing this going forward even after this issue gets committed. While doing this, if we can create the corresponding java.lang.String property names, ala HADOOP-3583 , and use them everywhere, it will be real good. For e.g., {code} static final String MAPREDUCE_CLUSTER_EXAMPLE_CONFIG_PROPERTY = mapreduce.cluster.example.config {code} Also, I think usage of strings like _mapreduce.map.max.attempts_ and _mapreduce.jobtracker.maxtasks.per.job_ should be discouraged in favour of _mapreduce.map.max-attempts_ and _mapreduce.jobtracker.maxtasks-per-job_ respectively. Thoughts about this? I am assuming that configuration related to sub-components should start with a prefix of the parent component. For e.g., _mapred.healthChecker.script.args_ will be _mapreduce.tasktracker.healthChecker.script-args_ . Right? Renaming of configuration property names in mapreduce - Key: MAPREDUCE-849 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Amareshwari Sriramadasu Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 In-line with HDFS-531, property names in configuration files should be standardized in MAPREDUCE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744462#action_12744462 ] Amareshwari Sriramadasu commented on MAPREDUCE-849: --- bq. I am assuming that configuration related to sub-components should start with a prefix of the parent component. For e.g., mapred.healthChecker.script.args will be mapreduce.tasktracker.healthChecker.script-args . Right? Yes. I will post a document which contains complete change-list of old name to new name. Renaming of configuration property names in mapreduce - Key: MAPREDUCE-849 URL: https://issues.apache.org/jira/browse/MAPREDUCE-849 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Amareshwari Sriramadasu Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 In-line with HDFS-531, property names in configuration files should be standardized in MAPREDUCE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-181) mapred.system.dir should be accessible only to hadoop daemons
[ https://issues.apache.org/jira/browse/MAPREDUCE-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744467#action_12744467 ] Devaraj Das commented on MAPREDUCE-181: --- I wonder whether it makes sense to have the jobclient write two files per a split file: 1) the splits info (the actual bytes) written to a secure location on the hdfs (with permissions 700) 2) the split metadata, which is a set of entries like {map-id:location_1location_2..location_n, start-offset-in-split-filelength} for each map-id. This is serialized over RPC, and the JobTracker writes it to the well known mapred-system-directory (which the JobTracker owns with perms 700). The JobTracker just reads/loads the metadata, and creates the TIP cache. The TaskTracker is handed off a split object that looks something like {start-offset-in-split-filelength}. As part of task localization, the TT copies the specific bytes from the split file (securely), and launches the task that then reads the split or the TT could simply stream it over RPC to the child. The replication factor could be set to a high number for the splits info file.. Doing it in this way should reduce the size of the split file information considerably (and we can have a cap on the metadata size as well), and also provide security for the user generated split files' content. For the JobConf, passing the basic and the minimum info to the JobTracker as Hong suggested on MAPREDUCE-841 seems to make sense. For all other conf properties, the Task can load them directly from the HDFS. The max size (in terms of #bytes) of the basic information could be easily derived and we could have a cap on that for the RPC communication. Thoughts? mapred.system.dir should be accessible only to hadoop daemons -- Key: MAPREDUCE-181 URL: https://issues.apache.org/jira/browse/MAPREDUCE-181 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Amar Kamat Assignee: Amar Kamat Attachments: hadoop-3578-branch-20-example-2.patch, hadoop-3578-branch-20-example.patch, HADOOP-3578-v2.6.patch, HADOOP-3578-v2.7.patch Currently the jobclient accesses the {{mapred.system.dir}} to add job details. Hence the {{mapred.system.dir}} has the permissions of {{rwx-wx-wx}}. This could be a security loophole where the job files might get overwritten/tampered after the job submission. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemanth Yamijala resolved MAPREDUCE-711. Resolution: Fixed Fix Version/s: 0.21.0 Release Note: - Removed distributed cache classes and package from the Common project. - Added the same to the mapreduce project. - This will mean that users using Distributed Cache will now necessarily need the mapreduce jar in Hadoop 0.21. - Modified the package name to o.a.h.mapreduce.filecache from o.a.h.filecache and deprecated the old package name. Hadoop Flags: [Incompatible change, Reviewed] HDFS tests have also passed. Now, all the projects are sync'ed up. I committed this to trunk. Thanks, Vinod ! Move Distributed Cache from Common to Map/Reduce Key: MAPREDUCE-711 URL: https://issues.apache.org/jira/browse/MAPREDUCE-711 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Vinod K V Fix For: 0.21.0 Attachments: MAPREDUCE-711-20090709-common.txt, MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, MAPREDUCE-711-20090710.txt Distributed Cache logically belongs as part of map/reduce and not Common. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.
[ https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744505#action_12744505 ] rahul k singh commented on MAPREDUCE-861: - There is small error in the xsd mentioned above for option 2: {code:xml} xs:element name=schedulingContext minOccurs=0 maxOccurs=unbounded xs:complexType xs:sequence minOccurs=0 maxOccurs=unbounded xs:element name=key minOccurs=1 maxOccurs=1 /xs:element xs:element name=value minOccurs=1 maxOccurs=1 /xs:element /xs:sequence /xs:complexType /xs:element {code} Modify queue configuration format and parsing to support a hierarchy of queues. --- Key: MAPREDUCE-861 URL: https://issues.apache.org/jira/browse/MAPREDUCE-861 Project: Hadoop Map/Reduce Issue Type: Sub-task Reporter: Hemanth Yamijala Assignee: rahul k singh MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce framework. This JIRA is for defining changes to the configuration related to queues. The current format for defining a queue and its properties is as follows: mapred.queue.queue-name.property-name. For e.g. mapred.queue.queue-name.acl-submit-job. The reason for using this verbose format was to be able to reuse the Configuration parser in Hadoop. However, administrators currently using the queue configuration have already indicated a very strong desire for a more manageable format. Since, this becomes more unwieldy with hierarchical queues, the time may be good to introduce a new format for representing queue configuration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744516#action_12744516 ] Hadoop QA commented on MAPREDUCE-476: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416836/MAPREDUCE-476-20090818.txt against trunk revision 805324. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/console This message is automatically generated. extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-883) harchive: Document how to unarchive
[ https://issues.apache.org/jira/browse/MAPREDUCE-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-883: --- Attachment: mapreduce-883-0.patch Simple doc suggesting to use cp/distcp for unarchiving. harchive: Document how to unarchive --- Key: MAPREDUCE-883 URL: https://issues.apache.org/jira/browse/MAPREDUCE-883 Project: Hadoop Map/Reduce Issue Type: Improvement Components: documentation, harchive Reporter: Koji Noguchi Priority: Minor Attachments: mapreduce-883-0.patch I was thinking of implementing harchive's 'unarchive' feature, but realized it has been implemented already ever since harchive was introduced. It just needs to be documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Patch Available (was: Open) More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744809#action_12744809 ] Hadoop QA commented on MAPREDUCE-885: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416936/MAPREDUCE-885.patch against trunk revision 805324. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/490/console This message is automatically generated. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744811#action_12744811 ] Aaron Kimball commented on MAPREDUCE-885: - I think this patch won't apply until MAPREDUCE-875 is in. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-875) Make DBRecordReader execute queries lazily
[ https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744816#action_12744816 ] Aaron Kimball commented on MAPREDUCE-875: - The failing Sqoop tests claim that it's failing because it can't find Avro. Not sure why this is happening -- Sqoop doesn't make use of Avro anywhere. Recycling patch status in case this was transient. If not, do I have to put some more random libraries in ivy.xml? According to {{git-blame}}, this was added to the root ivy.xml earlier that day: {code} 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 276) dependency org=org.apache.hadoop 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 277) name=avro 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 278) rev=1.0.0 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 279) conf=common-default/ 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 280) dependency org=org.codehaus.jackso 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 281) name=jackson-mapper-asl 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 282) rev=1.0.1 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 283) conf=common-default/ 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 284) dependency org=com.thoughtworks.pa 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 285) name=paranamer 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 286) rev=1.5 9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 287) conf=common-default/ {code} Make DBRecordReader execute queries lazily -- Key: MAPREDUCE-875 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-875.patch DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily
[ https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-875: Status: Patch Available (was: Open) Make DBRecordReader execute queries lazily -- Key: MAPREDUCE-875 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-875.patch DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily
[ https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-875: Attachment: MAPREDUCE-875.2.patch Attaching new patch after resync'ing with trunk. Just realized that avro was already added to sqoop's ivy.xml Make DBRecordReader execute queries lazily -- Key: MAPREDUCE-875 URL: https://issues.apache.org/jira/browse/MAPREDUCE-875 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-875.2.patch, MAPREDUCE-875.patch DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job
[ https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated MAPREDUCE-336: Attachment: MAPREDUCE-336_0_20090818.patch Straight-forward fix. The logging level of the tasks should be configurable by the job Key: MAPREDUCE-336 URL: https://issues.apache.org/jira/browse/MAPREDUCE-336 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Arun C Murthy Fix For: 0.21.0 Attachments: MAPREDUCE-336_0_20090818.patch It would be nice to be able to configure the logging level of the Task JVM's separately from the server JVM's. Reducing logging substantially increases performance and reduces the consumption of local disk on the task trackers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job
[ https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated MAPREDUCE-336: Fix Version/s: 0.21.0 Status: Patch Available (was: Open) The logging level of the tasks should be configurable by the job Key: MAPREDUCE-336 URL: https://issues.apache.org/jira/browse/MAPREDUCE-336 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Arun C Murthy Fix For: 0.21.0 Attachments: MAPREDUCE-336_0_20090818.patch It would be nice to be able to configure the logging level of the Task JVM's separately from the server JVM's. Reducing logging substantially increases performance and reduces the consumption of local disk on the task trackers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-880) TestRecoveryManager times out
[ https://issues.apache.org/jira/browse/MAPREDUCE-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744883#action_12744883 ] Amar Kamat commented on MAPREDUCE-880: -- Looked into this. Looks like the problem is with the case where the jobtracker is dead and the tasktrackers have some tasks running. In such cases MiniMRCluster.shutdown() waits forever for the task to finish (tracker to be idle). Somehow earlier the tasks were not scheduled and it used to work fine. Continuing with the debugging. TestRecoveryManager times out - Key: MAPREDUCE-880 URL: https://issues.apache.org/jira/browse/MAPREDUCE-880 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Reporter: Amar Kamat -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.