[jira] Commented: (MAPREDUCE-834) When TaskTracker config use old memory management values its memory monitoring is diabled.

2009-08-18 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744915#action_12744915
 ] 

Hemanth Yamijala commented on MAPREDUCE-834:


Few comments:

- Memory allotted for slot based on old configuration should not be based on 
getMaxVirtualMemoryForTask(), but on 
JobConf.MAPRED_TASK_DEFAULT_MAXVMEM_PROPERTY. Also note that this value will be 
in bytes, while the system maintains everything else in MB. So, it should be 
converted to MB.
- testTaskMemoryMonitoringWithDeprecatedConfiguration should also set the TT 
configuration for JobConf.MAPRED_TASK_DEFAULT_MAXVMEM_PROPERTY in bytes instead 
of MAPRED_TASK_MAXVMEM_PROPERTY.

> When TaskTracker config use old memory management values its memory 
> monitoring is diabled.
> --
>
> Key: MAPREDUCE-834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Karam Singh
> Attachments: mapreduce-834-1.patch
>
>
> TaskTracker memory config values -:
> mapred.tasktracker.vmem.reserved=8589934592
> mapred.task.default.maxvmem=2147483648
> mapred.task.limit.maxvmem=4294967296
> mapred.tasktracker.pmem.reserved=2147483648
> TaskTracker start as -:
>2009-08-05 12:39:03,308 WARN 
> org.apache.hadoop.mapred.TaskTracker: The variable 
> mapred.tasktracker.vmem.reserved is no longer used
>   2009-08-05 12:39:03,308 WARN 
> org.apache.hadoop.mapred.TaskTracker: The variable 
> mapred.tasktracker.pmem.reserved is no longer used
>   2009-08-05 12:39:03,308 WARN 
> org.apache.hadoop.mapred.TaskTracker: The variable 
> mapred.task.default.maxvmem is no longer used
>   2009-08-05 12:39:03,308 WARN 
> org.apache.hadoop.mapred.TaskTracker: The variable mapred.task.limit.maxvmem 
> is no longer used
>   2009-08-05 12:39:03,308 INFO 
> org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for 
> all reduce tasks on 
>   2009-08-05 12:39:03,309 INFO 
> org.apache.hadoop.mapred.TaskTracker:  Using MemoryCalculatorPlugin : 
> org.apache.hadoop.util.linuxmemorycalculatorplu...@19be4777
>   2009-08-05 12:39:03,311 WARN 
> org.apache.hadoop.mapred.TaskTracker: TaskTracker's 
> totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-777) A method for finding and tracking jobs from the new API

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-777:
--

Status: Patch Available  (was: Open)

> A method for finding and tracking jobs from the new API
> ---
>
> Key: MAPREDUCE-777
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-777
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: client
>Reporter: Owen O'Malley
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
> Attachments: patch-777-1.txt, patch-777-2.txt, patch-777.txt
>
>
> We need to create a replacement interface for the JobClient API in the new 
> interface. In particular, the user needs to be able to query and track jobs 
> that were launched by other processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-777) A method for finding and tracking jobs from the new API

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-777:
--

Attachment: patch-777-2.txt

Patch incorporating review comments except comment(4).
bq. Move Counters(org.apache.hadoop.mapred.Counters counters) to a method in 
old api
This needs CounterGroup constructor(s) to be made public. So, did not move this 
method to old api.

> A method for finding and tracking jobs from the new API
> ---
>
> Key: MAPREDUCE-777
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-777
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: client
>Reporter: Owen O'Malley
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
> Attachments: patch-777-1.txt, patch-777-2.txt, patch-777.txt
>
>
> We need to create a replacement interface for the JobClient API in the new 
> interface. In particular, the user needs to be able to query and track jobs 
> that were launched by other processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.

2009-08-18 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744910#action_12744910
 ] 

rahul k singh commented on MAPREDUCE-861:
-

We had an offline discussion with owen and eric. 
There was an agreement in principal to use option 2 with slight modification .

So all the configuration still remains the same except  
part. would change to 

Hence the new configuration would look like:
{code:xml}

queue1

   



subQueue1
alice,bob
running






{code}

> Modify queue configuration format and parsing to support a hierarchy of 
> queues.
> ---
>
> Key: MAPREDUCE-861
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-861
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Hemanth Yamijala
>Assignee: rahul k singh
>
> MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
> framework. This JIRA is for defining changes to the configuration related to 
> queues. 
> The current format for defining a queue and its properties is as follows: 
> mapred.queue... For e.g. 
> mapred.queue..acl-submit-job. The reason for using this verbose 
> format was to be able to reuse the Configuration parser in Hadoop. However, 
> administrators currently using the queue configuration have already indicated 
> a very strong desire for a more manageable format. Since, this becomes more 
> unwieldy with hierarchical queues, the time may be good to introduce a new 
> format for representing queue configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-777) A method for finding and tracking jobs from the new API

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated MAPREDUCE-777:
--

Status: Open  (was: Patch Available)

Cancelling patch to incorporate offline comments from Amar
Comments include:
1. Introduce Counters.downgrade() instead of constructor 
2. 
{code}
+  org.apache.hadoop.mapreduce.JobClient.TaskStatusFilter newFilter = 
+getNewFilter(filter);
+  printTaskEvents(events, newFilter, profiling, mapRanges, reduceRanges);
{code}
Use getNewFilter directly.

3. deprecate public methods in jobtracker, that got changed for new 
JobSubmissionProtocol
4. Move Counters(org.apache.hadoop.mapred.Counters counters) to a method in old 
api

> A method for finding and tracking jobs from the new API
> ---
>
> Key: MAPREDUCE-777
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-777
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: client
>Reporter: Owen O'Malley
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
> Attachments: patch-777-1.txt, patch-777.txt
>
>
> We need to create a replacement interface for the JobClient API in the new 
> interface. In particular, the user needs to be able to query and track jobs 
> that were launched by other processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-880) TestRecoveryManager times out

2009-08-18 Thread Amar Kamat (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744883#action_12744883
 ] 

Amar Kamat commented on MAPREDUCE-880:
--

Looked into this. Looks like the problem is with the case where the jobtracker 
is dead and the tasktrackers have some tasks running. In such cases 
MiniMRCluster.shutdown() waits forever for the task to finish (tracker to be 
idle). Somehow earlier the tasks were not scheduled and it used to work fine. 
Continuing with the debugging. 

> TestRecoveryManager times out
> -
>
> Key: MAPREDUCE-880
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-880
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Amar Kamat
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-832) Too many WARN messages about deprecated memorty config variables in JobTacker log

2009-08-18 Thread Hemanth Yamijala (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala updated MAPREDUCE-832:
---

Assignee: rahul k singh
  Status: Patch Available  (was: Open)

> Too many WARN messages about deprecated memorty config variables in JobTacker 
> log
> -
>
> Key: MAPREDUCE-832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-832
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Karam Singh
>Assignee: rahul k singh
> Attachments: mapreduce-832-20.patch, mapreduce-832.patch
>
>
> When user submit a mapred job using old memory config vairiable 
> (mapred.task.maxmem) followinig message too many times in JobTracker logs -:
> [
> WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no 
> longer used instead use  mapred.job.map.memory.mb and 
> mapred.job.reduce.memory.mb
> ]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-832) Too many WARN messages about deprecated memorty config variables in JobTacker log

2009-08-18 Thread Hemanth Yamijala (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala updated MAPREDUCE-832:
---

Attachment: mapreduce-832.patch

Attached a new patch that works for trunk. It is the same as what Rahul 
uploaded, except I modified the method checkAndWarnDeprecation to not require a 
Configuration instance. Instead, it uses the current object's values itself. 
Running this through Hudson

> Too many WARN messages about deprecated memorty config variables in JobTacker 
> log
> -
>
> Key: MAPREDUCE-832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-832
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Karam Singh
> Attachments: mapreduce-832-20.patch, mapreduce-832.patch
>
>
> When user submit a mapred job using old memory config vairiable 
> (mapred.task.maxmem) followinig message too many times in JobTracker logs -:
> [
> WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no 
> longer used instead use  mapred.job.map.memory.mb and 
> mapred.job.reduce.memory.mb
> ]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job

2009-08-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy reassigned MAPREDUCE-336:
---

Assignee: Arun C Murthy

> The logging level of the tasks should be configurable by the job
> 
>
> Key: MAPREDUCE-336
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-336
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Arun C Murthy
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-336_0_20090818.patch
>
>
> It would be nice to be able to configure the logging level of the Task JVM's 
> separately from the server JVM's. Reducing logging substantially increases 
> performance and reduces the consumption of local disk on the task trackers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job

2009-08-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-336:


Fix Version/s: 0.21.0
   Status: Patch Available  (was: Open)

> The logging level of the tasks should be configurable by the job
> 
>
> Key: MAPREDUCE-336
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-336
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Arun C Murthy
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-336_0_20090818.patch
>
>
> It would be nice to be able to configure the logging level of the Task JVM's 
> separately from the server JVM's. Reducing logging substantially increases 
> performance and reduces the consumption of local disk on the task trackers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-336) The logging level of the tasks should be configurable by the job

2009-08-18 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-336:


Attachment: MAPREDUCE-336_0_20090818.patch

Straight-forward fix.

> The logging level of the tasks should be configurable by the job
> 
>
> Key: MAPREDUCE-336
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-336
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Arun C Murthy
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-336_0_20090818.patch
>
>
> It would be nice to be able to configure the logging level of the Task JVM's 
> separately from the server JVM's. Reducing logging substantially increases 
> performance and reduces the consumption of local disk on the task trackers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-875:


Attachment: MAPREDUCE-875.2.patch

Attaching new patch after resync'ing with trunk. Just realized that avro was 
already added to sqoop's ivy.xml

> Make DBRecordReader execute queries lazily
> --
>
> Key: MAPREDUCE-875
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-875.2.patch, MAPREDUCE-875.patch
>
>
> DBInputFormat's DBRecordReader executes the user's SQL query in the 
> constructor. If the query is long-running, this can cause task timeout. The 
> user is unable to spawn a background thread (e.g., in a MapRunnable) to 
> inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-875:


Status: Patch Available  (was: Open)

> Make DBRecordReader execute queries lazily
> --
>
> Key: MAPREDUCE-875
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-875.patch
>
>
> DBInputFormat's DBRecordReader executes the user's SQL query in the 
> constructor. If the query is long-running, this can cause task timeout. The 
> user is unable to spawn a background thread (e.g., in a MapRunnable) to 
> inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744816#action_12744816
 ] 

Aaron Kimball commented on MAPREDUCE-875:
-

The failing Sqoop tests claim that it's failing because it can't find Avro. Not 
sure why this is happening -- Sqoop doesn't make use of Avro anywhere. 
Recycling patch status in case this was transient. If not, do I have to put 
some more random libraries in ivy.xml? 

According to {{git-blame}}, this was added to the root ivy.xml earlier that day:

{code}
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 276) 
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 280) 
9e58f6fc (Sharad Agarwal 2009-08-14 05:10:40 + 284) 
{code}

> Make DBRecordReader execute queries lazily
> --
>
> Key: MAPREDUCE-875
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-875.patch
>
>
> DBInputFormat's DBRecordReader executes the user's SQL query in the 
> constructor. If the query is long-running, this can cause task timeout. The 
> user is unable to spawn a background thread (e.g., in a MapRunnable) to 
> inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-875) Make DBRecordReader execute queries lazily

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-875:


Status: Open  (was: Patch Available)

> Make DBRecordReader execute queries lazily
> --
>
> Key: MAPREDUCE-875
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-875
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-875.patch
>
>
> DBInputFormat's DBRecordReader executes the user's SQL query in the 
> constructor. If the query is long-running, this can cause task timeout. The 
> user is unable to spawn a background thread (e.g., in a MapRunnable) to 
> inform Hadoop of on-going progress. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744811#action_12744811
 ] 

Aaron Kimball commented on MAPREDUCE-885:
-

I think this patch won't apply until MAPREDUCE-875 is in.

> More efficient SQL queries for DBInputFormat
> 
>
> Key: MAPREDUCE-885
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a 
> table, and selecting subsections of the table via the "LIMIT" and "OFFSET" 
> SQL keywords. These are only meaningful in an ordered context, so the query 
> also includes an "ORDER BY" clause on an index column. The resulting queries 
> are often inefficient and require full table scans. Actually using multiple 
> mappers with these queries can lead to O(n^2) behavior in the database, where 
> n is the number of splits. Attempting to use parallelism with these queries 
> is counter-productive.
> A better mechanism is to organize splits based on data values themselves, 
> which can be performed in the WHERE clause, allowing for index range scans of 
> tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744809#action_12744809
 ] 

Hadoop QA commented on MAPREDUCE-885:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12416936/MAPREDUCE-885.patch
  against trunk revision 805324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/490/console

This message is automatically generated.

> More efficient SQL queries for DBInputFormat
> 
>
> Key: MAPREDUCE-885
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a 
> table, and selecting subsections of the table via the "LIMIT" and "OFFSET" 
> SQL keywords. These are only meaningful in an ordered context, so the query 
> also includes an "ORDER BY" clause on an index column. The resulting queries 
> are often inefficient and require full table scans. Actually using multiple 
> mappers with these queries can lead to O(n^2) behavior in the database, where 
> n is the number of splits. Attempting to use parallelism with these queries 
> is counter-productive.
> A better mechanism is to organize splits based on data values themselves, 
> which can be performed in the WHERE clause, allowing for index range scans of 
> tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)
More efficient SQL queries for DBInputFormat


 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch

DBInputFormat generates InputSplits by counting the available rows in a table, 
and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL 
keywords. These are only meaningful in an ordered context, so the query also 
includes an "ORDER BY" clause on an index column. The resulting queries are 
often inefficient and require full table scans. Actually using multiple mappers 
with these queries can lead to O(n^2) behavior in the database, where n is the 
number of splits. Attempting to use parallelism with these queries is 
counter-productive.

A better mechanism is to organize splits based on data values themselves, which 
can be performed in the WHERE clause, allowing for index range scans of tables, 
and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-885:


Status: Patch Available  (was: Open)

> More efficient SQL queries for DBInputFormat
> 
>
> Key: MAPREDUCE-885
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a 
> table, and selecting subsections of the table via the "LIMIT" and "OFFSET" 
> SQL keywords. These are only meaningful in an ordered context, so the query 
> also includes an "ORDER BY" clause on an index column. The resulting queries 
> are often inefficient and require full table scans. Actually using multiple 
> mappers with these queries can lead to O(n^2) behavior in the database, where 
> n is the number of splits. Attempting to use parallelism with these queries 
> is counter-productive.
> A better mechanism is to organize splits based on data values themselves, 
> which can be performed in the WHERE clause, allowing for index range scans of 
> tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-885:


Attachment: MAPREDUCE-885.patch

This patch introduces DataDrivenDBInputFormat. This class extends DBInputFormat 
and reuses much of its common logic (e.g., setting up and tearing down 
connections, configuration, DBWritable, etc). But it adds a 
DataDrivenDBInputSplit class which splits queries based on data values, e.g. 
{{"id >= 10 AND id < 20"}} for one split, and {{"id >= 20 AND id < 30"}} for 
the next one. The resulting queries run significantly faster and parallelise 
properly.

Instead of requiring a counting query like DBInputFormat, this InputFormat 
requires a query that returns the min and max values of the split column on the 
data to import. DataDrivenDBInputSplit is a subclass of DBInputSplit; the 
original DBRecordReader family of classes has been modified to discriminate 
between the new InputSplit class vs. the old one; if it detects a new one, it 
submits the newer WHERE-based query rather than the LIMIT/OFFSET-based query to 
the database.

The min and max values of the column are used to generate splits via linear 
interpolation between the values. A DBSplitter interface has been added, which 
takes the min and max values for the column, as well as the number of splits to 
use. It then generates about this many splits, which subdivide the range of 
values into roughly-even intervals. Several DBSplitter implementations are 
provided which are applicable to different data types. For example, there is an 
IntegerSplitter which can split INTEGER, BIGINT, TINYINT, LONG, etc. columns. 
The FloatSplitter implementation works on DECIMAL, NUMBER, and REAL datatypes. 
A TextSplitter implementation is provided, but its utility is 
database-dependent. Databases may choose to sort strings via a number of 
algorithms (e.g., case-sensitive vs. case-insensitive). The TextSplitter 
assumes that strings are sorted in Unicode codepoint order. (e.g., "AAA" < 
"BBB" < "aaa".) A warning will be logged if the TextSplitter is used. 

Explicit tests have been added for some of the splitters. Sqoop has been 
modified to use the new InputFormat with encouraging performance results. 
Sqoop's existing regression test suite exercises the code paths for all the 
splitters and isolated several bugs which were fixed prior to submitting this 
patch. I will post the Sqoop patch separately after this JIRA issue is 
committed.

> More efficient SQL queries for DBInputFormat
> 
>
> Key: MAPREDUCE-885
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a 
> table, and selecting subsections of the table via the "LIMIT" and "OFFSET" 
> SQL keywords. These are only meaningful in an ordered context, so the query 
> also includes an "ORDER BY" clause on an index column. The resulting queries 
> are often inefficient and require full table scans. Actually using multiple 
> mappers with these queries can lead to O(n^2) behavior in the database, where 
> n is the number of splits. Attempting to use parallelism with these queries 
> is counter-productive.
> A better mechanism is to organize splits based on data values themselves, 
> which can be performed in the WHERE clause, allowing for index range scans of 
> tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-883) harchive: Document how to unarchive

2009-08-18 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated MAPREDUCE-883:
---

Attachment: mapreduce-883-0.patch

Simple doc suggesting to use cp/distcp for unarchiving.

> harchive: Document how to unarchive
> ---
>
> Key: MAPREDUCE-883
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-883
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation, harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-883-0.patch
>
>
> I was thinking of implementing harchive's 'unarchive' feature, but realized 
> it has been implemented already ever since harchive was introduced.
> It just needs to be documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-237) Runtimes of TestJobTrackerRestart* testcases are high again

2009-08-18 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744661#action_12744661
 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-237:
--

Got a TestJobTrackerRestart timeout on Hudson [build 
#487|http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/487/testReport/org.apache.hadoop.mapred/TestJobTrackerRestart/testJobTrackerRestart/].

> Runtimes of TestJobTrackerRestart* testcases are high again
> ---
>
> Key: MAPREDUCE-237
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-237
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Amar Kamat
>Assignee: Amar Kamat
>
> [junit] Running org.apache.hadoop.mapred.TestJobTrackerRestart
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 575.887 sec
> [junit] Running org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 864.319 sec
> Something I saw on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744516#action_12744516
 ] 

Hadoop QA commented on MAPREDUCE-476:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12416836/MAPREDUCE-476-20090818.txt
  against trunk revision 805324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 14 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/489/console

This message is automatically generated.

> extend DistributedCache to work locally (LocalJobRunner)
> 
>
> Key: MAPREDUCE-476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-476
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: sam rash
>Assignee: Philip Zeyliger
>Priority: Minor
> Attachments: HADOOP-2914-v1-full.patch, 
> HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, 
> MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, 
> MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, 
> MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, 
> MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, 
> MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch
>
>
> The DistributedCache does not work locally when using the outlined recipe at 
> http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html
>  
> Ideally, LocalJobRunner would take care of populating the JobConf and copying 
> remote files to the local file sytem (http, assume hdfs = default fs = local 
> fs when doing local development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.

2009-08-18 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744505#action_12744505
 ] 

rahul k singh commented on MAPREDUCE-861:
-

There is small error in the xsd mentioned above for option 2:
{code:xml}

  

  
  
  
  

  



{code}


> Modify queue configuration format and parsing to support a hierarchy of 
> queues.
> ---
>
> Key: MAPREDUCE-861
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-861
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Hemanth Yamijala
>Assignee: rahul k singh
>
> MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
> framework. This JIRA is for defining changes to the configuration related to 
> queues. 
> The current format for defining a queue and its properties is as follows: 
> mapred.queue... For e.g. 
> mapred.queue..acl-submit-job. The reason for using this verbose 
> format was to be able to reuse the Configuration parser in Hadoop. However, 
> administrators currently using the queue configuration have already indicated 
> a very strong desire for a more manageable format. Since, this becomes more 
> unwieldy with hierarchical queues, the time may be good to introduce a new 
> format for representing queue configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Hemanth Yamijala (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala resolved MAPREDUCE-711.


   Resolution: Fixed
Fix Version/s: 0.21.0
 Release Note: 
- Removed distributed cache classes and package from the Common project. 
- Added the same to the mapreduce project. 
- This will mean that users using Distributed Cache will now necessarily need 
the mapreduce jar in Hadoop 0.21.
- Modified the package name to o.a.h.mapreduce.filecache from o.a.h.filecache 
and deprecated the old package name.
 Hadoop Flags: [Incompatible change, Reviewed]

HDFS tests have also passed. Now, all the projects are sync'ed up.

I committed this to trunk. Thanks, Vinod !

> Move Distributed Cache from Common to Map/Reduce
> 
>
> Key: MAPREDUCE-711
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Vinod K V
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-711-20090709-common.txt, 
> MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
> MAPREDUCE-711-20090710.txt
>
>
> Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-181) mapred.system.dir should be accessible only to hadoop daemons

2009-08-18 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744467#action_12744467
 ] 

Devaraj Das commented on MAPREDUCE-181:
---

I wonder whether it makes sense to have the jobclient write two files per a 
split file:

1) the splits info (the actual bytes) written to a secure location on the hdfs 
(with permissions 700)
2) the split metadata, which is a set of entries like 
{:.., 
} for each map-id. This is serialized over 
RPC, and the JobTracker writes it to the well known mapred-system-directory 
(which the JobTracker owns with perms 700).

The JobTracker just reads/loads the metadata, and creates the TIP cache.

The TaskTracker is handed off a split object that looks something like 
{}. As part of task localization, the TT 
copies the specific bytes from the split file (securely), and launches the task 
that then reads the split or the TT could simply stream it over RPC to the 
child. The replication factor could be set to a high number for the splits info 
file.. 

Doing it in this way should reduce the size of the split file information 
considerably (and we can have a cap on the metadata size as well), and also 
provide security for the user generated split files' content.

For the JobConf, passing the basic and the minimum info to the JobTracker as 
Hong suggested on MAPREDUCE-841 seems to make sense. For all other conf 
properties, the Task can load them directly from the HDFS. The max size (in 
terms of #bytes) of the basic information could be easily derived and we could 
have a cap on that for the RPC communication.

Thoughts?

> mapred.system.dir should be accessible only to hadoop daemons 
> --
>
> Key: MAPREDUCE-181
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-181
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Amar Kamat
>Assignee: Amar Kamat
> Attachments: hadoop-3578-branch-20-example-2.patch, 
> hadoop-3578-branch-20-example.patch, HADOOP-3578-v2.6.patch, 
> HADOOP-3578-v2.7.patch
>
>
> Currently the jobclient accesses the {{mapred.system.dir}} to add job 
> details. Hence the {{mapred.system.dir}} has the permissions of 
> {{rwx-wx-wx}}. This could be a security loophole where the job files might 
> get overwritten/tampered after the job submission. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744464#action_12744464
 ] 

Hudson commented on MAPREDUCE-711:
--

Integrated in Hadoop-Hdfs-trunk #53 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/53/])
. Updated common and mapreduce jars from rev 804918 & 805081 resp.


> Move Distributed Cache from Common to Map/Reduce
> 
>
> Key: MAPREDUCE-711
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Vinod K V
> Attachments: MAPREDUCE-711-20090709-common.txt, 
> MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
> MAPREDUCE-711-20090710.txt
>
>
> Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744462#action_12744462
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-849:
---

bq. I am assuming that configuration related to sub-components should start 
with a prefix of the parent component. For e.g., 
mapred.healthChecker.script.args will be 
mapreduce.tasktracker.healthChecker.script-args . Right?
Yes. I will post a document which contains complete change-list of old name to 
new name. 

> Renaming of configuration property names in mapreduce
> -
>
> Key: MAPREDUCE-849
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
>
> In-line with HDFS-531, property names in configuration files should be 
> standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744459#action_12744459
 ] 

Vinod K V commented on MAPREDUCE-849:
-

These names look a lot cleaner. +1 for the overall direction. But, we should 
also think of ways to continue doing this going forward even after this issue 
gets committed.

While doing this, if we can create the corresponding java.lang.String property 
names, ala HADOOP-3583 , and use them everywhere, it will be real good. For 
e.g.,
{code}
static final String MAPREDUCE_CLUSTER_EXAMPLE_CONFIG_PROPERTY = 
"mapreduce.cluster.example.config"
{code}

Also, I think usage of strings like _mapreduce.map.max.attempts_ and 
_mapreduce.jobtracker.maxtasks.per.job_ should be discouraged in favour of 
_mapreduce.map.max-attempts_ and _mapreduce.jobtracker.maxtasks-per-job_ 
respectively. Thoughts about this?

I am assuming that configuration related to sub-components should start with a 
prefix of the parent component. For e.g., _mapred.healthChecker.script.args_ 
will be _mapreduce.tasktracker.healthChecker.script-args_ . Right?

> Renaming of configuration property names in mapreduce
> -
>
> Key: MAPREDUCE-849
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
>
> In-line with HDFS-531, property names in configuration files should be 
> standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-861) Modify queue configuration format and parsing to support a hierarchy of queues.

2009-08-18 Thread rahul k singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744452#action_12744452
 ] 

rahul k singh commented on MAPREDUCE-861:
-

As mentioned above , we had an internal agreement that we would be going ahead 
with
xml based configuration for hierarchial queues.

In terms of how configuration would be structured for hierarchial queues , we 
had 2 options in mind.

Option 1:
--
mapred-queues.xml would consist of the hierarchial queue hierarchy.

Typical hierarchial queue configuration would look like:
{code:xml}

q1

q1q1

u1,u2,u3


u1,u2

stop/running






{code}

The configuration above defines a queue "q1" and a single child "q1q1"

 tag would act as an black box kind of section for the 
mapred based parsers.
The xsd definition of  would be
{code:xml}

  

  

{code}

By defining  as  we can extend this section of 
configuration to add any kind of
tags to the .

Advantage:
1.This approach allows to have a single configuration file 
2. It is generic enough as in it allows users to declare scheduler properties 
the way they want.

Disadvantage:
1. This would result in having the parsing logic at different places , for 
framework level changes in framework and scheduler
specific parsing would be done in scheduler.
2. More cumbersome to implement .

Option 2:
-
Same as option 1  except the definition of  would change . 
It would have
child tags  and  which would define the key value mappings of the 
various properties required
by schedulers.

For example:
{code:xml}

q1

q1q1

   u1,u2,u3


   u1,u2

stop/running

capacity

maxCapacity




{code}

the new xsd for  would look like
{code:xml}

  

  
  
  
  
  

  

{code}

Advantage:
1. Allows to have a single configuration file.
2. Provides constant way to specify scheduling properties.
3. Easier to implement and parsing logic now resides at one common place.

Disadvantage:
1. Doesn't allows the nested settings for scheduler properties.
2. Assumes that scheduler properties would always be in key value format.

> Modify queue configuration format and parsing to support a hierarchy of 
> queues.
> ---
>
> Key: MAPREDUCE-861
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-861
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Hemanth Yamijala
>Assignee: rahul k singh
>
> MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
> framework. This JIRA is for defining changes to the configuration related to 
> queues. 
> The current format for defining a queue and its properties is as follows: 
> mapred.queue... For e.g. 
> mapred.queue..acl-submit-job. The reason for using this verbose 
> format was to be able to reuse the Configuration parser in Hadoop. However, 
> administrators currently using the queue configuration have already indicated 
> a very strong desire for a more manageable format. Since, this becomes more 
> unwieldy with hierarchical queues, the time may be good to introduce a new 
> format for representing queue configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-849) Renaming of configuration property names in mapreduce

2009-08-18 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1279#action_1279
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-849:
---

Configuration properties in  Mapreduce project are be catagorized into the 
following and suggested name for each catagory.
||Catagory|| Suggested Name||
|Cluster config | mapreduce.* |
|JobTracker config | mapreduce.jobtracker.* |
|TaskTracker config | mapreduce.tasktracker.* |
|Job-level config | mapreduce.job.* |
|Task-level config | mapreduce.task.* |
|Map task config | mapreduce.map.* |
|Reduce task config | mapreduce.reduce.* |
|Job client config | mapreduce.jobclient.* |
|Pipes config | mapreduce.pipes.* |
|Lib config | mapreduce..* |
|Example config | mapreduce..* |
|Test config | mapreduce.test.* |
|Streaming config | mapreduce.streaming.* or streaming.*|
|Contrib project config | mapreduce..* or .* |

Thoughts?

> Renaming of configuration property names in mapreduce
> -
>
> Key: MAPREDUCE-849
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-849
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Amareshwari Sriramadasu
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
>
> In-line with HDFS-531, property names in configuration files should be 
> standardized in MAPREDUCE. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-430) Task stuck in cleanup with OutOfMemoryErrors

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1271#action_1271
 ] 

Hadoop QA commented on MAPREDUCE-430:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12416767/MAPREDUCE-430-v1.7.patch
  against trunk revision 805081.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/console

This message is automatically generated.

> Task stuck in cleanup with OutOfMemoryErrors
> 
>
> Key: MAPREDUCE-430
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-430
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Amareshwari Sriramadasu
>Assignee: Amar Kamat
> Fix For: 0.20.1
>
> Attachments: MAPREDUCE-430-v1.6-branch-0.20.patch, 
> MAPREDUCE-430-v1.6.patch, MAPREDUCE-430-v1.7.patch
>
>
> Obesrved a task with OutOfMemory error, stuck in cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues

2009-08-18 Thread Sreekanth Ramakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreekanth Ramakrishnan updated MAPREDUCE-862:
-

Attachment: subqueue.png

> Modify UI to support a hierarchy of queues
> --
>
> Key: MAPREDUCE-862
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-862
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Hemanth Yamijala
> Attachments: clustersummarymodification.png, detailspage.png, 
> initialscreen.png, subqueue.png
>
>
> MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
> framework. This JIRA is for defining changes to the UI related to queues. 
> This includes the hadoop queue CLI and the web UI on the JobTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-862) Modify UI to support a hierarchy of queues

2009-08-18 Thread Sreekanth Ramakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreekanth Ramakrishnan updated MAPREDUCE-862:
-

Attachment: initialscreen.png
detailspage.png
clustersummarymodification.png

Attaching screens of how the UI would look for modified queue design.

Cluster summary would be modified, to introduce a new column which will have a 
number of Queue, which will be linked to modified queue details page, which is 
described in initialscreen.png.

>From initalscreen.png we can click on the queue hierarchy, which would have 
>two pages, for {{ContainerQueues}} we would not have a job list and for 
>{{JobQueue}} we have a job list apart from scheduling information.

> Modify UI to support a hierarchy of queues
> --
>
> Key: MAPREDUCE-862
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-862
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Hemanth Yamijala
> Attachments: clustersummarymodification.png, detailspage.png, 
> initialscreen.png, subqueue.png
>
>
> MAPREDUCE-853 proposes to introduce a hierarchy of queues into the Map/Reduce 
> framework. This JIRA is for defining changes to the UI related to queues. 
> This includes the hadoop queue CLI and the web UI on the JobTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Giridharan Kesavan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744426#action_12744426
 ] 

Giridharan Kesavan commented on MAPREDUCE-711:
--

Updated hdfs/lib with common and mapreduce jars from rev 804918 & 805081 resp. 

Triggered a hdfs trunk build (build added to build queue, as vesta is still 
running a patch build).
http://hudson.zones.apache.org/hudson/view/Hdfs/job/Hadoop-Hdfs-trunk/52/ 

> Move Distributed Cache from Common to Map/Reduce
> 
>
> Key: MAPREDUCE-711
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Vinod K V
> Attachments: MAPREDUCE-711-20090709-common.txt, 
> MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
> MAPREDUCE-711-20090710.txt
>
>
> Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-773) LineRecordReader can report non-zero progress while it is processing a compressed stream

2009-08-18 Thread Devaraj Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated MAPREDUCE-773:
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this.

> LineRecordReader can report non-zero progress while it is processing a 
> compressed stream
> 
>
> Key: MAPREDUCE-773
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-773
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.21.0
>
> Attachments: 773.2.patch, 773.3.patch, 773.patch, 773.patch
>
>
> Currently, the LineRecordReader returns 0.0 from getProgress() for most 
> inputs (since the "end" of the filesplit is set to Long.MAX_VALUE for 
> compressed inputs). This can be improved to return a non-zero progress even 
> for compressed streams (though it may not be very reflective of the actual 
> progress).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-284) Improvements to RPC between Child and TaskTracker

2009-08-18 Thread Ravi Gummadi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-284:
---

Attachment: MR-284.v1.patch

Attaching patch that sets ipc.client.tcpnodelay to true in core-default.xml

> Improvements to RPC between Child and TaskTracker
> -
>
> Key: MAPREDUCE-284
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-284
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Arun C Murthy
>Assignee: Ravi Gummadi
> Fix For: 0.21.0
>
> Attachments: MR-284.patch, MR-284.v1.patch
>
>
> We could improve the RPC between the Child and TaskTracker:
>* Set ping interval lower by default to 5s
>* Disable nagle's algorithm (tcp no-delay)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.

2009-08-18 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744415#action_12744415
 ] 

Jothi Padmanabhan commented on MAPREDUCE-157:
-

Regarding the interface for readers, we could support two kinds of users:

# Users who want fine grained control and would handle the individual events 
themselves. 
# Users who want a much more granular, summary kind of information. 

For users of type 1, who want finer grained information, they could use Event 
Readers to iterate through events and do the necessary processing

For users of type 2, we could provide more granular information through a 
JobHistoryParser class. This class would internally build the Job-Task-Attempt 
hierarchy/information by consuming all events using a event reader and make the 
summary information available for users to access. Users could do some thing 
like

{code}

parser.init(history file or stream)

JobInfo jobInfo = parser.getJobInfo();

// use the getters to get jobinfo (example: start time, finish time, counters, 
id, user name, conf, total maps, total reds, among others)

List taskInfoList = jobInfo.getAllTasks();

// Iterate through the list and do necessary processing. Getters for taskinfo 
would include taskid, task type, status, splits, counters, etc

List attemptsList = taskinfo.getAllAttempts();

// Attempt info would have getters for attempt id, errors, status, state, start 
time, finish time, tracker name, port etc.

{code}


Comments/Suggestions/Thoughts?

> Job History log file format is not friendly for external tools.
> ---
>
> Key: MAPREDUCE-157
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-157
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Owen O'Malley
>Assignee: Jothi Padmanabhan
>
> Currently, parsing the job history logs with external tools is very difficult 
> because of the format. The most critical problem is that newlines aren't 
> escaped in the strings. That makes using tools like grep, sed, and awk very 
> tricky.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce

2009-08-18 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744414#action_12744414
 ] 

Hemanth Yamijala commented on MAPREDUCE-711:


bq. Can you please run tests on Hudson (Giridharan could help with it I 
suppose) and commit the changes to HDFS when the tests pass.

I have already run the tests with the updated jars locally. There does not 
appear to be a way to run these off Hudson. So, we are planning to commit the 
jars and then trigger a Hudson HDFS build to make sure things work still. If 
something breaks, we will revert the commit and check again. (But given they 
pass locally, I am hoping we won't get to it).

Also, the MapReduce build failure in the tests is being tracked in 
MAPREDUCE-880 and is unrelated to this commit.

Giri, can you please commit the common and Map/Reduce jars to HDFS and trigger 
a build ?

> Move Distributed Cache from Common to Map/Reduce
> 
>
> Key: MAPREDUCE-711
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-711
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Vinod K V
> Attachments: MAPREDUCE-711-20090709-common.txt, 
> MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, 
> MAPREDUCE-711-20090710.txt
>
>
> Distributed Cache logically belongs as part of map/reduce and not Common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.