[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-09-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1274#action_1274
 ] 

Hudson commented on MAPREDUCE-885:
--

Integrated in Hadoop-Mapreduce-trunk #83 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/83/])
. More efficient SQL queries for DBInputFormat. Contributed by Aaron 
Kimball.


 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Fix For: 0.21.0

 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
 MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, 
 MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-09-09 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753185#action_12753185
 ] 

Aaron Kimball commented on MAPREDUCE-885:
-

I believe that the remaining findbugs warning is spurious. 
MysqlDDDBRR.executeQuery() saves a ref to the statement object that is created 
and executed; the DBRR.close() method will then call {{statement.close()}} 
later. This is the same pattern used in the rest of the DBRR's, but findbugs 
seems to have a problem with this one.


 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
 MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, 
 MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749059#action_12749059
 ] 

Hadoop QA commented on MAPREDUCE-885:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12418034/MAPREDUCE-885.5.patch
  against trunk revision 808730.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/console

This message is automatically generated.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
 MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-28 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749074#action_12749074
 ] 

Aaron Kimball commented on MAPREDUCE-885:
-

Failures are unrelated.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
 MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748655#action_12748655
 ] 

Hadoop QA commented on MAPREDUCE-885:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12417849/MAPREDUCE-885.3.patch
  against trunk revision 808730.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/532/console

This message is automatically generated.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
 MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-25 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747277#action_12747277
 ] 

Enis Soztutar commented on MAPREDUCE-885:
-

Data driven splits are really neat. Just a few suggestions 
- We can add a getSplitter(int sqlDataType) method to DDDBIF and move sql type 
- DBSplitter instance mapping, so that classes extending it can easily 
override this logic, for skewed data, etc. 
- Introduce DDDBRR extending DBRR in DDDBIF and move getDataBasedSelectQuery() 
as an overridden implementation of getSelectQuery(). 
- Do we need mapred.lib.db.DDDBIF since it is introduced as deprecated. I know 
that lot's of legacy code is using the old API, but adding a already deprecated 
class seems odd. 


 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744809#action_12744809
 ] 

Hadoop QA commented on MAPREDUCE-885:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12416936/MAPREDUCE-885.patch
  against trunk revision 805324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/490/console

This message is automatically generated.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

2009-08-18 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744811#action_12744811
 ] 

Aaron Kimball commented on MAPREDUCE-885:
-

I think this patch won't apply until MAPREDUCE-875 is in.

 More efficient SQL queries for DBInputFormat
 

 Key: MAPREDUCE-885
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
 Attachments: MAPREDUCE-885.patch


 DBInputFormat generates InputSplits by counting the available rows in a 
 table, and selecting subsections of the table via the LIMIT and OFFSET 
 SQL keywords. These are only meaningful in an ordered context, so the query 
 also includes an ORDER BY clause on an index column. The resulting queries 
 are often inefficient and require full table scans. Actually using multiple 
 mappers with these queries can lead to O(n^2) behavior in the database, where 
 n is the number of splits. Attempting to use parallelism with these queries 
 is counter-productive.
 A better mechanism is to organize splits based on data values themselves, 
 which can be performed in the WHERE clause, allowing for index range scans of 
 tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.