[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1274#action_1274 ] Hudson commented on MAPREDUCE-885: -- Integrated in Hadoop-Mapreduce-trunk #83 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/83/]) . More efficient SQL queries for DBInputFormat. Contributed by Aaron Kimball. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Fix For: 0.21.0 Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753185#action_12753185 ] Aaron Kimball commented on MAPREDUCE-885: - I believe that the remaining findbugs warning is spurious. MysqlDDDBRR.executeQuery() saves a ref to the statement object that is created and executed; the DBRR.close() method will then call {{statement.close()}} later. This is the same pattern used in the rest of the DBRR's, but findbugs seems to have a problem with this one. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749059#action_12749059 ] Hadoop QA commented on MAPREDUCE-885: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418034/MAPREDUCE-885.5.patch against trunk revision 808730. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/535/console This message is automatically generated. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749074#action_12749074 ] Aaron Kimball commented on MAPREDUCE-885: - Failures are unrelated. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748655#action_12748655 ] Hadoop QA commented on MAPREDUCE-885: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417849/MAPREDUCE-885.3.patch against trunk revision 808730. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/532/console This message is automatically generated. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747277#action_12747277 ] Enis Soztutar commented on MAPREDUCE-885: - Data driven splits are really neat. Just a few suggestions - We can add a getSplitter(int sqlDataType) method to DDDBIF and move sql type - DBSplitter instance mapping, so that classes extending it can easily override this logic, for skewed data, etc. - Introduce DDDBRR extending DBRR in DDDBIF and move getDataBasedSelectQuery() as an overridden implementation of getSelectQuery(). - Do we need mapred.lib.db.DDDBIF since it is introduced as deprecated. I know that lot's of legacy code is using the old API, but adding a already deprecated class seems odd. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744809#action_12744809 ] Hadoop QA commented on MAPREDUCE-885: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416936/MAPREDUCE-885.patch against trunk revision 805324. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/490/console This message is automatically generated. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744811#action_12744811 ] Aaron Kimball commented on MAPREDUCE-885: - I think this patch won't apply until MAPREDUCE-875 is in. More efficient SQL queries for DBInputFormat Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the LIMIT and OFFSET SQL keywords. These are only meaningful in an ordered context, so the query also includes an ORDER BY clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.