[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated MAPREDUCE-885: Resolution: Fixed Fix Version/s: 0.21.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I've committed this. Thanks Aaron! > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Fix For: 0.21.0 > > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Open (was: Patch Available) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Patch Available (was: Open) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.6.patch Attaching patch that addresses findbugs warnings > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.5.patch Just realized that the previous patches forgot to include changes to {{src/test/}} -- adding now. > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Patch Available (was: Open) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Assignee: zhuweimin (was: Aaron Kimball) Status: Patch Available (was: Open) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: zhuweimin > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Open (was: Patch Available) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: zhuweimin > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Open (was: Patch Available) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.4.patch new patch resync'd with trunk. > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.4.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Patch Available (was: Open) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.3.patch Attaching new patch which reverts incidental whitespace changes in OracleDBRR. > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, > MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Open (was: Patch Available) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.2.patch Attaching new patch that incorporates Enis's comments. Enis -- I've made the change to add DDDBRR extending DBRR and added a MySQLDDDBRR on top of that after all. I think it's cleaner this way after all. > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Status: Patch Available (was: Open) > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
[ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Kimball updated MAPREDUCE-885: Attachment: MAPREDUCE-885.patch This patch introduces DataDrivenDBInputFormat. This class extends DBInputFormat and reuses much of its common logic (e.g., setting up and tearing down connections, configuration, DBWritable, etc). But it adds a DataDrivenDBInputSplit class which splits queries based on data values, e.g. {{"id >= 10 AND id < 20"}} for one split, and {{"id >= 20 AND id < 30"}} for the next one. The resulting queries run significantly faster and parallelise properly. Instead of requiring a counting query like DBInputFormat, this InputFormat requires a query that returns the min and max values of the split column on the data to import. DataDrivenDBInputSplit is a subclass of DBInputSplit; the original DBRecordReader family of classes has been modified to discriminate between the new InputSplit class vs. the old one; if it detects a new one, it submits the newer WHERE-based query rather than the LIMIT/OFFSET-based query to the database. The min and max values of the column are used to generate splits via linear interpolation between the values. A DBSplitter interface has been added, which takes the min and max values for the column, as well as the number of splits to use. It then generates about this many splits, which subdivide the range of values into roughly-even intervals. Several DBSplitter implementations are provided which are applicable to different data types. For example, there is an IntegerSplitter which can split INTEGER, BIGINT, TINYINT, LONG, etc. columns. The FloatSplitter implementation works on DECIMAL, NUMBER, and REAL datatypes. A TextSplitter implementation is provided, but its utility is database-dependent. Databases may choose to sort strings via a number of algorithms (e.g., case-sensitive vs. case-insensitive). The TextSplitter assumes that strings are sorted in Unicode codepoint order. (e.g., "AAA" < "BBB" < "aaa".) A warning will be logged if the TextSplitter is used. Explicit tests have been added for some of the splitters. Sqoop has been modified to use the new InputFormat with encouraging performance results. Sqoop's existing regression test suite exercises the code paths for all the splitters and isolated several bugs which were fixed prior to submitting this patch. I will post the Sqoop patch separately after this JIRA issue is committed. > More efficient SQL queries for DBInputFormat > > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Aaron Kimball >Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a > table, and selecting subsections of the table via the "LIMIT" and "OFFSET" > SQL keywords. These are only meaningful in an ordered context, so the query > also includes an "ORDER BY" clause on an index column. The resulting queries > are often inefficient and require full table scans. Actually using multiple > mappers with these queries can lead to O(n^2) behavior in the database, where > n is the number of splits. Attempting to use parallelism with these queries > is counter-productive. > A better mechanism is to organize splits based on data values themselves, > which can be performed in the WHERE clause, allowing for index range scans of > tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.