[
https://issues.apache.org/jira/browse/HIVE-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762297#comment-13762297
]
Angela Li commented on HIVE-4247:
---------------------------------
Could someone please raise the priority in this? Thx!
> Filtering on a hbase row key duplicates results across multiple mappers
> -----------------------------------------------------------------------
>
> Key: HIVE-4247
> URL: https://issues.apache.org/jira/browse/HIVE-4247
> Project: Hive
> Issue Type: Bug
> Components: HBase Handler
> Affects Versions: 0.9.0
> Environment: All Platforms
> Reporter: Karthik Kumara
> Labels: patch
> Attachments: HiveHBaseTableInputFormat.patch
>
>
> Steps to reproduce
> 1. Create a Hive external table with HiveHbaseHandler with enough data in the
> hbase table to spawn multiple mappers for the hive query.
> 2. Write a query which has a filter (in the where clause) based on the hbase
> row key.
> 3. Running the map reduce job leads to each mapper querying the entire data
> set. duplicating the data for each mapper. Each mapper processes the entire
> filtered range and the results get multiplied as the number of mappers run.
> Expected behavior:
> Each mapper should process a different part of the data and should not
> duplicate.
> Cause:
> The cause seems to be the convertFilter method in HiveHBaseTableInputFormat.
> convertFilter has this piece of code which rewrites the start and the stop
> row for each split which leads each mapper to process the entire range
> if (tableSplit != null) {
> tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> The scan already has the start and stop row set when the splits are created.
> So this piece of code is probably redundant.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira