[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Zhang reassigned HIVE-2144: -------------------------------- Assignee: Tomasz Nykiel (was: Ning Zhang) > reduce workload generated by JDBCStatsPublisher > ----------------------------------------------- > > Key: HIVE-2144 > URL: https://issues.apache.org/jira/browse/HIVE-2144 > Project: Hive > Issue Type: Improvement > Reporter: Ning Zhang > Assignee: Tomasz Nykiel > > In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID > was inserted by another task (mostly likely a speculative or previously > failed task). Depending on if the ID is there, an INSERT or UPDATE query was > issues. So there are basically 2x of queries per row inserted into the > intermediate stats table. This workload could be reduced to 1/2 if we insert > it anyway (it is very rare that IDs are duplicated) and use a different SQL > query in the aggregation phase to dedup the ID (e.g., using group-by and > max()). The benefits are that even though the aggregation query is more > expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira