reduce workload generated by JDBCStatsPublisher
-----------------------------------------------

                 Key: HIVE-2144
                 URL: https://issues.apache.org/jira/browse/HIVE-2144
             Project: Hive
          Issue Type: Improvement
            Reporter: Ning Zhang
            Assignee: Ning Zhang


In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID 
was inserted by another task (mostly likely a speculative or previously failed 
task). Depending on if the ID is there, an INSERT or UPDATE query was issues. 
So there are basically 2x of queries per row inserted into the intermediate 
stats table. This workload could be reduced to 1/2 if we insert it anyway (it 
is very rare that IDs are duplicated) and use a different SQL query in the 
aggregation phase to dedup the ID (e.g., using group-by and max()). The 
benefits are that even though the aggregation query is more expensive, it is 
only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to