[
https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036707#comment-13036707
]
[email protected] commented on HIVE-2144:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review696
-----------------------------------------------------------
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1395>
Here we need to catch SQLRecoverableException and retry.
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1396>
If these parameters present in conf/hive-default.xml, you don't need to set
them again here since new JobConf() should read from hive-default.xml.
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1397>
the usual use case for aggregateStats() is that the key should be the
prefix (e.g., file_000) of the string inserted by publishStats, so that all
keys that match the prefix will be aggregated.
Can you add one more test for aggregateStats('file_000')?
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1398>
won't this also change the stats at the 2nd publishStat()?
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1399>
also add another aggStats for prefix.
- Ning
On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/765/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-05-19 23:14:26)
bq.
bq.
bq. Review request for hive.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. Currently, the JDBCStatsPublisher executes two queries per inserted row of
statistics, first query to check if the ID was inserted by another task, and
second query to insert a new or update the existing row.
bq. The latter occurs very rarely, since duplicates most likely originate from
speculative failed tasks.
bq.
bq. Currently the schema of the stat table is the following:
bq.
bq. PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not
have any integrity constraints declared.
bq.
bq. We amend it to:
bq.
bq. PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.
bq. HIVE-2144 improves on performance by greedily performing the insertion
statement.
bq. Then instead of executing two queries per row inserted, we can execute one
INSERT query.
bq. In the case primary key constraint violation, we perform a single UPDATE
query.
bq. The UPDATE query needs to check the condition, if the currently inserted
stats are "newer" then the ones already in the table.
bq.
bq.
bq. This addresses bug HIVE-2144.
bq. https://issues.apache.org/jira/browse/HIVE-2144
bq.
bq.
bq. Diffs
bq. -----
bq.
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
1125140
bq. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
PRE-CREATION
bq.
bq. Diff: https://reviews.apache.org/r/765/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. TestStatsPublisher JUnit test:
bq. - basic behaviour
bq. - multiple updates
bq. - cleanup of the statistics table after aggregation
bq.
bq. Standalone testing on the cluster.
bq. - insert/analyze queries over non-partitioned/partitioned tables
bq.
bq. NOTE. For the correct behaviour, the primary_key index needs to be
created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of
the table with the constraint declared.
bq.
bq.
bq. Thanks,
bq.
bq. Tomasz
bq.
bq.
> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
> Key: HIVE-2144
> URL: https://issues.apache.org/jira/browse/HIVE-2144
> Project: Hive
> Issue Type: Improvement
> Reporter: Ning Zhang
> Assignee: Tomasz Nykiel
> Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID
> was inserted by another task (mostly likely a speculative or previously
> failed task). Depending on if the ID is there, an INSERT or UPDATE query was
> issues. So there are basically 2x of queries per row inserted into the
> intermediate stats table. This workload could be reduced to 1/2 if we insert
> it anyway (it is very rare that IDs are duplicated) and use a different SQL
> query in the aggregation phase to dedup the ID (e.g., using group-by and
> max()). The benefits are that even though the aggregation query is more
> expensive, it is only run once per query.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira