Hi,
I am pretty perplexed at this time, trying to get hive to store the stats
after a query runs.

hive.stats.autogather is set to *true* by default

This is in our hive-site.xml

  <property>
     <name>hive.aux.jars.path</name>

 <value>file:///usr/lib/hive/aux-jars/mysql-connector-java-5.1.34.jar</value>
  </property>

  <property>
    <name>hive.stats.dbclass</name>
    <value>jdbc:mysql</value>
  </property>

  <property>
    <name>hive.stats.dbconnectionstring</name>
    <value>jdbc:mysql://*<hostname>*:3306/*<database>*
?useUnicode=true&amp;characterEncoding=UTF-8&amp;user=*<username>*
&amp;password=*<password>*</value>
  </property>

  <property>
    <name>hive.stats.jdbcdriver</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>

After a query runs, I see this error in the hive shell


*[Error 30017]: Skipping stats aggregation by error
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 30000]:
StatsPublisher cannot be obtained. There was a error to retrieve the
StatsPublisher, and retrying might help. If you dont want the query to fail
because accurate statistics could not be collected, set
hive.stats.reliable=false*

I see this in the hive logs

*2017-09-08T17:11:41,625 ERROR [Thread-140]: ERROR
org.apache.hadoop.hive.ql.stats.StatsFactory
(StatsFactory.java:initialize(76)) - jdbc:mysql Publisher/Aggregator
classes cannot be loaded.*

*2017-09-08T17:11:51,405 ERROR [779c2413-28ee-4fb0-8bda-0dd8f8bebe04 main]:
ERROR org.apache.hadoop.hive.ql.stats.StatsFactory
(StatsFactory.java:initialize(76)) - jdbc:mysql Publisher/Aggregator
classes cannot be loaded.*

*2017-09-08T17:11:51,406 ERROR [779c2413-28ee-4fb0-8bda-0dd8f8bebe04 main]:
ERROR org.apache.hadoop.hive.ql.exec.Task
(SessionState.java:printError(1038)) - [Error 30017]: Skipping stats
aggregation by error org.apache.hadoop.hive.ql.metadata.HiveException:
[Error 30000]: StatsPublisher cannot be obtained. There was a error to
retrieve the StatsPublisher, and retrying might help. If you dont want the
query to fail because accurate statistics could not be collected, set
hive.stats.reliable=false*

I looked at the documentation on
https://cwiki.apache.org/confluence/display/Hive/StatsDev.
It says that mysql and hbase stats publishers are already implemented in
the hive code and they implement the *IStatsPublisher *and *IStatsAggregator
*interface. I am looking at the hive codebase and I don't see those
interfaces defined in any classes. However, I found the *StatsPublisher*
and *StatsAggregator *interfaces. When looking at which classes have
implemented the *StatsPublisher *interface, I found only 2 usages.


*public class FSStatsPublisher implements StatsPublisher*
*public static class TFSOStatsPublisher implements StatsPublisher*

The second one is a test class and no one else extends the first one.

So, I am not sure, if the code is missing or I am supposed to implement
something or the configuration is wrong or missing something.


Also, on this page https://cwiki.apache.org/confluence/display/Hive/StatsDev,
I see this

The user can also specify the implementation to be used for the storage of
temporary statistics setting the variable hive.stats.dbclass. For example,
to set HBase as the implementation of *temporary* statistics storage (the
default is jdbc:derby or fs, depending on the Hive version) the user should
issue the following command:

What does *temporary* mean in this context?

What is more confusing is that when I run
*DESC EXTENDED <tablename>;*

I see this


*parameters:{transient_lastDdlTime=1504891564, totalSize=391013,
COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=6}*

Then I ran, *ANALYZE TABLE <tablename> COMPUTE STATISTICS FOR COLUMNS *and
ran* DESC EXTENDED* again.

*parameters:{transient_lastDdlTime=1504891564, totalSize=391013,
COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","col2":"true","*
*col3**":"true","**col4**":"true"}}, numFiles=6}*

Does this mean some stats are stored?

Any help is appreciated.

Thanx.

-- 
Regards,
Premal Shah.

Reply via email to