[
https://issues.apache.org/jira/browse/HIVE-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839684#comment-13839684
]
Prasanth J commented on HIVE-5936:
----------------------------------
Even ROW_COUNT and RAW_DATA_SIZE is not reliable. Following sequence of
operations illustrate it
{code}
hive> create table test (key string, value string);
OK
Time taken: 0.069 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt'
into table test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 0.231 seconds
hive> desc formatted test;
OK
# col_name data_type comment
key string None
value string None
# Detailed Table Information
Database: default
Owner: pjayachandran
CreateTime: Wed Dec 04 17:31:32 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: file:/tmp/warehouse/test
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 0
rawDataSize 0
totalSize 5812
transient_lastDdlTime 1386207121
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.094 seconds, Fetched: 32 row(s)
hive> drop table test;
OK
Time taken: 0.423 seconds
hive> set hive.stats.autogather=false;
hive> create table test (key string, value string);
OK
Time taken: 0.03 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt'
into table test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
OK
Time taken: 0.097 seconds
hive> desc formatted test;
OK
# col_name data_type comment
key string None
value string None
# Detailed Table Information
Database: default
Owner: pjayachandran
CreateTime: Wed Dec 04 17:32:29 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: file:/tmp/warehouse/test
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
numFiles 1
numRows -1
rawDataSize -1
totalSize 5812
transient_lastDdlTime 1386207152
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.061 seconds, Fetched: 32 row(s)
hive> set hive.stats.collect.rawdatasize=false;
hive> analyze table test compute statistics;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Listening for transport dt_socket at address: 65378
2013-12-04 17:35:55.379 java[81428:1003] Unable to load realm info from
SCDynamicStore
Execution log at:
/var/folders/2w/4x52xg597k50_bt27x3_k9tw0000gn/T//pjayachandran/pjayachandran_20131204173535_82f7e5c3-0016-4a63-a89c-e07b6ed07ab4.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-12-04 17:35:57,347 null map = 0%, reduce = 0%
2013-12-04 17:36:14,366 null map = 100%, reduce = 0%
Ended Job = job_local124477567_0001
Execution completed successfully
MapredLocal task succeeded
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 36.769 seconds
hive> desc formatted test;
OK
# col_name data_type comment
key string None
value string None
# Detailed Table Information
Database: default
Owner: pjayachandran
CreateTime: Wed Dec 04 17:32:29 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: file:/tmp/warehouse/test
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 500
rawDataSize 0
totalSize 5812
transient_lastDdlTime 1386207374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.064 seconds, Fetched: 32 row(s)
hive>
{code}
As seen above, statistics are different when autostats gathering is enabled vs
disabled. Also, not all SerDes support RAW_DATA_SIZE. AFAIK, LazySimpleSerde
and ORC supports RAW_DATA_SIZE. LazySimpleSerde supports RAW_DATA_SIZE during
INSERT operation and ANALYZE. But ORC supports only during INSERT operation.
Since there are multiple codepaths/ways stats can be updated I do not think
RAW_DATA_SIZE and ROW_COUNT is reliable always.
Following code segment is removed in HIVE-5921
{code}
if (nr < 0) {
nr = 0;
}
{code}
instead if ROW_COUNT is <=0, the number of rows will be estimated based on
average row size computed from schema
{code}
if (nr <= 0) {
nr = 0;
int avgRowSize = estimateRowSizeFromSchema(conf, schema, neededColumns);
if (avgRowSize > 0) {
nr = ds / avgRowSize;
}
}
{code}
There is another subtask HIVE-5949 which will have a flag to say if the
statistics is accurate (all statistics are from metastore) or estimated.
> analyze command failing to collect stats with counter mechanism
> ---------------------------------------------------------------
>
> Key: HIVE-5936
> URL: https://issues.apache.org/jira/browse/HIVE-5936
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Affects Versions: 0.13.0
> Reporter: Ashutosh Chauhan
> Assignee: Navis
> Attachments: HIVE-5936.1.patch.txt, HIVE-5936.2.patch.txt
>
>
> With counter mechanism, MR job is successful, but StatsTask on client fails
> with NPE.
--
This message was sent by Atlassian JIRA
(v6.1#6144)