[ 
https://issues.apache.org/jira/browse/HIVE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882153#comment-13882153
 ] 

Lefty Leverenz commented on HIVE-6300:
--------------------------------------

Good detailed descriptions.  Just some nit-picks and a few points of confusion: 

# Please limit the line lengths to 100 chars.  (hive-default.xml.template is 
far from perfect on this convention, but I'm planning to tidy it up someday.)
# "hive/tez" should be Hive/Tez and "java" should be Java in these descriptions:
#* hive.stats.max.variable.length
#* hive.stats.list.num.entries
#* hive.stats.map.num.entries
# In hive.stats.map.parallelism description:
#* "through each of the operator" should be "operators" or "through each 
operator" 
#* "Some operators like GROUPBY, generates more number of rows that corresponds 
to the number of mappers." -- omit the comma, make "generates" singular, and 
I'm not sure what you mean by "more number of rows that corresponds to the 
number of mappers" -- what's the correspondence, more rows means more 
parallelism?  At first I thought "that" should be "than" but now I don't know.  
The comment in HiveConf.java is simpler:  "to accurately compute statistics for 
GROUPBY map side parallelism needs to be known".
#* "hive" should be Hive
# In hive.stats.fetch.column.stats description, "for each needed columns" 
should be "column" and "when the number of columns are high" should be "is 
high".  Also, why does the comment in HiveConf.java mention partitions too?  
Maybe it's left over from previous behavior, before 
hive.stats.fetch.partition.stats was created:
#* +    // statistics annotation fetches column statistics for all required 
columns and for all
+    // required partitions which can be very expensive sometimes
# In hive.stats.fetch.partition.stats description, "paritition" should be 
"partition" and "when the number of partitions are high" should be "is high".  
Also, does this information mean the same as what's in HiveConf.java?
#* "When this flag is disabled, Hive will make calls to filesystem to get file 
sizes and will estimate the number of rows from row schema."
#* HiveConf.java:  "basic sizes being fetched from namenode"
# In hive.stats.avg.row.size description:
#* again, "through each of the operator" should be "operators" or "through each 
operator"
#* "LIMIT operator (which knows the number of rows) will use this value to 
estimate the size of data flowing through LIMIT operator" left me wondering 
what's done to estimate data flowing through other operators.  (But now I 
realize they're estimated using other configs.  But isn't it the optimizer that 
uses this value, not the LIMIT operator?)  Also, this description doesn't seem 
to match what's in HiveConf.java -- "average row size will be used to estimate 
the number of rows/data size" -- is number of rows known or not?
# In hive.stats.join.factor description:
#* again, "through each of the operator" should be "operators" or "through each 
operator"
#* by the way, in HiveConf.java the comment is slightly garbled:  "in the 
absence of column statistics, the estimated number of rows/data size that will 
<be> emitted from join operator will depend on t <this> factor"
# In hive.stats.deserialization.factor description:
#* again, "through each of the operator" should be "operators" or "through each 
operator"
#* "Since files in table/partitions are ..." should be "tables/partitions" 
(micro-nit) 

Whew.  Sorry about the number of nits.  If you like, I can make these changes 
in a temporary patch and let you remove the ones you don't like and clear up 
confusions in a third patch.

> Add documentation for stats configs to hive-default.xml.template
> ----------------------------------------------------------------
>
>                 Key: HIVE-6300
>                 URL: https://issues.apache.org/jira/browse/HIVE-6300
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor, Statistics
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>            Priority: Minor
>             Fix For: 0.13.0
>
>         Attachments: HIVE-6300.1.patch
>
>
> Add documentation for the following configs
> hive.stats.max.variable.length
> hive.stats.list.num.entries
> hive.stats.map.num.entries
> hive.stats.map.parallelism
> hive.stats.fetch.column.stats
> hive.stats.avg.row.size
> hive.stats.join.factor
> hive.stats.deserialization.factor
> hive.stats.fetch.partition.stats



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to