[ https://issues.apache.org/jira/browse/HIVE-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148017#comment-17148017 ]
Zoltan Haindrich commented on HIVE-23776: ----------------------------------------- I'm not sure if it was clear; basicstats right now is composed from * collected stuff: numRows; rawDataSize * "quickstats" stuff: numFiles,totalSize ; this ticket is about these things > There is stats annotation fallback which relies on this. stats also has it's [own|https://github.com/apache/hive/blob/6440d93981e6d6aab59ecf2e77ffa45cd84d47de/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205] file scanning machinery so it could easily live without it - but in most cases the compiler will rely on columnstats based infos - so I don't see any problem here Right now I don't know why workload management would use this kind of info (it may also resort to file scanning if it really need this) I've done a quick check and neither NUM_FILES,TOTAL_SIZE was used where it looked problematic... but I guess if I start removing it; the code and the tests will tell whether it could be removed > Retire quickstats autocollection > -------------------------------- > > Key: HIVE-23776 > URL: https://issues.apache.org/jira/browse/HIVE-23776 > Project: Hive > Issue Type: Improvement > Reporter: Zoltan Haindrich > Assignee: Zoltan Haindrich > Priority: Major > > this is about: > * num files > * datasize (sum of filesizes) > * num erasure coded files > right now these are scanned during every BasicStatsTask execution - which > means some filesystem reads/etc - for small inserts these are visible in case > the fs is a bit slower (s3 and friends) > I don't think they are really in use...we rely more on columnstats which are > more accurate ; and because of the datasize in this case is for "offline" > (ondisk) - while we should be insted calculate with "online" sizes... > proposal: > * remove collection and storage of this data > * collect it on the fly during "desc formatted" statements to provide them > for informational purposes -- This message was sent by Atlassian Jira (v8.3.4#803005)