[ 
https://issues.apache.org/jira/browse/HIVE-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148017#comment-17148017
 ] 

Zoltan Haindrich commented on HIVE-23776:
-----------------------------------------

I'm not sure if it was clear; basicstats right now is composed from
* collected stuff: numRows; rawDataSize
* "quickstats" stuff: numFiles,totalSize ; this ticket is about these things

>  There is stats annotation fallback which relies on this.
stats also has it's 
[own|https://github.com/apache/hive/blob/6440d93981e6d6aab59ecf2e77ffa45cd84d47de/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205]
  file scanning machinery so it could easily live without it - but in most 
cases the compiler will rely on columnstats based infos - so I don't see any 
problem here

Right now I don't know why workload management would use this kind of info (it 
may also resort to file scanning if it really need this) 

I've done a quick check and neither NUM_FILES,TOTAL_SIZE was used where it 
looked problematic...
but I guess if I start removing it; the code and the tests will tell whether it 
could be removed

> Retire quickstats autocollection
> --------------------------------
>
>                 Key: HIVE-23776
>                 URL: https://issues.apache.org/jira/browse/HIVE-23776
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>
> this is about:
> * num files
> * datasize (sum of filesizes)
> * num erasure coded files
> right now these are scanned during every BasicStatsTask execution - which 
> means some filesystem reads/etc - for small inserts these are visible in case 
> the fs is a bit slower (s3 and friends)
> I don't think they are really in use...we rely more on columnstats which are 
> more accurate ; and because of the datasize in this case is for "offline" 
> (ondisk) - while we should be insted calculate with "online" sizes...
> proposal:
> * remove collection and storage of this data
> * collect it on the fly during "desc formatted" statements to provide them 
> for informational purposes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to