[ 
https://issues.apache.org/jira/browse/TAJO-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174622#comment-14174622
 ] 

Jihoon Son commented on TAJO-1120:
----------------------------------

Hyunsik, thanks for your comment and sorry for confusing you.

As you noted, collecting column-level statistics whenever storing a table 
incurs a large overhead.
So, it is important to enable the column-level statistics collection only if it 
is necessary.
I think that the query planner is responsible for deciding when the 
column-level statistics is necessary.
As you also noted, deciding when the column-level statistics is necessary 
should be implemented in multiple steps, and I hope that this feature is 
regarded in other issues.

In this issue, I'd like to implement only the feature of enabling column-level 
statistics collection when storing a table.
Actually, I've implemented an enforcer for this feature.
Even though I should add some unit tests, I'll share it with you if you want.

Sincerely,
Jihoon

> Enable collecting column stats when storing a table if necessary
> ----------------------------------------------------------------
>
>                 Key: TAJO-1120
>                 URL: https://issues.apache.org/jira/browse/TAJO-1120
>             Project: Tajo
>          Issue Type: Improvement
>          Components: catalog
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.9.1
>
>
> Currently, the number of null values and the max/min values of a column are 
> collected only in the shuffle stage.
> In addition, the number of distinct values of a column seems not to be 
> collected in anywhere. 
> However, some recent issues such as TAJO-838 and TAJO-1091 require these 
> statistics, and thus we need to collect them for tables that are newly stored 
> via CTAS or INSERT INTO statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to