[
https://issues.apache.org/jira/browse/TAJO-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174622#comment-14174622
]
Jihoon Son commented on TAJO-1120:
----------------------------------
Hyunsik, thanks for your comment and sorry for confusing you.
As you noted, collecting column-level statistics whenever storing a table
incurs a large overhead.
So, it is important to enable the column-level statistics collection only if it
is necessary.
I think that the query planner is responsible for deciding when the
column-level statistics is necessary.
As you also noted, deciding when the column-level statistics is necessary
should be implemented in multiple steps, and I hope that this feature is
regarded in other issues.
In this issue, I'd like to implement only the feature of enabling column-level
statistics collection when storing a table.
Actually, I've implemented an enforcer for this feature.
Even though I should add some unit tests, I'll share it with you if you want.
Sincerely,
Jihoon
> Enable collecting column stats when storing a table if necessary
> ----------------------------------------------------------------
>
> Key: TAJO-1120
> URL: https://issues.apache.org/jira/browse/TAJO-1120
> Project: Tajo
> Issue Type: Improvement
> Components: catalog
> Reporter: Jihoon Son
> Assignee: Jihoon Son
> Fix For: 0.9.1
>
>
> Currently, the number of null values and the max/min values of a column are
> collected only in the shuffle stage.
> In addition, the number of distinct values of a column seems not to be
> collected in anywhere.
> However, some recent issues such as TAJO-838 and TAJO-1091 require these
> statistics, and thus we need to collect them for tables that are newly stored
> via CTAS or INSERT INTO statements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)