[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813964#comment-16813964 ] ASF GitHub Bot commented on DRILL-7089: --- sohami commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813614#comment-16813614 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-481335547 @amansinha100, thanks for pointing this, they passed, so I have added `ready-to-commit` label. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Labels: ready-to-commit > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813611#comment-16813611 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-481333534 @vvysotskyi if the regression tests pass you can update the read-to-commit label in the JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813263#comment-16813263 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-481200553 Rebased onto the master and resolved merge conflicts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812515#comment-16812515 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-480885230 @vvysotskyi thanks for the explanations. I am mostly good with the PR..so its a +1 from me but it would be good if @gparai can take a quick look at the statistics related changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812299#comment-16812299 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272945631 ## File path: exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java ## @@ -106,6 +107,53 @@ public boolean isValueStatistic() { public boolean isExact() { return true; } + }, + + /** + * Column statistics kind which represents number of non-null values for the specific column. + */ + NON_NULL_COUNT(Statistic.NNROWCOUNT) { +@Override +public Double mergeStatistics(List statisticsList) { + double nonNullRowCount = 0; + for (ColumnStatistics statistics : statisticsList) { +Double nnRowCount = (Double) statistics.getStatistic(this); +if (nnRowCount != null) { + nonNullRowCount += nnRowCount; +} + } + return nonNullRowCount; +} + }, + + /** + * Column statistics kind which represents number of distinct values for the specific column. + */ + NVD(Statistic.NDV) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for NDV"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + AVG_WIDTH(Statistic.AVG_WIDTH) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); +} + }, + + /** + * Column statistics kind which width of the specific column. Review comment: Thanks, missed it during the rebase onto the master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812297#comment-16812297 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272945731 ## File path: exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java ## @@ -106,6 +107,53 @@ public boolean isValueStatistic() { public boolean isExact() { return true; } + }, + + /** + * Column statistics kind which represents number of non-null values for the specific column. + */ + NON_NULL_COUNT(Statistic.NNROWCOUNT) { +@Override +public Double mergeStatistics(List statisticsList) { + double nonNullRowCount = 0; + for (ColumnStatistics statistics : statisticsList) { +Double nnRowCount = (Double) statistics.getStatistic(this); +if (nnRowCount != null) { + nonNullRowCount += nnRowCount; +} + } + return nonNullRowCount; +} + }, + + /** + * Column statistics kind which represents number of distinct values for the specific column. + */ + NVD(Statistic.NDV) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for NDV"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + AVG_WIDTH(Statistic.AVG_WIDTH) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + HISTOGRAM("histogram") { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); Review comment: Thanks, fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812296#comment-16812296 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272954092 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/MetadataProviderManager.java ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.base; + +import org.apache.drill.exec.planner.common.DrillStatsTable; +import org.apache.drill.exec.record.metadata.schema.SchemaProvider; + +/** + * Base interface for passing and obtaining {@link SchemaProvider}, {@link DrillStatsTable} and + * {@link TableMetadataProvider}, responsible for creating required + * {@link TableMetadataProviderBuilder} which constructs required {@link TableMetadataProvider} + * based on specified providers + */ +public interface MetadataProviderManager { Review comment: `MetadataProviderManager` exposes `SchemaProvider` and `StatsProvider` in order to pass them into builder for constructing `TableMetadataProvider`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812298#comment-16812298 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272964601 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java ## @@ -49,9 +50,7 @@ private final String userName; private GroupScan scan; private SessionOptionManager options; - // Stores the statistics(rowcount, NDV etc.) associated with the table - private DrillStatsTable statsTable; - private TupleMetadata schema; + private MetadataProviderManager metadataProviderManager; Review comment: This solution wasn't a result of following a particular Design Pattern. It was done in order to allow caching of `MetadataProvider` and preserve lazy initialization. `MetadataProviderManager` will be stored to the cache and once `MetadataProvider` is created, `MetadataProviderManager` will receive a link to it and it will be reused for constructing all further `MetadataProvider` instances. Yes, that's correct that "there's a 1-to-many relationship between Manager-->Provider and a 1-to-1 relationship between a Provider and its associated DrillTable". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811947#comment-16811947 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-480620573 @vvysotskyi one other thought is that since this PR touches several of the Statistics related classes, considering the Drill 1.16.0 release is imminent and the core statistics testing is ongoing, it would be disruptive to introduce these changes at this stage. Even though functional regression might pass, we have had performance issues even with statistics disabled, so I feel hesitant for 1.16.0. Let me know what you think. Also cc-ing @gparai This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811946#comment-16811946 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-480620573 @vvysotskyi one other thought is that since this PR touches several of the Statistics related classes, considering the Drill 1.16.0 release is imminent and the core statistics testing is ongoing, it would be disruptive to introduce these changes at this stage. Even though functional regression might pass, we have had performance issues even with statistics disabled, so I feel hesitant for 1.16.0. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811945#comment-16811945 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272845347 ## File path: exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java ## @@ -106,6 +107,53 @@ public boolean isValueStatistic() { public boolean isExact() { return true; } + }, + + /** + * Column statistics kind which represents number of non-null values for the specific column. + */ + NON_NULL_COUNT(Statistic.NNROWCOUNT) { +@Override +public Double mergeStatistics(List statisticsList) { + double nonNullRowCount = 0; + for (ColumnStatistics statistics : statisticsList) { +Double nnRowCount = (Double) statistics.getStatistic(this); +if (nnRowCount != null) { + nonNullRowCount += nnRowCount; +} + } + return nonNullRowCount; +} + }, + + /** + * Column statistics kind which represents number of distinct values for the specific column. + */ + NVD(Statistic.NDV) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for NDV"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + AVG_WIDTH(Statistic.AVG_WIDTH) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); +} + }, + + /** + * Column statistics kind which width of the specific column. Review comment: Change to histogram. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811943#comment-16811943 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272845887 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/MetadataProviderManager.java ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.base; + +import org.apache.drill.exec.planner.common.DrillStatsTable; +import org.apache.drill.exec.record.metadata.schema.SchemaProvider; + +/** + * Base interface for passing and obtaining {@link SchemaProvider}, {@link DrillStatsTable} and + * {@link TableMetadataProvider}, responsible for creating required + * {@link TableMetadataProviderBuilder} which constructs required {@link TableMetadataProvider} + * based on specified providers + */ +public interface MetadataProviderManager { Review comment: Since the `TableMetadataProvider` is supposed to provide the Schema and the Statistics, it is not clear why the `MetadataProviderManager` also exposes the SchemaProvider and the StatsProvider which in some sense are 'children' of the `TableMetadataProvider` since a Table must exist in order for its Schema and Statistics to exist. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811944#comment-16811944 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272846760 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java ## @@ -49,9 +50,7 @@ private final String userName; private GroupScan scan; private SessionOptionManager options; - // Stores the statistics(rowcount, NDV etc.) associated with the table - private DrillStatsTable statsTable; - private TupleMetadata schema; + private MetadataProviderManager metadataProviderManager; Review comment: The fact that the DrillTable has a reference to the `MetadataProviderManager` rather than the `MetadataProvider` itself is significant..I suppose you are following a particular Design Pattern .. probably the Facade Pattern [1] ? The MetadataProvider also follows the Builder pattern, so there's 2 levels of indirection: Manager --> Provider --> Builder. My initial thought was that 'do we need both or could a Provider be able to create instances for multiple tables, so a Provider --> Builder be sufficient ? Your approach is that there's a 1-to-many relationship between Manager-->Provider and a 1-to-1 relationship between a Provider and its associated DrillTable. Is that correct ? [1] https://en.wikipedia.org/wiki/Facade_pattern This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811942#comment-16811942 ] ASF GitHub Bot commented on DRILL-7089: --- amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#discussion_r272845354 ## File path: exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java ## @@ -106,6 +107,53 @@ public boolean isValueStatistic() { public boolean isExact() { return true; } + }, + + /** + * Column statistics kind which represents number of non-null values for the specific column. + */ + NON_NULL_COUNT(Statistic.NNROWCOUNT) { +@Override +public Double mergeStatistics(List statisticsList) { + double nonNullRowCount = 0; + for (ColumnStatistics statistics : statisticsList) { +Double nnRowCount = (Double) statistics.getStatistic(this); +if (nnRowCount != null) { + nonNullRowCount += nnRowCount; +} + } + return nonNullRowCount; +} + }, + + /** + * Column statistics kind which represents number of distinct values for the specific column. + */ + NVD(Statistic.NDV) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for NDV"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + AVG_WIDTH(Statistic.AVG_WIDTH) { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); +} + }, + + /** + * Column statistics kind which width of the specific column. + */ + HISTOGRAM("histogram") { +@Override +public Object mergeStatistics(List statisticsList) { + throw new UnsupportedOperationException("Cannot merge statistics for avg_width"); Review comment: 'statistics for histogram' This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811151#comment-16811151 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-480379029 Regression tests are passed for this PR, so it is ready to be reviewed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807779#comment-16807779 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-479008647 @amansinha100, could you please take a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807778#comment-16807778 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728#issuecomment-479008544 Diagrams of the classes introduced in this PR: https://docs.google.com/presentation/d/1XG_xgR4okzXaJ3Z7HFHfzCwlM5VkNfre8GFEAd2Zo8k/edit?usp=sharing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes
[ https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807710#comment-16807710 ] ASF GitHub Bot commented on DRILL-7089: --- vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for TableMetadataProvider at query level and adapt statistics to use Drill metastore API URL: https://github.com/apache/drill/pull/1728 In the scope of this PR introduced caching of table metadata (schema and statistics) at the query level. Introduced `MetadataProviderManager` which holds both `SchemaProvider` and `DrillStatsTable` and `TableMetadataProvider` if it was already created. `MetadataProviderManager` instance will be cached and used for every `DrillTable` which corresponds to the same table. Such an approach was used to preserve lazy initialization of group scan and `TableMetadataProvider` instances, so once the first instance of `TableMetadataProvider` is created, it will be stored in the `MetadataProviderManager` and its metadata will be reused for all further `TableMetadataProvider` instances. Another part of this PR is connected with the adoption of statistics to use Drill Metastore API. Enhanced logic to distinguish exact and estimated metadata, and used `TableMetadata` for obtaining statistics. Will create and attach a class diagram later. Also, tests should be run for this PR, so for now, I'll leave it in draft state. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement caching of BaseMetadata classes > - > > Key: DRILL-7089 > URL: https://issues.apache.org/jira/browse/DRILL-7089 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.16.0 >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > In the scope of DRILL-6852 were introduced new classes for metadata usage. > These classes may be reused in other GroupScan instances to preserve heap > usage for the case when metadata is large. > The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass > them to the {{GroupScan}}, so in the scope of the single query, it will be > possible to reuse them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)