[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813964#comment-16813964
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

sohami commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813614#comment-16813614
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-481335547
 
 
   @amansinha100, thanks for pointing this, they passed, so I have added 
`ready-to-commit` label.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813611#comment-16813611
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-481333534
 
 
   @vvysotskyi if the regression tests pass you can update the read-to-commit 
label in the JIRA. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813263#comment-16813263
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-481200553
 
 
   Rebased onto the master and resolved merge conflicts.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812515#comment-16812515
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-480885230
 
 
   @vvysotskyi thanks for the explanations.  I am mostly good with the PR..so 
its a +1 from me but it would be good if @gparai can take a quick look at the 
statistics related changes. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812299#comment-16812299
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272945631
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java
 ##
 @@ -106,6 +107,53 @@ public boolean isValueStatistic() {
 public boolean isExact() {
   return true;
 }
+  },
+
+  /**
+   * Column statistics kind which represents number of non-null values for the 
specific column.
+   */
+  NON_NULL_COUNT(Statistic.NNROWCOUNT) {
+@Override
+public Double mergeStatistics(List 
statisticsList) {
+  double nonNullRowCount = 0;
+  for (ColumnStatistics statistics : statisticsList) {
+Double nnRowCount = (Double) statistics.getStatistic(this);
+if (nnRowCount != null) {
+  nonNullRowCount += nnRowCount;
+}
+  }
+  return nonNullRowCount;
+}
+  },
+
+  /**
+   * Column statistics kind which represents number of distinct values for the 
specific column.
+   */
+  NVD(Statistic.NDV) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
NDV");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  AVG_WIDTH(Statistic.AVG_WIDTH) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
 
 Review comment:
   Thanks, missed it during the rebase onto the master.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812297#comment-16812297
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272945731
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java
 ##
 @@ -106,6 +107,53 @@ public boolean isValueStatistic() {
 public boolean isExact() {
   return true;
 }
+  },
+
+  /**
+   * Column statistics kind which represents number of non-null values for the 
specific column.
+   */
+  NON_NULL_COUNT(Statistic.NNROWCOUNT) {
+@Override
+public Double mergeStatistics(List 
statisticsList) {
+  double nonNullRowCount = 0;
+  for (ColumnStatistics statistics : statisticsList) {
+Double nnRowCount = (Double) statistics.getStatistic(this);
+if (nnRowCount != null) {
+  nonNullRowCount += nnRowCount;
+}
+  }
+  return nonNullRowCount;
+}
+  },
+
+  /**
+   * Column statistics kind which represents number of distinct values for the 
specific column.
+   */
+  NVD(Statistic.NDV) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
NDV");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  AVG_WIDTH(Statistic.AVG_WIDTH) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  HISTOGRAM("histogram") {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
 
 Review comment:
   Thanks, fixed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812296#comment-16812296
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272954092
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/MetadataProviderManager.java
 ##
 @@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.base;
+
+import org.apache.drill.exec.planner.common.DrillStatsTable;
+import org.apache.drill.exec.record.metadata.schema.SchemaProvider;
+
+/**
+ * Base interface for passing and obtaining {@link SchemaProvider}, {@link 
DrillStatsTable} and
+ * {@link TableMetadataProvider}, responsible for creating required
+ * {@link TableMetadataProviderBuilder} which constructs required {@link 
TableMetadataProvider}
+ * based on specified providers
+ */
+public interface MetadataProviderManager {
 
 Review comment:
   `MetadataProviderManager` exposes `SchemaProvider` and `StatsProvider` in 
order to pass them into builder for constructing `TableMetadataProvider`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812298#comment-16812298
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272964601
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java
 ##
 @@ -49,9 +50,7 @@
   private final String userName;
   private GroupScan scan;
   private SessionOptionManager options;
-  // Stores the statistics(rowcount, NDV etc.) associated with the table
-  private DrillStatsTable statsTable;
-  private TupleMetadata schema;
+  private MetadataProviderManager metadataProviderManager;
 
 Review comment:
   This solution wasn't a result of following a particular Design Pattern. It 
was done in order to allow caching of `MetadataProvider` and preserve lazy 
initialization. `MetadataProviderManager` will be stored to the cache and once 
`MetadataProvider` is created, `MetadataProviderManager` will receive a link to 
it and it will be reused for constructing all further `MetadataProvider` 
instances.
   
   Yes, that's correct that "there's a 1-to-many relationship between 
Manager-->Provider and a 1-to-1 relationship between a Provider and its 
associated DrillTable".
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811947#comment-16811947
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-480620573
 
 
   @vvysotskyi one other thought is that since this PR touches several of the 
Statistics related classes,  considering the Drill 1.16.0 release is imminent 
and the core statistics testing is ongoing, it would be disruptive to introduce 
these changes at this stage.  Even though functional regression might pass, we 
have had performance issues even with statistics disabled, so I feel hesitant 
for 1.16.0.  Let me know what you think.  
   
   Also cc-ing @gparai 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811946#comment-16811946
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-480620573
 
 
   @vvysotskyi one other thought is that since this PR touches several of the 
Statistics related classes,  considering the Drill 1.16.0 release is imminent 
and the core statistics testing is ongoing, it would be disruptive to introduce 
these changes at this stage.  Even though functional regression might pass, we 
have had performance issues even with statistics disabled, so I feel hesitant 
for 1.16.0.  Let me know what you think.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811945#comment-16811945
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272845347
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java
 ##
 @@ -106,6 +107,53 @@ public boolean isValueStatistic() {
 public boolean isExact() {
   return true;
 }
+  },
+
+  /**
+   * Column statistics kind which represents number of non-null values for the 
specific column.
+   */
+  NON_NULL_COUNT(Statistic.NNROWCOUNT) {
+@Override
+public Double mergeStatistics(List 
statisticsList) {
+  double nonNullRowCount = 0;
+  for (ColumnStatistics statistics : statisticsList) {
+Double nnRowCount = (Double) statistics.getStatistic(this);
+if (nnRowCount != null) {
+  nonNullRowCount += nnRowCount;
+}
+  }
+  return nonNullRowCount;
+}
+  },
+
+  /**
+   * Column statistics kind which represents number of distinct values for the 
specific column.
+   */
+  NVD(Statistic.NDV) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
NDV");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  AVG_WIDTH(Statistic.AVG_WIDTH) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
 
 Review comment:
   Change to histogram.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811943#comment-16811943
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272845887
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/MetadataProviderManager.java
 ##
 @@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.base;
+
+import org.apache.drill.exec.planner.common.DrillStatsTable;
+import org.apache.drill.exec.record.metadata.schema.SchemaProvider;
+
+/**
+ * Base interface for passing and obtaining {@link SchemaProvider}, {@link 
DrillStatsTable} and
+ * {@link TableMetadataProvider}, responsible for creating required
+ * {@link TableMetadataProviderBuilder} which constructs required {@link 
TableMetadataProvider}
+ * based on specified providers
+ */
+public interface MetadataProviderManager {
 
 Review comment:
   Since the `TableMetadataProvider` is supposed to provide the Schema and the 
Statistics, it is not clear why the `MetadataProviderManager` also exposes the 
SchemaProvider and the StatsProvider which in some sense are 'children' of the 
`TableMetadataProvider`  since a Table must exist in order for its Schema and 
Statistics to exist. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811944#comment-16811944
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272846760
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java
 ##
 @@ -49,9 +50,7 @@
   private final String userName;
   private GroupScan scan;
   private SessionOptionManager options;
-  // Stores the statistics(rowcount, NDV etc.) associated with the table
-  private DrillStatsTable statsTable;
-  private TupleMetadata schema;
+  private MetadataProviderManager metadataProviderManager;
 
 Review comment:
   The fact that the DrillTable has a reference to the 
`MetadataProviderManager` rather than the `MetadataProvider` itself is 
significant..I suppose you are following a particular Design Pattern .. 
probably the Facade Pattern [1] ?  The MetadataProvider also follows the 
Builder pattern, so there's 2 levels of indirection:  Manager --> Provider --> 
Builder.   My initial thought was that 'do we need both or could a Provider be 
able to create instances for multiple tables, so a Provider --> Builder be 
sufficient ?  Your approach is that there's a  1-to-many relationship between 
Manager-->Provider and a 1-to-1 relationship between a Provider and its 
associated DrillTable.  Is that correct ?   
   
   [1] https://en.wikipedia.org/wiki/Facade_pattern
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811942#comment-16811942
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

amansinha100 commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#discussion_r272845354
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/metastore/ColumnStatisticsKind.java
 ##
 @@ -106,6 +107,53 @@ public boolean isValueStatistic() {
 public boolean isExact() {
   return true;
 }
+  },
+
+  /**
+   * Column statistics kind which represents number of non-null values for the 
specific column.
+   */
+  NON_NULL_COUNT(Statistic.NNROWCOUNT) {
+@Override
+public Double mergeStatistics(List 
statisticsList) {
+  double nonNullRowCount = 0;
+  for (ColumnStatistics statistics : statisticsList) {
+Double nnRowCount = (Double) statistics.getStatistic(this);
+if (nnRowCount != null) {
+  nonNullRowCount += nnRowCount;
+}
+  }
+  return nonNullRowCount;
+}
+  },
+
+  /**
+   * Column statistics kind which represents number of distinct values for the 
specific column.
+   */
+  NVD(Statistic.NDV) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
NDV");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  AVG_WIDTH(Statistic.AVG_WIDTH) {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
+}
+  },
+
+  /**
+   * Column statistics kind which width of the specific column.
+   */
+  HISTOGRAM("histogram") {
+@Override
+public Object mergeStatistics(List 
statisticsList) {
+  throw new UnsupportedOperationException("Cannot merge statistics for 
avg_width");
 
 Review comment:
   'statistics for histogram'
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811151#comment-16811151
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-480379029
 
 
   Regression tests are passed for this PR, so it is ready to be reviewed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807779#comment-16807779
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-479008647
 
 
   @amansinha100, could you please take a look?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807778#comment-16807778
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on issue #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728#issuecomment-479008544
 
 
   Diagrams of the classes introduced in this PR: 
https://docs.google.com/presentation/d/1XG_xgR4okzXaJ3Z7HFHfzCwlM5VkNfre8GFEAd2Zo8k/edit?usp=sharing
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7089) Implement caching of BaseMetadata classes

2019-04-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807710#comment-16807710
 ] 

ASF GitHub Bot commented on DRILL-7089:
---

vvysotskyi commented on pull request #1728: DRILL-7089: Implement caching for 
TableMetadataProvider at query level and adapt statistics to use Drill 
metastore API
URL: https://github.com/apache/drill/pull/1728
 
 
   In the scope of this PR introduced caching of table metadata (schema and 
statistics) at the query level.
   Introduced `MetadataProviderManager` which holds both `SchemaProvider` and 
`DrillStatsTable` and `TableMetadataProvider` if it was already created.
   `MetadataProviderManager` instance will be cached and used for every 
`DrillTable` which corresponds to the same table.
   Such an approach was used to preserve lazy initialization of group scan and 
`TableMetadataProvider` instances, so once the first instance of 
`TableMetadataProvider` is created, it will be stored in the 
`MetadataProviderManager` and its metadata will be reused for all further 
`TableMetadataProvider` instances.
   
   Another part of this PR is connected with the adoption of statistics to use 
Drill Metastore API. Enhanced logic to distinguish exact and estimated 
metadata, and used `TableMetadata` for obtaining statistics.
   
   Will create and attach a class diagram later.
   
   Also, tests should be run for this PR, so for now, I'll leave it in draft 
state.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement caching of BaseMetadata classes
> -
>
> Key: DRILL-7089
> URL: https://issues.apache.org/jira/browse/DRILL-7089
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Volodymyr Vysotskyi
>Priority: Major
> Fix For: 1.16.0
>
>
> In the scope of DRILL-6852 were introduced new classes for metadata usage. 
> These classes may be reused in other GroupScan instances to preserve heap 
> usage for the case when metadata is large.
> The idea is to store {{BaseMetadata}} inheritors in {{DrillTable}} and pass 
> them to the {{GroupScan}}, so in the scope of the single query, it will be 
> possible to reuse them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)