Re: how to represent and optimize a query over partitioned storage

2021-12-13 Thread Alessandro Solimando
Thank you Jacques and Jing Zhang for the code pointers!

On Mon, 13 Dec 2021 at 03:58, Jing Zhang  wrote:

> Hi,
> partition pruner is a common requirement.
> You could provide a rule which push partition evaluated by filter condition
> into a TableScan.
> There are two ways to ensure the converted result would be chosen in the
> final plan.
> 1. If use CBO planner, you could override the `getRowCount ` by merging all
> the row count of matched partitions into total row count. After apply the
> rule, the total row count must be less than before.
> 2. You could use HBO planner which I thought is proper for this case,
> because we could always get benifite  from partition pruner.
>
> Apache Flink also provides Partitioner pruner, you could find it's
> implementation here [1]. Hope it helps.
>
> [1]
>
> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushPartitionIntoTableSourceScanRule.java
>
> Best,
> Jing Zhang
>
> Jacques Nadeau  于2021年12月13日周一 01:57写道:
>
> > I don't have a great pointer to an intelligent arbitrary filter coster.
> The
> > default one in Calcite isn't great (last I checked) as it considers more
> > filters to be equivalent to more reduction. This means it dramatically
> > overestimates set reduction. In Dremio, in simple cases we had to apply
> an
> > upper limit on reduction.
> >
> > For partitions specifically, we actually apply the condition to the
> > underlying partition details to get an accurate new cost. You can see the
> > base class of the pruning here:
> >
> >
> >
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/planner/logical/partition/PruneScanRuleBase.java
> >
> > In this code, the problem is decomposed into first doing search argument
> > pruning (sarg) followed by arbitrary expression pruning (using an
> > interpreter).
> >
> > Sorry I don't have a great pointer to a high-quality selectivity
> estimator
> > in OSS that uses more advanced stats. Maybe someone else can point to
> one.
> >
> > On Sat, Dec 11, 2021 at 11:43 PM Alessandro Solimando <
> > alessandro.solima...@gmail.com> wrote:
> >
> > > Do you have any code pointer for achieving that, Jacques?
> > >
> > > My main concern is how to estimate the new cost. Do you leverage the
> > > estimation of predicate selectivity over the partitioning expression
> > maybe?
> > >
> > > Il Dom 12 Dic 2021, 05:48 Jacques Nadeau  ha
> > scritto:
> > >
> > > > What we have done in the past is push filters into a scan and alter
> the
> > > > costing (and estimated row count). In cases where the filter or
> > portions
> > > of
> > > > the filter can be applied against partitioning columns, you prune
> > > > partitions and use a new row count estimate/cost estimate based on
> the
> > > > reduced partition set.
> > > >
> > > >
> > > > On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin <
> mgra...@querifylabs.com
> > >
> > > > wrote:
> > > >
> > > > > I assume that some of the filter conditions (which are involved in
> > the
> > > > > choice of partitions ) may by pushdown'ed to TableScan
> > > > >
> > > > > On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
> > > > >  wrote:
> > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Given some partitioned storage, we can omit the scan of some
> > > partitions
> > > > > > when a filter is present. How can the lower cost of the scan be
> > > > > > represented? As far as I can tell the current approach only
> allows
> > > > > > providing a single cost for the TableScan and Filter can only add
> > to
> > > > > > that. Should my implementation provide a rule that combines
> > > > > > Filter+TableScan?
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: how to represent and optimize a query over partitioned storage

2021-12-12 Thread Jing Zhang
Hi,
partition pruner is a common requirement.
You could provide a rule which push partition evaluated by filter condition
into a TableScan.
There are two ways to ensure the converted result would be chosen in the
final plan.
1. If use CBO planner, you could override the `getRowCount ` by merging all
the row count of matched partitions into total row count. After apply the
rule, the total row count must be less than before.
2. You could use HBO planner which I thought is proper for this case,
because we could always get benifite  from partition pruner.

Apache Flink also provides Partitioner pruner, you could find it's
implementation here [1]. Hope it helps.

[1]
https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushPartitionIntoTableSourceScanRule.java

Best,
Jing Zhang

Jacques Nadeau  于2021年12月13日周一 01:57写道:

> I don't have a great pointer to an intelligent arbitrary filter coster. The
> default one in Calcite isn't great (last I checked) as it considers more
> filters to be equivalent to more reduction. This means it dramatically
> overestimates set reduction. In Dremio, in simple cases we had to apply an
> upper limit on reduction.
>
> For partitions specifically, we actually apply the condition to the
> underlying partition details to get an accurate new cost. You can see the
> base class of the pruning here:
>
>
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/planner/logical/partition/PruneScanRuleBase.java
>
> In this code, the problem is decomposed into first doing search argument
> pruning (sarg) followed by arbitrary expression pruning (using an
> interpreter).
>
> Sorry I don't have a great pointer to a high-quality selectivity estimator
> in OSS that uses more advanced stats. Maybe someone else can point to one.
>
> On Sat, Dec 11, 2021 at 11:43 PM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
> > Do you have any code pointer for achieving that, Jacques?
> >
> > My main concern is how to estimate the new cost. Do you leverage the
> > estimation of predicate selectivity over the partitioning expression
> maybe?
> >
> > Il Dom 12 Dic 2021, 05:48 Jacques Nadeau  ha
> scritto:
> >
> > > What we have done in the past is push filters into a scan and alter the
> > > costing (and estimated row count). In cases where the filter or
> portions
> > of
> > > the filter can be applied against partitioning columns, you prune
> > > partitions and use a new row count estimate/cost estimate based on the
> > > reduced partition set.
> > >
> > >
> > > On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin  >
> > > wrote:
> > >
> > > > I assume that some of the filter conditions (which are involved in
> the
> > > > choice of partitions ) may by pushdown'ed to TableScan
> > > >
> > > > On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
> > > >  wrote:
> > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > Given some partitioned storage, we can omit the scan of some
> > partitions
> > > > > when a filter is present. How can the lower cost of the scan be
> > > > > represented? As far as I can tell the current approach only allows
> > > > > providing a single cost for the TableScan and Filter can only add
> to
> > > > > that. Should my implementation provide a rule that combines
> > > > > Filter+TableScan?
> > > > >
> > > >
> > >
> >
>


Re: how to represent and optimize a query over partitioned storage

2021-12-12 Thread Jacques Nadeau
I don't have a great pointer to an intelligent arbitrary filter coster. The
default one in Calcite isn't great (last I checked) as it considers more
filters to be equivalent to more reduction. This means it dramatically
overestimates set reduction. In Dremio, in simple cases we had to apply an
upper limit on reduction.

For partitions specifically, we actually apply the condition to the
underlying partition details to get an accurate new cost. You can see the
base class of the pruning here:

https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/planner/logical/partition/PruneScanRuleBase.java

In this code, the problem is decomposed into first doing search argument
pruning (sarg) followed by arbitrary expression pruning (using an
interpreter).

Sorry I don't have a great pointer to a high-quality selectivity estimator
in OSS that uses more advanced stats. Maybe someone else can point to one.

On Sat, Dec 11, 2021 at 11:43 PM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Do you have any code pointer for achieving that, Jacques?
>
> My main concern is how to estimate the new cost. Do you leverage the
> estimation of predicate selectivity over the partitioning expression maybe?
>
> Il Dom 12 Dic 2021, 05:48 Jacques Nadeau  ha scritto:
>
> > What we have done in the past is push filters into a scan and alter the
> > costing (and estimated row count). In cases where the filter or portions
> of
> > the filter can be applied against partitioning columns, you prune
> > partitions and use a new row count estimate/cost estimate based on the
> > reduced partition set.
> >
> >
> > On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin 
> > wrote:
> >
> > > I assume that some of the filter conditions (which are involved in the
> > > choice of partitions ) may by pushdown'ed to TableScan
> > >
> > > On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
> > >  wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > Given some partitioned storage, we can omit the scan of some
> partitions
> > > > when a filter is present. How can the lower cost of the scan be
> > > > represented? As far as I can tell the current approach only allows
> > > > providing a single cost for the TableScan and Filter can only add to
> > > > that. Should my implementation provide a rule that combines
> > > > Filter+TableScan?
> > > >
> > >
> >
>


Re: how to represent and optimize a query over partitioned storage

2021-12-11 Thread Alessandro Solimando
Do you have any code pointer for achieving that, Jacques?

My main concern is how to estimate the new cost. Do you leverage the
estimation of predicate selectivity over the partitioning expression maybe?

Il Dom 12 Dic 2021, 05:48 Jacques Nadeau  ha scritto:

> What we have done in the past is push filters into a scan and alter the
> costing (and estimated row count). In cases where the filter or portions of
> the filter can be applied against partitioning columns, you prune
> partitions and use a new row count estimate/cost estimate based on the
> reduced partition set.
>
>
> On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin 
> wrote:
>
> > I assume that some of the filter conditions (which are involved in the
> > choice of partitions ) may by pushdown'ed to TableScan
> >
> > On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
> >  wrote:
> >
> > >
> > > Hi,
> > >
> > > Given some partitioned storage, we can omit the scan of some partitions
> > > when a filter is present. How can the lower cost of the scan be
> > > represented? As far as I can tell the current approach only allows
> > > providing a single cost for the TableScan and Filter can only add to
> > > that. Should my implementation provide a rule that combines
> > > Filter+TableScan?
> > >
> >
>


Re: how to represent and optimize a query over partitioned storage

2021-12-11 Thread Jacques Nadeau
What we have done in the past is push filters into a scan and alter the
costing (and estimated row count). In cases where the filter or portions of
the filter can be applied against partitioning columns, you prune
partitions and use a new row count estimate/cost estimate based on the
reduced partition set.


On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin 
wrote:

> I assume that some of the filter conditions (which are involved in the
> choice of partitions ) may by pushdown'ed to TableScan
>
> On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
>  wrote:
>
> >
> > Hi,
> >
> > Given some partitioned storage, we can omit the scan of some partitions
> > when a filter is present. How can the lower cost of the scan be
> > represented? As far as I can tell the current approach only allows
> > providing a single cost for the TableScan and Filter can only add to
> > that. Should my implementation provide a rule that combines
> > Filter+TableScan?
> >
>


Re: how to represent and optimize a query over partitioned storage

2021-12-10 Thread Maxim Gramin
I assume that some of the filter conditions (which are involved in the
choice of partitions ) may by pushdown'ed to TableScan

On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
 wrote:

>
> Hi,
>
> Given some partitioned storage, we can omit the scan of some partitions
> when a filter is present. How can the lower cost of the scan be
> represented? As far as I can tell the current approach only allows
> providing a single cost for the TableScan and Filter can only add to
> that. Should my implementation provide a rule that combines
> Filter+TableScan?
>


how to represent and optimize a query over partitioned storage

2021-12-10 Thread Константин Новиков

Hi,
 
Given some partitioned storage, we can omit the scan of some partitions when a 
filter is present. How can the lower cost of the scan be represented? As far as 
I can tell the current approach only allows providing a single cost for the 
TableScan and Filter can only add to that. Should my implementation provide a 
rule that combines Filter+TableScan?