date:20210217

Support SI at Segment level

2021-02-17 Thread Nihal

Hi all,

Currently, if the parent(main) table and SI table don’t have the same valid
segments then we disable the SI table. And then from the next query onwards,
we scan and prune only the parent table until we trigger the next load or
REINDEX command (as these commands will make the parent and SI table
segments in sync). Because of this, queries take more time to give the
result when SI is disabled.

To solve this problem we are planning to support SI at the segment level. It
means we will not disable SI if the parent and SI table don’t have the same
segments, while we will do the pruning on Si for all valid segments, and for
the rest of the segments, we will do the pruning on main/parent table.


At the time of pruning with the main table in TableIndex.prune, if SI exists
for the corresponding filter then all segments which are not present in the
SI table will be pruned on the corresponding parent table segment.

Please let me know your thought and input about the same.

Regards
Nihal kumar ojha



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Improve carbondata CDC performance

2021-02-17 Thread akashrn5

Hi all,

The design doc is updated, please go through and give your
inputs/suggestions.

Thanks,

Regards,
Akash R



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Improve carbondata CDC performance

2021-02-17 Thread David CaiQiang

Hi Akash,
You can enhance the runtime filter to improve the join performance.

It has the rule to dynamically check whether the join can add the
runtime filter or not.

Better to push down the runtime filter into CarbonDataSourceScan, and
better to avoid adding a UDF function to rewrite the plan.





-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Improve carbondata CDC performance

2021-02-17 Thread Akash r

Hi David,

Are you talking about the dynamic run-time filter pushdown of spark ? If
that is the case, I had already raised a discussion fee months back to do
the similarly  in Carbondata to improve carbon join by using carbondata's
metadata. But since spark is already doing it , it's decided not to do any
changes in carbon side for the same feature and make it complex.

So this design is decided in the community meeting last week, please
correct me if I'm wrong in understanding your reply.

Also it's not plan change, basically, it's just adding a where filter of
custom UDF of block paths in the current join query that's all which skips
the unwanted block files to scan.

Regards,
Akash

On Thu, Feb 18, 2021, 6:36 AM David CaiQiang  wrote:

> Hi Akash,
> You can enhance the runtime filter to improve the join performance.
>
> It has the rule to dynamically check whether the join can add the
> runtime filter or not.
>
> Better to push down the runtime filter into CarbonDataSourceScan, and
> better to avoid adding a UDF function to rewrite the plan.
>
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: Improve carbondata CDC performance

2021-02-17 Thread Akash r

Hi,

In addition to this, the probability of false positives will be more when
we just push the runtime filter of source data min max ok target as it
purely depends on source data.

Even I had done POC by just adding a range filter on target table.

We need file level or block level pruning like SI does. So this approach
was decided.

Regards
Akash

On Thu, Feb 18, 2021, 6:36 AM David CaiQiang  wrote:

> Hi Akash,
> You can enhance the runtime filter to improve the join performance.
>
> It has the rule to dynamically check whether the join can add the
> runtime filter or not.
>
> Better to push down the runtime filter into CarbonDataSourceScan, and
> better to avoid adding a UDF function to rewrite the plan.
>
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Support SI at Segment level

Re: Improve carbondata CDC performance

Re: Improve carbondata CDC performance

Re: Improve carbondata CDC performance

Re: Improve carbondata CDC performance

5 matches

Site Navigation

Mail list logo

Footer information