Re: [DISCUSS] IEP-71 Public API for secondary index search

Andrey Mashenkov Wed, 07 Apr 2021 06:46:18 -0700

Maksim,

The ScanQuery API provides a filter as
> param that for case of index query should be splitted on such conditions.
> It looks like a non-trivial task.
>
ScanQuery, TextQuery and partially SQL query share the same infrastructure.
I've thought we could extend, improve and reuse some ScanQuery code that
already works fine: map query on topology, IO, batching.
Add IndexCondition alongside the Filter, and abstract query executor from
source (primary and secondary Indexes).
Add a sorted merge algorithm to the query merge stage. It can be very
useful also for TextQueries that suffers from the absence of sorted merge
and a "limit' condition work incorrectly.


If you think it will be too hard than creating from scratch, I'm ok.

3. Ignite creates a proxy object that is filled with objects that are
> inlined. If a user tries to access a field that isn't inlined or not
> indexed, then deserialization will start and Ignite will log.warn() about
> that.
>
Agree, this can be faster.
I don't like the idea a user code will be executed inside BTree operation,
any exception can cause FailureHandler triggering and stop the node.

There is one more thing that could be improved.
ScanQuery now iterates over per-partition PK Hash index trees and has
performance issues on a small grid with a large number of partitions.
So, there are many partitions on every node and many trees should be
scanned.
In this case scan over a secondary index gives significant boots even if
every row is materialized, because we need to traverse over a single tree
per-node.
Having the ability to run a ScanQuery over a secondary index (if one
exists) instead of PK Hash will be great.


On Wed, Apr 7, 2021 at 11:18 AM Maksim Timonin <[email protected]>
wrote:

> Hi, Andrey!
>
> Thanks for the review and your comments!
>
> >> Is it possible to extend ScanQuery functionality to pass index condition
> I investigated this way and see some issues:
> 1. Querying of indexes is not a scan actually. It's
> a tree traverse (predicate operation is an exclusion, other operations like
> gt, lt, min, max have explicit boundaries). An index query consists of
> conditions that match an index structure. In general for a multi-key index
> there can be multiple conditions. The ScanQuery API provides a filter as
> param that for case of index query should be splitted on such conditions.
> It looks like a non-trivial task.
> 2. Querying of an index requires a sorted result, while The ScanQuery
> doesn't matter about that. So there will be a different behavior of the
> iterator for scanning a cache and querying indexes. It's not much to
> implement I think, but it can make ScanQuery unclear for a user.
>
> Maybe it's a point to separate traverse (gt, lt, in, etc...) and scan
> (predicate) index operations to different API. So there still will be a new
> query type for the traversing.
>
> But we will introduce some inheritors for ScanQuery, like TableScanQuery
> and IndexScanQuery, for scan and filter. Then the question is about
> ordering, Cache and Table scans aren't ordered, but Index is. Then we can
> introduce an optional param "order" for ScanQuery too.
>
> WDYT?
>
> >> Functional indices
> >> This task looks like a huge one because the lifecycle of such classes
> should be described first
> I agree with you. That this part should be investigated deeper than I did.
> So let's postpone discussion about functional indexes for a while. IEP-71
> declares some phases, functional indexes are part of the 2nd phase, but
> users will get new functionality already from the 1st phase. Then I'll dig
> into things you mentioned. Thanks for pointing them out.
>
> >> IndexScan by the predicate is questionable
> Also in comments to the IEP on the Confluence you mentioned about
> deserialization that is required to get an object for predicate function.
> Now I see it like that:
> 1. The predicate should operate only with indexed fields;
> 2. User win from predicate only if index is inlined properly (even a part
> of rows aren't inlined due to varlen - it still can be faster then make a
> ScanQuery);
> 3. Ignite creates a proxy object that is filled with objects that are
> inlined. If a user tries to access a field that isn't inlined or not
> indexed, then deserialization will start and Ignite will log.warn() about
> that.
>
> So, I think it's a valid use case. Is there smth I'm missing?
>
>
>
>
>
> On Tue, Apr 6, 2021 at 6:21 PM Andrey Mashenkov <
> [email protected]>
> wrote:
>
> > Hi Maksim,
> >
> > Nice idea, I'd like to see this feature in Ignite.
> > The motivation is clear to me, it would be nice to have fast scans and
> omit
> > SQL overhead on planning, parsing and etc in some simple use-cases.
> >
> > I've left few minor comments to the IEP, but I have the next questions
> > which answer I failed to find in IEP.
> > 1. Is it possible to extend ScanQuery functionality to pass index
> condition
> > as a hint/parameter rather than create a separate query type?
> > This allows a user to run a query over the particular table (for
> > multi-table per cache case) and use an index for some type of conditions.
> >
> > 2. Functional indices, as you wrote, should use Functions distributed via
> > peerClassLoading mechanics.
> > This means there will no class with function on server sides and such
> > classes are not persistent. Seems, they can survive grid restart.
> > This task looks like a huge one because the lifecycle of such classes
> > should be described first.
> > Possible pitfalls are:
> > * Durability. Function code MUST be persistent, to survive node restart
> as
> > there can be no guaranteed classes available on the server-side.
> > * Consistency. Server (and maybe clients) nodes MUST have the same class
> > code at a time.
> > * Code ownership. Would class code be shared or per-cache? If first, you
> > can't just change class code by loading a new one, because other caches
> may
> > use this function.
> > If second, different caches may have different code/behavior, that may be
> > non-obvious to end-user.
> >
> > 3. IndexScan by the predicate is questionable.
> > Maybe it will can faster if there are multiple tables in a cache, but
> looks
> > similar to ScanQuery with a filter.
> >
> > Also, I believe we can have a common API (configuring, creating, using)
> for
> > all types of Indices, but
> > some types (e.g. functional) will be ignored in SQL due to limited
> support
> > on H2 side,
> > and other types will be shared and could be used by ScanQuery engine as
> > well as by SQL engine.
> >
> > On Tue, Apr 6, 2021 at 4:14 PM Maksim Timonin <[email protected]>
> > wrote:
> >
> > > Hi, Igniters!
> > >
> > > I'd like to propose a new feature - opportunity to query and create
> > indexes
> > > from public API.
> > >
> > > It will help in some cases, where:
> > > 1. SQL is not applicable by design of user application;
> > > 2. Where IndexScan is preferable than ScanQuery for performance
> reasons;
> > > 3. Functional indexes are required.
> > >
> > > Also it'll be great to have a transactional support for such queries,
> > like
> > > the "select for update" query provides. But I don't dig there much. It
> > will
> > > be a next step if this API will be implemented.
> > >
> > > I've prepared an IEP-71 for that [1] with more details. Please share
> your
> > > thoughts.
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-71+Public+API+for+secondary+index+search
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>


-- 
Best regards,
Andrey V. Mashenkov

Re: [DISCUSS] IEP-71 Public API for secondary index search

Reply via email to