Re: [DISCUSS] IEP-71 Public API for secondary index search

Maksim Timonin Mon, 12 Apr 2021 02:22:03 -0700

Hi, Andrey!

Am I right, that you mean this ticket [1] *IGNITE-12291 Create controllable
paged query requests / responses for TextQuery similar to current SQL
result processing*, when talked about incorrect limit work for TextQueries?



[1] https://issues.apache.org/jira/browse/IGNITE-12291

On Thu, Apr 8, 2021 at 4:32 PM Maksim Timonin <timonin.ma...@gmail.com>
wrote:

> Hi, Andrey!
>
> >> ScanQuery, TextQuery and partially SQL query share the same
> infrastructure
> I think I understand what you mean. I debug query processing and now agree
> that it's a nice idea to try to reuse the infrastructure of scan and text
> queries. Also as I can see there already Reducer functionality exists, so I
> hope we can use that. I'm not absolutely confident now that it will work
> fine, but I'm going to start there. Thanks for pointing me this direction!
>
> >> I don't like the idea a user code will be executed inside BTree
> operation
> On the confluence page I've shown that a predicate passes as
> TreeRowClosure. In this case you're right, any exception in a predicate
> will lead to a CorruptedTreeException. But I see another legal way to
> implement the predicate operation. BPlusTree.find accepts the X param that
> passed to IO.getRow(). As I understand this param helps to control how much
> returned row is filled. Then we can use it to return an object that
> contains only basic info - link, pageAddr, offset. Then predicate operation
> will be applied on the higher level on a cursor returned by a tree (like
> H2TreeIndex does). It's safe to run user code there, we can handle
> exceptions there.
>
>
>
> On Wed, Apr 7, 2021 at 4:46 PM Andrey Mashenkov <
> andrey.mashen...@gmail.com> wrote:
>
>> Maksim,
>>
>> The ScanQuery API provides a filter as
>> > param that for case of index query should be splitted on such
>> conditions.
>> > It looks like a non-trivial task.
>> >
>> ScanQuery, TextQuery and partially SQL query share the same
>> infrastructure.
>> I've thought we could extend, improve and reuse some ScanQuery code that
>> already works fine: map query on topology, IO, batching.
>> Add IndexCondition alongside the Filter, and abstract query executor from
>> source (primary and secondary Indexes).
>> Add a sorted merge algorithm to the query merge stage. It can be very
>> useful also for TextQueries that suffers from the absence of sorted merge
>> and a "limit' condition work incorrectly.
>>
>> If you think it will be too hard than creating from scratch, I'm ok.
>>
>> 3. Ignite creates a proxy object that is filled with objects that are
>> > inlined. If a user tries to access a field that isn't inlined or not
>> > indexed, then deserialization will start and Ignite will log.warn()
>> about
>> > that.
>> >
>> Agree, this can be faster.
>> I don't like the idea a user code will be executed inside BTree operation,
>> any exception can cause FailureHandler triggering and stop the node.
>>
>> There is one more thing that could be improved.
>> ScanQuery now iterates over per-partition PK Hash index trees and has
>> performance issues on a small grid with a large number of partitions.
>> So, there are many partitions on every node and many trees should be
>> scanned.
>> In this case scan over a secondary index gives significant boots even if
>> every row is materialized, because we need to traverse over a single tree
>> per-node.
>> Having the ability to run a ScanQuery over a secondary index (if one
>> exists) instead of PK Hash will be great.
>>
>>
>> On Wed, Apr 7, 2021 at 11:18 AM Maksim Timonin <timonin.ma...@gmail.com>
>> wrote:
>>
>> > Hi, Andrey!
>> >
>> > Thanks for the review and your comments!
>> >
>> > >> Is it possible to extend ScanQuery functionality to pass index
>> condition
>> > I investigated this way and see some issues:
>> > 1. Querying of indexes is not a scan actually. It's
>> > a tree traverse (predicate operation is an exclusion, other operations
>> like
>> > gt, lt, min, max have explicit boundaries). An index query consists of
>> > conditions that match an index structure. In general for a multi-key
>> index
>> > there can be multiple conditions. The ScanQuery API provides a filter as
>> > param that for case of index query should be splitted on such
>> conditions.
>> > It looks like a non-trivial task.
>> > 2. Querying of an index requires a sorted result, while The ScanQuery
>> > doesn't matter about that. So there will be a different behavior of the
>> > iterator for scanning a cache and querying indexes. It's not much to
>> > implement I think, but it can make ScanQuery unclear for a user.
>> >
>> > Maybe it's a point to separate traverse (gt, lt, in, etc...) and scan
>> > (predicate) index operations to different API. So there still will be a
>> new
>> > query type for the traversing.
>> >
>> > But we will introduce some inheritors for ScanQuery, like TableScanQuery
>> > and IndexScanQuery, for scan and filter. Then the question is about
>> > ordering, Cache and Table scans aren't ordered, but Index is. Then we
>> can
>> > introduce an optional param "order" for ScanQuery too.
>> >
>> > WDYT?
>> >
>> > >> Functional indices
>> > >> This task looks like a huge one because the lifecycle of such classes
>> > should be described first
>> > I agree with you. That this part should be investigated deeper than I
>> did.
>> > So let's postpone discussion about functional indexes for a while.
>> IEP-71
>> > declares some phases, functional indexes are part of the 2nd phase, but
>> > users will get new functionality already from the 1st phase. Then I'll
>> dig
>> > into things you mentioned. Thanks for pointing them out.
>> >
>> > >> IndexScan by the predicate is questionable
>> > Also in comments to the IEP on the Confluence you mentioned about
>> > deserialization that is required to get an object for predicate
>> function.
>> > Now I see it like that:
>> > 1. The predicate should operate only with indexed fields;
>> > 2. User win from predicate only if index is inlined properly (even a
>> part
>> > of rows aren't inlined due to varlen - it still can be faster then make
>> a
>> > ScanQuery);
>> > 3. Ignite creates a proxy object that is filled with objects that are
>> > inlined. If a user tries to access a field that isn't inlined or not
>> > indexed, then deserialization will start and Ignite will log.warn()
>> about
>> > that.
>> >
>> > So, I think it's a valid use case. Is there smth I'm missing?
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Apr 6, 2021 at 6:21 PM Andrey Mashenkov <
>> > andrey.mashen...@gmail.com>
>> > wrote:
>> >
>> > > Hi Maksim,
>> > >
>> > > Nice idea, I'd like to see this feature in Ignite.
>> > > The motivation is clear to me, it would be nice to have fast scans and
>> > omit
>> > > SQL overhead on planning, parsing and etc in some simple use-cases.
>> > >
>> > > I've left few minor comments to the IEP, but I have the next questions
>> > > which answer I failed to find in IEP.
>> > > 1. Is it possible to extend ScanQuery functionality to pass index
>> > condition
>> > > as a hint/parameter rather than create a separate query type?
>> > > This allows a user to run a query over the particular table (for
>> > > multi-table per cache case) and use an index for some type of
>> conditions.
>> > >
>> > > 2. Functional indices, as you wrote, should use Functions distributed
>> via
>> > > peerClassLoading mechanics.
>> > > This means there will no class with function on server sides and such
>> > > classes are not persistent. Seems, they can survive grid restart.
>> > > This task looks like a huge one because the lifecycle of such classes
>> > > should be described first.
>> > > Possible pitfalls are:
>> > > * Durability. Function code MUST be persistent, to survive node
>> restart
>> > as
>> > > there can be no guaranteed classes available on the server-side.
>> > > * Consistency. Server (and maybe clients) nodes MUST have the same
>> class
>> > > code at a time.
>> > > * Code ownership. Would class code be shared or per-cache? If first,
>> you
>> > > can't just change class code by loading a new one, because other
>> caches
>> > may
>> > > use this function.
>> > > If second, different caches may have different code/behavior, that
>> may be
>> > > non-obvious to end-user.
>> > >
>> > > 3. IndexScan by the predicate is questionable.
>> > > Maybe it will can faster if there are multiple tables in a cache, but
>> > looks
>> > > similar to ScanQuery with a filter.
>> > >
>> > > Also, I believe we can have a common API (configuring, creating,
>> using)
>> > for
>> > > all types of Indices, but
>> > > some types (e.g. functional) will be ignored in SQL due to limited
>> > support
>> > > on H2 side,
>> > > and other types will be shared and could be used by ScanQuery engine
>> as
>> > > well as by SQL engine.
>> > >
>> > > On Tue, Apr 6, 2021 at 4:14 PM Maksim Timonin <
>> timonin.ma...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi, Igniters!
>> > > >
>> > > > I'd like to propose a new feature - opportunity to query and create
>> > > indexes
>> > > > from public API.
>> > > >
>> > > > It will help in some cases, where:
>> > > > 1. SQL is not applicable by design of user application;
>> > > > 2. Where IndexScan is preferable than ScanQuery for performance
>> > reasons;
>> > > > 3. Functional indexes are required.
>> > > >
>> > > > Also it'll be great to have a transactional support for such
>> queries,
>> > > like
>> > > > the "select for update" query provides. But I don't dig there much.
>> It
>> > > will
>> > > > be a next step if this API will be implemented.
>> > > >
>> > > > I've prepared an IEP-71 for that [1] with more details. Please share
>> > your
>> > > > thoughts.
>> > > >
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-71+Public+API+for+secondary+index+search
>> > > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > > Andrey V. Mashenkov
>> > >
>> >
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>

Re: [DISCUSS] IEP-71 Public API for secondary index search

Reply via email to