Re: [DISCUSS] IEP-71 Public API for secondary index search

Maksim Timonin Fri, 17 Sep 2021 05:30:19 -0700

Hi!

The patch merged (IndexQuery with required index name). A lot of thanks to
Andrew Mashenkov, Taras Ledkov and Alex Plekhanov for their time and
reviews.


By the way, I'd like to up the discussion about the optionality of the
indexName param for creating IndexQuery. I believe that it's a useful
feature. My arguments are:
1. Ignite provides users the possibility to create an index without name
setting (with the QuerySqlField annotation, with QueryIndex for
QueryEntity). Then Ignite should provide the possibility to query indexes
the same way it uses for creating them.

2. Naming of the index isn't transparent:
- creating index name is hidden in internal packages: QueryUtils#indexName.
- the QuerySqlField annotation provides an opportunity to create an index
(set `index = true`), but it doesn't set name to index. Users will have to
replace QuerySqlField(index=true) with QuerySqlField(orderedGroups=) to set
the index name. This is a little bit misleading to use group index to set
naming for a single field index.

3. There is an example of using such API in other databases: MongoDB
doesn't require to specify an index name [1], it searches it by fields
only. Note, MongoDB doesn't require strict fields order for querying
indexes, but requires it for sorting [2].

When defining an index we enforce order of fields. QueryIndex uses
LinkedHashMap for that, QuerySqlField.Group uses the order param. But I
agree that it can be very inconvenient to fill IndexQuery with criteria in
the same order. Also cases when users support both indexes (A, B) and (B,
A) are pretty rare, and in this specific case one should set the index name
in any case to avoid misleading results. Also let's take in consideration
that there is already a case with MongoDB that doesn't restrict field
order.

So, I propose:
1. Allow create IndexQuery with list of criteria only, without specifying
name.
2. In this case iterate over all indexes created for specified value type,
and use the first that matches all criteria fields.


If there are no objections I'll prepare a patch for that. Igniters, WDYT?



[1] https://docs.mongodb.com/manual/indexes/
[2]
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/#sort-on-multiple-fields



On Mon, Aug 30, 2021 at 5:52 PM Maksim Timonin <[email protected]>
wrote:

> Hi!
>
> It's a PR re-review request.
>
> I've reworked PR [1]. The major changes are:
> 1. indexName is a required param for building IndexQuery.
> 2. the order of criteria fields doesn't make sense - now it checks only
> that criteria is a prefix for index fields.
> 3. IndexQuery can accept a value type (only value class was before).
> Thanks a lot to Taras Ledkov for this comment, I missed this.
> 4. Also refactored code in places where Andrey Mashenkov put his comments.
>
> Also created a separate ticket [2] for building IndexQuery from a list of
> criteria fields only.
>
> [1] https://github.com/apache/ignite/pull/9118
> [2] https://issues.apache.org/jira/browse/IGNITE-15391
>
> On Fri, Aug 27, 2021 at 10:49 AM Maksim Timonin <[email protected]>
> wrote:
>
>> Hi!
>>
>> Thanks all for your comments!
>>
>> 1. I'm OK on the first step to make the indexName param required and not
>> check fields order for this case. So, I'll remove code with search index by
>> fields from current PR.
>>
>> 2. But I propose to create a ticket for possible improvement - IndexQuery
>> should be created with a list of fields only.
>> I believe that it's a useful feature. My main argument is still the same:
>> - Ignite provides to users the possibility to create an index without
>> name setting. Then Ignite should provide the possibility to query them.
>> - There is an example of using such API in other databases: as Ivan
>> mentioned, MongoDB doesn't require to specify an index name [1], it
>> searches it by fields only. Note, MongoDB doesn't require strict fields
>> order for querying indexes, but requires it for sorting [2].
>>
>> I don't see reasons not to provide a similar API for our users. Agree,
>> that order of fields is under discussion.
>>
>> So, let's postpone step 2, and discuss it later again, as currently it is
>> a blocker for Index API. If somebody sees strong arguments against this
>> proposal and my arguments, please write them.
>>
>> [1] https://docs.mongodb.com/manual/indexes/
>> [2]
>> https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/#sort-on-multiple-fields
>>
>>
>>
>>
>> On Thu, Aug 26, 2021 at 8:00 PM Courtney Robinson <
>> [email protected]> wrote:
>>
>>> Prefer 1 from Teras' response. Specifying index name is preferred.
>>> I've seen customers do idx(A,B) and idx(B,A) where semantics change
>>> between
>>> the two.
>>>
>>> Regards,
>>> Courtney Robinson
>>> Founder and CEO, Hypi
>>> Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io>
>>>
>>> <https://hypi.io>
>>> https://hypi.io
>>>
>>>
>>> On Thu, Aug 26, 2021 at 4:28 PM Taras Ledkov <[email protected]>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > My proposal:
>>> > 1. Don't search index by criteria, specify the index name always
>>> > (preferred).
>>> >
>>> > OR
>>> >
>>> > 2. Search index by criteria without check the order of criteriones.
>>> > Use the Set of criterions instead of the ordered collection.
>>> > In the strange case when the both index exist (a, b) and (b, a) - use
>>> > the any index
>>> > when index name isn't specified.
>>> >
>>> > On 26.08.2021 16:49, Maksim Timonin wrote:
>>> > > There are some thoughts about strict field order:
>>> > > 1. Index (A, B) is not equivalent to index (B, A). Some queries may
>>> have
>>> > > different performance on such indexes, and users have to specify the
>>> > right
>>> > > index. What if both indexes exist?
>>> > > 2. We should avoid cases when a user uses in query only field B for
>>> index
>>> > > (A, B). We have to force the user to specify range for (A) too, or
>>> > > explicitly set it (null, null). Otherwise it looks like a mistake.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Aug 26, 2021 at 4:39 PM Ivan Daschinsky <[email protected]
>>> >
>>> > wrote:
>>> > >
>>> > >> 1. I suppose, that the next step is to implement the api for
>>> manually
>>> > >> creating index. I think that user wants to create index that will
>>> speed
>>> > up
>>> > >> his criteria base queries, so he or she will use the same criteria
>>> to
>>> > >> define the index. So no problem at all
>>> > >> 2. We should print warning or throws exception if there is not any
>>> index
>>> > >> that match specific criteria.
>>> > >>
>>> > >> BTW, Mongo DB doesn't make user to write index name in query. It
>>> just
>>> > >> works.
>>> > >>
>>> > >> чт, 26 авг. 2021 г., 15:52 Taras Ledkov <[email protected]>:
>>> > >>
>>> > >>> Hi,
>>> > >>>
>>> > >>>> It is an usability nightmare to make user write index name in all
>>> > >> cases.
>>> > >>> I don't see any difference between specifying the index name and
>>> > >>> specifying the index fields in the right order.
>>> > >>> Do you see?
>>> > >>>
>>> > >>> Let's there is the index:
>>> > >>> idx_A_B ON TBL (A, B)
>>> > >>>
>>> > >>> Is it OK that the query like below doesn't math the index
>>> 'idx_A_B'?
>>> > >>> new IndexQuery<>(..)
>>> > >>>       .setCriteria(lt("b", 1), lt("a", 2));
>>> > >>>
>>> > >>> On 26.08.2021 15:23, Ivan Daschinsky wrote:
>>> > >>>> I am against to make user write index name. It is quite simple and
>>> > >>>> straightforward algorithm to match index to field names, so it is
>>> > >> strange
>>> > >>>> to compare it to sql engine optimizer.
>>> > >>>>
>>> > >>>> It is an usability nightmare to make user write index name in all
>>> > >> cases.
>>> > >>>> чт, 26 авг. 2021 г., 14:42 Maksim Timonin <
>>> [email protected]>:
>>> > >>>>
>>> > >>>>> Hi, Igniters!
>>> > >>>>>
>>> > >>>>> There is a discussion about how to specify an index to query
>>> with an
>>> > >>>>> IndexQuery [1]. Currently my PR provides 2 ways to specify index:
>>> > >>>>> 1. With a table and index name;
>>> > >>>>> 2. With a table and list of index fields (without index name). In
>>> > this
>>> > >>> case
>>> > >>>>> IndexQueryProcessor tries to find an index that matches table and
>>> > >> index
>>> > >>>>> fields in strict order (order of fields in criteria has to match
>>> the
>>> > >>> order
>>> > >>>>> of fields in index).
>>> > >>>>>
>>> > >>>>> Discussion is whether is the second approach valid?
>>> > >>>>>
>>> > >>>>> Pros:
>>> > >>>>> 1. Currently index name is an optional field for QueryIndex and
>>> > >>>>> QuerySqlField. Then users can create an index with a table and
>>> list
>>> > of
>>> > >>>>> fields. Then, we should provide an opportunity to define an
>>> index for
>>> > >>>>> querying the same way as we do for creating.
>>> > >>>>> 2. It's required to know the index name to query it (in case the
>>> > index
>>> > >>> was
>>> > >>>>> created without an explicit name). Users can find it and then
>>> use it
>>> > >> as
>>> > >>> a
>>> > >>>>> constant in code, but I see some troubles there:
>>> > >>>>> 2.1. Get index name by querying the system view INDEXES. Note,
>>> that
>>> > >>> system
>>> > >>>>> views are marked as an experimental feature [2].
>>> > >>>>> 2.2. There is a workaround to know an index name with EXPLAIN
>>> clause
>>> > >> for
>>> > >>>>> sql query that uses the required index (but it depends on SQL
>>> > >>> optimizer).
>>> > >>>>> 2.3. Users can use the index name builder, but it is in the
>>> > >>>>> internal package
>>> > >>>>>
>>> (org.apache.ignite.internal.processors.query.QueryUtils#indexName).
>>> > >>> Then it
>>> > >>>>> can be changed from version to version without warning, and then
>>> the
>>> > >>> user
>>> > >>>>> can't rely on it in code.
>>> > >>>>> 3. Name of the Primary Key index (_key_PK) is predefined and
>>> > hardcoded
>>> > >>> in
>>> > >>>>> Ignite. Users can't set it while creating it, and the name of PK
>>> > index
>>> > >>> is
>>> > >>>>> hardcoded in the internal package too
>>> (QueryUtils.PRIMARY_KEY_INDEX).
>>> > >>>>>
>>> > >>>>> Cons:
>>> > >>>>> 1. It's declared that IndexQuery avoids some SQL steps (like
>>> > planning,
>>> > >>>>> optimizer) in favor of speed. It looks like that looking for an
>>> index
>>> > >> by
>>> > >>>>> list of fields is the work of an optimizer.
>>> > >>>>> 2. It can be not obvious that the order of fields in a query has
>>> to
>>> > >>> match
>>> > >>>>> the order of fields in the related index. We should have it in
>>> mind
>>> > >> when
>>> > >>>>> building a query - there should be a check for order of fields
>>> before
>>> > >>>>> querying.
>>> > >>>>>
>>> > >>>>>   From my side, I think that arguments for enforcing usage of an
>>> > index
>>> > >>> name
>>> > >>>>> for queries are strong enough. But for me it's strange that it's
>>> > >>> possible
>>> > >>>>> to create an index without a name, but it's required to use name
>>> to
>>> > >>> query
>>> > >>>>> it. Also taking in consideration that there is no guaranteed way
>>> to
>>> > >> get
>>> > >>> an
>>> > >>>>> index name (or I don't know it).
>>> > >>>>>
>>> > >>>>> Igniters, what do you think?
>>> > >>>>> [1]
>>> https://github.com/apache/ignite/pull/9118#discussion_r642557531
>>> > >>>>> [2]
>>> > >>>
>>> https://ignite.apache.org/docs/latest/monitoring-metrics/system-views
>>> > >>>>>
>>> > >>>>> On Fri, Aug 6, 2021 at 4:04 PM Maksim Timonin <
>>> > >> [email protected]>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> Hi, all!
>>> > >>>>>>
>>> > >>>>>> It's a gentle reminder. There is a PR for the new Index API
>>> [1]. It
>>> > >> was
>>> > >>>>>> approved by Alex Plekhanov. Does anybody want to review this API
>>> > too?
>>> > >>> If
>>> > >>>>>> there won't be objections we're going to merge it Monday, 16th
>>> of
>>> > >>> August.
>>> > >>>>>> Thanks!
>>> > >>>>>>
>>> > >>>>>> [1] https://github.com/apache/ignite/pull/9118
>>> > >>>>>>
>>> > >>>>>> On Fri, May 21, 2021 at 10:43 PM Maksim Timonin <
>>> > >>> [email protected]
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Andrey, hi!
>>> > >>>>>>>
>>> > >>>>>>> Some updates, there.
>>> > >>>>>>>
>>> > >>>>>>> I've submitted a PR for IndexQuery [1]. There is an issue about
>>> > lazy
>>> > >>>>> page
>>> > >>>>>>> loading, that is also related to Text query ticket
>>> IGNITE-12291.
>>> > >>>>>>>
>>> > >>>>>>> CacheQueries already have pending pages functionality, it's
>>> done
>>> > >> with
>>> > >>>>>>> multiple sending GridCacheQueryRequest. There was an issue with
>>> > >>>>> TextQuery
>>> > >>>>>>> and limit, after exceeding a limit we still send requests, so I
>>> > >>>>> submitted a
>>> > >>>>>>> patch to fix this [2].
>>> > >>>>>>>
>>> > >>>>>>> But currently, TextQuery, as SqlFieldsQuery also does, prepares
>>> > >> whole
>>> > >>>>>>> data on query request, holds it, and provides a cursor over
>>> this
>>> > >>>>>>> collection.
>>> > >>>>>>>
>>> > >>>>>>> As I understand you correctly, you propose to run TextQuery
>>> over
>>> > >> index
>>> > >>>>>>> with every poll page request. We can do this with Lucene
>>> > >>>>>>> IndexSearcher.searchAfter. So from one side, it will save
>>> > resources.
>>> > >>> But
>>> > >>>>>>> from the other side, no queries (no TextQuery, no
>>> SqlFieldsQuery)
>>> > >> lock
>>> > >>>>>>> index for querying. So there can be data inconsistency, as
>>> there
>>> > can
>>> > >>> be
>>> > >>>>>>> concurrent operations on an index while a user iterates over
>>> the
>>> > >>>>> cursor. It
>>> > >>>>>>> also could be for queries now, due to no index lock being
>>> there,
>>> > but
>>> > >>> the
>>> > >>>>>>> window of time of such inconsistency is much shorter.
>>> > >>>>>>>
>>> > >>>>>>> The same dilemma I have for IndexQuery. In my patch [1] I
>>> provide
>>> > >> lazy
>>> > >>>>>>> iteration over BPlusTree. There is no lock on an index too
>>> while
>>> > >>>>> querying.
>>> > >>>>>>> And I want to discuss the right way. I have in mind the next
>>> > things:
>>> > >>>>>>> 1. Indexes currently doesn't support transactions, also SQL
>>> queries
>>> > >>>>> don't
>>> > >>>>>>> lock index for queries, so Ignite don't guarantee data
>>> consistency;
>>> > >>>>>>> 2. As I understand preparing whole data for SQL queries is
>>> required
>>> > >>> due
>>> > >>>>>>> to relations between tables. The more complex query and
>>> relations
>>> > we
>>> > >>>>> have,
>>> > >>>>>>> the much consistency issues we have in result in case of
>>> parallel
>>> > >>>>>>> operations;
>>> > >>>>>>> 3. Querying a single index only (by TextQuery or IndexQuery)
>>> > doesn't
>>> > >>>>>>> affect any relations, so we can allow concurrent updates, as it
>>> > >> could
>>> > >>>>>>> affect a query result but it doesn't hurt.
>>> > >>>>>>>
>>> > >>>>>>> And following these thoughts, it's right to implement lazy
>>> > >> iterations
>>> > >>>>>>> over indexes. What do you think?
>>> > >>>>>>>
>>> > >>>>>>> Also, there is a second topic to discuss. BPlusTree indexes
>>> support
>>> > >>>>> query
>>> > >>>>>>> parallelism. But CacheQueries don't. There needs to be a
>>> change to
>>> > >>>>>>> infrastructure to support query parallelism, so on this patch
>>> [1] I
>>> > >>>>> handle
>>> > >>>>>>> multiple segments in a single thread. And this works OK, as in
>>> the
>>> > >>> case
>>> > >>>>> of
>>> > >>>>>>> lazy querying it's very fast to initialize a cursor, so there
>>> is
>>> > not
>>> > >>>>> much
>>> > >>>>>>> overhead on multiple segments. I ran performance tests and
>>> found
>>> > >> that
>>> > >>> in
>>> > >>>>>>> some cases, IndexQuery beats SqlFieldsQuery even with enabled
>>> > >>>>>>> queryParallelism (it helps a SqlFieldsQuery much). So the need
>>> for
>>> > >>>>>>> supporting queryParallelism for IndexQuery is required to be
>>> tested
>>> > >>>>> well.
>>> > >>>>>>> As IndexQuery already can help users to speed up some queries I
>>> > >>> propose
>>> > >>>>> to
>>> > >>>>>>> check queryParallelism a little bit later. WDYT?
>>> > >>>>>>>
>>> > >>>>>>> So, those 2 things affect the Apache Ignite release that
>>> IndexQuery
>>> > >>> will
>>> > >>>>>>> be delivered with. So, please let me know your thoughts.
>>> > >>>>>>>
>>> > >>>>>>> Any thoughts from the community are welcome too.
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> [1] https://github.com/apache/ignite/pull/9118
>>> > >>>>>>> [2] https://github.com/apache/ignite/pull/9086
>>> > >>>>>>>
>>> > >>>>>>> On Mon, Apr 12, 2021 at 1:52 PM Maksim Timonin <
>>> > >>> [email protected]
>>> > >>>>>>> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> Andrey,
>>> > >>>>>>>>
>>> > >>>>>>>> Thanks! I picked it.
>>> > >>>>>>>>
>>> > >>>>>>>> On Mon, Apr 12, 2021 at 1:51 PM Maksim Timonin <
>>> > >>>>> [email protected]>
>>> > >>>>>>>> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> Stephen,
>>> > >>>>>>>>>
>>> > >>>>>>>>> I don't see a reason to replace or deprecate IndexingSpi.
>>> I'm not
>>> > >>>>>>>>> sure how smbd uses it, but it works now.
>>> > >>>>>>>>>
>>> > >>>>>>>>> On Mon, Apr 12, 2021 at 1:42 PM Stephen Darlington <
>>> > >>>>>>>>> [email protected]> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>>> Is this a replacement for IndexingSpi? Put bluntly, do we
>>> > >> deprecate
>>> > >>>>>>>>>> (and remove) it?
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Or do you see them as complimentary?
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>> On 12 Apr 2021, at 11:29, Maksim Timonin <
>>> > >> [email protected]
>>> > >>>>>>>>>> wrote:
>>> > >>>>>>>>>>> Hi Stephen!
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Please have a look at the QueryProcessing paragraph [1].
>>> I've
>>> > >>>>>>>>>> described
>>> > >>>>>>>>>>> why IndexingSpi doesn't fit us well.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> [1]
>>> > >>>>>>>>>>>
>>> > >>
>>> >
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-71+Public+API+for+secondary+index+search#IEP71PublicAPIforsecondaryindexsearch-2)QueryProcessing
>>> > >>>>>>>>>>> On Mon, Apr 12, 2021 at 1:24 PM Stephen Darlington <
>>> > >>>>>>>>>>> [email protected]> wrote:
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> How does this fit with the current IndexingSpi?
>>> Superficially
>>> > >>> they
>>> > >>>>>>>>>> appear
>>> > >>>>>>>>>>>> to do very similar things?
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Regards,
>>> > >>>>>>>>>>>> Stephen
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>> On 6 Apr 2021, at 14:13, Maksim Timonin <
>>> > >>> [email protected]
>>> > >>>>>>>>>> wrote:
>>> > >>>>>>>>>>>>> Hi, Igniters!
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> I'd like to propose a new feature - opportunity to query
>>> and
>>> > >>>>> create
>>> > >>>>>>>>>>>> indexes
>>> > >>>>>>>>>>>>> from public API.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> It will help in some cases, where:
>>> > >>>>>>>>>>>>> 1. SQL is not applicable by design of user application;
>>> > >>>>>>>>>>>>> 2. Where IndexScan is preferable than ScanQuery for
>>> > >> performance
>>> > >>>>>>>>>> reasons;
>>> > >>>>>>>>>>>>> 3. Functional indexes are required.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Also it'll be great to have a transactional support for
>>> such
>>> > >>>>>>>>>> queries,
>>> > >>>>>>>>>>>> like
>>> > >>>>>>>>>>>>> the "select for update" query provides. But I don't dig
>>> there
>>> > >>>>>>>>>> much. It
>>> > >>>>>>>>>>>> will
>>> > >>>>>>>>>>>>> be a next step if this API will be implemented.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> I've prepared an IEP-71 for that [1] with more details.
>>> > Please
>>> > >>>>>>>>>> share your
>>> > >>>>>>>>>>>>> thoughts.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> [1]
>>> > >>>>>>>>>>>>>
>>> > >>
>>> >
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-71+Public+API+for+secondary+index+search
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>> --
>>> > >>> Taras Ledkov
>>> > >>> Mail-To: [email protected]
>>> > >>>
>>> > >>>
>>> > --
>>> > Taras Ledkov
>>> > Mail-To: [email protected]
>>> >
>>> >
>>>
>>

Re: [DISCUSS] IEP-71 Public API for secondary index search

Reply via email to