Hello,
after some debugging. I think I know what is going on. Just as a reminder:
I have a Lucene index A that indexes nt:file and some general properties
and another Lucene index B that indexes a type X that inherits nt:file and
indexes a property Y.

So, for a query like the following with a limit of 20
SELECT [jcr:uuid]  FROM [X]  WHERE  [Y] = 'ADJ0158on1ljc'
Oak chooses the general index A over the more specific index B. And,
according to the log, both have the "same" cost.


1.The plan for index A has costPerExecution 1, costPerEntry 1 and
estimatedEntryCount 623349. The plan for B has costPerExecution 1,
costPerEntry 1 and estimatedEntryCount 124670. To me it would indicate that
B has less cost. But, in private method getBestSelectorExecutionPlan from
class org.apache.jackrabbit.oak.query.QueryImpl (oak-core) it enters in the
following block (maxEntryCount has a value of 20, the limit I set to the
query)
if (sortOrder == null || p.getSortOrder() != null) {
    // if the query is unordered, or
    // if the query contains "order by" and the index can sort on that,
    // then we don't need to read all entries from the index
    entryCount = Math.min(maxEntryCount, entryCount);
}

The entry count for both plans is leveled to 20. I don't think that is ok.
Both indexes now have the "same" cost (double c = costPerExecution +
entryCount * costPerEntry) and since index A comes first on the list of
plans, it gets chosen as the best. The chosen plan is

[X] as [X] /* lucene:A(/oak:index/A) jcr:primaryType:X where [X].[Y] =
'ADJ0158on1ljc' */

An instance of FulltextPathCursor is created and here is where the "The
query read or traversed more than 100000 nodes" exception is thrown.

2. When I use EXPLAIN MEASURE in the query, the cost calculation is made
with the estimatedEntryCount of each plan. That's why index B is chosen
correctly here and gives the idea that B will be used.

I do not know if what I wrote makes sense to anybody. Maybe in more recent
versions of Oak, this is fixed or another approach is used. If it is,
great. But for now, we are stuck in 1.12.0 and I don't see many workarounds
(maybe setting a large limit to the query, but that would affect memory
consumption, I guess).

it seems I have no other choice than setting a tag to index B and run the
query forcing the use of that index.

Regards.

Jorge

El mar, 15 nov 2022 a las 16:45, Jorge Flórez (<jorgeeduardoflo...@gmail.com>)
escribió:

> Hi,
> as additional info, I executed the query using  "explain measure":
>
> explain measure SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>  [Num_Radicado] = 'R-2022-005778'
>
> and the result was:
>
>
> [RADICADO_MIGRADO] as [RADICADO_MIGRADO] /* lucene:Index1(/oak:index/Index1) 
> Num_Radicado:R-2022-005778 where [RADICADO_MIGRADO].[Num_Radicado] = 
> 'R-2022-005778' */ cost: { \"RADICADO_MIGRADO\": { perEntry: 1.0, 
> perExecution: 1.0, count: 52210 } }
>
> It seems the correct index would be used, but as you read in my previous
> mail, *that did not happen*. Any help is appreciated.
>
> Regards.
>
> Jorge
>
> El lun, 14 nov 2022 a las 13:02, Jorge Flórez (<
> jorgeeduardoflo...@gmail.com>) escribió:
>
>> Hello,
>>
>> in a repository we have (very large it seems) there are two index
>> definitions. Please see the image:
>>
>> https://drive.google.com/file/d/1KS2MZHfj1aRoWm7v6o3kbNFCPormxEft/view?usp=sharing
>>
>> One index to make the depiction of a content tree faster (Index2, which
>> indexes nodes of type nt:folder) and one to make queries over a specific
>> node type and property faster (Index1, which indexes nodes of type
>> RADICADO_MIGRADO and the property Num_Radicado).
>>
>> When I use queries like
>>
>> SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE  [Num_Radicado] =
>> 'R-2022-005778' and isdescendantnode('/')
>>
>> SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE  [Num_Radicado] =
>> 'R-2022-005778'
>>
>> Index2 is being picked. Which results in:
>> The query read or traversed more than 100000 nodes.:
>> java.lang.UnsupportedOperationException: The query read or traversed more
>> than 100000 nodes. To avoid affecting other tasks, processing was stopped.
>>
>> Why is Index2 picked, having that index1 is specific for that node type
>> and indexes that property? in this case both indexes are returning the same
>> cost...
>>
>> Thanks in advance.
>>
>> P.S.
>> The cost calculation and chosen plan for each query is here:
>>
>> Parsing JCR-SQL2 statement: SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]
>>  WHERE  [Num_Radicado] = 'R-2022-005778' and isdescendantnode('/')
>> cost using filter Filter(query=SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]
>>  WHERE  [Num_Radicado] = 'R-2022-005778' and isdescendantnode('/'),
>> path=//*, property=[Num_Radicado=[R-2022-005778]])
>> cost for reference is Infinity
>> cost for property is Infinity
>> cost for nodeType is 409504.0
>>
>> *cost for [/oak:index/Index2] of type (lucene-property) with plan
>> [lucene:Index2(/oak:index/Index2) jcr:primaryType:RADICADO_MIGRADO] is
>> 3.00cost for [/oak:index/Index1] of type (lucene-property) with plan
>> [lucene:Index1(/oak:index/Index1) Num_Radicado:R-2022-005778] is 3.00*
>> cost for lucene-property is Infinity
>> cost for aggregate lucene is Infinity
>> selected index
>> org.apache.jackrabbit.oak.plugins.index.lucene.LucenePropertyIndex@283b96a
>> with plan /oak:index/Index2 and
>> org.apache.jackrabbit.oak.plugins.index.lucene.LucenePropertyIndex@283b96a
>> with plan /oak:index/Index1 have similar costs 3.0 and 3.0 for query
>> Filter(query=SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'R-2022-005778' and isdescendantnode('/'), path=//*,
>> property=[Num_Radicado=[R-2022-005778]]) - check query explanation / index
>> definitions
>> cost for traverse is 3823716.0
>> count: 1 query: SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'x' and isdescendantnode('x')
>> query execute SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'R-2022-005778' and isdescendantnode('/')
>> query plan [RADICADO_MIGRADO] as [RADICADO_MIGRADO] /*
>> lucene:Index2(/oak:index/Index2) jcr:primaryType:RADICADO_MIGRADO where
>> ([RADICADO_MIGRADO].[Num_Radicado] = 'R-2022-005778') and
>> (isdescendantnode([RADICADO_MIGRADO], [/])) */
>> The query read or traversed more than 100000 nodes.
>>
>> Parsing JCR-SQL2 statement: SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]
>>  WHERE  [Num_Radicado] = 'R-2022-005778'
>> cost using filter Filter(query=SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]
>>  WHERE  [Num_Radicado] = 'R-2022-005778', path=*,
>> property=[Num_Radicado=[R-2022-005778]])
>> cost for reference is Infinity
>> cost for property is Infinity
>> cost for nodeType is 409504.0
>>
>> *cost for [/oak:index/Index2] of type (lucene-property) with plan
>> [lucene:Index2(/oak:index/Index2) jcr:primaryType:RADICADO_MIGRADO] is
>> 3.00cost for [/oak:index/Index1] of type (lucene-property) with plan
>> [lucene:Index1(/oak:index/Index1) Num_Radicado:R-2022-005778] is 3.00*
>> cost for lucene-property is Infinity
>> cost for aggregate lucene is Infinity
>> selected index
>> org.apache.jackrabbit.oak.plugins.index.lucene.LucenePropertyIndex@1cdf077b
>> with plan /oak:index/Index2 and
>> org.apache.jackrabbit.oak.plugins.index.lucene.LucenePropertyIndex@1cdf077b
>> with plan /oak:index/Index1 have similar costs 3.0 and 3.0 for query
>> Filter(query=SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'R-2022-005778', path=*,
>> property=[Num_Radicado=[R-2022-005778]]) - check query explanation / index
>> definitions
>> cost for traverse is 3889252.0
>> count: 1 query: SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'x'
>> query execute SELECT [jcr:uuid]  FROM [RADICADO_MIGRADO]  WHERE
>>  [Num_Radicado] = 'R-2022-005778'
>> query plan [RADICADO_MIGRADO] as [RADICADO_MIGRADO] /*
>> lucene:Index2(/oak:index/Index2) jcr:primaryType:RADICADO_MIGRADO where
>> [RADICADO_MIGRADO].[Num_Radicado] = 'R-2022-005778' */
>> The query read or traversed more than 100000 nodes.
>>
>>

Reply via email to