[ 
https://issues.apache.org/jira/browse/OAK-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575682#comment-17575682
 ] 

Nuno Santos commented on OAK-9881:
----------------------------------

The purpose of the logic described above is to optimize for the case where the 
wildcard appears at the end of the query, transforming a potentially expensive 
Lucene or Elastic wildcard query into a more efficient query, like a prefix 
query.

The following was done using the Elastic, for the queries 
{noformat}
select * from [nt:base] where [propa] like '12%'

select * from [nt:base] where [propa] like '12_'
{noformat}
 

With the current code in trunk, they are both transformed in wildcard queries:
{noformat}
Query: select * from [nt:base] where [propa] like '12%'
Plan: [[nt:base] as [nt:base] /* 
elasticsearch:46567d9b-ea8b-401e-aacb-78960881bc3b(/oak:index/46567d9b-ea8b-401e-aacb-78960881bc3b)
 {"bool":{"filter":[{"wildcard":{"propa":{"value":"12*"}}}]}}  where 
[nt:base].[propa] like '12%' */]

Query: select * from [nt:base] where [propa] like '12_'
Plan: [[nt:base] as [nt:base] /* 
elasticsearch:46567d9b-ea8b-401e-aacb-78960881bc3b(/oak:index/46567d9b-ea8b-401e-aacb-78960881bc3b)
 {"bool":{"filter":[{"wildcard":{"propa":{"value":"12?"}}}]}}  where 
[nt:base].[propa] like '12_' */] {noformat}
Fixing the code to properly check the last character of the string, the SQL2 
queries are transformed in Elastic prefix queries:
{noformat}
Query: select * from [nt:base] where [propa] like '12%'
Plan: [[nt:base] as [nt:base] /* 
elasticsearch:556f667c-d5e8-44e6-82d6-364f1c5ea79b(/oak:index/556f667c-d5e8-44e6-82d6-364f1c5ea79b)
 {"bool":{"filter":[{"prefix":{"propa":{"value":"12"}}}]}}  where 
[nt:base].[propa] like '12%' */]

Query: select * from [nt:base] where [propa] like '12_'
Plan: [[nt:base] as [nt:base] /* 
elasticsearch:556f667c-d5e8-44e6-82d6-364f1c5ea79b(/oak:index/556f667c-d5e8-44e6-82d6-364f1c5ea79b)
 {"bool":{"filter":[{"prefix":{"propa":{"value":"12"}}}]}}  where 
[nt:base].[propa] like '12_' */]
{noformat}
For the case of the filter with a wildcard character ({{{}like '12_'{}}}), the 
Elastic query is still a prefix query, and will retrieve from ES the same 
results as {{{}like '12%'{}}}. The final results of the SQL2 query correct, 
probably because Oak does another pass over the results returned from Elastic 
to enforce that real restriction. But doing a prefix query in Elastic may 
return a potentially large number of results (all properties whose value starts 
with '12', regardless of the size of the string), which in some cases can lead 
to very poor performance. For instance, imagine that a property for which there 
are nodes with all values from 1 to 10 000 000. For a restriction {{like '1%'}} 
Oak would retrieve from Elastic a large fraction of the index (around 1 million 
nodes) just to then filter it down to 10 values (numbers 10 to 19).

So the intention of the code before this fix was probably a better 
optimization, using prefix queries when the SQL2 query ends with % and wildcard 
queries when it ends with _. 

Maybe we should revert the Elastic code to be the same as before the fix, and 
use this optimization only for queries ending with %.

As the code is not reachable because of the off-by-one bug, the current trunk 
is using wildcard queries in all cases, so we are missing a potential easy 
optimization.

> Unreachable code in the logic that processes like constraints
> -------------------------------------------------------------
>
>                 Key: OAK-9881
>                 URL: https://issues.apache.org/jira/browse/OAK-9881
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: indexing
>            Reporter: Nuno Santos
>            Priority: Minor
>
> In ElasticRequestHandler, the following code has a section that is 
> unreachable:
> {code:java}
> private static Query like(String name, String first) {
>     first = first.replace('%', WildcardQuery.WILDCARD_STRING);
>     first = first.replace('_', WildcardQuery.WILDCARD_CHAR);
>     int indexOfWS = first.indexOf(WildcardQuery.WILDCARD_STRING);
>     int indexOfWC = first.indexOf(WildcardQuery.WILDCARD_CHAR);
>     int len = first.length();
>     if (indexOfWS == len || indexOfWC == len) { 
>          // Unreachable code
>     }{code}
> The condition {{indexOfWS == len || indexOfWC == len}} will always evaluate 
> to false because the variables {{indexOfWS}} and {{indexOfWC}} are between 
> {{-1}} and {{len-1}} (from the specification of {{indexOf()}}), so they will 
> never be equal to {{len}}. (I found this issue from a warning in the static  
> analyzer of IntelliJ).
> Is this indeed a bug? If so, then we are missing tests to expose this bug. 
> The same logic can be found here:
> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndex.java#L767-L791



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to