[
https://issues.apache.org/jira/browse/OAK-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575682#comment-17575682
]
Nuno Santos commented on OAK-9881:
----------------------------------
The purpose of the logic described above is to optimize for the case where the
wildcard appears at the end of the query, transforming a potentially expensive
Lucene or Elastic wildcard query into a more efficient query, like a prefix
query.
The following was done using the Elastic, for the queries
{noformat}
select * from [nt:base] where [propa] like '12%'
select * from [nt:base] where [propa] like '12_'
{noformat}
With the current code in trunk, they are both transformed in wildcard queries:
{noformat}
Query: select * from [nt:base] where [propa] like '12%'
Plan: [[nt:base] as [nt:base] /*
elasticsearch:46567d9b-ea8b-401e-aacb-78960881bc3b(/oak:index/46567d9b-ea8b-401e-aacb-78960881bc3b)
{"bool":{"filter":[{"wildcard":{"propa":{"value":"12*"}}}]}} where
[nt:base].[propa] like '12%' */]
Query: select * from [nt:base] where [propa] like '12_'
Plan: [[nt:base] as [nt:base] /*
elasticsearch:46567d9b-ea8b-401e-aacb-78960881bc3b(/oak:index/46567d9b-ea8b-401e-aacb-78960881bc3b)
{"bool":{"filter":[{"wildcard":{"propa":{"value":"12?"}}}]}} where
[nt:base].[propa] like '12_' */] {noformat}
Fixing the code to properly check the last character of the string, the SQL2
queries are transformed in Elastic prefix queries:
{noformat}
Query: select * from [nt:base] where [propa] like '12%'
Plan: [[nt:base] as [nt:base] /*
elasticsearch:556f667c-d5e8-44e6-82d6-364f1c5ea79b(/oak:index/556f667c-d5e8-44e6-82d6-364f1c5ea79b)
{"bool":{"filter":[{"prefix":{"propa":{"value":"12"}}}]}} where
[nt:base].[propa] like '12%' */]
Query: select * from [nt:base] where [propa] like '12_'
Plan: [[nt:base] as [nt:base] /*
elasticsearch:556f667c-d5e8-44e6-82d6-364f1c5ea79b(/oak:index/556f667c-d5e8-44e6-82d6-364f1c5ea79b)
{"bool":{"filter":[{"prefix":{"propa":{"value":"12"}}}]}} where
[nt:base].[propa] like '12_' */]
{noformat}
For the case of the filter with a wildcard character ({{{}like '12_'{}}}), the
Elastic query is still a prefix query, and will retrieve from ES the same
results as {{{}like '12%'{}}}. The final results of the SQL2 query correct,
probably because Oak does another pass over the results returned from Elastic
to enforce that real restriction. But doing a prefix query in Elastic may
return a potentially large number of results (all properties whose value starts
with '12', regardless of the size of the string), which in some cases can lead
to very poor performance. For instance, imagine that a property for which there
are nodes with all values from 1 to 10 000 000. For a restriction {{like '1%'}}
Oak would retrieve from Elastic a large fraction of the index (around 1 million
nodes) just to then filter it down to 10 values (numbers 10 to 19).
So the intention of the code before this fix was probably a better
optimization, using prefix queries when the SQL2 query ends with % and wildcard
queries when it ends with _.
Maybe we should revert the Elastic code to be the same as before the fix, and
use this optimization only for queries ending with %.
As the code is not reachable because of the off-by-one bug, the current trunk
is using wildcard queries in all cases, so we are missing a potential easy
optimization.
> Unreachable code in the logic that processes like constraints
> -------------------------------------------------------------
>
> Key: OAK-9881
> URL: https://issues.apache.org/jira/browse/OAK-9881
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: indexing
> Reporter: Nuno Santos
> Priority: Minor
>
> In ElasticRequestHandler, the following code has a section that is
> unreachable:
> {code:java}
> private static Query like(String name, String first) {
> first = first.replace('%', WildcardQuery.WILDCARD_STRING);
> first = first.replace('_', WildcardQuery.WILDCARD_CHAR);
> int indexOfWS = first.indexOf(WildcardQuery.WILDCARD_STRING);
> int indexOfWC = first.indexOf(WildcardQuery.WILDCARD_CHAR);
> int len = first.length();
> if (indexOfWS == len || indexOfWC == len) {
> // Unreachable code
> }{code}
> The condition {{indexOfWS == len || indexOfWC == len}} will always evaluate
> to false because the variables {{indexOfWS}} and {{indexOfWC}} are between
> {{-1}} and {{len-1}} (from the specification of {{indexOf()}}), so they will
> never be equal to {{len}}. (I found this issue from a warning in the static
> analyzer of IntelliJ).
> Is this indeed a bug? If so, then we are missing tests to expose this bug.
> The same logic can be found here:
> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndex.java#L767-L791
--
This message was sent by Atlassian Jira
(v8.20.10#820010)