Re: [Initially posted to users@j.a.o] Problem with read limits & a query using a lucene index with many results (but below setting queryLimitReads)

2019-02-13 Thread Georg Henzler

Hi Thomas,

thanks for the quick answer!


Yes, I have seen cases where an index is re-opened during query
execution. In that case, already returned entries are read again and
skipped, so basically counted twice. I think it would be good to fix
this (only count entries once).


It sounds like this is the root cause of my problem. I created OAK-8046
for it to have it tracked.


I think queries should read at most a few thousands entries. That way,
there are no problems if the limit is set to 100'000. If an
application needs to read more than that, then best run multiple
queries, using keyset pagination if needed:

* https://blog.jooq.org/tag/keyset-pagination/
* https://use-the-index-luke.com/no-offset


The use case with the redirect maps is not really UI related and it does
not use an offset. Splitting up the redirect map generation in multiple
queries is not that straight forward since we have ~2 Mio nodes... and
the content magnitude of first level root paths change over time. So it
could be possible to have the problem fixed for a little while and then
one of the root paths goes over the 100,000 again and the problem is 
back.
Before splitting it up in multiple queries I would rather use a 
traversal
without query (that is at least 100% safe), but doing that feels 
wasteful

- the lucene index is in place and ready to be used. Also it worked
perfectly fine with oak 1.4.3. I still think it would be great to have
a query option to disable it for "export use cases" like the one 
described.


-Georg


Re: [Initially posted to users@j.a.o] Problem with read limits & a query using a lucene index with many results (but below setting queryLimitReads)

2019-02-13 Thread Thomas Mueller
Hi,

> Wouldn't it make sense to introduce a query option ala [1] to disable 
> read/memory limits for one particular query?

It's possible, but my fear is that people would use the option in their queries 
too often...

> OAK-6875 does not always have the desired effect (for sure there is some 
> un-deterministic behaviour for large content being accessed

Yes, I have seen cases where an index is re-opened during query execution. In 
that case, already returned entries are read again and skipped, so basically 
counted twice. I think it would be good to fix this (only count entries once).

I think queries should read at most a few thousands entries. That way, there 
are no problems if the limit is set to 100'000. If an application needs to read 
more than that, then best run multiple queries, using keyset pagination if 
needed:

* https://blog.jooq.org/tag/keyset-pagination/
* https://use-the-index-luke.com/no-offset

Regards,
Thomas
 



[Initially posted to users@j.a.o] Problem with read limits & a query using a lucene index with many results (but below setting queryLimitReads)

2019-02-12 Thread Georg Henzler

Hi all,

sorry for cross-posting, but I didn't get an answer on the users list.

I think the change made with OAK-6875 does not always have the desired 
effect (for sure there is some un-deterministic behaviour for large 
content being accessed via a lucene index which should at least be 
deterministic&explainable). See below email for details (if somebody 
could confirm or reject my assumptions, that would already help a lot!)


Also in general: Wouldn't it make sense to introduce a query option ala 
[1] to disable read/memory limits for one particular query? It would 
then just be a safety net for queries that unexpectedly exceed the 
limits, for special use cases as described below it could be turned it 
off.


-Georg

[1] 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#Query_Option_Index_Tag


 Original Message 
Subject: Problem with read limits & a query using a lucene index with 
many results (but below setting queryLimitReads)

Date: 2019-02-07 01:46
From: Georg Henzler 
To: us...@jackrabbit.apache.org
Reply-To: us...@jackrabbit.apache.org

Hi all,

We have a servlet in place that exports redirects to apache using 
rewrite

maps [1]. That servlet is running a query [2] against a large repository
that holds ~ 2 Mio nodes for the primary type cq:PageContent (as 
referenced
in the query). We have a lucene index defined for property 
redirectTarget

that holds around 1 Mio documents when checked via JMX [3]  (the custom
index also holds the properties sling:alias and sling:vanityPath that 
are

not strictly needed for this query but for another use case, see [7] for
exact definition). When checking the query with the explain query tool, 
it
always uses the index www_redirectmanager as desired. The amount of 
nodes
that have the property redirectTarget set is ~150,000. The servlet 
returns

usually within 1-2 minutes which is totally fine (it is called once per
hour).

Since upgrading to OAK 1.8.7 (we had 1.4.3 before without problems), we 
get
the error [6] in around 2% of the cases (so most of the time it works, 
but

sometimes we get the error and the servlet fails, it is *not*
deterministic). I suppose this is connected to the change in [5]. We 
have

already increased queryLimitInMemory and queryLimitReads
(PID org.apache.jackrabbit.oak.query.QueryEngineSettingsService) to 
500,000

(from default 200,000) but we still get the error every now and then. We
had once one node that always (deterministically) returned the error 
[6],
after reindexing of [7] we were back to non-deterministic 2% of the 
queries
(but even while the problem was deterministic on that node, explain 
query

always returned that index to be used).

I have the following understanding:
1. The settings queryLimitInMemory and queryLimitReads both are 
evaluated
*after* the results form the index are retrieved (so the query engine 
asks

the index for nodes and gets ~150,000 and reads those and then applies
further criteria to filter the result set further, to avoid large result
sets for this filtering those properties are in place)
2. Having multiple properties in the index [3] should not really make a
difference for this particular problem since no matter how many 
properties

are held in index the result set for query [1] is always the same
3. No matter if the assumptions from 1. and 2. are true, the problem 
should

be deterministic

Has anyone else run into a similar problem? Are the assumptions above
correct? Obviously the query [1] could be split up to run many queries 
for
sub paths or even traverse all paths for the property, but 
conceptionally

it should really possible to do this in one query IMHO.

-Georg

[1] https://httpd.apache.org/docs/2.4/rewrite/rewritemap.html
[2] SELECT * FROM [cq:PageContent] AS s WHERE 
ISDESCENDANTNODE([/content])

and s.[redirectTarget] is not null
[3]
/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3DLucene+Index+statistics%2Ctype%3DLuceneIndex
[4]
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#Slow_Queries_and_Read_Limits
[5] https://issues.apache.org/jira/browse/OAK-6875
[6] 07.12.2018 11:01:22.408 *WARN* [192.168.166.72 [1544176801343] GET
/bin/www/redirectmap/redirecttarget HTTP/1.1]
org.apache.jackrabbit.oak.query.FilterIterators The query read or 
traversed

more than 50 nodes.
java.lang.UnsupportedOperationException: The query read or traversed 
more
than 50 nodes. To avoid affecting other tasks, processing was 
stopped.


[7]