[jira] [Created] (OAK-11139) Allow downloading only recently changed nodes from MongoDB

2024-09-23 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11139:


 Summary: Allow downloading only recently changed nodes from MongoDB
 Key: OAK-11139
 URL: https://issues.apache.org/jira/browse/OAK-11139
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Oak-run indexing allows to download all nodes from MongoDB to the tree store.

However, sometimes we only need the set of recently changed nodes, and not all 
nodes. For example to update an existing index, or for backup purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-11108) Tree store: support parallel indexing

2024-09-17 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-11108.
--
Resolution: Fixed

> Tree store: support parallel indexing
> -
>
> Key: OAK-11108
> URL: https://issues.apache.org/jira/browse/OAK-11108
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The tree store (unlike the flat file store) allows relatively easily to index 
> in parallel, that is using multiple threads. 
> We already have implemented this by splitting the flat file store, but this 
> requires that we split at the exact "border" of lucene documents. Splitting 
> takes time, is complicated (where is the border exactly?), and doesn't always 
> work eg. if index definitions have conflicting "borders" (eg. indexing 
> folders and indexing assets and pages at the same time).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-11129) Improve Lucene documentation

2024-09-16 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-11129.
--
Resolution: Fixed

Public documentation is updated

> Improve Lucene documentation
> 
>
> Key: OAK-11129
> URL: https://issues.apache.org/jira/browse/OAK-11129
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The Lucene documentation 
> https://jackrabbit.apache.org/oak/docs/query/lucene.html is misleading for 
> binary properties. See also 
> https://stackoverflow.com/questions/78973742/indexing-a-binary-and-searching-with-contains-cannot-find-results/78989278#78989278



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11129) Improve Lucene documentation

2024-09-16 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881959#comment-17881959
 ] 

Thomas Mueller commented on OAK-11129:
--

PR https://github.com/apache/jackrabbit-oak/pull/1720/files

> Improve Lucene documentation
> 
>
> Key: OAK-11129
> URL: https://issues.apache.org/jira/browse/OAK-11129
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The Lucene documentation 
> https://jackrabbit.apache.org/oak/docs/query/lucene.html is misleading for 
> binary properties. See also 
> https://stackoverflow.com/questions/78973742/indexing-a-binary-and-searching-with-contains-cannot-find-results/78989278#78989278



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11129) Improve Lucene documentation

2024-09-16 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11129:


 Summary: Improve Lucene documentation
 Key: OAK-11129
 URL: https://issues.apache.org/jira/browse/OAK-11129
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


The Lucene documentation 
https://jackrabbit.apache.org/oak/docs/query/lucene.html is misleading for 
binary properties. See also 
https://stackoverflow.com/questions/78973742/indexing-a-binary-and-searching-with-contains-cannot-find-results/78989278#78989278



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-11108) Tree store: support parallel indexing

2024-09-13 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-11108:


Assignee: Thomas Mueller

> Tree store: support parallel indexing
> -
>
> Key: OAK-11108
> URL: https://issues.apache.org/jira/browse/OAK-11108
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The tree store (unlike the flat file store) allows relatively easily to index 
> in parallel, that is using multiple threads. 
> We already have implemented this by splitting the flat file store, but this 
> requires that we split at the exact "border" of lucene documents. Splitting 
> takes time, is complicated (where is the border exactly?), and doesn't always 
> work eg. if index definitions have conflicting "borders" (eg. indexing 
> folders and indexing assets and pages at the same time).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11073) Create conversion utils in oak-commons to convert iterables/iterators to set/list/stream

2024-09-12 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881321#comment-17881321
 ] 

Thomas Mueller commented on OAK-11073:
--

Conversions is used in org.apache.jackrabbit.oak.plugins.value (it is public).

What about AdapterUtils - it is a bit less likely to cause conflicts. But I 
don't have a strong opinion; we could also just leave it as it is in my view.



> Create conversion utils in oak-commons to convert iterables/iterators to 
> set/list/stream
> 
>
> Key: OAK-11073
> URL: https://issues.apache.org/jira/browse/OAK-11073
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: commons
>Reporter: Rishabh Daim
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.70.0
>
> Attachments: stream-it.diff
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-11107) Index statistics support for multi-threaded indexing

2024-09-11 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-11107.
--
Resolution: Fixed

> Index statistics support for multi-threaded indexing
> 
>
> Key: OAK-11107
> URL: https://issues.apache.org/jira/browse/OAK-11107
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The oak-run-commons IndexerStatisticsTracker tracking of nodes doesn't work 
> correctly when using multiple threads. This is because if a second thread 
> calls "startEntry" before the first thread calls "endEntry", then the 
> starting time of the entries are mixed up. So that the total time is 
> incorrect, and many slow entries are logged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11108) Tree store: support parallel indexing

2024-09-11 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880891#comment-17880891
 ] 

Thomas Mueller commented on OAK-11108:
--

PR https://github.com/apache/jackrabbit-oak/pull/1707

> Tree store: support parallel indexing
> -
>
> Key: OAK-11108
> URL: https://issues.apache.org/jira/browse/OAK-11108
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Priority: Major
>
> The tree store (unlike the flat file store) allows relatively easily to index 
> in parallel, that is using multiple threads. 
> We already have implemented this by splitting the flat file store, but this 
> requires that we split at the exact "border" of lucene documents. Splitting 
> takes time, is complicated (where is the border exactly?), and doesn't always 
> work eg. if index definitions have conflicting "borders" (eg. indexing 
> folders and indexing assets and pages at the same time).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11108) Tree store: support parallel indexing

2024-09-11 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11108:


 Summary: Tree store: support parallel indexing
 Key: OAK-11108
 URL: https://issues.apache.org/jira/browse/OAK-11108
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller


The tree store (unlike the flat file store) allows relatively easily to index 
in parallel, that is using multiple threads. 

We already have implemented this by splitting the flat file store, but this 
requires that we split at the exact "border" of lucene documents. Splitting 
takes time, is complicated (where is the border exactly?), and doesn't always 
work eg. if index definitions have conflicting "borders" (eg. indexing folders 
and indexing assets and pages at the same time).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-11107) Index statistics support for multi-threaded indexing

2024-09-11 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-11107:


Assignee: Thomas Mueller

> Index statistics support for multi-threaded indexing
> 
>
> Key: OAK-11107
> URL: https://issues.apache.org/jira/browse/OAK-11107
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The oak-run-commons IndexerStatisticsTracker tracking of nodes doesn't work 
> correctly when using multiple threads. This is because if a second thread 
> calls "startEntry" before the first thread calls "endEntry", then the 
> starting time of the entries are mixed up. So that the total time is 
> incorrect, and many slow entries are logged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11107) Index statistics support for multi-threaded indexing

2024-09-11 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11107:


 Summary: Index statistics support for multi-threaded indexing
 Key: OAK-11107
 URL: https://issues.apache.org/jira/browse/OAK-11107
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller


The oak-run-commons IndexerStatisticsTracker tracking of nodes doesn't work 
correctly when using multiple threads. This is because if a second thread calls 
"startEntry" before the first thread calls "endEntry", then the starting time 
of the entries are mixed up. So that the total time is incorrect, and many slow 
entries are logged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11099) Tree Store: support indexing from a pack file (without unpacking)

2024-09-09 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11099:


 Summary: Tree Store: support indexing from a pack file (without 
unpacking)
 Key: OAK-11099
 URL: https://issues.apache.org/jira/browse/OAK-11099
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


The tree store supports pack files, which reduces the number of files.

Currently, before starting to index, such pack files need to be unpacked. This 
is not necessary. It takes about 15 minutes to unpack a 225 GB file. These 15 
minutes can be saved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11073) Create conversion utils in oak-commons to convert iterables/iterators to set/list

2024-09-09 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880206#comment-17880206
 ] 

Thomas Mueller commented on OAK-11073:
--

This looks good to me. Related: I never really understood why the enhanced 
"for" loop doesn't support "Iterator". The reason given in the JSR don't 
convince me:

"Appendix I. Design FAQ -- Why can't I use the enhanced for statement with an 
Iterator (rather than an Iterable or array)?

Two reasons: (1) The construct would not provide much in the way on syntactic 
improvement if you had an explicit iterator in your code, and (2) Execution of 
the loop would have the "side effect" of advancing (and typically exhausting) 
the iterator. In other words, the enhanced for statement provides a simple, 
elegant, solution for the common case of iterating over a collection or array, 
and does not attempt to address more complicated cases, which are better 
addressed with the traditional for statement."


> Create conversion utils in oak-commons to convert iterables/iterators to 
> set/list
> -
>
> Key: OAK-11073
> URL: https://issues.apache.org/jira/browse/OAK-11073
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: commons
>Reporter: Rishabh Daim
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.70.0
>
> Attachments: stream-it.diff
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10532) Cost estimation for "not(@x)" calculates cost for "@x='value'" instead

2024-09-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10532.
--
Fix Version/s: 1.70.0
   Resolution: Fixed

> Cost estimation for "not(@x)" calculates cost for "@x='value'" instead
> --
>
> Key: OAK-10532
> URL: https://issues.apache.org/jira/browse/OAK-10532
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.70.0
>
>
> The cost estimation for a query that uses a Lucene index calculates the cost 
> incorrectly if there is a "not()" condition. Examples:
> {noformat}
> /jcr:root//*[(not(@x)) and (not(@y))
> {noformat}
> The Lucene query is then:
> {noformat}
> +:nullProps:x +:nullProps:y
> {noformat}
> But the cost estimation seems to take into account the number of documents 
> for the fields "x" and "y", instead of the field ":nullProps"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10532) Cost estimation for "not(@x)" calculates cost for "@x='value'" instead

2024-08-28 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877457#comment-17877457
 ] 

Thomas Mueller commented on OAK-10532:
--

https://github.com/apache/jackrabbit-oak/pull/1673/files

> Cost estimation for "not(@x)" calculates cost for "@x='value'" instead
> --
>
> Key: OAK-10532
> URL: https://issues.apache.org/jira/browse/OAK-10532
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The cost estimation for a query that uses a Lucene index calculates the cost 
> incorrectly if there is a "not()" condition. Examples:
> {noformat}
> /jcr:root//*[(not(@x)) and (not(@y))
> {noformat}
> The Lucene query is then:
> {noformat}
> +:nullProps:x +:nullProps:y
> {noformat}
> But the cost estimation seems to take into account the number of documents 
> for the fields "x" and "y", instead of the field ":nullProps"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10532) Cost estimation for "not(@x)" calculates cost for "@x='value'" instead

2024-08-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10532:


Assignee: Thomas Mueller

> Cost estimation for "not(@x)" calculates cost for "@x='value'" instead
> --
>
> Key: OAK-10532
> URL: https://issues.apache.org/jira/browse/OAK-10532
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> The cost estimation for a query that uses a Lucene index calculates the cost 
> incorrectly if there is a "not()" condition. Examples:
> {noformat}
> /jcr:root//*[(not(@x)) and (not(@y))
> {noformat}
> The Lucene query is then:
> {noformat}
> +:nullProps:x +:nullProps:y
> {noformat}
> But the cost estimation seems to take into account the number of documents 
> for the fields "x" and "y", instead of the field ":nullProps"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-11054) Oak AsyncCheckpointCreatorTest sometimes fails

2024-08-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-11054.
--
Resolution: Fixed

> Oak AsyncCheckpointCreatorTest sometimes fails
> --
>
> Key: OAK-11054
> URL: https://issues.apache.org/jira/browse/OAK-11054
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: checkpoint, index
>
> The "oldest" checkpoint is removed, but for this to work reliably, the
> checkpoints need to be at least 1 ms apart. So if we wait at least 1 ms,
> then the checkpoints are not on the same millisecond. This is a bit a
> hack, but I think it's safer to change the test case than to change the code.
> https://github.com/apache/jackrabbit-oak/actions/runs/10506276783/job/29105589468#step:6:2225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-11055) Warnings "falling back to classic diff" fill the log

2024-08-27 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-11055.
--
Fix Version/s: 1.70.0
   Resolution: Fixed

> Warnings "falling back to classic diff" fill the log
> 
>
> Key: OAK-11055
> URL: https://issues.apache.org/jira/browse/OAK-11055
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: documentmk
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.70.0
>
>
> I see the following warning a lot in the log file. As this is a known case, I 
> think we should not log the exception stack trace always. Only the message. 
> Otherwise, the log file might cause out-of-disk-space.
> {noformat}
> 00:04:16.333 [main] WARN  o.a.j.o.p.document.DocumentNodeStore - 
> diffJournalChildren failed with IllegalStateException, falling back to 
> classic diff
> java.lang.IllegalStateException: Root document does not have a lastRev entry 
> for local clusterId 0
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.readTrunkChanges(JournalDiffLoader.java:139)
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.call(JournalDiffLoader.java:75)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffImpl(DocumentNodeStore.java:3341)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$9.call(DocumentNodeStore.java:1991)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:85)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.load(CacheLIRS.java:1019)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.get(CacheLIRS.java:980)
> at org.apache.jackrabbit.oak.cache.CacheLIRS.get(CacheLIRS.java:291)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:243)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:57)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache.getChanges(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.plugins.document.TieredDiffCache.getChanges(TieredDiffCache.java:74)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1986)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:51)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalFlatFileStoreStrategy.createSortedStoreFile(IncrementalFlatFileStoreStrategy.java:88)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalStoreBuilder.build(IncrementalStoreBuilder.java:124)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.buildStore(DocumentStoreIndexerBase.java:232)
> at 
> com.adobe.granite.indexing.tool.BuildIndexStoreCmd.run(BuildIndexSt

[jira] [Commented] (OAK-11055) Warnings "falling back to classic diff" fill the log

2024-08-22 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875835#comment-17875835
 ] 

Thomas Mueller commented on OAK-11055:
--

https://github.com/apache/jackrabbit-oak/pull/1665

> Warnings "falling back to classic diff" fill the log
> 
>
> Key: OAK-11055
> URL: https://issues.apache.org/jira/browse/OAK-11055
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: documentmk
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I see the following warning a lot in the log file. As this is a known case, I 
> think we should not log the exception stack trace always. Only the message. 
> Otherwise, the log file might cause out-of-disk-space.
> {noformat}
> 00:04:16.333 [main] WARN  o.a.j.o.p.document.DocumentNodeStore - 
> diffJournalChildren failed with IllegalStateException, falling back to 
> classic diff
> java.lang.IllegalStateException: Root document does not have a lastRev entry 
> for local clusterId 0
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.readTrunkChanges(JournalDiffLoader.java:139)
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.call(JournalDiffLoader.java:75)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffImpl(DocumentNodeStore.java:3341)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$9.call(DocumentNodeStore.java:1991)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:85)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.load(CacheLIRS.java:1019)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.get(CacheLIRS.java:980)
> at org.apache.jackrabbit.oak.cache.CacheLIRS.get(CacheLIRS.java:291)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:243)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:57)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache.getChanges(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.plugins.document.TieredDiffCache.getChanges(TieredDiffCache.java:74)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1986)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:51)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalFlatFileStoreStrategy.createSortedStoreFile(IncrementalFlatFileStoreStrategy.java:88)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalStoreBuilder.build(IncrementalStoreBuilder.java:124)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.buildStore(DocumentStoreIndexerBase.java:232)
> at 
> com.adobe.granite.indexing.tool.BuildIndexS

[jira] [Created] (OAK-11055) Warnings "falling back to classic diff" fill the log

2024-08-22 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11055:


 Summary: Warnings "falling back to classic diff" fill the log
 Key: OAK-11055
 URL: https://issues.apache.org/jira/browse/OAK-11055
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


I see the following warning a lot in the log file. As this is a known case, I 
think we should not log the exception stack trace always. Only the message. 
Otherwise, the log file might cause out-of-disk-space.

{noformat}
00:04:16.333 [main] WARN  o.a.j.o.p.document.DocumentNodeStore - 
diffJournalChildren failed with IllegalStateException, falling back to classic 
diff
java.lang.IllegalStateException: Root document does not have a lastRev entry 
for local clusterId 0
at 
org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.readTrunkChanges(JournalDiffLoader.java:139)
at 
org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.call(JournalDiffLoader.java:75)
at 
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffImpl(DocumentNodeStore.java:3341)
at 
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$9.call(DocumentNodeStore.java:1991)
at 
org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:85)
at 
org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:79)
at 
org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.load(CacheLIRS.java:1019)
at org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.get(CacheLIRS.java:980)
at org.apache.jackrabbit.oak.cache.CacheLIRS.get(CacheLIRS.java:291)
at 
org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:243)
at 
org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:57)
at 
org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache.getChanges(MemoryDiffCache.java:79)
at 
org.apache.jackrabbit.oak.plugins.document.TieredDiffCache.getChanges(TieredDiffCache.java:74)
at 
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1986)
at 
org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
at 
org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
at 
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
at 
org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
at 
org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
at 
org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
at 
org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
at 
org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:51)
at 
org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalFlatFileStoreStrategy.createSortedStoreFile(IncrementalFlatFileStoreStrategy.java:88)
at 
org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalStoreBuilder.build(IncrementalStoreBuilder.java:124)
at 
org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.buildStore(DocumentStoreIndexerBase.java:232)
at 
com.adobe.granite.indexing.tool.BuildIndexStoreCmd.run(BuildIndexStoreCmd.java:167)
at com.adobe.granite.indexing.tool.Main.main(Main.java:124) 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-11055) Warnings "falling back to classic diff" fill the log

2024-08-22 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-11055:
-
Component/s: documentmk

> Warnings "falling back to classic diff" fill the log
> 
>
> Key: OAK-11055
> URL: https://issues.apache.org/jira/browse/OAK-11055
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: documentmk
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I see the following warning a lot in the log file. As this is a known case, I 
> think we should not log the exception stack trace always. Only the message. 
> Otherwise, the log file might cause out-of-disk-space.
> {noformat}
> 00:04:16.333 [main] WARN  o.a.j.o.p.document.DocumentNodeStore - 
> diffJournalChildren failed with IllegalStateException, falling back to 
> classic diff
> java.lang.IllegalStateException: Root document does not have a lastRev entry 
> for local clusterId 0
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.readTrunkChanges(JournalDiffLoader.java:139)
> at 
> org.apache.jackrabbit.oak.plugins.document.JournalDiffLoader.call(JournalDiffLoader.java:75)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffImpl(DocumentNodeStore.java:3341)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$9.call(DocumentNodeStore.java:1991)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:85)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache$1.call(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.load(CacheLIRS.java:1019)
> at 
> org.apache.jackrabbit.oak.cache.CacheLIRS$Segment.get(CacheLIRS.java:980)
> at org.apache.jackrabbit.oak.cache.CacheLIRS.get(CacheLIRS.java:291)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:243)
> at 
> org.apache.jackrabbit.oak.plugins.document.persistentCache.NodeCache.get(NodeCache.java:57)
> at 
> org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache.getChanges(MemoryDiffCache.java:79)
> at 
> org.apache.jackrabbit.oak.plugins.document.TieredDiffCache.getChanges(TieredDiffCache.java:74)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1986)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:147)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compareExisting(JsopNodeStateDiffer.java:100)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer$1.childNodeChanged(JsopNodeStateDiffer.java:65)
> at 
> org.apache.jackrabbit.oak.plugins.document.DiffCache.parseJsopDiff(DiffCache.java:123)
> at 
> org.apache.jackrabbit.oak.plugins.document.JsopNodeStateDiffer.compare(JsopNodeStateDiffer.java:51)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1993)
> at 
> org.apache.jackrabbit.oak.plugins.document.AbstractDocumentNodeState.compareAgainstBaseState(AbstractDocumentNodeState.java:118)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:51)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalFlatFileStoreStrategy.createSortedStoreFile(IncrementalFlatFileStoreStrategy.java:88)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.incrementalstore.IncrementalStoreBuilder.build(IncrementalStoreBuilder.java:124)
> at 
> org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.buildStore(DocumentStoreIndexerBase.java:232)
> at 
> com.adobe.granite.indexing.tool.BuildIndexStoreCmd.run(BuildIndexStoreCmd.java:167)
> at com.adobe.granite.indexing.tool

[jira] [Commented] (OAK-11054) Oak AsyncCheckpointCreatorTest sometimes fails

2024-08-22 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875828#comment-17875828
 ] 

Thomas Mueller commented on OAK-11054:
--

https://github.com/apache/jackrabbit-oak/pull/1664

> Oak AsyncCheckpointCreatorTest sometimes fails
> --
>
> Key: OAK-11054
> URL: https://issues.apache.org/jira/browse/OAK-11054
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: checkpoint, index
>
> The "oldest" checkpoint is removed, but for this to work reliably, the
> checkpoints need to be at least 1 ms apart. So if we wait at least 1 ms,
> then the checkpoints are not on the same millisecond. This is a bit a
> hack, but I think it's safer to change the test case than to change the code.
> https://github.com/apache/jackrabbit-oak/actions/runs/10506276783/job/29105589468#step:6:2225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-11054) Oak AsyncCheckpointCreatorTest sometimes fails

2024-08-22 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-11054:
-
Labels: checkpoint index  (was: index)

> Oak AsyncCheckpointCreatorTest sometimes fails
> --
>
> Key: OAK-11054
> URL: https://issues.apache.org/jira/browse/OAK-11054
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: checkpoint, index
>
> The "oldest" checkpoint is removed, but for this to work reliably, the
> checkpoints need to be at least 1 ms apart. So if we wait at least 1 ms,
> then the checkpoints are not on the same millisecond. This is a bit a
> hack, but I think it's safer to change the test case than to change the code.
> https://github.com/apache/jackrabbit-oak/actions/runs/10506276783/job/29105589468#step:6:2225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-11054) Oak AsyncCheckpointCreatorTest sometimes fails

2024-08-22 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-11054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-11054:
-
Labels: index  (was: )

> Oak AsyncCheckpointCreatorTest sometimes fails
> --
>
> Key: OAK-11054
> URL: https://issues.apache.org/jira/browse/OAK-11054
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Priority: Major
>  Labels: index
>
> The "oldest" checkpoint is removed, but for this to work reliably, the
> checkpoints need to be at least 1 ms apart. So if we wait at least 1 ms,
> then the checkpoints are not on the same millisecond. This is a bit a
> hack, but I think it's safer to change the test case than to change the code.
> https://github.com/apache/jackrabbit-oak/actions/runs/10506276783/job/29105589468#step:6:2225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11054) Oak AsyncCheckpointCreatorTest sometimes fails

2024-08-22 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11054:


 Summary: Oak AsyncCheckpointCreatorTest sometimes fails
 Key: OAK-11054
 URL: https://issues.apache.org/jira/browse/OAK-11054
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller


The "oldest" checkpoint is removed, but for this to work reliably, the
checkpoints need to be at least 1 ms apart. So if we wait at least 1 ms,
then the checkpoints are not on the same millisecond. This is a bit a
hack, but I think it's safer to change the test case than to change the code.

https://github.com/apache/jackrabbit-oak/actions/runs/10506276783/job/29105589468#step:6:2225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11018) doc: clarify warning about setting jcr:uuid on non-referenceable nodes

2024-08-15 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873910#comment-17873910
 ] 

Thomas Mueller commented on OAK-11018:
--

I'm not sure what you mean with "evil" here.

> Manually adding a property with the name jcr:uuid to a non referenceable node 
> might have unexpected effects as Oak maintains an unique index on jcr:uuid 
> properties. As the namespace jcr is reserved, doing so is strongly 
> discouraged.

I would keep this. Or maybe we can clarify it? I don't know what is unclear 
actually.

>  might have unexpected effects 

Someone might except duplicate UUIDs to work, and a "uniqueness constraint 
violation" would be unexpected for him.

> Manually adding a property with the name jcr:uuid to a non referenceable

I'm not sure what is the motivation for this issue. Do you plan to add such 
properties?

> doc: clarify warning about setting jcr:uuid on non-referenceable nodes
> --
>
> Key: OAK-11018
> URL: https://issues.apache.org/jira/browse/OAK-11018
> Project: Jackrabbit Oak
>  Issue Type: Documentation
>  Components: doc
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Minor
>
> [https://jackrabbit.apache.org/oak/docs/differences.html#Identifiers] says 
> (as per change in OAK-2164):
> {quote}Manually adding a property with the name jcr:uuid to a non 
> referenceable node might have unexpected effects as Oak maintains an unique 
> index on jcr:uuid properties. As the namespace jcr is reserved, doing so is 
> strongly discouraged.
> {quote}
> But the tests for OAK-11000 show that this just works as "expected in Oak" 
> (throwing an exception even though no mix:referenceable is present) - as the 
> UUID index is maintained even for nodes that do not have mix:referenceable.
> Should we remove or rephrase that warning?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-11025) Silence more warnings for ordered properties

2024-08-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873517#comment-17873517
 ] 

Thomas Mueller commented on OAK-11025:
--

https://github.com/apache/jackrabbit-oak/pull/1645

> Silence more warnings for ordered properties
> 
>
> Key: OAK-11025
> URL: https://issues.apache.org/jira/browse/OAK-11025
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> With oak-run indexing, we have a lot of warnings, with stack trace each time, 
> that look like this:
> {noformat}
> 07:23:34.803 [main] WARN  o.a.j.o.p.i.l.LuceneDocumentMaker - [...] Ignoring 
> ordered property. Could not convert property ... of type STRING to type DATE 
> for path ...
> java.lang.IllegalArgumentException: Not a date string: 2025-05-06T15:27:36
>   at 
> org.apache.jackrabbit.oak.plugins.value.Conversions$Converter.toCalendar(Conversions.java:100)
>   at 
> org.apache.jackrabbit.oak.plugins.value.Conversions$Converter.toDate(Conversions.java:112)
> {noformat}
> We already have a mechanism to silence (only log once every 10 seconds) 
> similar messages. So we can extend that mechanism to also cover the other 
> cases.
> I checked and found 3 cases:
> * IllegalArgumentException: Not a date string
> * RuntimeException: Unable to parse the provided date field
> * NumberFormatException: For input string
> This issue is not about the performance impact, but to reduce the amount of 
> unnecessary logging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-11025) Silence more warnings for ordered properties

2024-08-14 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-11025:


 Summary: Silence more warnings for ordered properties
 Key: OAK-11025
 URL: https://issues.apache.org/jira/browse/OAK-11025
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller
Assignee: Thomas Mueller


With oak-run indexing, we have a lot of warnings, with stack trace each time, 
that look like this:

{noformat}
07:23:34.803 [main] WARN  o.a.j.o.p.i.l.LuceneDocumentMaker - [...] Ignoring 
ordered property. Could not convert property ... of type STRING to type DATE 
for path ...
java.lang.IllegalArgumentException: Not a date string: 2025-05-06T15:27:36
at 
org.apache.jackrabbit.oak.plugins.value.Conversions$Converter.toCalendar(Conversions.java:100)
at 
org.apache.jackrabbit.oak.plugins.value.Conversions$Converter.toDate(Conversions.java:112)
{noformat}

We already have a mechanism to silence (only log once every 10 seconds) similar 
messages. So we can extend that mechanism to also cover the other cases.

I checked and found 3 cases:

* IllegalArgumentException: Not a date string
* RuntimeException: Unable to parse the provided date field
* NumberFormatException: For input string

This issue is not about the performance impact, but to reduce the amount of 
unnecessary logging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10341) Indexing: replace FlatFileStore+PersistedLinkedList with a tree store

2024-07-11 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865087#comment-17865087
 ] 

Thomas Mueller commented on OAK-10341:
--

I wasn't able to rebase the old branch for some reason, so I created a new 
branch.

https://github.com/apache/jackrabbit-oak/pull/1577

> Indexing: replace FlatFileStore+PersistedLinkedList with a tree store
> -
>
> Key: OAK-10341
> URL: https://issues.apache.org/jira/browse/OAK-10341
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, for indexing large repositories with the document store, we first 
> read all nodes and write them to a sorted file (sorting and merging when 
> needed). Then we index from that sorted file (called "FlatFileStore").
> There are multiple problems with this mechanism:
> * The last merging stage of the flat file store is actually not needed: we 
> could index from the un-merged streams. It would save one step where we write 
> and read all the data.
> * It requires to know the aggregation in the index definition, in order to 
> have a set of "preferred children". If this is unknown, then indexing might 
> take nearly infinite time. 
> * Even if it is known, indexing might be very very slow, specially if there 
> are many direct child nodes for some of the nodes that require aggregation. 
> * It requires a PersistedLinkedList to avoid running out of memory. This 
> persisted linked list uses a key-value store internally. This is an 
> additional overhead: we store and read the data again. However, access to 
> that storage is still done using just an iterator, and not with a key lookup. 
> So performance can still be quite bad.
> * For parallel indexing, we split the flat file. This is not possible unless 
> if we know the aggregation. Sometimes splitting is not possible.
> We want to explore using a tree store that would solve all of the above 
> problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10341) Indexing: replace FlatFileStore+PersistedLinkedList with a tree store

2024-07-11 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10341:


Assignee: Thomas Mueller

> Indexing: replace FlatFileStore+PersistedLinkedList with a tree store
> -
>
> Key: OAK-10341
> URL: https://issues.apache.org/jira/browse/OAK-10341
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, for indexing large repositories with the document store, we first 
> read all nodes and write them to a sorted file (sorting and merging when 
> needed). Then we index from that sorted file (called "FlatFileStore").
> There are multiple problems with this mechanism:
> * The last merging stage of the flat file store is actually not needed: we 
> could index from the un-merged streams. It would save one step where we write 
> and read all the data.
> * It requires to know the aggregation in the index definition, in order to 
> have a set of "preferred children". If this is unknown, then indexing might 
> take nearly infinite time. 
> * Even if it is known, indexing might be very very slow, specially if there 
> are many direct child nodes for some of the nodes that require aggregation. 
> * It requires a PersistedLinkedList to avoid running out of memory. This 
> persisted linked list uses a key-value store internally. This is an 
> additional overhead: we store and read the data again. However, access to 
> that storage is still done using just an iterator, and not with a key lookup. 
> So performance can still be quite bad.
> * For parallel indexing, we split the flat file. This is not possible unless 
> if we know the aggregation. Sometimes splitting is not possible.
> We want to explore using a tree store that would solve all of the above 
> problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10913) SQL-2 grammar: remove documentation for "distinct"

2024-07-11 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10913.
--
Fix Version/s: 1.22.21
   Resolution: Fixed

> SQL-2 grammar: remove documentation for "distinct"
> --
>
> Key: OAK-10913
> URL: https://issues.apache.org/jira/browse/OAK-10913
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.22.21
>
>
> In https://jackrabbit.apache.org/oak/docs/query/grammar-sql2.html#query-1 we 
> document "distinct": 
> “distinct” ensures each row is only returned once.
> But the current implementation doesn't guarantee this: distinct is only 
> applicable to path, and even without distinct each path is only returned 
> once, because the path is basically the primary key.
> Originally, the idea was to add support for "distinct" on properties, but 
> this was is not implemented so far.
> It seems better to remove documentation for "distinct", so that users are not 
> confused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10913) SQL-2 grammar: remove documentation for "distinct"

2024-06-24 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859593#comment-17859593
 ] 

Thomas Mueller commented on OAK-10913:
--

https://github.com/apache/jackrabbit-oak/pull/1552

> SQL-2 grammar: remove documentation for "distinct"
> --
>
> Key: OAK-10913
> URL: https://issues.apache.org/jira/browse/OAK-10913
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> In https://jackrabbit.apache.org/oak/docs/query/grammar-sql2.html#query-1 we 
> document "distinct": 
> “distinct” ensures each row is only returned once.
> But the current implementation doesn't guarantee this: distinct is only 
> applicable to path, and even without distinct each path is only returned 
> once, because the path is basically the primary key.
> Originally, the idea was to add support for "distinct" on properties, but 
> this was is not implemented so far.
> It seems better to remove documentation for "distinct", so that users are not 
> confused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10913) SQL-2 grammar: remove documentation for "distinct"

2024-06-24 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10913:


 Summary: SQL-2 grammar: remove documentation for "distinct"
 Key: OAK-10913
 URL: https://issues.apache.org/jira/browse/OAK-10913
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


In https://jackrabbit.apache.org/oak/docs/query/grammar-sql2.html#query-1 we 
document "distinct": 

“distinct” ensures each row is only returned once.

But the current implementation doesn't guarantee this: distinct is only 
applicable to path, and even without distinct each path is only returned once, 
because the path is basically the primary key.

Originally, the idea was to add support for "distinct" on properties, but this 
was is not implemented so far.

It seems better to remove documentation for "distinct", so that users are not 
confused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-06-11 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854260#comment-17854260
 ] 

Thomas Mueller edited comment on OAK-8046 at 6/12/24 6:15 AM:
--

> in a content management system with 100.000nd of pages and assets, doing a 
> query that is below 200 items is not always feasible?

It is, using keyset pagination as documented in 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#keyset-pagination

> There are even ootb features that read more nodes.

Queries that read more than 100'000 nodes need to be changed. This happened for 
example for "sling alias" queries and "vanity path" queries in Apache Sling.

It is fine to read more than 200 nodes per query. It is not good to read more 
than 100'000 nodes. There is a grey area in-between.

> Plus how would you influence the time a query takes, besides setting a good 
> index definition

In reality this is not a problem for queries that read few nodes.


was (Author: tmueller):
> in a content management system with 100.000nd of pages and assets, doing a 
> query that is below 200 items is not always feasible?

It is, using keyset pagination as documented in 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#keyset-pagination

> There are even ootb features that read more nodes.

Queries that read more than 100'000 nodes need to be changed. This happened for 
example for "sling alias" queries and "vanity path" queries in Apache Sling.

It is fine to read more than 200 nodes per query. It is not good to read more 
than 100'000 nodes.

> Plus how would you influence the time a query takes, besides setting a good 
> index definition

In reality this is not a problem for queries that read few nodes.

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-06-11 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854260#comment-17854260
 ] 

Thomas Mueller edited comment on OAK-8046 at 6/12/24 6:15 AM:
--

> in a content management system with 100.000nd of pages and assets, doing a 
> query that is below 200 items is not always feasible?

It is, using keyset pagination as documented in 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#keyset-pagination

> There are even ootb features that read more nodes.

Queries that read more than 100'000 nodes need to be changed. This happened for 
example for "sling alias" queries and "vanity path" queries in Apache Sling.

It is best to less than 200 nodes per query. It is not good to read more than 
100'000 nodes. There is a grey area in-between.

> Plus how would you influence the time a query takes, besides setting a good 
> index definition

In reality this is not a problem for queries that read few nodes.


was (Author: tmueller):
> in a content management system with 100.000nd of pages and assets, doing a 
> query that is below 200 items is not always feasible?

It is, using keyset pagination as documented in 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#keyset-pagination

> There are even ootb features that read more nodes.

Queries that read more than 100'000 nodes need to be changed. This happened for 
example for "sling alias" queries and "vanity path" queries in Apache Sling.

It is fine to read more than 200 nodes per query. It is not good to read more 
than 100'000 nodes. There is a grey area in-between.

> Plus how would you influence the time a query takes, besides setting a good 
> index definition

In reality this is not a problem for queries that read few nodes.

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-06-11 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854260#comment-17854260
 ] 

Thomas Mueller commented on OAK-8046:
-

> in a content management system with 100.000nd of pages and assets, doing a 
> query that is below 200 items is not always feasible?

It is, using keyset pagination as documented in 
https://jackrabbit.apache.org/oak/docs/query/query-engine.html#keyset-pagination

> There are even ootb features that read more nodes.

Queries that read more than 100'000 nodes need to be changed. This happened for 
example for "sling alias" queries and "vanity path" queries in Apache Sling.

It is fine to read more than 200 nodes per query. It is not good to read more 
than 100'000 nodes.

> Plus how would you influence the time a query takes, besides setting a good 
> index definition

In reality this is not a problem for queries that read few nodes.

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-06-10 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853671#comment-17853671
 ] 

Thomas Mueller commented on OAK-8046:
-

>  I guess the only thing we can do is move this class to an ignored log file 

[~royteeuwen] No. Best is if the queries read less than 200 nodes, and 
relatively quickly (within a second or so). 

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-05-21 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848193#comment-17848193
 ] 

Thomas Mueller commented on OAK-8046:
-

> Should a reindex be triggered

No. That won't help.

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-8046) Result items are not always correctly counted against the configured read limit if a query uses a lucene index

2024-05-21 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848192#comment-17848192
 ] 

Thomas Mueller commented on OAK-8046:
-

[~royteeuwen] it means while the query is still running (and reading more 
nodes), the index was updated concurrently. Indexes are updated every ~5 
seconds.

Best is if the queries read less than 200 nodes, and relatively quickly (within 
a second or so). If you have queries that read 100'000 or more nodes, it is 
quite easy to get into this situation. With less than 200 nodes, it's typically 
never a problem. (There's also the case where less than 200 nodes are read, but 
very slowly... but that's unlikely).

> Result items are not always correctly counted against the configured read 
> limit if a query uses a lucene index 
> ---
>
> Key: OAK-8046
> URL: https://issues.apache.org/jira/browse/OAK-8046
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.8.7
>Reporter: Georg Henzler
>Assignee: Vikas Saurabh
>Priority: Major
> Fix For: 1.12.0, 1.10.1, 1.8.12
>
> Attachments: OAK-8046-take2.patch, OAK-8046.patch
>
>
> There are cases where an index is re-opened during query execution. In that 
> case, already returned entries are read again and skipped, so basically 
> counted twice. This should be fixed to only count entries once (see also [1])
> The issue most likely exists since the read limit was introduced with OAK-6875
> [1] 
> https://lists.apache.org/thread.html/dddf9834fee0bccb6e48f61ba2a01430e34fc0b464b12809f7dfe2eb@%3Coak-dev.jackrabbit.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10713) oak-lucene: add test coverage for stack overflow based on very long and complex regexp

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10713:
-
Summary: oak-lucene: add test coverage for stack overflow based on very 
long and complex regexp  (was: oak-lucene: add test coverage for stack overflow 
based on complex regexp)

> oak-lucene: add test coverage for stack overflow based on very long and 
> complex regexp
> --
>
> Key: OAK-10713
> URL: https://issues.apache.org/jira/browse/OAK-10713
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>  Labels: candidate_oak_1_22
> Fix For: 1.62.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10713) oak-lucene: add test coverage for stack overflow based on complex regexp

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10713:
-
Summary: oak-lucene: add test coverage for stack overflow based on complex 
regexp  (was: oak-lucene: add test coverage for potential DoS attack based on 
complex regexp)

> oak-lucene: add test coverage for stack overflow based on complex regexp
> 
>
> Key: OAK-10713
> URL: https://issues.apache.org/jira/browse/OAK-10713
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>  Labels: candidate_oak_1_22
> Fix For: 1.62.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses Lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Description: 
See .

Analysis so far:

- oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has reached 
EOL a long time ago
- the lucene version can in some cases throw a StackOverflowException, see 
OAK-10713
- oak-lucene *embeds* and *exports* lucene-core
- update to version >= 4.8 non-trivial due to backwards compat breakage

Work in :

- inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into oak-lucene
- fixed two JDK11 compile issues (potentially uninitialized vars in finally 
block) 
- backported fix from https://github.com/apache/lucene/issues/11537
- enable test added in OAK-10713
- ran Oak integration tests

Open questions:

- Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate that
- should we ask Lucene team for a public release (might be hard sell)
- alternatively, as tried here, inline source code into oak-lucene (maybe add 
explainers to all source files)
- do we need to adopt the lucene test suite as well?
- lucene-core dependencies in other Oak modules to be checked (seems mostly for 
tests, or for run modules)





  was:
See .

Analysis so far:

- oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has reached 
EOL a long time ago
- the version is vulnerable to an DoS attack (regexp stack overflow), see 
OAK-10713
- oak-lucene *embeds* and *exports* lucene-core
- update to version >= 4.8 non-trivial due to backwards compat breakage

Work in :

- inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into oak-lucene
- fixed two JDK11 compile issues (potentially uninitialized vars in finally 
block) 
- backported fix from https://github.com/apache/lucene/issues/11537
- enable test added in OAK-10713
- ran Oak integration tests

Open questions:

- Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate that
- should we ask Lucene team for a public release (might be hard sell)
- alternatively, as tried here, inline source code into oak-lucene (maybe add 
explainers to all source files)
- do we need to adopt the lucene test suite as well?
- lucene-core dependencies in other Oak modules to be checked (seems mostly for 
tests, or for run modules)






> oak-lucene uses Lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the lucene version can in some cases throw a StackOverflowException, see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Summary: oak-lucene uses lucene version that can throw a 
StackOverflowException  (was: oak-lucene uses lucene version vulnerable to DoS 
attack)

> oak-lucene uses lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the version is vulnerable to an DoS attack (regexp stack overflow), see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10719) oak-lucene uses Lucene version that can throw a StackOverflowException

2024-03-28 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10719:
-
Summary: oak-lucene uses Lucene version that can throw a 
StackOverflowException  (was: oak-lucene uses lucene version that can throw a 
StackOverflowException)

> oak-lucene uses Lucene version that can throw a StackOverflowException
> --
>
> Key: OAK-10719
> URL: https://issues.apache.org/jira/browse/OAK-10719
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> See .
> Analysis so far:
> - oak-lucene uses lucene-core (4.7.2) (see OAK-10716); that version has 
> reached EOL a long time ago
> - the version is vulnerable to an DoS attack (regexp stack overflow), see 
> OAK-10713
> - oak-lucene *embeds* and *exports* lucene-core
> - update to version >= 4.8 non-trivial due to backwards compat breakage
> Work in :
> - inlined lucene-core as of git tag "releases/lucene-solr/4.7.2" into 
> oak-lucene
> - fixed two JDK11 compile issues (potentially uninitialized vars in finally 
> block) 
> - backported fix from https://github.com/apache/lucene/issues/11537
> - enable test added in OAK-10713
> - ran Oak integration tests
> Open questions:
> - Lucene 4.7.2 builds with ant/ivy - does it make sense to try to replicate 
> that
> - should we ask Lucene team for a public release (might be hard sell)
> - alternatively, as tried here, inline source code into oak-lucene (maybe add 
> explainers to all source files)
> - do we need to adopt the lucene test suite as well?
> - lucene-core dependencies in other Oak modules to be checked (seems mostly 
> for tests, or for run modules)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10694) Clarify state of oak-search-mt

2024-03-08 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824653#comment-17824653
 ] 

Thomas Mueller commented on OAK-10694:
--

> So what do we do in 1.22? Remove as well?

Yes. I think that is simpler.

> Clarify state of oak-search-mt
> --
>
> Key: OAK-10694
> URL: https://issues.apache.org/jira/browse/OAK-10694
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: search-mt
>Reporter: Manfred Baedke
>Priority: Major
>  Labels: candidate_oak_1_22
>
> oak-search-mt depends on an artifact from the retired Apache Incubator 
> project 
> [Joshua|https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+Home],
>  which has a dependency to Guava 19.
> May it be deprecated/removed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10694) Clarify state of oak-search-mt

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824036#comment-17824036
 ] 

Thomas Mueller commented on OAK-10694:
--

[~reschke] I would remove it already know. I don't think this is used by anyone 
(and not maintained). It will likely result in more issues if we deprecate it 
but keep it, and less issues if we remove it.

If someone requires this, then it would be his task to maintain it, in my view.

> Clarify state of oak-search-mt
> --
>
> Key: OAK-10694
> URL: https://issues.apache.org/jira/browse/OAK-10694
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: search-mt
>Reporter: Manfred Baedke
>Priority: Major
>  Labels: candidate_oak_1_22
>
> oak-search-mt depends on an artifact from the retired Apache Incubator 
> project 
> [Joshua|https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+Home],
>  which has a dependency to Guava 19.
> May it be deprecated/removed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:58 AM:
--

I can add the method "expectedFpp()" in our code as well 
(getEstimatedEntryCount we already have), with documentation that this is O ( n 
). The implementation is pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

Actually I would suggest this method:

{noformat}
/**
 * Get the expected false positive rate for the current entries in the 
filter.
 * This will first calculate the estimated entry count, and then calculate 
the false positive probability from there.
...
 */
public double expectedFpp() {
return calculateFpp(getEstimatedEntryCount(), getBitCount(), getK());
}
{noformat}


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823917#comment-17823917
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/6/24 8:52 AM:
--

I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O ( n ). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30


was (Author: tmueller):
I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O(n). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-06 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823917#comment-17823917
 ] 

Thomas Mueller commented on OAK-10674:
--

I can add the methods "expectedFpp()" and "approximateElementCount()" in our 
code as well, with documentation that this is O(n). The implementation is 
pretty simple: see the Guava implementation here:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java#L190C17-L190C30

> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:44 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}

I can work on this, no issue. We need to also move over some tests.




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set. This internally uses the hashCode()
 * method to derive a high-quality hash code.
 * 
 * @param obj the object (must not be null)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822576#comment-17822576
 ] 

Thomas Mueller edited comment on OAK-10674 at 3/1/24 1:43 PM:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(Hash.hash64(obj.hashCode()));
}
{noformat}




was (Author: tmueller):
[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(obj.hashCode());
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10674) DocumentStore: verify that we could use Oak's Bloom filter

2024-03-01 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822576#comment-17822576
 ] 

Thomas Mueller commented on OAK-10674:
--

[~reschke] best would be to move over 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.analysis.utils.Hash 
as well. And then we can add a convenience methods:

{noformat}
/**
 * Add an entry. This internally uses the hashCode() method to derive a
 * high-quality hash code.
 *
 * @param obj the object (must not be null)
 */
public void add(@NotNull Object obj) {
add(Hash.hash64(obj.hashCode()));
}

/**
 * Whether the entry may be in the set.
 * 
 * @param hash the hash value (need to be a high quality hash code, with all
 * bits having high entropy)
 * @return true if the entry was added, or, with a certain false positive
 * probability, even if it was not added
 */
public boolean mayContain(@NotNull Object obj) {
return mayContain(obj.hashCode());
}
{noformat}



> DocumentStore: verify that we could use Oak's Bloom filter
> --
>
> Key: OAK-10674
> URL: https://issues.apache.org/jira/browse/OAK-10674
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
>
> Test that we can use 
> oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/analysis/utils/BloomFilter.java
>  (for now, by copying it over).
> Then decide about where to move it, and whether API changes are desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817235#comment-17817235
 ] 

Thomas Mueller edited comment on OAK-10648 at 2/14/24 3:00 PM:
---

I didn't test this yet, but the following change seem to be necessary:

https://github.com/apache/jackrabbit-oak/blob/trunk/oak-search/src/main/java/org/apache/jackrabbit/oak/plugins/index/search/spi/query/FulltextIndexPlanner.java#L851

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.


was (Author: tmueller):
I didn't test this yet, but the following change seem to be necessary:

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.

> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}
> Index definition for the "cq:movedTo" property:
> {noformat}
> "cqMovedTo": {
> "notNullCheckEnabled": true,
> "nullCheckEnabled": true,
> "propertyIndex": true,
> "name": "cq:movedTo",
> "type": "String"
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Description: 
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}

Index definition for the "cq:movedTo" property:

{noformat}
"cqMovedTo": {
"notNullCheckEnabled": true,
"nullCheckEnabled": true,
"propertyIndex": true,
"name": "cq:movedTo",
"type": "String"
}
{noformat}

  was:
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}



> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}
> Index definition for the "cq:movedTo" property:
> {noformat}
> "cqMovedTo": {
> "notNullCheckEnabled": true,
> "nullCheckEnabled": true,
> "propertyIndex": true,
> "name": "cq:movedTo",
> "type": "String"
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) "IS NULL" (Null Props) Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Summary: "IS NULL" (Null Props) Cause Incorrect Query Estimation  (was: 
Null Props Cause Incorrect Query Estimation)

> "IS NULL" (Null Props) Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10648) Null Props Cause Incorrect Query Estimation

2024-02-14 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10648:
-
Description: 
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.

Queries:
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
'%ksb1325bm%') 
{noformat}
 
{noformat}
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
UNION
SELECT * FROM [cq:Tag] 
WHERE [cq:movedTo] IS NULL 
AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
{noformat}


  was:
Using null props in a query can cause the query engine to incorrectly estimate 
the cost of query plan which can lead to a traversal and slow queries to 
execute.

 

If you look at the query plan below the number of null props documents is quiet 
high yet the cost for the query is only 19. When we execute the UNION query the 
cost is 38 which is why it is not selected when in reality the original cost 
should be much higher.

 

After removing the null check the cost estimation is drastically different and 
correctly reflects the number of documents in the index.


> Null Props Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.
> Queries:
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND (LOWER([jcr:title.en]) LIKE '%ksb1325bm%' OR LOWER([jcr:title]) LIKE 
> '%ksb1325bm%') 
> {noformat}
>  
> {noformat}
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title.en]) LIKE '%ksb1325bm%' 
> UNION
> SELECT * FROM [cq:Tag] 
> WHERE [cq:movedTo] IS NULL 
> AND LOWER([jcr:title]) LIKE '%ksb1325bm%'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10648) Null Props Cause Incorrect Query Estimation

2024-02-13 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817235#comment-17817235
 ] 

Thomas Mueller commented on OAK-10648:
--

I didn't test this yet, but the following change seem to be necessary:

{noformat}
oak-search FulltextIndexPlanner

 if (pr.isNotNullRestriction()) {
// don't use weight for "is not null" restrictions
weight = 1;
 missing code start --
} else if (pr.isNullRestriction()) {
// don't use weight for "is null" restrictions
weight = 1;
 missing code end --
} else {
if (weight > 1) {
// for non-equality conditions such as
// where x > 1, x < 2, x like y,...:
// use a maximum weight of 3,
// so assume we read at least 30%
if (!isEqualityRestriction(pr)) {
weight = Math.min(3, weight);
}
}
}
{noformat}

We should probably add a feature toggle / system property so that we can switch 
back to the original behavior, to we can switch back in case an application 
relies on the current behavior.

> Null Props Cause Incorrect Query Estimation
> ---
>
> Key: OAK-10648
> URL: https://issues.apache.org/jira/browse/OAK-10648
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: indexing
>Reporter: Patrique Legault
>Priority: Major
> Attachments: Non Union Query Plan.json, Non Union With Null 
> Check.json, Screenshot 2024-02-13 at 9.30.43 AM.png, Union Query Plan.json, 
> cqTagLucene.json
>
>
> Using null props in a query can cause the query engine to incorrectly 
> estimate the cost of query plan which can lead to a traversal and slow 
> queries to execute.
>  
> If you look at the query plan below the number of null props documents is 
> quiet high yet the cost for the query is only 19. When we execute the UNION 
> query the cost is 38 which is why it is not selected when in reality the 
> original cost should be much higher.
>  
> After removing the null check the cost estimation is drastically different 
> and correctly reflects the number of documents in the index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10424) Allow Fast Query Size and Insecure Facets to be selectively enabled with query options for permitted principals

2024-01-15 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806843#comment-17806843
 ] 

Thomas Mueller commented on OAK-10424:
--

Documentation (proposal):  https://github.com/apache/jackrabbit-oak/pull/1269

https://jackrabbit.apache.org/oak/docs/query/query-engine.html#result-size 
source code in 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/query/query-engine.md



> Allow Fast Query Size and Insecure Facets to be selectively enabled with 
> query options for permitted principals 
> 
>
> Key: OAK-10424
> URL: https://issues.apache.org/jira/browse/OAK-10424
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Affects Versions: 1.56.0
>Reporter: Mark Adamcin
>Assignee: Mark Adamcin
>Priority: Major
>  Labels: query
> Fix For: 1.62.0
>
>
> Setting the global QueryEngineSettingsService.getFastQuerySize() value to 
> true is currently the only way to allow service users to leverage JCR query 
> for collecting accurate repository count metrics in a performant way. 
> However, doing so in a multiuser repository may be inadvisable because the 
> fast result size is returned to the caller without considering the caller's 
> read permissions over the paths returned in the result, which may allow less 
> privileged users to discover the presence of nodes that are not otherwise 
> visible to them.
> See 
> [https://jackrabbit.apache.org/oak/docs/query/query-engine.html#result-size]
> As an alternative to the global setting, Oak should provide a query option 
> alongside [TRAVERSAL, OFFSET / LIMIT, and INDEX 
> TAG|https://jackrabbit.apache.org/oak/docs/query/query-engine.html#query-options],
>  such as "INSECURE RESULT SIZE" .
> Similarly, IndexDefinition.SecureFacetConfiguration.MODE.INSECURE (insecure 
> facets) can provide extremely valuable counts for property value distribution 
> in large repositories. At the moment, it can only be defined on an index 
> definition, even though it governs the facet counts at query time and has no 
> effect on the persisted content of the index at all. Like fastQuerySize, Oak 
> should provide a query option such as "INSECURE FACETS", for permitted system 
> users to leverage insecure facets even when the query execution plan uses an 
> index definition that only allows secure or statistical facet security. 
> For example, 
> select a.[jcr:path] from [nt:base] as a where contains(a.[text], 'Hello 
> World') option(insecure result size, insecure facets, offset 10)
> To address the security risk, the application should also provide a 
> configuration of some kind to restrict the ability to effectively leverage 
> this option to permitted system users, which could be implemented as a JCR 
> repository privilege or an allowlist property in the 
> QueryEngineSettingsService configuration.
> I have provided a PR that adds support for an INSECURE RESULT SIZE query 
> option and an INSECURE FACETS query option, as well as an 
> "rep:insecureQueryOptions" repository privilege. I think the JCR 
> privilege-based approach for configuration of this permission is more aligned 
> with how system users are defined in practice, but this approach requires a 
> minor version increase in the following oak-security-spi packages:
>  * org.apache.jackrabbit.oak.spi.security.authorization.permission
>  * org.apache.jackrabbit.oak.spi.security.privilege
> Because all registered permissions are serialized into a long bitset, there 
> is clearly a premium on adding another built-in privilege, so I figured that 
> it would be better to choose a name for the privilege that would make it 
> applicable to both of these new options, and any future query options that 
> may involve a tradeoff between security and performance.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10577) Advanced repository statistics

2024-01-03 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10577.
--
Fix Version/s: 1.62.0
   Resolution: Fixed

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.62.0
>
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10577) Advanced repository statistics

2024-01-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802241#comment-17802241
 ] 

Thomas Mueller commented on OAK-10577:
--

Merged on 2023-01-03

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Issue Type: Improvement  (was: Wish)

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: cache
>Reporter: Philipp Suter
>Assignee: Thomas Mueller
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-3583:
---

Assignee: Thomas Mueller

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Assignee: Thomas Mueller
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Description: 
 The currently used Guava Cache API should not be used, so that we no longer 
depend on Guava.

The JCache API [1] could be used maybe. The JCache API implementation should be 
configurable/ pluggable so it could support one of the available distributed 
implementations [2].

[1] https://jcp.org/en/jsr/detail?id=107
[2] https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html

  was:
The JCache API [1] was finally released and is ready to be used. 

Ideally the currently used Guava Cache is replaced by the JCache API. The 
JCache API implementation should be configurable/ pluggable so it could support 
one of the available distributed implementations [2].

The default should be a wrapper around the current Guava Cache and LIRSCache 
implementations.

[1] https://jcp.org/en/jsr/detail?id=107
[2] https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html


> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Priority: Major
>
>  The currently used Guava Cache API should not be used, so that we no longer 
> depend on Guava.
> The JCache API [1] could be used maybe. The JCache API implementation should 
> be configurable/ pluggable so it could support one of the available 
> distributed implementations [2].
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-3583) Replace Guava API for caching

2023-12-20 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-3583:

Summary: Replace Guava API for caching  (was: Replace Guava API with JCache 
API)

> Replace Guava API for caching
> -
>
> Key: OAK-3583
> URL: https://issues.apache.org/jira/browse/OAK-3583
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: cache
>Reporter: Philipp Suter
>Priority: Major
>
> The JCache API [1] was finally released and is ready to be used. 
> Ideally the currently used Guava Cache is replaced by the JCache API. The 
> JCache API implementation should be configurable/ pluggable so it could 
> support one of the available distributed implementations [2].
> The default should be a wrapper around the current Guava Cache and LIRSCache 
> implementations.
> [1] https://jcp.org/en/jsr/detail?id=107
> [2] 
> https://jcp.org/aboutJava/communityprocess/implementations/jsr107/index.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10577) Advanced repository statistics

2023-12-05 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793295#comment-17793295
 ] 

Thomas Mueller commented on OAK-10577:
--

PR (work in progress): https://github.com/apache/jackrabbit-oak/pull/1247

> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10577:
-
Description: 
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Approximate number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.


  was:
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.



> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Approximate number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10577:
-
Description: 
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Number and size of distinct binaries.
* Number of distinct values per (indexed) property, and the top values. This is 
useful to improve cost estimation (the "weight" property of indexes) and 
estimate index sizes.


  was:
Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Size of distinct binaries.


> Advanced repository statistics
> --
>
> Key: OAK-10577
> URL: https://issues.apache.org/jira/browse/OAK-10577
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: oak-run
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, we have very few metrics per repository, and most are for the 
> whole repository: total size, the total index sizes, datastore size. The only 
> metric we collect per path is the approximate number of nodes per path.
> I would like to collect more data, first via a "flat file store" (sorted list 
> of node data), e.g.
> * Approximate number of nodes per path.
> * Approximate size of binaries per path.
> * Histograms of binary sizes.
> * The same, but for a filtered set of binaries.
> * Number and size of distinct binaries.
> * Number of distinct values per (indexed) property, and the top values. This 
> is useful to improve cost estimation (the "weight" property of indexes) and 
> estimate index sizes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10577) Advanced repository statistics

2023-12-04 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10577:


 Summary: Advanced repository statistics
 Key: OAK-10577
 URL: https://issues.apache.org/jira/browse/OAK-10577
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: oak-run
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently, we have very few metrics per repository, and most are for the whole 
repository: total size, the total index sizes, datastore size. The only metric 
we collect per path is the approximate number of nodes per path.

I would like to collect more data, first via a "flat file store" (sorted list 
of node data), e.g.

* Approximate number of nodes per path.
* Approximate size of binaries per path.
* Histograms of binary sizes.
* The same, but for a filtered set of binaries.
* Size of distinct binaries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-20 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787939#comment-17787939
 ] 

Thomas Mueller commented on OAK-10549:
--

To avoid OOME when running the tests, I changed the test case to use only 10 
facets:

https://github.com/apache/jackrabbit-oak/commit/79ac7fd718b1abb495635cec38f9887a4a2b9219

With 200 facets, the test required 190 MB (-mx190m); with 10 facets, only 25 MB.

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-17 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10549.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-15 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786419#comment-17786419
 ] 

Thomas Mueller commented on OAK-10549:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1215

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-15 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10549:
-
Description: 
Currently, reading many facets (eg. 20) at a time is quite slow when using a 
Lucene index. We already cache the data, but performance is not all that great. 
One of the reasons is that we run one Lucene query per facet column. 

It is possible to speed this up, using eager facet caching.

  was:Currently, reading many facets (eg. 20) at a time is quite slow when 
using a Lucene index.


> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index. We already cache the data, but performance is not all that 
> great. One of the reasons is that we run one Lucene query per facet column. 
> It is possible to speed this up, using eager facet caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785910#comment-17785910
 ] 

Thomas Mueller commented on OAK-10549:
--

The latest change in this area was here:
https://github.com/apache/jackrabbit-oak/compare/trunk...oak-indexing:jackrabbit-oak:OAK-8898

> Improve performance of facet count at scale (Lucene)
> 
>
> Key: OAK-10549
> URL: https://issues.apache.org/jira/browse/OAK-10549
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene, query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently, reading many facets (eg. 20) at a time is quite slow when using a 
> Lucene index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10549) Improve performance of facet count at scale (Lucene)

2023-11-14 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10549:


 Summary: Improve performance of facet count at scale (Lucene)
 Key: OAK-10549
 URL: https://issues.apache.org/jira/browse/OAK-10549
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: lucene, query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently, reading many facets (eg. 20) at a time is quite slow when using a 
Lucene index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10527) Improve readability of the explain query output

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10527.
--
Resolution: Fixed

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10527) Improve readability of the explain query output

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10527:
-
Fix Version/s: 1.60.0

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10420) Tool to compare Lucene index content

2023-11-07 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10420.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Tool to compare Lucene index content
> 
>
> Key: OAK-10420
> URL: https://issues.apache.org/jira/browse/OAK-10420
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> I want to verify that an Oak Lucene index matches another index. Comparing 
> the number of documents in each index is possible, but this comparison is not 
> sufficient. 
> The main problem is that aggregation order depends on the order that child 
> nodes are traversed, and this order is not guaranteed to be always the same 
> (e.g. segment node store returns children in a different order than the 
> document node store). This will make checksums of files different. Checksum 
> of files can't always be compared due to this.
> I would like to create a tool that makes comparison of index content easy. 
> This tool needs to account for small differences caused by the above problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10532) Cost estimation for "not(@x)" calculates cost for "@x='value'" instead

2023-11-03 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10532:


 Summary: Cost estimation for "not(@x)" calculates cost for 
"@x='value'" instead
 Key: OAK-10532
 URL: https://issues.apache.org/jira/browse/OAK-10532
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: lucene
Reporter: Thomas Mueller


The cost estimation for a query that uses a Lucene index calculates the cost 
incorrectly if there is a "not()" condition. Examples:

{noformat}
/jcr:root//*[(not(@x)) and (not(@y))
{noformat}

The Lucene query is then:
{noformat}
+:nullProps:x +:nullProps:y
{noformat}

But the cost estimation seems to take into account the number of documents for 
the fields "x" and "y", instead of the field ":nullProps"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10527) Improve readability of the explain query output

2023-11-03 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782462#comment-17782462
 ] 

Thomas Mueller commented on OAK-10527:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1187

> Improve readability of the explain query output
> ---
>
> Key: OAK-10527
> URL: https://issues.apache.org/jira/browse/OAK-10527
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> Currently the output "explain query" of Oak (the query plan) is hard to 
> interpret.
> A more human-readable output would be better. Example:
> Old:
> {noformat}
> [nt:base] as [nt:base] /* 
> lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
> sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
> ([nt:base].[sling:vanityPath] is not null) and 
> (first([nt:base].[sling:vanityPath]) > '') */
> {noformat}
> New:
> {noformat}
> [nt:base] as [nt:base] /* lucene:slingResourceResolver-1
> indexDefinition: /oak:index/slingResourceResolver-1
> estimatedEntries: 46
> luceneQuery: sling:vanityPath:[* TO *]
> synchronousPropertyCondition: sling:vanityPath is not null
>  */
> {noformat}
> Also, the formatting of the logged query statement should be improved: 
> instead of one single line with the whole statement, the statement should 
> contain line breaks before the important keywords. Example:
> Old:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
> isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
> FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
> {noformat}
> New:
> {noformat}
> Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
> [sling:redirect], [sling:redirectStatus]
>   FROM [nt:base]
>   WHERE NOT isdescendantnode('/jcr:system')
>   AND [sling:vanityPath] IS NOT NULL
>   AND FIRST([sling:vanityPath]) > ''
>   ORDER BY FIRST([sling:vanityPath])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10527) Improve readability of the explain query output

2023-11-02 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10527:


 Summary: Improve readability of the explain query output
 Key: OAK-10527
 URL: https://issues.apache.org/jira/browse/OAK-10527
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


Currently the output "explain query" of Oak (the query plan) is hard to 
interpret.
A more human-readable output would be better. Example:

Old:
{noformat}
[nt:base] as [nt:base] /* 
lucene:slingResourceResolver-1(/oak:index/slingResourceResolver-1) 
sling:vanityPath:[* TO *] sync:(sling:vanityPath is not null) where 
([nt:base].[sling:vanityPath] is not null) and 
(first([nt:base].[sling:vanityPath]) > '') */
{noformat}

New:
{noformat}
[nt:base] as [nt:base] /* lucene:slingResourceResolver-1
indexDefinition: /oak:index/slingResourceResolver-1
estimatedEntries: 46
luceneQuery: sling:vanityPath:[* TO *]
synchronousPropertyCondition: sling:vanityPath is not null
 */
{noformat}

Also, the formatting of the logged query statement should be improved: instead 
of one single line with the whole statement, the statement should contain line 
breaks before the important keywords. Example:

Old:
{noformat}
Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
[sling:redirect], [sling:redirectStatus] FROM [nt:base] WHERE NOT 
isdescendantnode('/jcr:system') AND [sling:vanityPath] IS NOT NULL AND 
FIRST([sling:vanityPath]) > '' ORDER BY FIRST([sling:vanityPath])
{noformat}

New:
{noformat}
Parsing JCR-SQL2 statement: explain SELECT [sling:vanityPath], 
[sling:redirect], [sling:redirectStatus]
  FROM [nt:base]
  WHERE NOT isdescendantnode('/jcr:system')
  AND [sling:vanityPath] IS NOT NULL
  AND FIRST([sling:vanityPath]) > ''
  ORDER BY FIRST([sling:vanityPath])
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10265) Oak-run offline reindex - async lane revert not taking place for stored index def after index import

2023-10-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10265:
-
Component/s: query
 indexing

> Oak-run offline reindex - async lane revert not taking place for stored index 
> def after index import
> 
>
> Key: OAK-10265
> URL: https://issues.apache.org/jira/browse/OAK-10265
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: indexing, query
>Reporter: Nitin Gupta
>Assignee: Nitin Gupta
>Priority: Major
> Fix For: 1.54.0
>
>
> During offline reindex using oak-run,
> the index import phase first changes the async property to temp-async and 
> keeps the original value in async-previous property.
> This is reverted when the import is done. However it appears that the revert 
> doesn't happen for the stored index definition and leaves that at 
> async = temp-async
> async-previous = [async, nrt]
> By setting "refresh=true", the stored index definition is copied to the 
> regular index definition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10518) IndexInfo should have a isActive() method

2023-10-26 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10518.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> IndexInfo should have a isActive() method
> -
>
> Key: OAK-10518
> URL: https://issues.apache.org/jira/browse/OAK-10518
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> With the composite node store, it is a bit hard to find out if an index is 
> active or not, as only the latest version of an index is usually active that 
> is mounted. Unless if there is a merges property that resolves.
> The IndexInfoService / IndexInfo class should have a method isActive() so 
> it's easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10518) IndexInfo should have a isActive() method

2023-10-25 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779531#comment-17779531
 ] 

Thomas Mueller commented on OAK-10518:
--

PR for review https://github.com/apache/jackrabbit-oak/pull/1180

> IndexInfo should have a isActive() method
> -
>
> Key: OAK-10518
> URL: https://issues.apache.org/jira/browse/OAK-10518
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: indexing
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> With the composite node store, it is a bit hard to find out if an index is 
> active or not, as only the latest version of an index is usually active that 
> is mounted. Unless if there is a merges property that resolves.
> The IndexInfoService / IndexInfo class should have a method isActive() so 
> it's easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10518) IndexInfo should have a isActive() method

2023-10-25 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10518:


 Summary: IndexInfo should have a isActive() method
 Key: OAK-10518
 URL: https://issues.apache.org/jira/browse/OAK-10518
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: indexing
Reporter: Thomas Mueller
Assignee: Thomas Mueller


With the composite node store, it is a bit hard to find out if an index is 
active or not, as only the latest version of an index is usually active that is 
mounted. Unless if there is a merges property that resolves.

The IndexInfoService / IndexInfo class should have a method isActive() so it's 
easy to find out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10497) Properties order in FFS can be different across runs

2023-10-25 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10497.
--
Resolution: Fixed

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10497) Properties order in FFS can be different across runs

2023-10-25 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779395#comment-17779395
 ] 

Thomas Mueller commented on OAK-10497:
--

Merged on 2023-10-20

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OAK-10265) Oak-run offline reindex - async lane revert not taking place for stored index def after index import

2023-10-24 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-10265:
-
Description: 
During offline reindex using oak-run,
the index import phase first changes the async property to temp-async and keeps 
the original value in async-previous property.

This is reverted when the import is done. However it appears that the revert 
doesn't happen for the stored index definition and leaves that at 
async = temp-async
async-previous = [async, nrt]

By setting "refresh=true", the stored index definition is copied to the regular 
index definition.

  was:
During offline reindex using oak-run,
the index import phase first changes the async property to temp-async and keeps 
the original value in async-previous property.

This is reverted when the import is done. However it appears that the revert 
doesn't happen for the stored index definition and leaves that at 
async = temp-async
async-previous = [async, nrt]

We should probably add refresh=true to avoid this.


> Oak-run offline reindex - async lane revert not taking place for stored index 
> def after index import
> 
>
> Key: OAK-10265
> URL: https://issues.apache.org/jira/browse/OAK-10265
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Nitin Gupta
>Priority: Major
> Fix For: 1.54.0
>
>
> During offline reindex using oak-run,
> the index import phase first changes the async property to temp-async and 
> keeps the original value in async-previous property.
> This is reverted when the import is done. However it appears that the revert 
> doesn't happen for the stored index definition and leaves that at 
> async = temp-async
> async-previous = [async, nrt]
> By setting "refresh=true", the stored index definition is copied to the 
> regular index definition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10497) Properties order in FFS can be different across runs

2023-10-19 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1393#comment-1393
 ] 

Thomas Mueller commented on OAK-10497:
--

New PRs:
https://github.com/apache/jackrabbit-oak/pull/1174 -- this I merged without 
running the tests :-/
https://github.com/apache/jackrabbit-oak/pull/1175 -- fixes the bug from the 
above PR

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10497) Properties order in FFS can be different across runs

2023-10-19 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10497:


Assignee: Thomas Mueller  (was: Nitin Gupta)

> Properties order in FFS can be different across runs
> 
>
> Key: OAK-10497
> URL: https://issues.apache.org/jira/browse/OAK-10497
> Project: Jackrabbit Oak
>  Issue Type: Task
>Reporter: Nitin Gupta
>Assignee: Thomas Mueller
>Priority: Major
>
> While building the FFS, the order of the properties can be different for the 
> same node across different builds/runs.
>  
> This does not have any impact on indexing, but in case there's a need for 
> verification across different strategies to compare if the FFS built is the 
> same - this sometimes lead to false failures.
>  
> We should ensure a sorted order of the properties of every node in the FFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-13 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10490.
--
Fix Version/s: 1.60.0
   Resolution: Fixed

> Suggest queries return duplicate entries if prefetch is enabled
> ---
>
> Key: OAK-10490
> URL: https://issues.apache.org/jira/browse/OAK-10490
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.60.0
>
>
> If prefetch is enabled, and prefetch count is larger than 0, then suggest 
> queries return duplicate results.
> This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
> FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
> But pathIterator.next() updates currentRow. So that the following code can 
> return different results:
> {noformat}
> // here, excerpt1 and except2 are different:
> IndexRow row1 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt2 = row2.getValue("rep:excerpt"));
> // here, excerpt1 is equal to except2:
> IndexRow row1 = fulltextPathCursor.next();
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> String excerpt2 = row2.getValue("rep:excerpt"));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-12 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774555#comment-17774555
 ] 

Thomas Mueller commented on OAK-10490:
--

PR https://github.com/apache/jackrabbit-oak/pull/1148

> Suggest queries return duplicate entries if prefetch is enabled
> ---
>
> Key: OAK-10490
> URL: https://issues.apache.org/jira/browse/OAK-10490
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> If prefetch is enabled, and prefetch count is larger than 0, then suggest 
> queries return duplicate results.
> This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
> FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
> But pathIterator.next() updates currentRow. So that the following code can 
> return different results:
> {noformat}
> // here, excerpt1 and except2 are different:
> IndexRow row1 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt2 = row2.getValue("rep:excerpt"));
> // here, excerpt1 is equal to except2:
> IndexRow row1 = fulltextPathCursor.next();
> IndexRow row2 = fulltextPathCursor.next();
> String excerpt1 = row1.getValue("rep:excerpt"));
> String excerpt2 = row2.getValue("rep:excerpt"));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10490) Suggest queries return duplicate entries if prefetch is enabled

2023-10-12 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10490:


 Summary: Suggest queries return duplicate entries if prefetch is 
enabled
 Key: OAK-10490
 URL: https://issues.apache.org/jira/browse/OAK-10490
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller
Assignee: Thomas Mueller


If prefetch is enabled, and prefetch count is larger than 0, then suggest 
queries return duplicate results.

This seems to be caused by oak-search FulltextIndex.FulltextPathCursor: 
FulltextPathCursor.next() returns a new IndexRow that references currentRow. 
But pathIterator.next() updates currentRow. So that the following code can 
return different results:

{noformat}
// here, excerpt1 and except2 are different:
IndexRow row1 = fulltextPathCursor.next();
String excerpt1 = row1.getValue("rep:excerpt"));
IndexRow row2 = fulltextPathCursor.next();
String excerpt2 = row2.getValue("rep:excerpt"));

// here, excerpt1 is equal to except2:
IndexRow row1 = fulltextPathCursor.next();
IndexRow row2 = fulltextPathCursor.next();
String excerpt1 = row1.getValue("rep:excerpt"));
String excerpt2 = row2.getValue("rep:excerpt"));
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller resolved OAK-10399.
--
Fix Version/s: 1.58.0
   Resolution: Fixed

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Priority: Major
> Fix For: 1.58.0
>
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-30 Thread Thomas Mueller (Jira)


 [ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller reassigned OAK-10399:


Assignee: Thomas Mueller

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
> Fix For: 1.58.0
>
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10420) Tool to compare Lucene index content

2023-08-28 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759579#comment-17759579
 ] 

Thomas Mueller commented on OAK-10420:
--

PR https://github.com/apache/jackrabbit-oak/pull/1086

> Tool to compare Lucene index content
> 
>
> Key: OAK-10420
> URL: https://issues.apache.org/jira/browse/OAK-10420
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Thomas Mueller
>Assignee: Thomas Mueller
>Priority: Major
>
> I want to verify that an Oak Lucene index matches another index. Comparing 
> the number of documents in each index is possible, but this comparison is not 
> sufficient. 
> The main problem is that aggregation order depends on the order that child 
> nodes are traversed, and this order is not guaranteed to be always the same 
> (e.g. segment node store returns children in a different order than the 
> document node store). This will make checksums of files different. Checksum 
> of files can't always be compared due to this.
> I would like to create a tool that makes comparison of index content easy. 
> This tool needs to account for small differences caused by the above problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10420) Tool to compare Lucene index content

2023-08-28 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10420:


 Summary: Tool to compare Lucene index content
 Key: OAK-10420
 URL: https://issues.apache.org/jira/browse/OAK-10420
 Project: Jackrabbit Oak
  Issue Type: Improvement
Reporter: Thomas Mueller
Assignee: Thomas Mueller


I want to verify that an Oak Lucene index matches another index. Comparing the 
number of documents in each index is possible, but this comparison is not 
sufficient. 

The main problem is that aggregation order depends on the order that child 
nodes are traversed, and this order is not guaranteed to be always the same 
(e.g. segment node store returns children in a different order than the 
document node store). This will make checksums of files different. Checksum of 
files can't always be compared due to this.

I would like to create a tool that makes comparison of index content easy. This 
tool needs to account for small differences caused by the above problem.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-14 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754144#comment-17754144
 ] 

Thomas Mueller commented on OAK-10399:
--

PR https://github.com/apache/jackrabbit-oak/pull/1066

> Automatically pick a merged index over multiple levels
> --
>
> Key: OAK-10399
> URL: https://issues.apache.org/jira/browse/OAK-10399
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Thomas Mueller
>Priority: Major
>
> When using the composite node store for blue-green deployments, multiple 
> versions of a index can exist at the same time, for a short period of time 
> (while both blue and green are running at the same time). In OAK-9301 we 
> support merged indexes.
> What we don't support currently is merged indexes over multiple levels. 
> Example:
> * /oak:index/index-1 (first version of the index)
> * /oak:index/index-1-custom-1 (customization of that index)
> * /oak:index/index-2 (new base version)
> * /oak:index/index-2-custom-1 (auto-merged index)
> * /oak:index/index-3 (the second new base version)
> * /oak:index/index-3-custom-1 (auto-merged index)
> In this case, index-3 is used for queries, instead of index-3-custom-1.
> The reason is the following: whenever we auto-merge, we set the merges 
> property to the previous base version, and the previous customization. This 
> works well for index-2-custom-1, but doesn't work for index-3-custom-1.
> We need to change the index picking algorithm, such that only one level of 
> base indexes is checked: only the existence of index-3. The existence of 
> index-2 must not be checked. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (OAK-10399) Automatically pick a merged index over multiple levels

2023-08-14 Thread Thomas Mueller (Jira)
Thomas Mueller created OAK-10399:


 Summary: Automatically pick a merged index over multiple levels
 Key: OAK-10399
 URL: https://issues.apache.org/jira/browse/OAK-10399
 Project: Jackrabbit Oak
  Issue Type: Improvement
  Components: query
Reporter: Thomas Mueller


When using the composite node store for blue-green deployments, multiple 
versions of a index can exist at the same time, for a short period of time 
(while both blue and green are running at the same time). In OAK-9301 we 
support merged indexes.

What we don't support currently is merged indexes over multiple levels. Example:

* /oak:index/index-1 (first version of the index)
* /oak:index/index-1-custom-1 (customization of that index)
* /oak:index/index-2 (new base version)
* /oak:index/index-2-custom-1 (auto-merged index)
* /oak:index/index-3 (the second new base version)
* /oak:index/index-3-custom-1 (auto-merged index)

In this case, index-3 is used for queries, instead of index-3-custom-1.

The reason is the following: whenever we auto-merge, we set the merges property 
to the previous base version, and the previous customization. This works well 
for index-2-custom-1, but doesn't work for index-3-custom-1.

We need to change the index picking algorithm, such that only one level of base 
indexes is checked: only the existence of index-3. The existence of index-2 
must not be checked. 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >