[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372914#comment-16372914 ] Thomas Mueller commented on OAK-7109: - [~diru] OK I (think) I understand. > Not sure if the index supports not() For "contains", this is supported via "contains(..., '-exclude')". But for generic conditions, no it's not currently supported. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph >Priority: Major > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332289#comment-16332289 ] Dirk Rudolph commented on OAK-7109: --- Thanks for the response. Regarding 1) see https://issues.apache.org/jira/browse/OAK-7109?focusedCommentId=16309376=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16309376 Optimisation does the following at the moment: A and (B or not(C and D)) => (A and B) or (A and not(C and D)) To achieve an optimisation where the result is a DNF, which can then be split in UNIONS of exclusively conjunctions, another step needs to happen before the current optimisation - NNF (moving all negation down the tree of statements) A and (B or not(C or D)) => A and (B or not(C) or not(B)) => (A and B) or (A and not(C)) or (A and not(B)) Not sure if the index supports not() but if it does, the UNION of the query above (3) queries would give exact facets which simply need to be deduplicated. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph >Priority: Major > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332267#comment-16332267 ] Thomas Mueller commented on OAK-7109: - [~diru] I'm sorry for the delay. I'm afraid I can't follow you... Some links (more for myself, please tell me if I made a mistake): CNF https://en.wikipedia.org/wiki/Conjunctive_normal_form example: A and (B or C) DNF https://en.wikipedia.org/wiki/Disjunctive_normal_form example: (A and not(B) and not((C)) or (not(D) and E and F) NNF https://en.wikipedia.org/wiki/Negation_normal_form example: (A or B) and C > all constraints have to be passed to lucene, so the query has to be in DNF, > which is not the case at the moment Only the filter is passed to Lucene currently, and that one doesn't have any "or" conditions (except for "x in(1, 2, 3)"). Changing that will be hard, and has some disadvantages. Other "or" conditions are currently only supported by using "union" (aggregation), or by not processing them in the index (filtering in the query engine). So I think it's not so much about "not" conditions. > would require also a deduplication between the lucene result sets returned > from each of the unions. Yes. I think that's possible, even though it's not optimal. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph >Priority: Major > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312900#comment-16312900 ] Dirk Rudolph commented on OAK-7109: --- [~tmueller] so adding the feature to aggregate the current rep:facet extraction from the UNION alternatives has 2 drawbacks: 1) as said above, all constraints have to be passed to lucene, so the query has to be in DNF, which is not the case at the moment 2) even if this is the case, the disjunctive conjunctions are not mutually exclusive leading to inaccurate result as well > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312793#comment-16312793 ] Thomas Mueller commented on OAK-7109: - [~catholicon] OK I see facets does not exactly match "group by" + "count". So, what if we add a feature to aggregate the data from a "select [rep:facet(...)] ... UNION select [rep:facet(...)] ..." query? I believe aggregating that data in the query engine should be possible, as the data format of the facet feature is known. >> What if Lucene doesn't index all the constraints? > fail such queries Sounds good to me. I believe right now, if a query uses "select [rep:facet(...)]", then only indexes that support that are used. If there is no index that supports facets, then the query should fail with an exception (if that's not the case yet, we should probably add that). If the Lucene index doesn't support some of the conditions, then it shouldn't return an index plan. That should solve the problem with "union" queries as well. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312691#comment-16312691 ] Dirk Rudolph commented on OAK-7109: --- {quote} I have a very pessimistic view that we should fail such queries - I mean it's better to fail and allow for right index def than giving incorrect results. {quote} +1 > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312687#comment-16312687 ] Vikas Saurabh commented on OAK-7109: {quote} (I know the "group by" and "count" are not currently supported by Oak). Or are there other aspects I missed? {quote} Indeed fundamentally that's what facets do - provide usually few (not 'all' unlike group by) properties and count according to how many documents match the query. Lucene's faceting support also does ranges although we don't support that yet - e.g. I could facet of "jcr:created" and the categories could turn out as "today", "within last week", etc (I'm not completely sure about the API... I'm just trying to illustrate that faceted categories can potentially be not-the-actually-stored-value). bq. What do you mean with "scoring"? The scoring part is entirely different issue unrelated to facets - e.g. we correctly won't (can't??) order documents matching queries such as {{ WHERE (CONTAINS(., 'text') AND foo1='bar') OR (CONTAINS(., 'text' AND foo2='bar' AND foo3='bar')}} (foo=bar could be different fulltext clause too... the issue is that we can't quite merge scores coming out of separate lucene queries) But, let's ignore the scoring for this issue. bq. What if Lucene doesn't index all the constraints? I have a very pessimistic view that we should fail such queries - I mean it's better to fail and allow for right index def than giving incorrect results. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312675#comment-16312675 ] Thomas Mueller commented on OAK-7109: - I don't fully know how facets work. Could you help me a bit with this please. The query {noformat} select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 'ipsum') and (isdescendantnode(a,'/content1') or isdescendantnode(a,'/content2')) {noformat} converted to "regular SQL" would be this, right? {noformat} select [simple/tags], count(*) from [nt:base] as a where contains(a.[*], 'ipsum') and (isdescendantnode(a,'/content1') or isdescendantnode(a,'/content2')) group by [simple/tags] {noformat} (I know the "group by" and "count" are not currently supported by Oak). Or are there other aspects I missed? What do you mean with "scoring"? If it's the same, then I guess we might want to support the "group by" and "count" features in Oak, or add a custom logic to combine the results of {noformat} select [rep:facet(...)] ... UNION select [rep:facet(...)] ... {noformat} > passing all constraints to lucene What if Lucene doesn't index all the constraints? > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312632#comment-16312632 ] Vikas Saurabh commented on OAK-7109: [~diru], thanks for the investigation. I now see the issue, but unfortunately, with the current design of how query engine parses the queries and then passes sub-query to index providers, it's almost impossible to have correct faceting for complex queries. The way I see the fundamental problem is: * facet is an aggregation function => any query with rep:facet must be completely resolved by a single index * currently index providers only resolve ANDed clauses => so, complex queries never get all their clauses passed down to (lucene) index I really don't have any solution work-around for your problem though :(. [~tmueller], would you have any ideas about how can we make such cases work? PS: Btw, [~diru], the scoring across UNIONed clauses won't work (as you mentioned in the mail) - but that's a digression and won't solve the problem at hand as you correctly said that the different clauses across UNIONs won't be disjoint. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch, > restrictionPropagationTest.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309559#comment-16309559 ] Dirk Rudolph commented on OAK-7109: --- Here is an example where constraints get lost in the filter: {code} select * from [nt:base] where ([propa] = 'true' and [propb] in('foo','bar')) or ([propa] = 'false' and not([propb] in('foo','bar'))) {code} It implements kind of white-/blacklisting ala "If a is set to true, b has to be in a configured set, if not, b has not to be in the configured set." It evaluates to: {code} [nt:base] as [nt:base] /* lucene:test2(/oak:index/test2) propa:[* TO *] where [nt:base].[propa] is not null */ {code} Which doesn't contain anything of propb, so in that case facet counting will be wrong as well. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309376#comment-16309376 ] Dirk Rudolph commented on OAK-7109: --- Hi [~catholicon] somehow the mail agent doesn't accept my mailings to oak-dev (I'm subscribed and receiving mail but sending doesn't work ... anyway). I checked the implementation of the optimisation and its not in dnf, as the optimisation is not done on the negation normal form of the query (so not(a or b) are not properly expanded to not(a) and not(b). For example (based on org.apache.jackrabbit.oak.query.SQL2OptimiseQueryTest#optimiseAndOrAnd()): {code} given ([a]=1 or [b]=2 or ([c]=3 and not([d]=4 or [e]=5))) and [x]=6 <=> ([a]=1 or [b]=2 or ([c]=3 and [d]<>4 and [e]<>5))) and [x]=6 expected ([a]=1 and [x]=6), ([b]=2 and [x]=6), ([c]=3 and [d]<>4 and [e]<>5 and [x]=6) actual ((c = 3) and (not ((d = 4) or (e = 5 and (x = 6), (b = 2) and (x = 6), (a = 1) and (x = 6) {code} And even, assuming we would have the alternative being a DNF and facet counting across unions would be supported merging the results from each of the queries given to lucene, the result will still be wrong as each of the disjunctive statements will not be mutually exclusive (as it would be with xor). So from my perspective there is not way to get proper facet counts in that case from consumer side and only the option of b) filtering the documents based on the filter c) passing all constraints to lucene would work. Regarding b) as from what I can see in the code base the nodes are not actually read but only the permissions on their path are checked in [FilteredSortedSetDocValuesFacetCounts.java#L91|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/util/FilteredSortedSetDocValuesFacetCounts.java#L91] I will check further why our specific query doesn't get entirely passed to lucene (or better which constraints are not taken into account beside the path constraints). Anyway as a user of the jcr api I would expect a RepositoryException (or any other) when I try to run a query with facet extraction that no index can provide - similar to the exception I get when the field I extract facets on is not stored. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301777#comment-16301777 ] Vikas Saurabh commented on OAK-7109: [~diru], bq. That basically works, but only in the case that both queries hit the same index as only then TF/IDF score is comparable (also across multiple queries). So the solutions I see are: Umm... I don't think lucene scores across different queries from same index can be comparable (the first thing that comes to my mind is normalization factors would be different for each query there might be other reasons too) bq. a) creating DNF disjunctive statements of a query as alternatives (not sure if the alternative currently created is DNF) and support proper counting over union queries well, the alternative is indeed very similar... although, ORs are made into UNIONs. The bigger problem is that current lucene cost estimation would give same cost (at least for the example in description) for both sub-queries ... that would make total cost of UNION-ed execution double of what non-alternative version would give. Current (OAK-6776) would scale cost for both components down... so, the cost war would be fairer... but still there would be chances that original query wins the cost war. b) filtering the results in the using the query plans filter while counting facets, similar to the way its done for ACLs I think that would be pretty bad for performance. I haven't looked closely of how ACL was done - but, there definitely were concerns... not sure how were they avoided... or if that wasn't required at all. c) implementing a mode which translates any query as it is to its lucene equivalent I'm not sure what you mean by "any query" - as far as I know all reasonable constrains (property, ordering, fulltext) do get passed down well to lucene. Of course, it depends that the backing index definition is sufficient according to the query. Imo, if both (or more... along with operators) could be passed down well, then this could have been solved - but, we don't have functionality yet. bq. We tried already running one query for each path, but even with that the individual queries are too complex to be passed to lucene with all constraints. (not entirely sure why though ...) I'd interested to look at your query and the index def. Can you share some details on a mail to oak-dev? > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the > [disjunctive normal > form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex > query and executing a query for each of the disjunctive statements. As this > is expanding exponentially its only a theoretical solution, nothing for > production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301325#comment-16301325 ] Dirk Rudolph commented on OAK-7109: --- Yeah support of unions with facets doesn't work well, as facets are extracted on each row, though they related to the result not the rows. Will open an improvement for that as well as this has some costs: basically calling getTopChildren() for each row while iterating the result set. With splitting the result I didn't mean running the query in a union but running individual queries merging their RowIterators sets manually and extracting facets only from the first hit of each merging them together as well. That basically works but as I said I would have to rewrite the query in DNF like in the example: {code} select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 'ipsum') and isdescendantnode(a,'/content1') select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 'ipsum') and isdescendantnode(a,'/content2') {code} That basically works, but only in the case that both queries hit the same index as only then TF/IDF score is comparable (also across multiple queries). So the solutions I see are: a) creating DNF disjunctive statements of a query as alternatives (not sure if the alternative currently created is DNF) and support proper counting over union queries b) filtering the results in the using the query plans filter while counting facets, similar to the way its done for ACLs c) implementing a mode which translates any query as it is to its lucene equivalent Both a) and b) come probably with a drawback on performance. c) might not even be feasible. > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the DNF > of my complex query and executing a query for each of the disjunctive > statements. As this is expanding exponentially its only a theoretical > solution, nothing for production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries
[ https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301319#comment-16301319 ] Vikas Saurabh commented on OAK-7109: bq. To workaround that the only solution that came to my mind is building the DNF of my complex query and executing a query for each of the disjunctive statements. As this is expanding exponentially its only a theoretical solution, nothing for production. Interesting issue. Btw, the work-around you mentioned above would also most likely not work right away :(. Result from {noformat} select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 'ipsum') and isdescendantnode(a,'/content1') UNION select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 'ipsum') and isdescendantnode(a,'/content2') {noformat} would be ||path||facet|| |/content1/test/bar|{"tag1":1"tag2":1}| |/content2/test/bar|{"tag1":1"tag2":1}| Basically, afaict, you are hitting 2 issues: * single query passes only 1 path restrition down to planner - so, without manual break into UNION, single query would win the cost war (unfortunately) and give the result you have in description ({{tag1:3, tag2:3}} * otoh, with manual break into query, you'd get different facet results for each part of the UNION and you'd have to aggregate the result at your end I don't see how to easily fix this issue though :(. [~tmueller], [~chetanm], [~teofili], you guys might be interested in this issue. Otoh, btw, if we "accept" that you can break the query and aggregate facets once more at your end, even then I think what you should do is: * hit multiple query - one each for each path * get first row from each path and aggregate facets * run normal query (without facet) with union/or/what-you-have-in-description - so, that you still get benefits from lucene scoring compared correctly across different paths. Btw, the reason, I think you should run separate queries and extract facets from first result from each path is to avoid consuming all results from a single path before being able to get facet output from the second path. (... and, yes, I know, this is sub-optimal... but, afaict, that's the best possible way as of now). > rep:facet returns wrong results for complex queries > --- > > Key: OAK-7109 > URL: https://issues.apache.org/jira/browse/OAK-7109 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.7 >Reporter: Dirk Rudolph > Labels: facet > Attachments: facetsInMultipleRoots.patch > > > eComplex queries in that case are queries, which are passed to lucene not > containing all original constraints. For example queries with multiple path > restrictions like: > {code} > select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], > 'ipsum') and (isdescendantnode(a,'/content1') or > isdescendantnode(a,'/content2')) > {code} > In that particular case the index planer gives ":fulltext:ipsum" to lucene > even though the index supports evaluating path constraints. > As counting the facets happens on the raw result of lucene, the returned > facets are incorrect. For example having the following content > {code} > /content1/test/foo > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content2/test/bar > + text = lorem ipsum > - simple/ > + tags = tag1, tag2 > /content3/test/bar > + text = lorem ipsum > - simple/ >+ tags = tag1, tag2 > {code} > the expected result for the dimensions of simple/tags and the query above is > - tag1: 2 > - tag2: 2 > as the result set is 2 results long and all documents are equal. The actual > result set is > - tag1: 3 > - tag2: 3 > as the path constraint is not handled by lucene. > To workaround that the only solution that came to my mind is building the DNF > of my complex query and executing a query for each of the disjunctive > statements. As this is expanding exponentially its only a theoretical > solution, nothing for production. -- This message was sent by Atlassian JIRA (v6.4.14#64029)