[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-02-22 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372914#comment-16372914
 ] 

Thomas Mueller commented on OAK-7109:
-

[~diru] OK I (think) I understand.

> Not sure if the index supports not()

For "contains", this is supported via "contains(..., '-exclude')". But for 
generic conditions, no it's not currently supported.

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>Priority: Major
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-19 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332289#comment-16332289
 ] 

Dirk Rudolph commented on OAK-7109:
---

Thanks for the response. Regarding 1) see 
https://issues.apache.org/jira/browse/OAK-7109?focusedCommentId=16309376=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16309376

Optimisation does the following at the moment:

A and (B or not(C and D)) => (A and B) or (A and not(C and D))

To achieve an optimisation where the result is a DNF, which can then be split 
in UNIONS of exclusively conjunctions, another step needs to happen before the 
current optimisation - NNF (moving all negation down the tree of statements)

A and (B or not(C or D)) => A and (B or not(C) or not(B)) => (A and B) or (A 
and not(C)) or (A and not(B)) 

Not sure if the index supports not() but if it does, the UNION of the query 
above (3) queries would give exact facets which simply need to be deduplicated. 

 

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>Priority: Major
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-19 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332267#comment-16332267
 ] 

Thomas Mueller commented on OAK-7109:
-

[~diru] I'm sorry for the delay. I'm afraid I can't follow you... Some links 
(more for myself, please tell me if I made a mistake):

CNF https://en.wikipedia.org/wiki/Conjunctive_normal_form
example: A and (B or C)

DNF https://en.wikipedia.org/wiki/Disjunctive_normal_form
example: (A and not(B) and not((C)) or (not(D) and E and F)

NNF https://en.wikipedia.org/wiki/Negation_normal_form
example: (A or B) and C

> all constraints have to be passed to lucene, so the query has to be in DNF, 
> which is not the case at the moment

Only the filter is passed to Lucene currently, and that one doesn't have any 
"or" conditions (except for "x in(1, 2, 3)"). Changing that will be hard, and 
has some disadvantages. Other "or" conditions are currently only supported by 
using "union" (aggregation), or by not processing them in the index (filtering 
in the query engine).

So I think it's not so much about "not" conditions.

> would require also a deduplication between the lucene result sets returned 
> from each of the unions.

Yes. I think that's possible, even though it's not optimal.



> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>Priority: Major
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-05 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312900#comment-16312900
 ] 

Dirk Rudolph commented on OAK-7109:
---

[~tmueller] so adding the feature to aggregate the current rep:facet extraction 
from the UNION alternatives has 2 drawbacks:

1) as said above, all constraints have to be passed to lucene, so the query has 
to be in DNF, which is not the case at the moment
2) even if this is the case, the disjunctive conjunctions are not mutually 
exclusive leading to inaccurate result as well



> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-05 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312793#comment-16312793
 ] 

Thomas Mueller commented on OAK-7109:
-

[~catholicon] OK I see facets does not exactly match "group by" + "count". So, 
what if we add a feature to aggregate the data from a "select [rep:facet(...)] 
... UNION select [rep:facet(...)] ..." query? I believe aggregating that data 
in the query engine should be possible, as the data format of the facet feature 
is known.

>> What if Lucene doesn't index all the constraints?
> fail such queries

Sounds good to me. I believe right now, if a query uses "select 
[rep:facet(...)]", then only indexes that support that are used. If there is no 
index that supports facets, then the query should fail with an exception (if 
that's not the case yet, we should probably add that). If the Lucene index 
doesn't support some of the conditions, then it shouldn't return an index plan. 
That should solve the problem with "union" queries as well.

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-05 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312691#comment-16312691
 ] 

Dirk Rudolph commented on OAK-7109:
---

{quote}
I have a very pessimistic view that we should fail such queries - I mean it's 
better to fail and allow for right index def than giving incorrect results.
{quote}
+1

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-05 Thread Vikas Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312687#comment-16312687
 ] 

Vikas Saurabh commented on OAK-7109:


{quote}
(I know the "group by" and "count" are not currently supported by Oak).
Or are there other aspects I missed?
{quote}
Indeed fundamentally that's what facets do -  provide usually few (not 'all' 
unlike group by) properties and count according to how many documents match the 
query. Lucene's faceting support also does ranges although we don't support 
that yet - e.g. I could facet of "jcr:created" and the categories could turn 
out as "today", "within last week", etc (I'm not completely sure about the 
API... I'm just trying to illustrate that faceted categories can potentially be 
not-the-actually-stored-value).

bq. What do you mean with "scoring"?
The scoring part is entirely different issue unrelated to facets - e.g. we 
correctly won't (can't??) order documents matching queries such as {{ WHERE 
(CONTAINS(., 'text') AND foo1='bar') OR (CONTAINS(., 'text' AND foo2='bar' AND 
foo3='bar')}} (foo=bar could be different fulltext clause too... the issue is 
that we can't quite merge scores coming out of separate lucene queries)
But, let's ignore the scoring for this issue.

bq. What if Lucene doesn't index all the constraints?
I have a very pessimistic view that we should fail such queries - I mean it's 
better to fail and allow for right index def than giving incorrect results.

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-05 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312675#comment-16312675
 ] 

Thomas Mueller commented on OAK-7109:
-

I don't fully know how facets work. Could you help me a bit with this please. 
The query
{noformat}
select [rep:facet(simple/tags)] from [nt:base] as a 
where contains(a.[*], 'ipsum') 
and (isdescendantnode(a,'/content1') or isdescendantnode(a,'/content2'))
{noformat}

converted to "regular SQL" would be this, right?
{noformat}
select [simple/tags], count(*)
from [nt:base] as a 
where contains(a.[*], 'ipsum') 
and (isdescendantnode(a,'/content1') or isdescendantnode(a,'/content2'))
group by [simple/tags]
{noformat}

(I know the "group by" and "count" are not currently supported by Oak).
Or are there other aspects I missed? What do you mean with "scoring"?

If it's the same, then I guess we might want to support the "group by" and 
"count" features in Oak, or add a custom logic to combine the results of 
{noformat}
select [rep:facet(...)] ... UNION select [rep:facet(...)] ...
{noformat}

> passing all constraints to lucene

What if Lucene doesn't index all the constraints?

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-04 Thread Vikas Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312632#comment-16312632
 ] 

Vikas Saurabh commented on OAK-7109:


[~diru], thanks for the investigation. I now see the issue, but unfortunately, 
with the current design of how query engine parses the queries and then passes 
sub-query to index providers, it's almost impossible to have correct faceting 
for complex queries.

The way I see the fundamental problem is:
* facet is an aggregation function => any query with rep:facet must be 
completely resolved by a single index
* currently index providers only resolve ANDed clauses => so, complex queries 
never get all their clauses passed down to (lucene) index

I really don't have any solution work-around for your problem though :(.

[~tmueller], would you have any ideas about how can we make such cases work?

PS: Btw, [~diru], the scoring across UNIONed clauses won't work (as you 
mentioned in the mail) - but that's a digression and won't solve the problem at 
hand as you correctly said that the different clauses across UNIONs won't be 
disjoint.

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch, 
> restrictionPropagationTest.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-03 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309559#comment-16309559
 ] 

Dirk Rudolph commented on OAK-7109:
---

Here is an example where constraints get lost in the filter:

{code}
select * from [nt:base] where ([propa] = 'true' and [propb] in('foo','bar')) or 
([propa] = 'false' and not([propb] in('foo','bar')))
{code}

It implements kind of white-/blacklisting ala "If a is set to true, b has to be 
in a configured set, if not, b has not to be in the configured set." It 
evaluates to: 

{code}
[nt:base] as [nt:base] /* lucene:test2(/oak:index/test2) propa:[* TO *] where 
[nt:base].[propa] is not null */
{code}

Which doesn't contain anything of propb, so in that case facet counting will be 
wrong as well.



> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2018-01-03 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309376#comment-16309376
 ] 

Dirk Rudolph commented on OAK-7109:
---

Hi [~catholicon] somehow the mail agent doesn't accept my mailings to oak-dev 
(I'm subscribed and receiving mail but sending doesn't work ... anyway).

I checked the implementation of the optimisation and its not in dnf, as the 
optimisation is not done on the negation normal form of the query (so not(a or 
b) are not properly expanded to not(a) and not(b). For example (based on 
org.apache.jackrabbit.oak.query.SQL2OptimiseQueryTest#optimiseAndOrAnd()):

{code}
given ([a]=1 or [b]=2 or ([c]=3 and not([d]=4 or [e]=5))) and [x]=6 <=> ([a]=1 
or [b]=2 or ([c]=3 and [d]<>4 and [e]<>5))) and [x]=6
expected ([a]=1 and [x]=6), ([b]=2 and [x]=6), ([c]=3 and [d]<>4 and [e]<>5 and 
[x]=6)
actual ((c = 3) and (not ((d = 4) or (e = 5 and (x = 6), (b = 2) and (x = 
6), (a = 1) and (x = 6)
{code}

And even, assuming we would have the alternative being a DNF and facet counting 
across unions would be supported merging the results from each of the queries 
given to lucene, the result will still be wrong as each of the disjunctive 
statements will not be mutually exclusive (as it would be with xor). So from my 
perspective there is not way to get proper facet counts in that case from 
consumer side and only the option of 

b) filtering the documents based on the filter 
c) passing all constraints to lucene

would work. 

Regarding b) as from what I can see in the code base the nodes are not actually 
read but only the permissions on their path are checked in 
[FilteredSortedSetDocValuesFacetCounts.java#L91|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/util/FilteredSortedSetDocValuesFacetCounts.java#L91]

I will check further why our specific query doesn't get entirely passed to 
lucene (or better which constraints are not taken into account beside the path 
constraints). Anyway as a user of the jcr api I would expect a 
RepositoryException (or any other) when I try to run a query with facet 
extraction that no index can provide - similar to the exception I get when the 
field I extract facets on is not stored. 


> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2017-12-22 Thread Vikas Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301777#comment-16301777
 ] 

Vikas Saurabh commented on OAK-7109:


[~diru],
bq. That basically works, but only in the case that both queries hit the same 
index as only then TF/IDF score is comparable (also across multiple queries). 
So the solutions I see are:
Umm... I don't think lucene scores across different queries from same index can 
be comparable (the first thing that comes to my mind is normalization factors 
would be different for each query there might be other reasons too)

bq. a) creating DNF disjunctive statements of a query as alternatives (not sure 
if the alternative currently created is DNF) and support proper counting over 
union queries
well, the alternative is indeed very similar... although, ORs are made into 
UNIONs. The bigger problem is that current lucene cost estimation would give 
same cost (at least for the example in description) for both sub-queries ... 
that would make total cost of UNION-ed execution double of what non-alternative 
version would give.
Current (OAK-6776) would scale cost for both components down... so, the cost 
war would be fairer... but still there would be chances that original query 
wins the cost war.

b) filtering the results in the using the query plans filter while counting 
facets, similar to the way its done for ACLs
I think that would be pretty bad for performance. I haven't looked closely of 
how ACL was done - but, there definitely were concerns... not sure how were 
they avoided... or if that wasn't required at all.

c) implementing a mode which translates any query as it is to its lucene 
equivalent
I'm not sure what you mean by "any query" - as far as I know all reasonable 
constrains (property, ordering, fulltext) do get passed down well to lucene. Of 
course, it depends that the backing index definition is sufficient according to 
the query. Imo, if both (or more... along with operators) could be passed down 
well, then this could have been solved - but, we don't have functionality yet.

bq. We tried already running one query for each path, but even with that the 
individual queries are too complex to be passed to lucene with all constraints. 
(not entirely sure why though ...)
I'd interested to look at your query and the index def. Can you share some 
details on a mail to oak-dev?

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the 
> [disjunctive normal 
> form|https://en.wikipedia.org/wiki/Disjunctive_normal_form] of my complex 
> query and executing a query for each of the disjunctive statements. As this 
> is expanding exponentially its only a theoretical solution, nothing for 
> production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2017-12-22 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301325#comment-16301325
 ] 

Dirk Rudolph commented on OAK-7109:
---

Yeah support of unions with facets doesn't work well, as facets are extracted 
on each row, though they related to the result not the rows. Will open an 
improvement for that as well as this has some costs: basically calling 
getTopChildren() for each row while iterating the result set. 

With splitting the result I didn't mean running the query in a union but 
running individual queries merging their RowIterators sets manually and 
extracting facets only from the first hit of each merging them together as 
well. That basically works but as I said I would have to rewrite the query in 
DNF like in the example:

{code}
select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
'ipsum') and isdescendantnode(a,'/content1')
select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
'ipsum') and isdescendantnode(a,'/content2')
{code}

That basically works, but only in the case that both queries hit the same index 
as only then TF/IDF score is comparable (also across multiple queries). So the 
solutions I see are:
a) creating DNF disjunctive statements of a query as alternatives (not sure if 
the alternative currently created is DNF) and support proper counting over 
union queries
b) filtering the results in the using the query plans filter while counting 
facets, similar to the way its done for ACLs
c) implementing a mode which translates any query as it is to its lucene 
equivalent

Both a) and b) come probably with a drawback on performance. c) might not even 
be feasible. 


> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the DNF 
> of my complex query and executing a query for each of the disjunctive 
> statements. As this is expanding exponentially its only a theoretical 
> solution, nothing for production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-7109) rep:facet returns wrong results for complex queries

2017-12-22 Thread Vikas Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301319#comment-16301319
 ] 

Vikas Saurabh commented on OAK-7109:


bq. To workaround that the only solution that came to my mind is building the 
DNF of my complex query and executing a query for each of the disjunctive 
statements. As this is expanding exponentially its only a theoretical solution, 
nothing for production. 
Interesting issue. Btw, the work-around you mentioned above would also most 
likely not work right away :(. Result from
{noformat}
select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
'ipsum') and isdescendantnode(a,'/content1')
UNION
select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
'ipsum') and isdescendantnode(a,'/content2')
{noformat}
would be
||path||facet||
|/content1/test/bar|{"tag1":1"tag2":1}|
|/content2/test/bar|{"tag1":1"tag2":1}|

Basically, afaict, you are hitting 2 issues:
* single query passes only 1 path restrition down to planner - so, without 
manual break into UNION, single query would win the cost war (unfortunately) 
and give the result you have in description ({{tag1:3, tag2:3}}
* otoh, with manual break into query, you'd get different facet results for 
each part of the UNION and you'd have to aggregate the result at your end

I don't see how to easily fix this issue though :(. [~tmueller], [~chetanm], 
[~teofili], you guys might be interested in this issue.

Otoh, btw, if we "accept" that you can break the query and aggregate facets 
once more at your end, even then I think what you should do is:
* hit multiple query - one each for each path
* get first row from each path and aggregate facets
* run normal query (without facet) with union/or/what-you-have-in-description - 
so, that you still get benefits from lucene scoring compared correctly across 
different paths.

Btw, the reason, I think you should run separate queries and extract facets 
from first result from each path is to avoid consuming all results from a 
single path before being able to get facet output from the second path.
(... and, yes, I know, this is sub-optimal... but, afaict, that's the best 
possible way as of now).

> rep:facet returns wrong results for complex queries
> ---
>
> Key: OAK-7109
> URL: https://issues.apache.org/jira/browse/OAK-7109
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: lucene
>Affects Versions: 1.6.7
>Reporter: Dirk Rudolph
>  Labels: facet
> Attachments: facetsInMultipleRoots.patch
>
>
> eComplex queries in that case are queries, which are passed to lucene not 
> containing all original constraints. For example queries with multiple path 
> restrictions like:
> {code}
> select [rep:facet(simple/tags)] from [nt:base] as a where contains(a.[*], 
> 'ipsum') and (isdescendantnode(a,'/content1') or 
> isdescendantnode(a,'/content2'))
> {code}
> In that particular case the index planer gives ":fulltext:ipsum" to lucene 
> even though the index supports evaluating path constraints. 
> As counting the facets happens on the raw result of lucene, the returned 
> facets are incorrect. For example having the following content 
> {code}
> /content1/test/foo
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content2/test/bar
>  + text = lorem ipsum
>  - simple/
>   + tags = tag1, tag2
> /content3/test/bar
>  + text = lorem ipsum
>  - simple/
>+ tags = tag1, tag2
> {code}
> the expected result for the dimensions of simple/tags and the query above is 
> - tag1: 2
> - tag2: 2
> as the result set is 2 results long and all documents are equal. The actual 
> result set is 
> - tag1: 3
> - tag2: 3
> as the path constraint is not handled by lucene.
> To workaround that the only solution that came to my mind is building the DNF 
> of my complex query and executing a query for each of the disjunctive 
> statements. As this is expanding exponentially its only a theoretical 
> solution, nothing for production. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)