[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-20 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875182#comment-15875182
 ] 

Matt Weber edited comment on LUCENE-7638 at 2/21/17 12:38 AM:
--

[~jim.ferenczi] [~mikemccand] Could we apply the same articulation points logic 
in analyzeGraphPhrase but generate span queries to essentially act like a 
phrase? 

{code}
SpanNear[
  SpanOr[
SpanNear[SpanTerm[new], SpanTerm[york]]
SpanTerm[ny]
  ],
  SpanTerm[city]
]
{code}



was (Author: mattweber):
[~jim.ferenczi] [~mikemccand] Could we apply the same articulation points logic 
in analyzeGraphPhrase but generate span queries to essentially act like a 
phrase? 

SpanNear[
  SpanOr[
SpanNear[SpanTerm[new], SpanTerm[york]]
SpanTerm[ny]
  ],
  SpanTerm[city]
]



> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-17 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826085#comment-15826085
 ] 

Jim Ferenczi edited comment on LUCENE-7638 at 1/17/17 2:00 PM:
---

Thanks for taking a look [~mattweber] and [~mikemccand]

{quote}
Can this handle "side paths on side paths" graph structure (I think you called 
this "nested multi-term synonyms")? While no analysis chain can naturally 
produce this today (that I know of!), the TokenStream attributes can easily 
express it. And you could imagine it happening in the future, e.g. if you use 
Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter 
(except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which 
can correctly consume a graph). If this is expected to work maybe we should add 
a test case showing that?
{quote}

I'll add a test case because it's expected to work ;) . This is also the reason 
why this patch does not produce {code}PhraseQuery{code} for synonyms.   For 
simple "side paths" this is easy to do but we would need to switch to Span 
queries for "side paths on side paths" so I though that it could be done in 
another issue. 

{quote}
It seems like you don't need to be using Term here, except at the end to pass 
to the newXXXQuery, since everything is in a single field here, and we are 
hoping to move away from Term entirely (LUCENE-7632)?
{quote}

Thanks, I'll simplify the patch.

{quote}
Holes are challenging for graph token streams ... can you add a test case that 
encounters holes, e.g. simulated StopFilter? There are at least two "fun" 
cases: a hole that cuts the graph entirely into two partitions, and a synonym 
spanning over a hole ... CannedTokenStream is useful for feeding such 
"interesting" cases.
{quote}

I though that holes would not be a problem for boolean queries but now I am not 
sure. I'll test that.

{quote}
The Path.id seems to be used only for tie-breaking on compare, not for lookup 
in the TreeSet as the comment suggests?
{quote}

The comment is misleading. It is needed because I use 
{code}TreeSet#remove{code} which uses compare to check object equality. So the 
{code}Path.id{code} is the unique identifier for the path.
 





was (Author: jim.ferenczi):
Thanks for taking a look @mattweber and [~mikemccand]

{quote}
Can this handle "side paths on side paths" graph structure (I think you called 
this "nested multi-term synonyms")? While no analysis chain can naturally 
produce this today (that I know of!), the TokenStream attributes can easily 
express it. And you could imagine it happening in the future, e.g. if you use 
Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter 
(except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which 
can correctly consume a graph). If this is expected to work maybe we should add 
a test case showing that?
{quote}

I'll add a test case because it's expected to work ;) . This is also the reason 
why this patch does not produce {code}PhraseQuery{code} for synonyms.   For 
simple "side paths" this is easy to do but we would need to switch to Span 
queries for "side paths on side paths" so I though that it could be done in 
another issue. 

{quote}
It seems like you don't need to be using Term here, except at the end to pass 
to the newXXXQuery, since everything is in a single field here, and we are 
hoping to move away from Term entirely (LUCENE-7632)?
{quote}

Thanks, I'll simplify the patch.

{quote}
Holes are challenging for graph token streams ... can you add a test case that 
encounters holes, e.g. simulated StopFilter? There are at least two "fun" 
cases: a hole that cuts the graph entirely into two partitions, and a synonym 
spanning over a hole ... CannedTokenStream is useful for feeding such 
"interesting" cases.
{quote}

I though that holes would not be a problem for boolean queries but now I am not 
sure. I'll test that.

{quote}
The Path.id seems to be used only for tie-breaking on compare, not for lookup 
in the TreeSet as the comment suggests?
{quote}

The comment is misleading. It is needed because I use 
{code}TreeSet#remove{code} which uses compare to check object equality. So the 
{code}Path.id{code} is the unique identifier for the path.
 




> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph 

[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824337#comment-15824337
 ] 

Jim Ferenczi edited comment on LUCENE-7638 at 1/16/17 5:44 PM:
---

{quote}
Maybe TermAutomatonQuery would be a good fit for that problem?
{quote}


For pure phrase query it's a good fit because it's a proximity query but for 
boolean queries the problem is different. We cannot build the 
TermAutomatonQuery directly, first we need to find the start and end state of 
each multi-term synonyms in the graph. That's what the attached patch is doing 
lazily, for each intersection point it creates a multi-term synonym query. 
Currently the multi-term synonym query is a boolean query but we could change 
the logic and use the TermAutomatonQuery instead or even create a PhaseQuery 
for each path in the multi-term synonym. This patch also handles nested 
multi-term synonyms which makes the detection of intersection points harder. 
Bottom point is that if we are able to extract the multi-term synonyms of the 
graph then we can choose more easily how we want to search and score these 
inner graph. Does this makes sense ?


was (Author: jim.ferenczi):
For pure phrase query it's a good fit because it's a proximity query but for 
boolean queries the problem is different. We cannot build the 
TermAutomatonQuery directly, first we need to find the start and end state of 
each multi-term synonyms in the graph. That's what the attached patch is doing 
lazily, for each intersection point it creates a multi-term synonym query. 
Currently the multi-term synonym query is a boolean query but we could change 
the logic and use the TermAutomatonQuery instead or even create a PhaseQuery 
for each path in the multi-term synonym. This patch also handles nested 
multi-term synonyms which makes the detection of intersection points harder. 
Bottom point is that if we are able to extract the multi-term synonyms of the 
graph then we can choose more easily how we want to search and score these 
inner graph. Does this makes sense ?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org