[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875182#comment-15875182 ] Matt Weber edited comment on LUCENE-7638 at 2/21/17 12:38 AM: -- [~jim.ferenczi] [~mikemccand] Could we apply the same articulation points logic in analyzeGraphPhrase but generate span queries to essentially act like a phrase? {code} SpanNear[ SpanOr[ SpanNear[SpanTerm[new], SpanTerm[york]] SpanTerm[ny] ], SpanTerm[city] ] {code} was (Author: mattweber): [~jim.ferenczi] [~mikemccand] Could we apply the same articulation points logic in analyzeGraphPhrase but generate span queries to essentially act like a phrase? SpanNear[ SpanOr[ SpanNear[SpanTerm[new], SpanTerm[york]] SpanTerm[ny] ], SpanTerm[city] ] > Optimize graph query produced by QueryBuilder > - > > Key: LUCENE-7638 > URL: https://issues.apache.org/jira/browse/LUCENE-7638 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi > Attachments: LUCENE-7638.patch, LUCENE-7638.patch > > > The QueryBuilder creates a graph query when the underlying TokenStream > contains token with PositionLengthAttribute greater than 1. > These TokenStreams are in fact graphs (lattice to be more precise) where > synonyms can span on multiple terms. > Currently the graph query is built by visiting all the path of the graph > TokenStream. For instance if you have a synonym like "ny, new york" and you > search for "new york city", the query builder would produce two pathes: > "new york city", "ny city" > This can quickly explode when the number of multi terms synonyms increase. > The query "ny ny" for instance would produce 4 pathes and so on. > For boolean queries with should or must clauses it should be more efficient > to build a boolean query that merges all the intersections in the graph. So > instead of "new york city", "ny city" we could produce: > "+((+new +york) ny) +city" > The attached patch is a proposal to do that instead of the all path solution. > The patch transforms multi terms synonyms in graph query for each > intersection in the graph. This is not done in this patch but we could also > create a specialized query that gives equivalent scores to multi terms > synonyms like the SynonymQuery does for single term synonyms. > For phrase query this patch does not change the current behavior but we could > also use the new method to create optimized graph SpanQuery. > [~mattweber] I think this patch could optimize a lot of cases where multiple > muli-terms synonyms are present in a single request. Could you take a look ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826085#comment-15826085 ] Jim Ferenczi edited comment on LUCENE-7638 at 1/17/17 2:00 PM: --- Thanks for taking a look [~mattweber] and [~mikemccand] {quote} Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that? {quote} I'll add a test case because it's expected to work ;) . This is also the reason why this patch does not produce {code}PhraseQuery{code} for synonyms. For simple "side paths" this is easy to do but we would need to switch to Span queries for "side paths on side paths" so I though that it could be done in another issue. {quote} It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery, since everything is in a single field here, and we are hoping to move away from Term entirely (LUCENE-7632)? {quote} Thanks, I'll simplify the patch. {quote} Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases. {quote} I though that holes would not be a problem for boolean queries but now I am not sure. I'll test that. {quote} The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests? {quote} The comment is misleading. It is needed because I use {code}TreeSet#remove{code} which uses compare to check object equality. So the {code}Path.id{code} is the unique identifier for the path. was (Author: jim.ferenczi): Thanks for taking a look @mattweber and [~mikemccand] {quote} Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that? {quote} I'll add a test case because it's expected to work ;) . This is also the reason why this patch does not produce {code}PhraseQuery{code} for synonyms. For simple "side paths" this is easy to do but we would need to switch to Span queries for "side paths on side paths" so I though that it could be done in another issue. {quote} It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery, since everything is in a single field here, and we are hoping to move away from Term entirely (LUCENE-7632)? {quote} Thanks, I'll simplify the patch. {quote} Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases. {quote} I though that holes would not be a problem for boolean queries but now I am not sure. I'll test that. {quote} The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests? {quote} The comment is misleading. It is needed because I use {code}TreeSet#remove{code} which uses compare to check object equality. So the {code}Path.id{code} is the unique identifier for the path. > Optimize graph query produced by QueryBuilder > - > > Key: LUCENE-7638 > URL: https://issues.apache.org/jira/browse/LUCENE-7638 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi > Attachments: LUCENE-7638.patch > > > The QueryBuilder creates a graph query when the underlying TokenStream > contains token with PositionLengthAttribute greater than 1. > These TokenStreams are in fact graphs (lattice to be more precise) where > synonyms can span on multiple terms. > Currently the graph
[jira] [Comment Edited] (LUCENE-7638) Optimize graph query produced by QueryBuilder
[ https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824337#comment-15824337 ] Jim Ferenczi edited comment on LUCENE-7638 at 1/16/17 5:44 PM: --- {quote} Maybe TermAutomatonQuery would be a good fit for that problem? {quote} For pure phrase query it's a good fit because it's a proximity query but for boolean queries the problem is different. We cannot build the TermAutomatonQuery directly, first we need to find the start and end state of each multi-term synonyms in the graph. That's what the attached patch is doing lazily, for each intersection point it creates a multi-term synonym query. Currently the multi-term synonym query is a boolean query but we could change the logic and use the TermAutomatonQuery instead or even create a PhaseQuery for each path in the multi-term synonym. This patch also handles nested multi-term synonyms which makes the detection of intersection points harder. Bottom point is that if we are able to extract the multi-term synonyms of the graph then we can choose more easily how we want to search and score these inner graph. Does this makes sense ? was (Author: jim.ferenczi): For pure phrase query it's a good fit because it's a proximity query but for boolean queries the problem is different. We cannot build the TermAutomatonQuery directly, first we need to find the start and end state of each multi-term synonyms in the graph. That's what the attached patch is doing lazily, for each intersection point it creates a multi-term synonym query. Currently the multi-term synonym query is a boolean query but we could change the logic and use the TermAutomatonQuery instead or even create a PhaseQuery for each path in the multi-term synonym. This patch also handles nested multi-term synonyms which makes the detection of intersection points harder. Bottom point is that if we are able to extract the multi-term synonyms of the graph then we can choose more easily how we want to search and score these inner graph. Does this makes sense ? > Optimize graph query produced by QueryBuilder > - > > Key: LUCENE-7638 > URL: https://issues.apache.org/jira/browse/LUCENE-7638 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi > Attachments: LUCENE-7638.patch > > > The QueryBuilder creates a graph query when the underlying TokenStream > contains token with PositionLengthAttribute greater than 1. > These TokenStreams are in fact graphs (lattice to be more precise) where > synonyms can span on multiple terms. > Currently the graph query is built by visiting all the path of the graph > TokenStream. For instance if you have a synonym like "ny, new york" and you > search for "new york city", the query builder would produce two pathes: > "new york city", "ny city" > This can quickly explode when the number of multi terms synonyms increase. > The query "ny ny" for instance would produce 4 pathes and so on. > For boolean queries with should or must clauses it should be more efficient > to build a boolean query that merges all the intersections in the graph. So > instead of "new york city", "ny city" we could produce: > "+((+new +york) ny) +city" > The attached patch is a proposal to do that instead of the all path solution. > The patch transforms multi terms synonyms in graph query for each > intersection in the graph. This is not done in this patch but we could also > create a specialized query that gives equivalent scores to multi terms > synonyms like the SynonymQuery does for single term synonyms. > For phrase query this patch does not change the current behavior but we could > also use the new method to create optimized graph SpanQuery. > [~mattweber] I think this patch could optimize a lot of cases where multiple > muli-terms synonyms are present in a single request. Could you take a look ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org