Re: Fixing query-time multi-word synonym issue

2013-03-28 Thread Otis Gospodnetic
Ha!  Right, it's not about synonyms, it's about the QP doing the wrong
thing.

At this point I'm not even 100% sure which QP is the problem.  I think it's
the Lucene one, but also the SolrQueryParser, is this correct?

But isn't there a flex parser in Lucene contrib somewhere?  What is the
state of that one?  I could be mistaken, but it seems like that QP was
written, stuffed in the queryparser/, and is never used anywhere in Solr or
elsewhere?  Is this correct or am I missing something?

I'm referring to the Flexible query parser mentioned on
http://lucene.apache.org/core/4_2_0/queryparser/overview-summary.html (via
https://issues.apache.org/jira/browse/LUCENE-1567 )

I did not try it to see if it has the same problem with multi-word
synonyms, but I remember the authors being quite ambitious about it.
 Anyone happens to know?

Thanks,
Otis


On Thu, Jan 24, 2013 at 8:17 PM, Robert Muir rcm...@gmail.com wrote:

 On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Funny, I expected everyone would jump on this thread considering so many
  people hit this multi-word synonym issue...
 

 Probably because its not a synonyms problem: its just a bug in a
 specific queryparser.

 If you dont use that queryparser, you dont care: if you want to fix
 that one, well the only correct solution is obvious, fix the
 queryparser not to split on whitespace. hacking other things around it
 is no option at all :)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Fixing query-time multi-word synonym issue

2013-01-26 Thread Michael McCandless
On Fri, Jan 25, 2013 at 6:28 PM, Jack Krupansky j...@basetechnology.com wrote:

 Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc
 says The positionLength determines how many positions this token spans,
 which doesn't sound very relevant to multi-term synonyms that span multiple
 positions.

I don't think we have good enough javadocs around PosLenAtt ... we
really should add a simple graph examples.  Though, it is a very
expert topic ...

The way to think of PosIncAtt is that it tells you the start node of
this arc (= token), while PosLenAtt tells you the end node, except
both atts are delta coded, making it hard to think about.
Furthermore, all arcs (tokens) must be emitted in order, ie smaller
numbered nodes (positions) must come first, and all arcs leaving each
node must be enumerated before you go to the next node.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-26 Thread Jack Krupansky
Yeah, I suspected that it was going to be expert/cryptic. I think the real 
point, from my September proposal, was that we need a common piece of code 
that has all that expert smarts to reconstruct the graph and then can 
generate the Lucene Query structure that will match that reconstructed 
graph. It would also be great to have clear, detailed doc of the linear 
graph encoding, but it is the code to reconstitute the graph that is the 
real goal.


And, of course, we need Synonym filter to do the full graph encoding. Last 
time I looked (September), the PosLenAtt didn't seem to have a value that I 
could make sense of in terms of the graph (I think it always had the same 
value), but maybe I just wasn't interpreting it according to the 
undocumented rules.


I think a fair number of people want to be able to do query-time synonyms, 
so I wouldn't classify that as an expert use case, but I agree that 
dealing with graph encoding/decoding is more of an expert chore. Although, 
in the end, it is really just a small number of people who work on query 
parsers who actually have a need to know about synonym graphs.


-- Jack Krupansky

-Original Message- 
From: Michael McCandless

Sent: Saturday, January 26, 2013 6:31 AM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Fri, Jan 25, 2013 at 6:28 PM, Jack Krupansky j...@basetechnology.com 
wrote:



Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc
says The positionLength determines how many positions this token spans,
which doesn't sound very relevant to multi-term synonyms that span 
multiple

positions.


I don't think we have good enough javadocs around PosLenAtt ... we
really should add a simple graph examples.  Though, it is a very
expert topic ...

The way to think of PosIncAtt is that it tells you the start node of
this arc (= token), while PosLenAtt tells you the end node, except
both atts are delta coded, making it hard to think about.
Furthermore, all arcs (tokens) must be emitted in order, ie smaller
numbered nodes (positions) must come first, and all arcs leaving each
node must be enumerated before you go to the next node.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-26 Thread Michael McCandless
On Sat, Jan 26, 2013 at 11:16 AM, Jack Krupansky
j...@basetechnology.com wrote:
 Yeah, I suspected that it was going to be expert/cryptic. I think the real
 point, from my September proposal, was that we need a common piece of code
 that has all that expert smarts to reconstruct the graph and then can
 generate the Lucene Query structure that will match that reconstructed
 graph.

That would be nice to have.  Maybe a possible starting point is
TokenStreamToAutomaton?  The only catch is that this creates an
Automaton with bytes as the labels, but we (somehow) need tokens/terms
as the labels.  Then we just need AutomatonQuery ;)

 It would also be great to have clear, detailed doc of the linear
 graph encoding, but it is the code to reconstitute the graph that is the
 real goal.

+1

 And, of course, we need Synonym filter to do the full graph encoding. Last
 time I looked (September), the PosLenAtt didn't seem to have a value that I
 could make sense of in terms of the graph (I think it always had the same
 value), but maybe I just wasn't interpreting it according to the
 undocumented rules.

At some point it was changed to set PosLenAtt, but only if the output
is a single token (eg, domain name service - dns; in this case the
dns token will have PosLenAtt=3).

Creating new positions (dns - domain name service) is a harder
change; normally it's only Tokenizers that create positions.  But eg
WordDelimiterFilter also needs to creates positions.

It would be nice to have some supporting infrastructure to make it
easier for token filters to create new positions; the trickiness is it
requires the token filter to be non-deterministic in general because
the pos len of a given token can be a function of whether future
tokens are split (creating new positions).

 I think a fair number of people want to be able to do query-time synonyms,
 so I wouldn't classify that as an expert use case, but I agree that
 dealing with graph encoding/decoding is more of an expert chore.

Right, it would be nice to support both, and since query-time could in
theory work correctly (unlike index time since we can't index the
graph), it's perhaps the more compelling / easier route for apps that
really need exact/expanding multi-word synonyms.

But we have to fix the QueryParsers first: indexing already cannot
properly encode the graph, and query-time cannot get past the
QueryParser ...

 Although,
 in the end, it is really just a small number of people who work on query
 parsers who actually have a need to know about synonym graphs.

True.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-25 Thread Jack Krupansky

Here's an example query with q.op=AND:

   causes of heart attack

And I have this synonym definition:

   heart attack, myocardial infarction

So, what is the alleged query parser fix so that the query is treated as:

   causes of (heart attack OR myocardial infarction)

The core problem with the synonym filter is that it mashes all the terms of 
a multi-term synonym to be at the same position so that the order (attack 
after heart and infarction after mycardial) is lost.


What is needed is a synonym filter with a notion of path so the term 
sequences for each of the synonym alternatives is available for the query 
parser to generate the OR alternative queries.


Granted, the query parser ALSO needs to present the full sequence of terms 
to the analyzer as one string causes of heart attack, but that alone 
doesn't address the synonym filter misbehavior.


-- Jack Krupansky

-Original Message- 
From: Robert Muir

Sent: Friday, January 25, 2013 3:46 AM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Fri, Jan 25, 2013 at 12:48 AM, Jack Krupansky
j...@basetechnology.com wrote:

Otis, this is precisely why nothing will get done any time soon on the
multi-term synonym issue - there isn't even common agreement that there is 
a

problem, let alone common agreement on the specifics of the problem, let
alone common agreement on a solution.


I think you are the only one arguing the bug is a synonymsfilter problem.



Even though technically the Solr Query Parser is now separate from the
Lucene Query Parser, the synonym filter is still strictly Lucene. 
Addressing

the multi-term synonym feature requires enhancement to the synonym filter,


dude you have a bug in X, you fix the bug in X: you dont go hack around it 
in Y.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-25 Thread Robert Muir
On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote:
 Here's an example query with q.op=AND:

causes of heart attack

 And I have this synonym definition:

heart attack, myocardial infarction

 So, what is the alleged query parser fix so that the query is treated as:

causes of (heart attack OR myocardial infarction)


Thats actually inefficient and stupid to do. if you make a parser that
doesnt split on whitespace, you can just tell it to fold at index and
query time just like stemming. no OR necessary.

But I think you are trying to get off topic, again the real problem
affecting 99%+ users is that the lucene queryparser splits on
whitespace.

If this is fixed, then lots of things (not just synonyms, but other
basic shit that is broken today) starts working too:
https://issues.apache.org/jira/browse/LUCENE-2605

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-25 Thread Jack Krupansky
One clarification from my previous comment: One requirement is to prevent 
false matches for instances of heart infarction and myocardial attack - 
the current synonym filter does not preserver the path or term ordering 
within the multi-term phrases. Even if the query parser does present the 
full term sequence as a single input string.


Yes, the position information is preserved, but there is no path attribute 
to be able to tell that heart was before attack as opposed to before 
infarction.


-- Jack Krupansky

-Original Message- 
From: Robert Muir

Sent: Friday, January 25, 2013 9:47 AM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com 
wrote:

Here's an example query with q.op=AND:

   causes of heart attack

And I have this synonym definition:

   heart attack, myocardial infarction

So, what is the alleged query parser fix so that the query is treated as:

   causes of (heart attack OR myocardial infarction)



Thats actually inefficient and stupid to do. if you make a parser that
doesnt split on whitespace, you can just tell it to fold at index and
query time just like stemming. no OR necessary.

But I think you are trying to get off topic, again the real problem
affecting 99%+ users is that the lucene queryparser splits on
whitespace.

If this is fixed, then lots of things (not just synonyms, but other
basic shit that is broken today) starts working too:
https://issues.apache.org/jira/browse/LUCENE-2605

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-25 Thread Michael McCandless
PositionLengthAttribute is sufficient to express the true graph, but
SynonymFilter has not been fully fixed to properly set it.
Specifically, it cannot create new positions, which is what's
necessary if you expand when applying synonyms (e.g., dns - domain
name service).

It is better to do the reverse: map the multi-word phrase down to a
single token, at indexing time (domain name service - dns): you get
accurate scoring (exact docFreq for how many docs have either dns or
domain name service) and faster search performance, and PosLenAtt is
properly set, and you workaround the fact that the index cannot index
the position length att (since you never create alternate paths in
the token graph).  The downside is you must re-index if you change
your synonyms.

However: once we fix QueryParser to stop splitting on whitespace (it's
really ridiculous that it does so: it causes so many problems), and
fix SynFilter to create positions, it is in theory possible to take
the resulting graph (if you expand when applying synonyms) and
enumerate the correct query (something like MultiPhraseQuery, or OR
of them, or something; maybe we'll need WordGraphQuery), and get the
correct results.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky
j...@basetechnology.com wrote:
 One clarification from my previous comment: One requirement is to prevent
 false matches for instances of heart infarction and myocardial attack -
 the current synonym filter does not preserver the path or term ordering
 within the multi-term phrases. Even if the query parser does present the
 full term sequence as a single input string.

 Yes, the position information is preserved, but there is no path attribute
 to be able to tell that heart was before attack as opposed to before
 infarction.


 -- Jack Krupansky

 -Original Message- From: Robert Muir
 Sent: Friday, January 25, 2013 9:47 AM

 To: dev@lucene.apache.org
 Subject: Re: Fixing query-time multi-word synonym issue

 On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com
 wrote:

 Here's an example query with q.op=AND:

causes of heart attack

 And I have this synonym definition:

heart attack, myocardial infarction

 So, what is the alleged query parser fix so that the query is treated as:

causes of (heart attack OR myocardial infarction)


 Thats actually inefficient and stupid to do. if you make a parser that
 doesnt split on whitespace, you can just tell it to fold at index and
 query time just like stemming. no OR necessary.

 But I think you are trying to get off topic, again the real problem
 affecting 99%+ users is that the lucene queryparser splits on
 whitespace.

 If this is fixed, then lots of things (not just synonyms, but other
 basic shit that is broken today) starts working too:
 https://issues.apache.org/jira/browse/LUCENE-2605

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-25 Thread Jack Krupansky

Thanks for that insight, Mike!

I don't think anybody is in disagreement with the need for the query parser 
to present the full, white-space delimited pseudo-term sequence to analysis 
in one step. My proposal from last September recognized that.


Is there a decent writeup on PositionLengthAttribute? I mean, the Javadoc 
says The positionLength determines how many positions this token spans, 
which doesn't sound very relevant to multi-term synonyms that span multiple 
positions.


-- Jack Krupansky

-Original Message- 
From: Michael McCandless

Sent: Friday, January 25, 2013 4:41 PM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

PositionLengthAttribute is sufficient to express the true graph, but
SynonymFilter has not been fully fixed to properly set it.
Specifically, it cannot create new positions, which is what's
necessary if you expand when applying synonyms (e.g., dns - domain
name service).

It is better to do the reverse: map the multi-word phrase down to a
single token, at indexing time (domain name service - dns): you get
accurate scoring (exact docFreq for how many docs have either dns or
domain name service) and faster search performance, and PosLenAtt is
properly set, and you workaround the fact that the index cannot index
the position length att (since you never create alternate paths in
the token graph).  The downside is you must re-index if you change
your synonyms.

However: once we fix QueryParser to stop splitting on whitespace (it's
really ridiculous that it does so: it causes so many problems), and
fix SynFilter to create positions, it is in theory possible to take
the resulting graph (if you expand when applying synonyms) and
enumerate the correct query (something like MultiPhraseQuery, or OR
of them, or something; maybe we'll need WordGraphQuery), and get the
correct results.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jan 25, 2013 at 11:27 AM, Jack Krupansky
j...@basetechnology.com wrote:

One clarification from my previous comment: One requirement is to prevent
false matches for instances of heart infarction and myocardial 
attack -

the current synonym filter does not preserver the path or term ordering
within the multi-term phrases. Even if the query parser does present the
full term sequence as a single input string.

Yes, the position information is preserved, but there is no path 
attribute

to be able to tell that heart was before attack as opposed to before
infarction.


-- Jack Krupansky

-Original Message- From: Robert Muir
Sent: Friday, January 25, 2013 9:47 AM

To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Fri, Jan 25, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com
wrote:


Here's an example query with q.op=AND:

   causes of heart attack

And I have this synonym definition:

   heart attack, myocardial infarction

So, what is the alleged query parser fix so that the query is treated as:

   causes of (heart attack OR myocardial infarction)



Thats actually inefficient and stupid to do. if you make a parser that
doesnt split on whitespace, you can just tell it to fold at index and
query time just like stemming. no OR necessary.

But I think you are trying to get off topic, again the real problem
affecting 99%+ users is that the lucene queryparser splits on
whitespace.

If this is fixed, then lots of things (not just synonyms, but other
basic shit that is broken today) starts working too:
https://issues.apache.org/jira/browse/LUCENE-2605

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-24 Thread Otis Gospodnetic
Funny, I expected everyone would jump on this thread considering so many
people hit this multi-word synonym issue...

Mikhail, I looked at the slides and while I understand that use case well,
I don't see where query-time multi-word synonyms come into play. Please
enlighten me :)  Thanks!

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 22, 2013 1:48 PM, Mikhail Khludnev mkhlud...@griddynamics.com
wrote:

 FWIW, multi-word synonyms is a side benefit of query parsing approach
 implemented by my team.
 Here how it looks like
 https://docs.google.com/a/griddynamics.com/presentation/pub?id=1oifLFI0MiA3ZyXZWisHJVRK13P8cki5yCABvABPObKwstart=falseloop=falsedelayms=3000#slide=id.g1006de00_2_34fee
  people frequent typo has been substituted to the correct brand.
 Five previous slides depicts the approach, the main idea is get rid of old
 good Boolean retrieval, and introduce own notion of matching.
 I can share more details if you wish.


 On Tue, Jan 22, 2013 at 7:17 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

 Hello,

 I'm looking for some guidance around solving the infamous index-time vs.
 query-time multi-word synonym problem.  Looking for help with understanding
 the pieces and effort involved, and also being on a lookout for any
 potential man, it will take you forever, you'll have to do major Lucene
 surgery type of warnings.

 I never looked deeply into this problem and my understanding is that
 multi-word synonyms don't work at query-time because QueryParser(?) simply
 breaks queries on spaces and thus makes it impossible for
 SynonymTokenFilter (?) to see the non-broken-up token sequence and do
 synonym expansion.

 I think this is also documented on the Wiki.
 Are there other pieces involved that I didn't mention, but should have?

 The following are 3 different efforts I found:
 https://issues.apache.org/jira/browse/LUCENE-4499
 http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
 http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html

 Plus Jack's proposal:
 http://search-lucene.com/m/Zkj0k15dDGP1

 Does any of the above approaches sound like the right one, or at least in
 the right direction, and stands the chance of being accepted?

 Thanks,
 Otis




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Fixing query-time multi-word synonym issue

2013-01-24 Thread Robert Muir
On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Funny, I expected everyone would jump on this thread considering so many
 people hit this multi-word synonym issue...


Probably because its not a synonyms problem: its just a bug in a
specific queryparser.

If you dont use that queryparser, you dont care: if you want to fix
that one, well the only correct solution is obvious, fix the
queryparser not to split on whitespace. hacking other things around it
is no option at all :)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-24 Thread Jack Krupansky
Otis, this is precisely why nothing will get done any time soon on the 
multi-term synonym issue - there isn't even common agreement that there is a 
problem, let alone common agreement on the specifics of the problem, let 
alone common agreement on a solution.


Even though technically the Solr Query Parser is now separate from the 
Lucene Query Parser, the synonym filter is still strictly Lucene. Addressing 
the multi-term synonym feature requires enhancement to the synonym filter, 
so without common agreement on the nature of the problem (which I already 
outlined), the only path forward is for Solr to provide its own synonym 
filter and leave the existing Lucene synonym filter intact.


That said, I don't expect general community agreement on a path forward for 
multi-term synonyms any time soon.


-- Jack Krupansky

-Original Message- 
From: Robert Muir

Sent: Thursday, January 24, 2013 8:17 PM
To: dev@lucene.apache.org
Subject: Re: Fixing query-time multi-word synonym issue

On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

Funny, I expected everyone would jump on this thread considering so many
people hit this multi-word synonym issue...



Probably because its not a synonyms problem: its just a bug in a
specific queryparser.

If you dont use that queryparser, you dont care: if you want to fix
that one, well the only correct solution is obvious, fix the
queryparser not to split on whitespace. hacking other things around it
is no option at all :)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fixing query-time multi-word synonym issue

2013-01-24 Thread Varun Thacker
There is one more issue which faces a similar problem, the queryparser
splits on whitespace and when the field being searched against is
multi-word then search doesn't behave correctly.

But ideally these problems should not arise, I want to be a little more
optimistic on this and see if there is a  solution to these problems :)

On Fri, Jan 25, 2013 at 11:18 AM, Jack Krupansky j...@basetechnology.comwrote:

 Otis, this is precisely why nothing will get done any time soon on the
 multi-term synonym issue - there isn't even common agreement that there is
 a problem, let alone common agreement on the specifics of the problem, let
 alone common agreement on a solution.

 Even though technically the Solr Query Parser is now separate from the
 Lucene Query Parser, the synonym filter is still strictly Lucene.
 Addressing the multi-term synonym feature requires enhancement to the
 synonym filter, so without common agreement on the nature of the problem
 (which I already outlined), the only path forward is for Solr to provide
 its own synonym filter and leave the existing Lucene synonym filter
 intact.

 That said, I don't expect general community agreement on a path forward
 for multi-term synonyms any time soon.

 -- Jack Krupansky

 -Original Message- From: Robert Muir
 Sent: Thursday, January 24, 2013 8:17 PM
 To: dev@lucene.apache.org
 Subject: Re: Fixing query-time multi-word synonym issue


 On Thu, Jan 24, 2013 at 8:10 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:

 Funny, I expected everyone would jump on this thread considering so many
 people hit this multi-word synonym issue...


 Probably because its not a synonyms problem: its just a bug in a
 specific queryparser.

 If you dont use that queryparser, you dont care: if you want to fix
 that one, well the only correct solution is obvious, fix the
 queryparser not to split on whitespace. hacking other things around it
 is no option at all :)

 --**--**-
 To unsubscribe, e-mail: 
 dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 --**--**-
 To unsubscribe, e-mail: 
 dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 


Regards,
Varun Thacker
http://www.vthacker.in/


Re: Fixing query-time multi-word synonym issue

2013-01-22 Thread Mikhail Khludnev
FWIW, multi-word synonyms is a side benefit of query parsing approach
implemented by my team.
Here how it looks like
https://docs.google.com/a/griddynamics.com/presentation/pub?id=1oifLFI0MiA3ZyXZWisHJVRK13P8cki5yCABvABPObKwstart=falseloop=falsedelayms=3000#slide=id.g1006de00_2_34fee
people frequent typo has been substituted to the correct brand.
Five previous slides depicts the approach, the main idea is get rid of old
good Boolean retrieval, and introduce own notion of matching.
I can share more details if you wish.


On Tue, Jan 22, 2013 at 7:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hello,

 I'm looking for some guidance around solving the infamous index-time vs.
 query-time multi-word synonym problem.  Looking for help with understanding
 the pieces and effort involved, and also being on a lookout for any
 potential man, it will take you forever, you'll have to do major Lucene
 surgery type of warnings.

 I never looked deeply into this problem and my understanding is that
 multi-word synonyms don't work at query-time because QueryParser(?) simply
 breaks queries on spaces and thus makes it impossible for
 SynonymTokenFilter (?) to see the non-broken-up token sequence and do
 synonym expansion.

 I think this is also documented on the Wiki.
 Are there other pieces involved that I didn't mention, but should have?

 The following are 3 different efforts I found:
 https://issues.apache.org/jira/browse/LUCENE-4499
 http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
 http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html

 Plus Jack's proposal:
 http://search-lucene.com/m/Zkj0k15dDGP1

 Does any of the above approaches sound like the right one, or at least in
 the right direction, and stands the chance of being accepted?

 Thanks,
 Otis




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com