Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-20 Thread Morus Walter
David Spencer writes:
> > 
> > could you put the current version of your code on that website as a java
> 
> Weblog entry updated:
> 
> http://searchmorph.com/weblog/index.php?id=23
> 
thanks
> 
> Great suggestion and thanks for that idiom - I should know such things 
> by now. To clarify the "issue", it's just a performance one, not other 
> functionality...anyway I put in the code - and to be scientific I 
> benchmarked it two times before the change and two times after - and the 
> results were suprising the same both times (1:45 to 1:50 with an index 
> that takes up > 200MB). Probably there are cases where this will run 
> faster, and the code seems more "correct" now so it's in.
> 
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
> 
> 
> > 
> > 
> > An interesting application of this might be an ngram-Index enhanced version
> > of the FuzzyQuery. While this introduces more complexity on the indexing
> > side, it might be a large speedup for fuzzy searches.
> 
> I also thinking of reviewing the list to see if anyone had done a "Jaro 
> Winkler" fuzzy query yet and doing that
> 
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote:
Hi David,
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

great :-)
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
could you put the current version of your code on that website as a java
Weblog entry updated:
http://searchmorph.com/weblog/index.php?id=23
To link to source code:
http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java
source also? At least until it's in the lucene sandbox.
I created an ngram index on one of my indexes and think I found an issue
in the indexing code:
There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			break;
			}

otherwise you loop over all terms in all fields.
Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the "issue", it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up > 200MB). Probably there are cases where this will run 
faster, and the code seems more "correct" now so it's in.



An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.
I also thinking of reviewing the list to see if anyone had done a "Jaro 
Winkler" fuzzy query yet and doing that



Thanks,
 Dave
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread Morus Walter
Hi David,
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
> phases. First you build a "fast lookup index" as mentioned above. Then 
> to correct a word you do a query in this index based on the ngrams in 
> the misspelled word.
> 
> Let's see.
> 
> [1] Source is attached and I'd like to contribute it to the sandbox, esp 
> if someone can validate that what it's doing is reasonable and useful.
> 
great :-)
> 
> [4] Here's source in HTML:
> 
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
> 
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.


I created an ngram index on one of my indexes and think I found an issue
in the indexing code:

There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
final TermEnum te = r.terms();
i'd suggest
final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
Term t = te.term();
if ( !t.field().equals(field) ) {
break;
}

otherwise you loop over all terms in all fields.


An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
Also,

You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.


cheers,
Aad

Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>> recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms

>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in
some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in
addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say,
>  >50% of the documents, is probably spelled correctly and "recursize",

> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and
generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we
probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser"
occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the

page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max
r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.
I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0
TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words 
like this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This 
is an interesting issue, how to tune the algorithm so that it might 
return words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):
Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
conts -> _c co on nt ts s_
const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would 
be:

conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.

I guess one way is to add all simple transpositions to the lookup 
table (the "ngram index") so that these could easily be found, with 
the heuristic that "a frequent way of misspelling words is to 
transpose two adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

I enhanced the code to store simple transpositions also and I 
regenerated my site w/ ngrams from 2 to 5 chars. If you set the 
transposition boost up to 10 then "const" is returned 2nd...

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Andrzej Bialecki
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.
If you added 2-grams, then it would look like this (constructing also 
special start/end grams):

conts -> _c co on nt ts s_
const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would be:

conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".
Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.
I added a line of output to the right column under the 'details' box. 
"const" appears 216 times in the index (out of 96k docs), thus it is 
indeed kinda rare.

http://www.searchmorph.com/kat/spell.jsp?s=const

should have a pretty low levenshtein distance. Any idea what causes this
behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

Good question.
If I use the form at the bottom of the page and ask for more results, 
the suggestion of "const" does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0
Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0

To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Aad Nales wrote:

> David,
> 
> Perhaps I misunderstand somehting so please correct me if I do. I used

> http://www.searchmorph.com/kat/spell.jsp to look for conts without 
> changing any of the default values. What I got as results did not 
> include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

> should have a pretty low levenshtein distance. Any idea what causes 
> this behavior?



-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Andrzej Bialecki
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

should have a pretty low levenshtein distance. Any idea what causes this
behavior?

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread Aad Nales
David,

Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
should have a pretty low levenshtein distance. Any idea what causes this
behavior?

cheers,
Aad

-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 21:23
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing 
> terms
> to bi- or trigrams, create a separate lookup index, and then simply
for 
> each term ask a phrase query (phrase = all n-grams from an input
term), 
> with a slop > 0, to get similar existing terms. This should be fast,
and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp

if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal

search index query -- rather this is a test page for spelling
corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra
mSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s
pell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This

Doc contains all ngrams (in my test case, like #2 above, of length 3 and

4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.
Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).
Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
>50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote:
I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.
Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be fast, and you could provide a "did you mean"
function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)
Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.


[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
"recursixe" or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

Very nice demo! 
Thanks, kinda designed for ngram-nerds if you know what I mean :)
I bet it's running way faster than the linear search 
Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.

You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in "plain" English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:
ok, thx, will update
- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits.
Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
Good point, I'll try to add that in as an optional parameter.
because you operate on a smal

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Andrzej Bialecki
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be fast, and you could provide a "did you mean"
function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)

[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.
There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
"recursixe" or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
Very nice demo! I bet it's running way faster than the linear search 
over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.
You may want to experiment with 2 <= n <= 5. Some n-gram based 
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in "plain" English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher
I think this is a reasonable heuristics. Reading the code I would 
present it this way:

- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits. You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
because you operate on a small result set.

[6]
If people want to vote me in as a committer to the sandbox then I can
Well, someone needs to maintain the code after all... ;-)
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote:
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp
How embarassing!
Sorry!
Fixed!


T
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:

David Spencer wrote:

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...


Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Tate Avery

I get a NullPointerException shown (via Apache) when I try to access 
http://www.searchmorph.com/kat/spell.jsp


T

-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
 Dave

package org.apache.lucene.spell;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   "This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/)."
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" and
 *"Apache Lucene" must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "Apache",
 *"Apache Lucene", nor may "Apache" appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ===

RE: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-11 Thread Aad Nales
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did you mean" spelling correction algorithm

>> determines that a frequent term is a good suggestion, why not suggest

>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that
users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the 
> term
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered
should 
> of course be shown.  But if the user's term is very common one need
not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
 recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term

in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency,

  it is left as-is.


My idea is to first execute the query and only execute the 'spell check'
if the number of results is lower than a certain treshhold. 

Secondly, I would like to use the 'stemming' functionality that MySpell
offers to be used for all stuff that is written to the index together
with the POS appearance.

Thirdly I want to regularly scan the index for often used words to be
added to the list of 'approved' terms. This would serve another purpose
of the customer, which is building an synonym index for Dutch words used
in an eductional context.

But having read all the input I think that using the index itself for a
first spellcheck is probably not a bad start. 



I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in

the query that are not in the index at all).


> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely 
that the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).
I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  

And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote:
Hi Doug,

Perhaps.  Are folks really better at spelling the
beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
Good ole Citeseer finds 2 docs that seem relevant:
http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any
I have some of the ngram spelling suggestion stuff, based on earlier 
msgs in this thread, working in my dev tree. I'll try to get a test site 
up later today for people to fool around with.


this topic, if you are interested.
On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.
Hope this was helpful,
Eks

  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-10 Thread eks dev
Hi Doug,

> Perhaps.  Are folks really better at spelling the
> beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
this topic, if you are interested.

On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.

Hope this was helpful,
Eks




  






___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote:
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:
Not that I know of.
- ignore(penalize?) terms that are rare
I think this one is easy to threshold: ignore matching terms that are 
rarer than the term entered.

- ignore(penalize?) terms that are common
This, in effect, falls out of the previous criterion.  A term that is 
very common will not have any matching terms that are more common.  As 
an optimization, you could avoid even looking for matching terms when a 
term is very common.

- terms that are closer (string distance) to the term entered are better
This is the meaty one.
- terms that start w/ the same 'n' chars as the users term are better
Perhaps.  Are folks really better at spelling the beginning of words?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:

- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand And so I guess the point 
here is we precalculate the trigram->term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
  n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread Andrzej Bialecki
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
I did a WordNet/synonym query expander. Search for "WordNet" on this 
page. Of interest is it stores the Wordnet info in a separate Lucene 
index as at its essence all an index is is a database.

http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Also, another variation, is to instead spell based on what terms are in 
the index, not what an external dictionary says. I've done this on my 
experimental site searchmorph.com in a dumb/inefficient way. Here's an 
example:

http://www.searchmorph.com/kat/search.jsp?s=recursivz
After you click above it takes ~10sec as it produces terms close to 
"recursivz". Opps - looking at the output, it looks like the same word 
is suggest multiple times - ouch - I must be considering all fields, not 
just the contents field. TBD is fixing this. (or no wonder it's so slow :))

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.



Cheers,
Aad
--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


combining open office spellchecker with Lucene

2004-09-09 Thread Aad Nales
Hi All,

Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Cheers,
Aad


--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]