[
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wettin updated LUCENE-626:
-------------------------------
Description:
Extensive java docs available in patch, but I try to keep it compiled here:
http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
The patch spellcheck.diff should not depend on anything but Lucene trunk. It
has basic support for phrase suggestions and query goal detection, but is
pretty buggy and lacks features available in didyoumean.diff.bz2. The latter
depends on LUCENE-550.
Example:
{code:java}
public void testImportData() throws Exception {
// load 200 000 user queries with session data and time stamp. no goals
specified.
System.out.println("Processing
http://ginandtonique.org/~kalle/data/pirate.data.gz");
importFile(new InputStreamReader(new GZIPInputStream(new
URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
System.out.println("Processing
http://ginandtonique.org/~kalle/data/hero.data.gz");
importFile(new InputStreamReader(new GZIPInputStream(new
URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
System.out.println("Done.");
// run some tests without the second level suggestions,
// i.e. user behavioral data only. no ngrams or so.
assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe
caribbean"));
assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the
carribbean"));
assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the
carriben"));
assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the
carabien"));
assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the
carabbean"));
assertEquals("pirates of the caribbean", facade.didYouMean("pirates og
carribean"));
assertEquals("pirates of the caribbean soundtrack",
facade.didYouMean("pirates of the caribbean music"));
assertEquals("pirates of the caribbean score", facade.didYouMean("pirates
of the caribbean soundtrack"));
assertEquals("pirate of caribbean", facade.didYouMean("pirate of
carabian"));
assertEquals("pirates of caribbean", facade.didYouMean("pirate of
caribbean"));
assertEquals("pirates of caribbean", facade.didYouMean("pirates of
caribbean"));
// depening on how many hits and goals are noted with these two queries
// perhaps the delta should be added to a synonym dictionary?
assertEquals("homm iv", facade.didYouMean("homm 4"));
// not yet known.. and we have no second level yet.
assertNull(facade.didYouMean("the pilates"));
// use the dictionary built from user queries to build the token phrase and
ngram suggester.
facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()),
1d);
// now it's learned
assertEquals("the pirates", facade.didYouMean("the pilates"));
// typos
assertEquals("heroes of might and magic", facade.didYouMean("heroes of
fight and magic"));
assertEquals("heroes of might and magic", facade.didYouMean("heroes of
right and magic"));
assertEquals("heroes of might and magic", facade.didYouMean("heroes of
magic and light"));
// composite dictionary key not learned yet..
assertEquals(null, facade.didYouMean("heroesof lightand magik"));
// learn
assertEquals("heroes of might and magic", facade.didYouMean("heroes of
light and magik"));
// test
assertEquals("heroes of might and magic", facade.didYouMean("heroesof
lightand magik"));
// wrong term order
assertEquals("heroes of might and magic", facade.didYouMean("heroes of
magic and might"));
}
{code}
was:
Some minor changes to how the single token ngram spell checker in
contrib/spellcheck, but nothing that breaks any old implementation I think.
Also fixed the broken test.
NgramPhraseSuggestier tokenizes a query and suggests combinations of the single
token suggestions matrix.
They must match as a query against an apriori index. By using a span near query
(default) you get features like this:
assertEquals("lost in translation", ngramSuggester.didYouMean("lost on
translation"));
If term position vectors are available it is possible to make it context
sensitive (or what one may call it) to suggest a new term order.
assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light
heros"));
assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros
on light and magik"));
assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
assertEquals("game made", ngramSuggester.didYouMean("made game"));
assertEquals("game made", ngramSuggester.didYouMean("made lame"));
assertEquals("the game", ngramSuggester.didYouMean("the game"));
assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
assertEquals("game", ngramSuggester.didYouMean("same"));
assertEquals(0, ngramSuggester.suggest("may game").size());
SessionAnalyzedDictionary is the adaptive layer, that learns from how users
changed their queries, what data they inspected, et c. It will automagically
find and suggest synonyms, decomposed words, and probably a lot of other neat
features I still have not detected.
A bit depending on the situation, ignored suggestions get suppresed and
followed suggestions get suggeted even more.
assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
assertEquals("the da vinci code", dictionary.didYouMean("the davinci
code"));
assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes
of might and magic ii"));
assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes
of might and magic 2"));
The adaptive layer is not yet(tm) persistent, but soft referenced so that the
dictionary don't go eat up all your RAM.
> Extended spell checker with phrase support and adaptive user session analysis.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-626
> URL: https://issues.apache.org/jira/browse/LUCENE-626
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Karl Wettin
> Assigned To: Karl Wettin
> Priority: Minor
> Attachments: didyoumean.patch.bz2, spellchecker.diff
>
>
> Extensive java docs available in patch, but I try to keep it compiled here:
> http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> The patch spellcheck.diff should not depend on anything but Lucene trunk. It
> has basic support for phrase suggestions and query goal detection, but is
> pretty buggy and lacks features available in didyoumean.diff.bz2. The latter
> depends on LUCENE-550.
> Example:
> {code:java}
> public void testImportData() throws Exception {
> // load 200 000 user queries with session data and time stamp. no goals
> specified.
> System.out.println("Processing
> http://ginandtonique.org/~kalle/data/pirate.data.gz");
> importFile(new InputStreamReader(new GZIPInputStream(new
> URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
> System.out.println("Processing
> http://ginandtonique.org/~kalle/data/hero.data.gz");
> importFile(new InputStreamReader(new GZIPInputStream(new
> URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
> System.out.println("Done.");
> // run some tests without the second level suggestions,
> // i.e. user behavioral data only. no ngrams or so.
>
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe
> caribbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carribbean"));
> assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carriben"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carabien"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carabbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates og
> carribean"));
> assertEquals("pirates of the caribbean soundtrack",
> facade.didYouMean("pirates of the caribbean music"));
> assertEquals("pirates of the caribbean score", facade.didYouMean("pirates
> of the caribbean soundtrack"));
> assertEquals("pirate of caribbean", facade.didYouMean("pirate of
> carabian"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirate of
> caribbean"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirates of
> caribbean"));
> // depening on how many hits and goals are noted with these two queries
> // perhaps the delta should be added to a synonym dictionary?
> assertEquals("homm iv", facade.didYouMean("homm 4"));
> // not yet known.. and we have no second level yet.
> assertNull(facade.didYouMean("the pilates"));
> // use the dictionary built from user queries to build the token phrase
> and ngram suggester.
>
> facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()),
> 1d);
> // now it's learned
> assertEquals("the pirates", facade.didYouMean("the pilates"));
> // typos
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> fight and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> right and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> magic and light"));
> // composite dictionary key not learned yet..
> assertEquals(null, facade.didYouMean("heroesof lightand magik"));
> // learn
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> light and magik"));
> // test
> assertEquals("heroes of might and magic", facade.didYouMean("heroesof
> lightand magik"));
> // wrong term order
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> magic and might"));
> }
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]