[ 
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment: LUCENE-626_20070817.patch

As the phrase-suggestion layer on top of contrib/spell in this patch was noted 
in a bunch of forums the last weeks, I've removed the 550-dependency and 
brought it up to date with the trunk. 

Second level suggesting (ngram token, phrase) can run stand alone. See 
TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it 
will act as a cache on top of second level suggestions. (See docs.)

Output from using adaptive layer only, i.e. suggestions based on how users 
previously behaved. About half a million user queries analyed to build the 
dictionary (takes 30 seconds to build on my dual core):

3ms      pirates ofthe caribbean -> pirates of the caribbean
2ms      pirates of the carribbean -> pirates of the caribbean
0ms      pirates carricean -> pirates caribbean
1ms      pirates of the carriben -> pirates of the caribbean
0ms      pirates of the carabien -> pirates of the caribbean
0ms      pirates of the carabbean -> pirates of the caribbean
1ms      pirates og carribean -> pirates of the caribbean
0ms      pirates of the caribbean music -> pirates of the caribbean soundtrack
0ms      pirates of the caribbean soundtrack -> pirates of the caribbean score
0ms      pirate of carabian -> pirate of caribbean
0ms      pirate of caribbean -> pirates of caribbean
0ms      pirates of caribbean -> pirates of caribbean
0ms      homm 4 -> homm iv
0ms      the pilates -> null


Using the phrase ngram token suggestion using token matrices checked against an 
apriori index. A lot of queries required for one suggestion. Instantiated index 
as apriori saves plenty of millis. This is  expensive stuff, but works pretty 
good. 

72ms     the pilates -> the pirates
440ms    heroes of fight and magic -> heroes of might and magic
417ms    heroes of right and magic -> heroes of might and magic
383ms    heroes of magic and light -> heroes of might and magic
20ms     heroesof lightand magik -> null
385ms    heroes of light and magik -> heroes of might and magic
0ms      heroesof lightand magik -> heroes of might and magic
385ms    heroes of magic and might -> heroes of might and magic 

(That 0ms is becase previous was cached. One does not have to use this cache.)

> Extended spell checker with phrase support and adaptive user session analysis.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: https://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: didyoumean.patch.bz2, LUCENE-626_20070817.patch, 
> spellchecker.diff
>
>
> Extensive java docs available in patch, but I try to keep it compiled here: 
> http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> The patch spellcheck.diff should not depend on anything but Lucene trunk. It 
> has basic support for phrase suggestions  and query goal detection, but is 
> pretty buggy and lacks features available in didyoumean.diff.bz2. The latter 
> depends on LUCENE-550.
> Example:
> {code:java}
> public void testImportData() throws Exception {
>     // load 200 000 user queries with session data and time stamp. no goals 
> specified.
>     System.out.println("Processing 
> http://ginandtonique.org/~kalle/data/pirate.data.gz";);
>     importFile(new InputStreamReader(new GZIPInputStream(new 
> URL("http://ginandtonique.org/~kalle/data/pirate.data.gz";).openStream())));
>     System.out.println("Processing 
> http://ginandtonique.org/~kalle/data/hero.data.gz";);
>     importFile(new InputStreamReader(new GZIPInputStream(new 
> URL("http://ginandtonique.org/~kalle/data/hero.data.gz";).openStream())));
>     System.out.println("Done.");
>     // run some tests without the second level suggestions,
>     // i.e. user behavioral data only. no ngrams or so.
>     
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe 
> caribbean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carribbean"));
>     assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carriben"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carabien"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carabbean"));
>     assertEquals("pirates of the caribbean", facade.didYouMean("pirates og 
> carribean"));
>     assertEquals("pirates of the caribbean soundtrack", 
> facade.didYouMean("pirates of the caribbean music"));
>     assertEquals("pirates of the caribbean score", facade.didYouMean("pirates 
> of the caribbean soundtrack"));
>     assertEquals("pirate of caribbean", facade.didYouMean("pirate of 
> carabian"));
>     assertEquals("pirates of caribbean", facade.didYouMean("pirate of 
> caribbean"));
>     assertEquals("pirates of caribbean", facade.didYouMean("pirates of 
> caribbean"));
>     // depening on how many hits and goals are noted with these two queries
>     // perhaps the delta should be added to a synonym dictionary? 
>     assertEquals("homm iv", facade.didYouMean("homm 4"));
>     // not yet known.. and we have no second level yet.
>     assertNull(facade.didYouMean("the pilates"));
>     // use the dictionary built from user queries to build the token phrase 
> and ngram suggester.      
>     
> facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()),
>  1d);
>     // now it's learned
>     assertEquals("the pirates", facade.didYouMean("the pilates"));
>     // typos
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> fight and magic"));
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> right and magic"));
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> magic and light"));
>     // composite dictionary key not learned yet..
>     assertEquals(null, facade.didYouMean("heroesof lightand magik"));
>     // learn
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> light and magik"));
>     // test
>     assertEquals("heroes of might and magic", facade.didYouMean("heroesof 
> lightand magik"));
>     // wrong term order
>     assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> magic and might"));
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to