StandardTokenizer and split tokens

Mansour Al Akeel Fri, 22 Jun 2012 15:26:46 -0700

Hello all,

I am tying to write a simple autosuggest functionality. I was looking
at some auto suggest code, and came over this post
http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
I have been stuck with the some strange words, trying to see how they
are generated. Here's the Anayzer:


public class AutoCompleteAnalyzer extends Analyzer {
        public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream result = null;
                result = new StandardTokenizer(Version.LUCENE_36, reader);
                result = new EdgeNGramTokenFilter(result,       
EdgeNGramTokenFilter.Side.FRONT,
1, 20);
                return result;
        }       
}

And this is the relevant method that does the indexing. It's being
called with reindexOn("title");

private void reindexOn(String keyword) throws CorruptIndexException,
                        IOException {
                log.info("indexing on " + keyword);
                Analyzer analyzer = new AutoCompleteAnalyzer();
                IndexWriterConfig config = new 
IndexWriterConfig(Version.LUCENE_36,
        analyzer);
                IndexWriter analyticalWriter = new 
IndexWriter(suggestIndexDirectory, config);
                analyticalWriter.commit(); // needed to create the initiale 
index
                IndexReader indexReader = 
IndexReader.open(productsIndexDirectory);
                Map<String, Integer> wordsMap = new HashMap<String, Integer>();
                LuceneDictionary dict = new LuceneDictionary(indexReader, 
keyword);
                BytesRefIterator iter = dict.getWordsIterator();
                BytesRef ref = null;
                while ((ref = iter.next()) != null) {
                        String word = new String(ref.bytes);
                        int len = word.length();
                        if (len < 3) {
                                continue;
                        }
                        if (wordsMap.containsKey(word)) {
                                String msg = "Word " + word + " Already Exists";
                                throw new IllegalStateException(msg);
                        }
                        wordsMap.put(word, indexReader.docFreq(new 
Term(keyword, word)));
                }

                for (String word : wordsMap.keySet()) {
                        Document doc = new Document();
                        Field field = null;
                        field = new Field(SOURCE_WORD_FIELD, word, 
Field.Store.YES,
Field.Index.NOT_ANALYZED);
                        doc.add(field);
                        field = new Field(GRAMMED_WORDS_FIELD, word,
Field.Store.YES,        Field.Index.ANALYZED);
                        doc.add(field);
                        String count = Integer.toString(wordsMap.get(word));
                        field = new Field(COUNT_FIELD, count, Field.Store.NO,
Field.Index.NOT_ANALYZED); // count
                        doc.add(field);
                        analyticalWriter.addDocument(doc);
                }
                analyticalWriter.commit();
                analyticalWriter.close();
                indexReader.close();
        }

        private static final String GRAMMED_WORDS_FIELD = "words";
        private static final String SOURCE_WORD_FIELD = "sourceWord";
        private static final String COUNT_FIELD = "count";

And now, my unit testing :

        @BeforeClass
        public static void setUp() throws CorruptIndexException, IOException {
                String idxFileName = "myIndexDirectory";
                Indexer indexer = new Indexer(idxFileName);
                indexer.addDoc("Apache Lucene in Action");
                indexer.addDoc("Lord of the Rings");
                indexer.addDoc("Apache Solr in Action");
                indexer.addDoc("apples and Oranges");
                indexer.addDoc("apple iphone");
                indexer.reindexKeywords();
                search = new SearchEngine(idxFileName);
        }

The strange part, is looking under the index I found there are
sourceWords (lordne, applee, solres ). I understand that the ngram
will result in parts of each word. Ex:

l
lo
lor
lord

But of these go into one field, but what about "lorden" and "solres"
?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
Is there something I am missing ??

I understand there could be easier ways to create this functionality
(http://wiki.apache.org/lucene-java/SpellChecker), but I like to
resolve this issue, and to
understand if I am doing something wrong.

Thank you in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

StandardTokenizer and split tokens

Reply via email to