Here is my attempt at making a dictionary lookup using lucene. Need some
pointers in optimising. Currently it takes 30 secs for a million lookups
using a dictionary of 500K words about 30x of that of a hashmap. But space
used is almost same as far as i can see in memory sizes looks almost the
same(from the process manager).


 private static final String ID = "id";
  private static final String WORD = "word";
  private IndexWriter iwriter;
  private IndexSearcher isearcher;
  private RAMDirectory idx = new RAMDirectory();
  private Analyzer analyzer = new WhitespaceAnalyzer();

  public void init() throws Exception {
    this.iwriter =
        new IndexWriter(idx, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);

  }

  public void destroy() throws Exception {
    iwriter.close();
    isearcher.close();
  }

  public void ready() throws Exception {
    iwriter.optimize();
    iwriter.close();

    this.isearcher = new IndexSearcher(idx, true);
  }

  public void addToDictionary(String word, Integer id) throws IOException {

    Document doc = new Document();
    doc.add(new Field(WORD, word, Field.Store.NO,
Field.Index.NOT_ANALYZED));
    doc.add(new Field(ID, id.toString(), Store.YES,
Field.Index.NOT_ANALYZED));
?? Is there a way other than storing the id as string ?
    iwriter.addDocument(doc);
  }

  public Integer get(String word) throws IOException, ParseException {
    BooleanQuery query = new BooleanQuery();
    query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
    TopDocs top = isearcher.search(query, null, 1);
    ScoreDoc[] hits = top.scoreDocs;
    if (hits.length == 0) return null;
    return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
  }

On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org>wrote:

> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
> be very efficient.  Plus, there is virtually no code to write.  I've seen
> bare bones indexes be as little as 20% of the original w/ very fast lookup.
>  Furthermore, there are many options available for controlling how much is
> loaded into memory, etc.  Finally, it will handle all the languages you
> throw at it.
>
> -Grant
>
> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>
> > Currently java strings use double the space of the characters in it
> because
> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> when
> > loaded into a HashMap<String, Integer>.  Is there some optimization we
> could
> > do in terms of storing them and ensuring that chinese, devanagiri and
> other
> > characters dont get messed up in the process.
> >
> > Some options benson suggested was: storing just the byte[] form and
> adding
> > the the option of supplying the hash function in OpenObjectIntHashmap or
> > even using a UTF-8 string.
> >
> > Or we could leave this alone. I currently estimate the memory requirement
> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> > generating the dictionary split for the vectorizer
> >
> > Robin
>
>

Reply via email to