Here is my attempt at making a dictionary lookup using lucene. Need some
pointers in optimising. Currently it takes 30 secs for a million lookups
using a dictionary of 500K words about 30x of that of a hashmap. But space
used is almost same as far as i can see in memory sizes looks almost the
same(from the process manager).
private static final String ID = "id";
private static final String WORD = "word";
private IndexWriter iwriter;
private IndexSearcher isearcher;
private RAMDirectory idx = new RAMDirectory();
private Analyzer analyzer = new WhitespaceAnalyzer();
public void init() throws Exception {
this.iwriter =
new IndexWriter(idx, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
}
public void destroy() throws Exception {
iwriter.close();
isearcher.close();
}
public void ready() throws Exception {
iwriter.optimize();
iwriter.close();
this.isearcher = new IndexSearcher(idx, true);
}
public void addToDictionary(String word, Integer id) throws IOException {
Document doc = new Document();
doc.add(new Field(WORD, word, Field.Store.NO,
Field.Index.NOT_ANALYZED));
doc.add(new Field(ID, id.toString(), Store.YES,
Field.Index.NOT_ANALYZED));
?? Is there a way other than storing the id as string ?
iwriter.addDocument(doc);
}
public Integer get(String word) throws IOException, ParseException {
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
TopDocs top = isearcher.search(query, null, 1);
ScoreDoc[] hits = top.scoreDocs;
if (hits.length == 0) return null;
return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
}
On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <[email protected]>wrote:
> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
> be very efficient. Plus, there is virtually no code to write. I've seen
> bare bones indexes be as little as 20% of the original w/ very fast lookup.
> Furthermore, there are many options available for controlling how much is
> loaded into memory, etc. Finally, it will handle all the languages you
> throw at it.
>
> -Grant
>
> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>
> > Currently java strings use double the space of the characters in it
> because
> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> when
> > loaded into a HashMap<String, Integer>. Is there some optimization we
> could
> > do in terms of storing them and ensuring that chinese, devanagiri and
> other
> > characters dont get messed up in the process.
> >
> > Some options benson suggested was: storing just the byte[] form and
> adding
> > the the option of supplying the hash function in OpenObjectIntHashmap or
> > even using a UTF-8 string.
> >
> > Or we could leave this alone. I currently estimate the memory requirement
> > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when
> > generating the dictionary split for the vectorizer
> >
> > Robin
>
>