Here is my attempt at making a dictionary lookup using lucene. Need some pointers in optimising. Currently it takes 30 secs for a million lookups using a dictionary of 500K words about 30x of that of a hashmap. But space used is almost same as far as i can see in memory sizes looks almost the same(from the process manager).
private static final String ID = "id"; private static final String WORD = "word"; private IndexWriter iwriter; private IndexSearcher isearcher; private RAMDirectory idx = new RAMDirectory(); private Analyzer analyzer = new WhitespaceAnalyzer(); public void init() throws Exception { this.iwriter = new IndexWriter(idx, analyzer, true, IndexWriter.MaxFieldLength.LIMITED); } public void destroy() throws Exception { iwriter.close(); isearcher.close(); } public void ready() throws Exception { iwriter.optimize(); iwriter.close(); this.isearcher = new IndexSearcher(idx, true); } public void addToDictionary(String word, Integer id) throws IOException { Document doc = new Document(); doc.add(new Field(WORD, word, Field.Store.NO, Field.Index.NOT_ANALYZED)); doc.add(new Field(ID, id.toString(), Store.YES, Field.Index.NOT_ANALYZED)); ?? Is there a way other than storing the id as string ? iwriter.addDocument(doc); } public Integer get(String word) throws IOException, ParseException { BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); TopDocs top = isearcher.search(query, null, 1); ScoreDoc[] hits = top.scoreDocs; if (hits.length == 0) return null; return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); } On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org>wrote: > A Lucene index, w/ no storage, positions, etc. (optionally) turned off will > be very efficient. Plus, there is virtually no code to write. I've seen > bare bones indexes be as little as 20% of the original w/ very fast lookup. > Furthermore, there are many options available for controlling how much is > loaded into memory, etc. Finally, it will handle all the languages you > throw at it. > > -Grant > > On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > > > Currently java strings use double the space of the characters in it > because > > its all in utf-16. A 190MB dictionary file therefore uses around 600MB > when > > loaded into a HashMap<String, Integer>. Is there some optimization we > could > > do in terms of storing them and ensuring that chinese, devanagiri and > other > > characters dont get messed up in the process. > > > > Some options benson suggested was: storing just the byte[] form and > adding > > the the option of supplying the hash function in OpenObjectIntHashmap or > > even using a UTF-8 string. > > > > Or we could leave this alone. I currently estimate the memory requirement > > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > > generating the dictionary split for the vectorizer > > > > Robin > >