On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <gsing...@apache.org>wrote:
> On the indexing side, add in batches and reuse the document and fields. > Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max merge docs. > > On the search side, no need for a BooleanQuery and no need for scoring, so > you will likely want your own Collector (dead simple to write). > bought it down to 15 secs from 30 for 1mil lookups using TermQuery and Collector which is instantiated at once > > It _MAY_ even be faster to simply do the indexing as a word w/ the id as a > payload and then use TermPositions (and no query at all) and forgo searching > all together. Then you just need an IndexReader. First search will always > be slow, unless you "warm" it first. This should help avoid the cost of > going to document storage, which is almost always the most expensive thing > one does in Lucene do to it's random nature. Might even be beneficial to be > able to retrieve IDs in batches (sorted lexicographically, too). > Since all the words have unique ids' then i dont think there is any need for assigning ids. Will re-use lucene document id. Testing shows that it decreased index time to 13 sec and lookup time to 11 sec But I still dont get the "not searching" part. Will take a look at TermPosition and how its done. > > Don't get me wrong, it will likely be slower than a hash map, but the hash > map won't scale and the Lucene term dictionary is delta encoded, so it will > compress a fair amount. Also, as you grow, you will need to use an > FSDirectory. I stil havent seen the size diff for what I was doing previously. But after I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as compared to a HashMap. with 5 mil words and 10mil lookups Hashmap is 4x faster in ADD and 6x faster in lookup. Inmemory Lucene dict gives around 100K lookups per second. Which is like 1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then again, it just need to match lucene Analyzer's speed with which tokens are processed. > -Grant > > On Jan 16, 2010, at 5:37 PM, Robin Anil wrote: > > > Here is my attempt at making a dictionary lookup using lucene. Need some > > pointers in optimising. Currently it takes 30 secs for a million lookups > > using a dictionary of 500K words about 30x of that of a hashmap. But > space > > used is almost same as far as i can see in memory sizes looks almost the > > same(from the process manager). > > > > > > private static final String ID = "id"; > > private static final String WORD = "word"; > > private IndexWriter iwriter; > > private IndexSearcher isearcher; > > private RAMDirectory idx = new RAMDirectory(); > > private Analyzer analyzer = new WhitespaceAnalyzer(); > > > > public void init() throws Exception { > > this.iwriter = > > new IndexWriter(idx, analyzer, true, > > IndexWriter.MaxFieldLength.LIMITED); > > > > } > > > > public void destroy() throws Exception { > > iwriter.close(); > > isearcher.close(); > > } > > > > public void ready() throws Exception { > > iwriter.optimize(); > > iwriter.close(); > > > > this.isearcher = new IndexSearcher(idx, true); > > } > > > > public void addToDictionary(String word, Integer id) throws IOException > { > > > > Document doc = new Document(); > > doc.add(new Field(WORD, word, Field.Store.NO, > > Field.Index.NOT_ANALYZED)); > > doc.add(new Field(ID, id.toString(), Store.YES, > > Field.Index.NOT_ANALYZED)); > > ?? Is there a way other than storing the id as string ? > > iwriter.addDocument(doc); > > } > > > > public Integer get(String word) throws IOException, ParseException { > > BooleanQuery query = new BooleanQuery(); > > query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); > > TopDocs top = isearcher.search(query, null, 1); > > ScoreDoc[] hits = top.scoreDocs; > > if (hits.length == 0) return null; > > return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); > > } > > > > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org > >wrote: > > > >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off > will > >> be very efficient. Plus, there is virtually no code to write. I've > seen > >> bare bones indexes be as little as 20% of the original w/ very fast > lookup. > >> Furthermore, there are many options available for controlling how much > is > >> loaded into memory, etc. Finally, it will handle all the languages you > >> throw at it. > >> > >> -Grant > >> > >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > >> > >>> Currently java strings use double the space of the characters in it > >> because > >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB > >> when > >>> loaded into a HashMap<String, Integer>. Is there some optimization we > >> could > >>> do in terms of storing them and ensuring that chinese, devanagiri and > >> other > >>> characters dont get messed up in the process. > >>> > >>> Some options benson suggested was: storing just the byte[] form and > >> adding > >>> the the option of supplying the hash function in OpenObjectIntHashmap > or > >>> even using a UTF-8 string. > >>> > >>> Or we could leave this alone. I currently estimate the memory > requirement > >>> using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings > when > >>> generating the dictionary split for the vectorizer > >>> > >>> Robin > >> > >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >