On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <[email protected]>wrote:
> On the indexing side, add in batches and reuse the document and fields.
>
Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max
merge docs.
>
> On the search side, no need for a BooleanQuery and no need for scoring, so
> you will likely want your own Collector (dead simple to write).
>
bought it down to 15 secs from 30 for 1mil lookups using TermQuery and
Collector which is instantiated at once
>
> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
> payload and then use TermPositions (and no query at all) and forgo searching
> all together. Then you just need an IndexReader. First search will always
> be slow, unless you "warm" it first. This should help avoid the cost of
> going to document storage, which is almost always the most expensive thing
> one does in Lucene do to it's random nature. Might even be beneficial to be
> able to retrieve IDs in batches (sorted lexicographically, too).
>
Since all the words have unique ids' then i dont think there is any need for
assigning ids. Will re-use lucene document id.
Testing shows that it decreased index time to 13 sec and lookup time to 11
sec
But I still dont get the "not searching" part. Will take a look at
TermPosition and how its done.
>
> Don't get me wrong, it will likely be slower than a hash map, but the hash
> map won't scale and the Lucene term dictionary is delta encoded, so it will
> compress a fair amount. Also, as you grow, you will need to use an
> FSDirectory.
I stil havent seen the size diff for what I was doing previously. But after
I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as
compared to a HashMap.
with 5 mil words and 10mil lookups
Hashmap is 4x faster in ADD and 6x faster in lookup.
Inmemory Lucene dict gives around 100K lookups per second. Which is like
1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then
again, it just need to match lucene Analyzer's speed with which tokens are
processed.
> -Grant
>
> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:
>
> > Here is my attempt at making a dictionary lookup using lucene. Need some
> > pointers in optimising. Currently it takes 30 secs for a million lookups
> > using a dictionary of 500K words about 30x of that of a hashmap. But
> space
> > used is almost same as far as i can see in memory sizes looks almost the
> > same(from the process manager).
> >
> >
> > private static final String ID = "id";
> > private static final String WORD = "word";
> > private IndexWriter iwriter;
> > private IndexSearcher isearcher;
> > private RAMDirectory idx = new RAMDirectory();
> > private Analyzer analyzer = new WhitespaceAnalyzer();
> >
> > public void init() throws Exception {
> > this.iwriter =
> > new IndexWriter(idx, analyzer, true,
> > IndexWriter.MaxFieldLength.LIMITED);
> >
> > }
> >
> > public void destroy() throws Exception {
> > iwriter.close();
> > isearcher.close();
> > }
> >
> > public void ready() throws Exception {
> > iwriter.optimize();
> > iwriter.close();
> >
> > this.isearcher = new IndexSearcher(idx, true);
> > }
> >
> > public void addToDictionary(String word, Integer id) throws IOException
> {
> >
> > Document doc = new Document();
> > doc.add(new Field(WORD, word, Field.Store.NO,
> > Field.Index.NOT_ANALYZED));
> > doc.add(new Field(ID, id.toString(), Store.YES,
> > Field.Index.NOT_ANALYZED));
> > ?? Is there a way other than storing the id as string ?
> > iwriter.addDocument(doc);
> > }
> >
> > public Integer get(String word) throws IOException, ParseException {
> > BooleanQuery query = new BooleanQuery();
> > query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
> > TopDocs top = isearcher.search(query, null, 1);
> > ScoreDoc[] hits = top.scoreDocs;
> > if (hits.length == 0) return null;
> > return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
> > }
> >
> > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off
> will
> >> be very efficient. Plus, there is virtually no code to write. I've
> seen
> >> bare bones indexes be as little as 20% of the original w/ very fast
> lookup.
> >> Furthermore, there are many options available for controlling how much
> is
> >> loaded into memory, etc. Finally, it will handle all the languages you
> >> throw at it.
> >>
> >> -Grant
> >>
> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
> >>
> >>> Currently java strings use double the space of the characters in it
> >> because
> >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> >> when
> >>> loaded into a HashMap<String, Integer>. Is there some optimization we
> >> could
> >>> do in terms of storing them and ensuring that chinese, devanagiri and
> >> other
> >>> characters dont get messed up in the process.
> >>>
> >>> Some options benson suggested was: storing just the byte[] form and
> >> adding
> >>> the the option of supplying the hash function in OpenObjectIntHashmap
> or
> >>> even using a UTF-8 string.
> >>>
> >>> Or we could leave this alone. I currently estimate the memory
> requirement
> >>> using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings
> when
> >>> generating the dictionary split for the vectorizer
> >>>
> >>> Robin
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>