On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <gsing...@apache.org>wrote:

> On the indexing side, add in batches and reuse the document and fields.
>
Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max
merge docs.

>
> On the search side, no need for a BooleanQuery and no need for scoring, so
> you will likely want your own Collector (dead simple to write).
>
bought it down to 15 secs from 30 for 1mil lookups using TermQuery and
Collector which is instantiated at once


>
> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
> payload and then use TermPositions (and no query at all) and forgo searching
> all together.  Then you just need an IndexReader.  First search will always
> be slow, unless you "warm" it first.  This should help avoid the cost of
> going to document storage, which is almost always the most expensive thing
> one does in Lucene do to it's random nature.  Might even be beneficial to be
> able to retrieve IDs in batches (sorted lexicographically, too).
>

Since all the words have unique ids' then i dont think there is any need for
assigning ids. Will re-use lucene document id.
Testing shows that it decreased index time to 13 sec and lookup time to 11
sec

But I still dont get the "not searching" part. Will take a look at
TermPosition and how its done.

>
> Don't get me wrong, it will likely be slower than a hash map, but the hash
> map won't scale and the Lucene term dictionary is delta encoded, so it will
> compress a fair amount.  Also, as you grow, you will need to use an
> FSDirectory.

I stil havent seen the size diff for what I was doing previously. But after
I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as
compared to a HashMap.

with 5 mil words and 10mil lookups
Hashmap is 4x faster in ADD and 6x faster in lookup.
Inmemory Lucene dict gives around 100K lookups per second. Which is like
1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then
again, it just need to match lucene Analyzer's speed with which tokens are
processed.






> -Grant
>
> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:
>
> > Here is my attempt at making a dictionary lookup using lucene. Need some
> > pointers in optimising. Currently it takes 30 secs for a million lookups
> > using a dictionary of 500K words about 30x of that of a hashmap. But
> space
> > used is almost same as far as i can see in memory sizes looks almost the
> > same(from the process manager).
> >
> >
> > private static final String ID = "id";
> >  private static final String WORD = "word";
> >  private IndexWriter iwriter;
> >  private IndexSearcher isearcher;
> >  private RAMDirectory idx = new RAMDirectory();
> >  private Analyzer analyzer = new WhitespaceAnalyzer();
> >
> >  public void init() throws Exception {
> >    this.iwriter =
> >        new IndexWriter(idx, analyzer, true,
> > IndexWriter.MaxFieldLength.LIMITED);
> >
> >  }
> >
> >  public void destroy() throws Exception {
> >    iwriter.close();
> >    isearcher.close();
> >  }
> >
> >  public void ready() throws Exception {
> >    iwriter.optimize();
> >    iwriter.close();
> >
> >    this.isearcher = new IndexSearcher(idx, true);
> >  }
> >
> >  public void addToDictionary(String word, Integer id) throws IOException
> {
> >
> >    Document doc = new Document();
> >    doc.add(new Field(WORD, word, Field.Store.NO,
> > Field.Index.NOT_ANALYZED));
> >    doc.add(new Field(ID, id.toString(), Store.YES,
> > Field.Index.NOT_ANALYZED));
> > ?? Is there a way other than storing the id as string ?
> >    iwriter.addDocument(doc);
> >  }
> >
> >  public Integer get(String word) throws IOException, ParseException {
> >    BooleanQuery query = new BooleanQuery();
> >    query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
> >    TopDocs top = isearcher.search(query, null, 1);
> >    ScoreDoc[] hits = top.scoreDocs;
> >    if (hits.length == 0) return null;
> >    return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
> >  }
> >
> > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org
> >wrote:
> >
> >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off
> will
> >> be very efficient.  Plus, there is virtually no code to write.  I've
> seen
> >> bare bones indexes be as little as 20% of the original w/ very fast
> lookup.
> >> Furthermore, there are many options available for controlling how much
> is
> >> loaded into memory, etc.  Finally, it will handle all the languages you
> >> throw at it.
> >>
> >> -Grant
> >>
> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
> >>
> >>> Currently java strings use double the space of the characters in it
> >> because
> >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> >> when
> >>> loaded into a HashMap<String, Integer>.  Is there some optimization we
> >> could
> >>> do in terms of storing them and ensuring that chinese, devanagiri and
> >> other
> >>> characters dont get messed up in the process.
> >>>
> >>> Some options benson suggested was: storing just the byte[] form and
> >> adding
> >>> the the option of supplying the hash function in OpenObjectIntHashmap
> or
> >>> even using a UTF-8 string.
> >>>
> >>> Or we could leave this alone. I currently estimate the memory
> requirement
> >>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings
> when
> >>> generating the dictionary split for the vectorizer
> >>>
> >>> Robin
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to