Re: Efficient dictionary storage in memory

Grant Ingersoll Sun, 17 Jan 2010 05:49:08 -0800

On Jan 17, 2010, at 5:34 AM, Robin Anil wrote:

> Hi Grant,
> I tried with IndexReader and got around 2x boost in speed. i.e around 200K
> lookups/s as compared to hashmap which is 600K lookups/s
> I cant seem to reuse Term object which is a major bottleneck. Also
> TermPositions wasnt able to give me the docid, it did give the payload in
> the form of a bytearray which i have no idea how to decipher so I stuck to
> TermDocs instead


Right, no need for the payloads if you use the id, but can you guarantee that 
no merges take place?  I think the optimize will do that.  What do you use the 
ID for?

I still doubt it will be faster than a pure hashmap, but that wasn't your 
original question.  The tradeoff by going this route is that it scales much 
more and you get Lucene's delta compression and it shouldn't take up as much 
memory.

Underneath the hood, Lucene is doing a binary search of a sublist of the terms 
and then a linear scan of at most 128 terms (by default)

> 
> Here is the code
> 
>  private static final String WORD = "word";
>  private IndexWriter iwriter;
>  private IndexReader ireader;
>  private RAMDirectory idx = new RAMDirectory();
>  private Analyzer analyzer = new KeywordAnalyzer();
>  private Document doc = new Document();
>  private Field wordField =
>      new Field(WORD, "", Field.Store.NO,
> Field.Index.NOT_ANALYZED_NO_NORMS);
>  private Term queryTerm = new Term(WORD, "");
> 
>  public void readyForImport() throws Exception {
>    this.iwriter =
>        new IndexWriter(idx, analyzer, true, new NoDeletionPolicy(),
>            IndexWriter.MaxFieldLength.LIMITED);
>    this.iwriter.setMaxFieldLength(200);
>    this.iwriter.setMaxMergeDocs(10000000);
>    this.iwriter.setUseCompoundFile(false);
>    doc.add(wordField);
>  }
> 
>  public void destroy() throws Exception {
>    ireader.close();
>    iwriter.close();
>  }
> 
>  public void readyForRead() throws Exception {
>    iwriter.optimize();
>    iwriter.close();
>    this.ireader = IndexReader.open(idx, true);
>  }
> 
>  public void addToDictionary(String word, int id) throws IOException {
>    if (id < 0) throw new IllegalArgumentException("ID cannot be negative");
>    wordField.setValue(word);
>    iwriter.addDocument(doc);
>  }
> 
>  public int get(String word) throws IOException {
>    Term t = queryTerm.createTerm(word);
>    TermDocs docs = ireader.termDocs(t);
>    if(docs.next() == false) return -1;
>    return docs.doc();
> }
> 
> 
> 
> 
> On Sun, Jan 17, 2010 at 6:36 AM, Robin Anil <robin.a...@gmail.com> wrote:
> 
>> 
>> On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>> 
>>> On the indexing side, add in batches and reuse the document and fields.
>>> 
>> Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing
>> max merge docs.
>> 
>>> 
>>> On the search side, no need for a BooleanQuery and no need for scoring, so
>>> you will likely want your own Collector (dead simple to write).
>>> 
>> bought it down to 15 secs from 30 for 1mil lookups using TermQuery and
>> Collector which is instantiated at once
>> 
>> 
>>> 
>>> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
>>> payload and then use TermPositions (and no query at all) and forgo searching
>>> all together.  Then you just need an IndexReader.  First search will always
>>> be slow, unless you "warm" it first.  This should help avoid the cost of
>>> going to document storage, which is almost always the most expensive thing
>>> one does in Lucene do to it's random nature.  Might even be beneficial to be
>>> able to retrieve IDs in batches (sorted lexicographically, too).
>>> 
>> 
>> Since all the words have unique ids' then i dont think there is any need
>> for assigning ids. Will re-use lucene document id.
>> Testing shows that it decreased index time to 13 sec and lookup time to 11
>> sec
>> 
>> But I still dont get the "not searching" part. Will take a look at
>> TermPosition and how its done.
>> 
>>> 
>>> Don't get me wrong, it will likely be slower than a hash map, but the hash
>>> map won't scale and the Lucene term dictionary is delta encoded, so it will
>>> compress a fair amount.  Also, as you grow, you will need to use an
>>> FSDirectory.
>> 
>> I stil havent seen the size diff for what I was doing previously. But after
>> I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as
>> compared to a HashMap.
>> 
>> with 5 mil words and 10mil lookups
>> Hashmap is 4x faster in ADD and 6x faster in lookup.
>> Inmemory Lucene dict gives around 100K lookups per second. Which is like
>> 1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then
>> again, it just need to match lucene Analyzer's speed with which tokens are
>> processed.
>> 
>> 
>> 
>> 
>> 
>> 
>>> -Grant
>>> 
>>> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:
>>> 
>>>> Here is my attempt at making a dictionary lookup using lucene. Need some
>>>> pointers in optimising. Currently it takes 30 secs for a million lookups
>>>> using a dictionary of 500K words about 30x of that of a hashmap. But
>>> space
>>>> used is almost same as far as i can see in memory sizes looks almost the
>>>> same(from the process manager).
>>>> 
>>>> 
>>>> private static final String ID = "id";
>>>> private static final String WORD = "word";
>>>> private IndexWriter iwriter;
>>>> private IndexSearcher isearcher;
>>>> private RAMDirectory idx = new RAMDirectory();
>>>> private Analyzer analyzer = new WhitespaceAnalyzer();
>>>> 
>>>> public void init() throws Exception {
>>>>   this.iwriter =
>>>>       new IndexWriter(idx, analyzer, true,
>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>> 
>>>> }
>>>> 
>>>> public void destroy() throws Exception {
>>>>   iwriter.close();
>>>>   isearcher.close();
>>>> }
>>>> 
>>>> public void ready() throws Exception {
>>>>   iwriter.optimize();
>>>>   iwriter.close();
>>>> 
>>>>   this.isearcher = new IndexSearcher(idx, true);
>>>> }
>>>> 
>>>> public void addToDictionary(String word, Integer id) throws IOException
>>> {
>>>> 
>>>>   Document doc = new Document();
>>>>   doc.add(new Field(WORD, word, Field.Store.NO,
>>>> Field.Index.NOT_ANALYZED));
>>>>   doc.add(new Field(ID, id.toString(), Store.YES,
>>>> Field.Index.NOT_ANALYZED));
>>>> ?? Is there a way other than storing the id as string ?
>>>>   iwriter.addDocument(doc);
>>>> }
>>>> 
>>>> public Integer get(String word) throws IOException, ParseException {
>>>>   BooleanQuery query = new BooleanQuery();
>>>>   query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
>>>>   TopDocs top = isearcher.search(query, null, 1);
>>>>   ScoreDoc[] hits = top.scoreDocs;
>>>>   if (hits.length == 0) return null;
>>>>   return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
>>>> }
>>>> 
>>>> On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org
>>>> wrote:
>>>> 
>>>>> A Lucene index, w/ no storage, positions, etc. (optionally) turned off
>>> will
>>>>> be very efficient.  Plus, there is virtually no code to write.  I've
>>> seen
>>>>> bare bones indexes be as little as 20% of the original w/ very fast
>>> lookup.
>>>>> Furthermore, there are many options available for controlling how much
>>> is
>>>>> loaded into memory, etc.  Finally, it will handle all the languages you
>>>>> throw at it.
>>>>> 
>>>>> -Grant
>>>>> 
>>>>> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>>>>> 
>>>>>> Currently java strings use double the space of the characters in it
>>>>> because
>>>>>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>>>>> when
>>>>>> loaded into a HashMap<String, Integer>.  Is there some optimization we
>>>>> could
>>>>>> do in terms of storing them and ensuring that chinese, devanagiri and
>>>>> other
>>>>>> characters dont get messed up in the process.
>>>>>> 
>>>>>> Some options benson suggested was: storing just the byte[] form and
>>>>> adding
>>>>>> the the option of supplying the hash function in OpenObjectIntHashmap
>>> or
>>>>>> even using a UTF-8 string.
>>>>>> 
>>>>>> Or we could leave this alone. I currently estimate the memory
>>> requirement
>>>>>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings
>>> when
>>>>>> generating the dictionary split for the vectorizer
>>>>>> 
>>>>>> Robin
>>>>> 
>>>>> 
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>> 
>>> Search the Lucene ecosystem using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>> 
>>> 
>>

Re: Efficient dictionary storage in memory

Reply via email to