Re: Lucene in-memory index

Igor Shalyminov Thu, 17 Oct 2013 15:07:01 -0700

Mike,

For now I'm using just a SpanQuery over a ~600MB index segment 
single-threadedly (one segment - one thread, the complete setup is 30 segments 
with the total of 20GB).


I'm trying to use Lucene for the morphologically annotated text corpus (namely, 
Russian National Corpus).
The main query type in it is co-occurrence search with desired word 
morphological features and distance between tokens.

In my test case I work with a single field - grammar (it is word-level - every 
word in the corpus has one). Full grammar annotation of a word is a set of 
atomic grammar features.
For an example, the verb "book" has in its grammar:
- POS  tag (V);
- time (pres);

and the noun "book":
- POS tag (N)
- number (sg).
 
In general one grammar annotation has approximately 8 atomic features.

Words are treated as initially ambiguous, so that for the word "book" 
occurrence in the text we get grammar tokens:
V    pres    N    sg
2 parses: "V,pres" and "N,sg" are just independent tokens with 
positionIncrement=0 in the index.

Moreover, each such token has parse bitmask in its payload: 
V|0001    pres|0001    N|0010    sg|0010

Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the 
maximum of 4 parse variants. It allows me to find the word "book" for the query 
"V" & "pres" but not for the query "V" & "sg".

So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with 
position and payload checking over a 600MB segment and getting the precise doc 
hits number and overall matches number via iterating over getSpans().

This takes me about 20 seconds, even if everything is in RAM.
The next thing I'm going to explore is compression, I'll try 
DirectPostingsFormat as you suggested.

--
Best Regards,
Igor

17.10.2013, 20:26, "Michael McCandless" <[email protected]>:
> DirectPostingsFormat holds all postings in RAM, uncompressed, as
> simple java arrays.  But it's quite RAM heavy...
>
> The hotspots may also be in the queries you are running ... maybe you
> can describe more how you're using Lucene?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
> <[email protected]> wrote:
>
>>  Hello!
>>
>>  I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both 
>> work the same for me (the same bad:( ).
>>  Thus, I think my problem is not disk access (although I always see 
>> getPayload() in the VisualVM top).
>>  So, maybe the hard part in the postings traversal is decompression?
>>  Are there Lucene codecs which use light postings compression (maybe none at 
>> all)?
>>
>>  And, getting back to in-memory index topic, is lucene.codecs.memory 
>> somewhat similar to RAMDirectory?
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  10.10.2013, 03:01, "Vitaly Funstein" <[email protected]>:
>>>  I don't think you want to load indexes of this size into a RAMDirectory.
>>>  The reasons have been listed multiple times here... in short, just use
>>>  MMapDirectory.
>>>
>>>  On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>  <[email protected]>wrote:
>>>>   Hello!
>>>>
>>>>   I need to perform an experiment of loading the entire index in RAM and
>>>>   seeing how the search performance changes.
>>>>   My index has TermVectors with payload and position info, StoredFields, 
>>>> and
>>>>   DocValues. It takes ~30GB on disk (the server has 48).
>>>>
>>>>   _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>   File(_indexDirectory)));
>>>>
>>>>   Is the line above the only thing I have to do to complete my goal?
>>>>
>>>>   And also:
>>>>   - will all the data be loaded in the RAM right after opening, or during
>>>>   the reading stage?
>>>>   - will the index data be stored in RAM as it is on disk, or will it be
>>>>   uncompressed first?
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Igor
>>>>
>>>>   ---------------------------------------------------------------------
>>>>   To unsubscribe, e-mail: [email protected]
>>>>   For additional commands, e-mail: [email protected]
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: [email protected]
>>  For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene in-memory index

Reply via email to