Mike,
For now I'm using just a SpanQuery over a ~600MB index segment
single-threadedly (one segment - one thread, the complete setup is 30 segments
with the total of 20GB).
I'm trying to use Lucene for the morphologically annotated text corpus (namely,
Russian National Corpus).
The main query type in it is co-occurrence search with desired word
morphological features and distance between tokens.
In my test case I work with a single field - grammar (it is word-level - every
word in the corpus has one). Full grammar annotation of a word is a set of
atomic grammar features.
For an example, the verb "book" has in its grammar:
- POS tag (V);
- time (pres);
and the noun "book":
- POS tag (N)
- number (sg).
In general one grammar annotation has approximately 8 atomic features.
Words are treated as initially ambiguous, so that for the word "book"
occurrence in the text we get grammar tokens:
V pres N sg
2 parses: "V,pres" and "N,sg" are just independent tokens with
positionIncrement=0 in the index.
Moreover, each such token has parse bitmask in its payload:
V|0001 pres|0001 N|0010 sg|0010
Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the
maximum of 4 parse variants. It allows me to find the word "book" for the query
"V" & "pres" but not for the query "V" & "sg".
So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with
position and payload checking over a 600MB segment and getting the precise doc
hits number and overall matches number via iterating over getSpans().
This takes me about 20 seconds, even if everything is in RAM.
The next thing I'm going to explore is compression, I'll try
DirectPostingsFormat as you suggested.
--
Best Regards,
Igor
17.10.2013, 20:26, "Michael McCandless" <[email protected]>:
> DirectPostingsFormat holds all postings in RAM, uncompressed, as
> simple java arrays. But it's quite RAM heavy...
>
> The hotspots may also be in the queries you are running ... maybe you
> can describe more how you're using Lucene?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
> <[email protected]> wrote:
>
>> Hello!
>>
>> I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both
>> work the same for me (the same bad:( ).
>> Thus, I think my problem is not disk access (although I always see
>> getPayload() in the VisualVM top).
>> So, maybe the hard part in the postings traversal is decompression?
>> Are there Lucene codecs which use light postings compression (maybe none at
>> all)?
>>
>> And, getting back to in-memory index topic, is lucene.codecs.memory
>> somewhat similar to RAMDirectory?
>>
>> --
>> Best Regards,
>> Igor
>>
>> 10.10.2013, 03:01, "Vitaly Funstein" <[email protected]>:
>>> I don't think you want to load indexes of this size into a RAMDirectory.
>>> The reasons have been listed multiple times here... in short, just use
>>> MMapDirectory.
>>>
>>> On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>> <[email protected]>wrote:
>>>> Hello!
>>>>
>>>> I need to perform an experiment of loading the entire index in RAM and
>>>> seeing how the search performance changes.
>>>> My index has TermVectors with payload and position info, StoredFields,
>>>> and
>>>> DocValues. It takes ~30GB on disk (the server has 48).
>>>>
>>>> _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>> File(_indexDirectory)));
>>>>
>>>> Is the line above the only thing I have to do to complete my goal?
>>>>
>>>> And also:
>>>> - will all the data be loaded in the RAM right after opening, or during
>>>> the reading stage?
>>>> - will the index data be stored in RAM as it is on disk, or will it be
>>>> uncompressed first?
>>>>
>>>> --
>>>> Best Regards,
>>>> Igor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]