Hi Danil,

Thank you for answering once again. 

You are right that we always know the file we are searching, the file location 
is stored in a database. 

Having done some testing, it seems to me that use index/file yields reasonable 
performance just like you suggested. 

For a 500K docs/index, I measured the index load time plus querying and getting 
the result back. It takes around 350 milliseconds. Also the memory footprint is 
around 1.5 M. 

Many thanks,
Rui Wang 
On 7 Dec 2011, at 07:46, Danil ŢORIN wrote:

> 10B documents is a lot of data.
> 
> Index/file won't scale: you will not be able to open all the indexes in the
> same time (filehandlers limits, memory limits, etc), and if you'll
> search through them sequentially, it will take a lot of time.
> 
> Unless in your usecase you always know the file you are searching, in this
> case you could open just one index at a time, search it, and close it.
> In this case index/file is a good and scalable solution.
> (There will be a penalty of fresh open of the index, but 500K docs/index
> should be quite quick to open, you may want to maintain a pool of opened
> indexes with LFU eviction, so repeated request will reuse already opened
> IndexReader, and old/unused indexReaders will be closed to free the
> resources)
> 
> 
> 
> ehcache has a possibility to keep some entries in the memory (let say few
> thousand) and the rest of the cache to be persisted on disk.
> So the memory usage is not the issue, you could run it with 64M of JVM, and
> let OS to handle the rest.
> 
> On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote:
> 
>> Hi Danil,
>> 
>> Thank you for your suggestions.
>> 
>> We will have approximately half million documents per file, so using your
>> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to
>> get more files in the future, so a scalable solution is most desirable.
>> 
>> The document IDs are not unique between files, so we will have to filter
>> by file name as well. echcahe is certainly an interesting idea, does it
>> have the comparable load speed as a Lucene index, what about memory
>> footprint?
>> 
>> Another thing I should have mentioned before, we will add a few files (say
>> 10) per day, this means we need to update indices on a regular basis, hence
>> the reason why we were thinking of generating one index per file.
>> 
>> Am I right to say that you would definitely not go for one index per file
>> solution? is it also due to memory consumption?
>> 
>> Many thanks,
>> Rui Wang
>> 
>> 
>> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>> 
>>> How many documents there are in the system ?
>>> approximate it by: 20000 files * avg(docs/file)
>>> 
>>> From my understanding your queries will be just lookup for a document ID
>>> (Q: are those IDs unique between files? or you need to filter by
>> filename?)
>>> If that will be the only usecase than maybe you should consider some
>> other
>>> lookup systems, a ehcache offloaded and persistent on disk might work
>> just
>>> as well.
>>> 
>>> If you are anywhere < 200 mln documents I'd say you should go with a
>> single
>>> index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
>>> In a slightly beefier host and Lucene4 (try various codecs for
>> speed/memory
>>> usage) I think you could go to 1 bln documents.
>>> 
>>> If you plan on more complex queries..like given a position in a file,
>>> identify a document that contains it...than the number of documents
>> should
>>> be reconsidered.
>>> 
>>> In worst case case scenario I would go with partitioned index (5-10
>>> partitions, but not thousands)
>>> 
>>> 
>>> On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> Thank you very much for your answers.
>>>> 
>>>> I will do some profiling on memory usage, but is there any documentation
>>>> on how Lucene uses/allocates the memory?
>>>> 
>>>> Best wishes,
>>>> Rui Wang
>>>> 
>>>> 
>>>> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>>>> 
>>>>> hi
>>>>> 
>>>>>>> would the memory usage go through the roof?
>>>>> 
>>>>> Yup ....
>>>>> 
>>>>> My past experience got me pickels  in there...
>>>>> 
>>>>> 
>>>>> 
>>>>> with regards
>>>>> karthik
>>>>> 
>>>>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> We are planning to use lucene in our project, but not entirely sure
>>>> about
>>>>>> some of the design decisions were made. Below are the details, any
>>>>>> comments/suggestions are more than welcome.
>>>>>> 
>>>>>> The requirements of the project are below:
>>>>>> 
>>>>>> 1. We have  tens of thousands of files, their size ranging from 500M
>> to
>>>> a
>>>>>> few terabytes, and majority of the contents in these files will not be
>>>>>> accessed frequently.
>>>>>> 
>>>>>> 2. We are planning to keep less accessed contents outside of our
>>>> database,
>>>>>> store them on the file system.
>>>>>> 
>>>>>> 3. We also have code to get the binary position of these contents in
>> the
>>>>>> files. Using these binary positions, we can quickly retrieve the
>>>> contents
>>>>>> and convert them into our domain objects.
>>>>>> 
>>>>>> We think Lucene provides a scalable solution for storing and indexing
>>>>>> these binary positions, so the idea is that each piece of the content
>> in
>>>>>> the files will a document, each document will have at least an ID
>> field
>>>> to
>>>>>> identify to content and a binary position field contains the starting
>>>> and
>>>>>> stop position of the content. Having done some performance testing, it
>>>>>> seems to us that Lucene is well capable of doing this.
>>>>>> 
>>>>>> At the moment, we are planning to create one Lucene index per file, so
>>>> if
>>>>>> we have new files to be added to the system, we can simply generate a
>>>> new
>>>>>> index. The problem is do with searching, this approach means that we
>>>> need
>>>>>> to create an new IndexSearcher every time a file is accessed through
>> our
>>>>>> web service. We knew that it is rather expensive to open a new
>>>>>> IndexSearcher, and are thinking of using some kind of pooling
>> mechanism.
>>>>>> Our questions are:
>>>>>> 
>>>>>> 1. Is this one index per file approach a viable solution? What do you
>>>>>> think about pooling IndexSearcher?
>>>>>> 
>>>>>> 2. If we have many IndexSearchers opened at the same time, would the
>>>>>> memory usage go through the roof? I couldn't find any document on how
>>>>>> Lucene use allocate memory.
>>>>>> 
>>>>>> Thank you very much for your help.
>>>>>> 
>>>>>> Many thanks,
>>>>>> Rui Wang
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *N.S.KARTHIK
>>>>> R.M.S.COLONY
>>>>> BEHIND BANK OF INDIA
>>>>> R.M.V 2ND STAGE
>>>>> BANGALORE
>>>>> 560094*
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to