10B documents is a lot of data.

Index/file won't scale: you will not be able to open all the indexes in the
same time (filehandlers limits, memory limits, etc), and if you'll
search through them sequentially, it will take a lot of time.

Unless in your usecase you always know the file you are searching, in this
case you could open just one index at a time, search it, and close it.
In this case index/file is a good and scalable solution.
(There will be a penalty of fresh open of the index, but 500K docs/index
should be quite quick to open, you may want to maintain a pool of opened
indexes with LFU eviction, so repeated request will reuse already opened
IndexReader, and old/unused indexReaders will be closed to free the
resources)



ehcache has a possibility to keep some entries in the memory (let say few
thousand) and the rest of the cache to be persisted on disk.
So the memory usage is not the issue, you could run it with 64M of JVM, and
let OS to handle the rest.

On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote:

> Hi Danil,
>
> Thank you for your suggestions.
>
> We will have approximately half million documents per file, so using your
> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to
> get more files in the future, so a scalable solution is most desirable.
>
> The document IDs are not unique between files, so we will have to filter
> by file name as well. echcahe is certainly an interesting idea, does it
> have the comparable load speed as a Lucene index, what about memory
> footprint?
>
> Another thing I should have mentioned before, we will add a few files (say
> 10) per day, this means we need to update indices on a regular basis, hence
> the reason why we were thinking of generating one index per file.
>
> Am I right to say that you would definitely not go for one index per file
> solution? is it also due to memory consumption?
>
> Many thanks,
> Rui Wang
>
>
> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>
> > How many documents there are in the system ?
> > approximate it by: 20000 files * avg(docs/file)
> >
> > From my understanding your queries will be just lookup for a document ID
> > (Q: are those IDs unique between files? or you need to filter by
> filename?)
> > If that will be the only usecase than maybe you should consider some
> other
> > lookup systems, a ehcache offloaded and persistent on disk might work
> just
> > as well.
> >
> > If you are anywhere < 200 mln documents I'd say you should go with a
> single
> > index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
> > In a slightly beefier host and Lucene4 (try various codecs for
> speed/memory
> > usage) I think you could go to 1 bln documents.
> >
> > If you plan on more complex queries..like given a position in a file,
> > identify a document that contains it...than the number of documents
> should
> > be reconsidered.
> >
> > In worst case case scenario I would go with partitioned index (5-10
> > partitions, but not thousands)
> >
> >
> > On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote:
> >
> >> Hi Guys,
> >>
> >> Thank you very much for your answers.
> >>
> >> I will do some profiling on memory usage, but is there any documentation
> >> on how Lucene uses/allocates the memory?
> >>
> >> Best wishes,
> >> Rui Wang
> >>
> >>
> >> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
> >>
> >>> hi
> >>>
> >>>>> would the memory usage go through the roof?
> >>>
> >>> Yup ....
> >>>
> >>> My past experience got me pickels  in there...
> >>>
> >>>
> >>>
> >>> with regards
> >>> karthik
> >>>
> >>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> We are planning to use lucene in our project, but not entirely sure
> >> about
> >>>> some of the design decisions were made. Below are the details, any
> >>>> comments/suggestions are more than welcome.
> >>>>
> >>>> The requirements of the project are below:
> >>>>
> >>>> 1. We have  tens of thousands of files, their size ranging from 500M
> to
> >> a
> >>>> few terabytes, and majority of the contents in these files will not be
> >>>> accessed frequently.
> >>>>
> >>>> 2. We are planning to keep less accessed contents outside of our
> >> database,
> >>>> store them on the file system.
> >>>>
> >>>> 3. We also have code to get the binary position of these contents in
> the
> >>>> files. Using these binary positions, we can quickly retrieve the
> >> contents
> >>>> and convert them into our domain objects.
> >>>>
> >>>> We think Lucene provides a scalable solution for storing and indexing
> >>>> these binary positions, so the idea is that each piece of the content
> in
> >>>> the files will a document, each document will have at least an ID
> field
> >> to
> >>>> identify to content and a binary position field contains the starting
> >> and
> >>>> stop position of the content. Having done some performance testing, it
> >>>> seems to us that Lucene is well capable of doing this.
> >>>>
> >>>> At the moment, we are planning to create one Lucene index per file, so
> >> if
> >>>> we have new files to be added to the system, we can simply generate a
> >> new
> >>>> index. The problem is do with searching, this approach means that we
> >> need
> >>>> to create an new IndexSearcher every time a file is accessed through
> our
> >>>> web service. We knew that it is rather expensive to open a new
> >>>> IndexSearcher, and are thinking of using some kind of pooling
> mechanism.
> >>>> Our questions are:
> >>>>
> >>>> 1. Is this one index per file approach a viable solution? What do you
> >>>> think about pooling IndexSearcher?
> >>>>
> >>>> 2. If we have many IndexSearchers opened at the same time, would the
> >>>> memory usage go through the roof? I couldn't find any document on how
> >>>> Lucene use allocate memory.
> >>>>
> >>>> Thank you very much for your help.
> >>>>
> >>>> Many thanks,
> >>>> Rui Wang
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> *N.S.KARTHIK
> >>> R.M.S.COLONY
> >>> BEHIND BANK OF INDIA
> >>> R.M.V 2ND STAGE
> >>> BANGALORE
> >>> 560094*
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to