Hi Danil, Thank you for your suggestions.
We will have approximately half million documents per file, so using your calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to get more files in the future, so a scalable solution is most desirable. The document IDs are not unique between files, so we will have to filter by file name as well. echcahe is certainly an interesting idea, does it have the comparable load speed as a Lucene index, what about memory footprint? Another thing I should have mentioned before, we will add a few files (say 10) per day, this means we need to update indices on a regular basis, hence the reason why we were thinking of generating one index per file. Am I right to say that you would definitely not go for one index per file solution? is it also due to memory consumption? Many thanks, Rui Wang On 6 Dec 2011, at 10:05, Danil ŢORIN wrote: > How many documents there are in the system ? > approximate it by: 20000 files * avg(docs/file) > > From my understanding your queries will be just lookup for a document ID > (Q: are those IDs unique between files? or you need to filter by filename?) > If that will be the only usecase than maybe you should consider some other > lookup systems, a ehcache offloaded and persistent on disk might work just > as well. > > If you are anywhere < 200 mln documents I'd say you should go with a single > index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM) > In a slightly beefier host and Lucene4 (try various codecs for speed/memory > usage) I think you could go to 1 bln documents. > > If you plan on more complex queries..like given a position in a file, > identify a document that contains it...than the number of documents should > be reconsidered. > > In worst case case scenario I would go with partitioned index (5-10 > partitions, but not thousands) > > > On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote: > >> Hi Guys, >> >> Thank you very much for your answers. >> >> I will do some profiling on memory usage, but is there any documentation >> on how Lucene uses/allocates the memory? >> >> Best wishes, >> Rui Wang >> >> >> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: >> >>> hi >>> >>>>> would the memory usage go through the roof? >>> >>> Yup .... >>> >>> My past experience got me pickels in there... >>> >>> >>> >>> with regards >>> karthik >>> >>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote: >>> >>>> Hi All, >>>> >>>> We are planning to use lucene in our project, but not entirely sure >> about >>>> some of the design decisions were made. Below are the details, any >>>> comments/suggestions are more than welcome. >>>> >>>> The requirements of the project are below: >>>> >>>> 1. We have tens of thousands of files, their size ranging from 500M to >> a >>>> few terabytes, and majority of the contents in these files will not be >>>> accessed frequently. >>>> >>>> 2. We are planning to keep less accessed contents outside of our >> database, >>>> store them on the file system. >>>> >>>> 3. We also have code to get the binary position of these contents in the >>>> files. Using these binary positions, we can quickly retrieve the >> contents >>>> and convert them into our domain objects. >>>> >>>> We think Lucene provides a scalable solution for storing and indexing >>>> these binary positions, so the idea is that each piece of the content in >>>> the files will a document, each document will have at least an ID field >> to >>>> identify to content and a binary position field contains the starting >> and >>>> stop position of the content. Having done some performance testing, it >>>> seems to us that Lucene is well capable of doing this. >>>> >>>> At the moment, we are planning to create one Lucene index per file, so >> if >>>> we have new files to be added to the system, we can simply generate a >> new >>>> index. The problem is do with searching, this approach means that we >> need >>>> to create an new IndexSearcher every time a file is accessed through our >>>> web service. We knew that it is rather expensive to open a new >>>> IndexSearcher, and are thinking of using some kind of pooling mechanism. >>>> Our questions are: >>>> >>>> 1. Is this one index per file approach a viable solution? What do you >>>> think about pooling IndexSearcher? >>>> >>>> 2. If we have many IndexSearchers opened at the same time, would the >>>> memory usage go through the roof? I couldn't find any document on how >>>> Lucene use allocate memory. >>>> >>>> Thank you very much for your help. >>>> >>>> Many thanks, >>>> Rui Wang >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >>> -- >>> *N.S.KARTHIK >>> R.M.S.COLONY >>> BEHIND BANK OF INDIA >>> R.M.V 2ND STAGE >>> BANGALORE >>> 560094* >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org