Hi Danil, Thank you for answering once again.
You are right that we always know the file we are searching, the file location is stored in a database. Having done some testing, it seems to me that use index/file yields reasonable performance just like you suggested. For a 500K docs/index, I measured the index load time plus querying and getting the result back. It takes around 350 milliseconds. Also the memory footprint is around 1.5 M. Many thanks, Rui Wang On 7 Dec 2011, at 07:46, Danil ŢORIN wrote: > 10B documents is a lot of data. > > Index/file won't scale: you will not be able to open all the indexes in the > same time (filehandlers limits, memory limits, etc), and if you'll > search through them sequentially, it will take a lot of time. > > Unless in your usecase you always know the file you are searching, in this > case you could open just one index at a time, search it, and close it. > In this case index/file is a good and scalable solution. > (There will be a penalty of fresh open of the index, but 500K docs/index > should be quite quick to open, you may want to maintain a pool of opened > indexes with LFU eviction, so repeated request will reuse already opened > IndexReader, and old/unused indexReaders will be closed to free the > resources) > > > > ehcache has a possibility to keep some entries in the memory (let say few > thousand) and the rest of the cache to be persisted on disk. > So the memory usage is not the issue, you could run it with 64M of JVM, and > let OS to handle the rest. > > On Tue, Dec 6, 2011 at 12:26, Rui Wang <rw...@ebi.ac.uk> wrote: > >> Hi Danil, >> >> Thank you for your suggestions. >> >> We will have approximately half million documents per file, so using your >> calculation, 20000 files * 500000 = 10, 000, 000, 000. And we are likely to >> get more files in the future, so a scalable solution is most desirable. >> >> The document IDs are not unique between files, so we will have to filter >> by file name as well. echcahe is certainly an interesting idea, does it >> have the comparable load speed as a Lucene index, what about memory >> footprint? >> >> Another thing I should have mentioned before, we will add a few files (say >> 10) per day, this means we need to update indices on a regular basis, hence >> the reason why we were thinking of generating one index per file. >> >> Am I right to say that you would definitely not go for one index per file >> solution? is it also due to memory consumption? >> >> Many thanks, >> Rui Wang >> >> >> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote: >> >>> How many documents there are in the system ? >>> approximate it by: 20000 files * avg(docs/file) >>> >>> From my understanding your queries will be just lookup for a document ID >>> (Q: are those IDs unique between files? or you need to filter by >> filename?) >>> If that will be the only usecase than maybe you should consider some >> other >>> lookup systems, a ehcache offloaded and persistent on disk might work >> just >>> as well. >>> >>> If you are anywhere < 200 mln documents I'd say you should go with a >> single >>> index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM) >>> In a slightly beefier host and Lucene4 (try various codecs for >> speed/memory >>> usage) I think you could go to 1 bln documents. >>> >>> If you plan on more complex queries..like given a position in a file, >>> identify a document that contains it...than the number of documents >> should >>> be reconsidered. >>> >>> In worst case case scenario I would go with partitioned index (5-10 >>> partitions, but not thousands) >>> >>> >>> On Tue, Dec 6, 2011 at 11:03, Rui Wang <rw...@ebi.ac.uk> wrote: >>> >>>> Hi Guys, >>>> >>>> Thank you very much for your answers. >>>> >>>> I will do some profiling on memory usage, but is there any documentation >>>> on how Lucene uses/allocates the memory? >>>> >>>> Best wishes, >>>> Rui Wang >>>> >>>> >>>> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote: >>>> >>>>> hi >>>>> >>>>>>> would the memory usage go through the roof? >>>>> >>>>> Yup .... >>>>> >>>>> My past experience got me pickels in there... >>>>> >>>>> >>>>> >>>>> with regards >>>>> karthik >>>>> >>>>> On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rw...@ebi.ac.uk> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We are planning to use lucene in our project, but not entirely sure >>>> about >>>>>> some of the design decisions were made. Below are the details, any >>>>>> comments/suggestions are more than welcome. >>>>>> >>>>>> The requirements of the project are below: >>>>>> >>>>>> 1. We have tens of thousands of files, their size ranging from 500M >> to >>>> a >>>>>> few terabytes, and majority of the contents in these files will not be >>>>>> accessed frequently. >>>>>> >>>>>> 2. We are planning to keep less accessed contents outside of our >>>> database, >>>>>> store them on the file system. >>>>>> >>>>>> 3. We also have code to get the binary position of these contents in >> the >>>>>> files. Using these binary positions, we can quickly retrieve the >>>> contents >>>>>> and convert them into our domain objects. >>>>>> >>>>>> We think Lucene provides a scalable solution for storing and indexing >>>>>> these binary positions, so the idea is that each piece of the content >> in >>>>>> the files will a document, each document will have at least an ID >> field >>>> to >>>>>> identify to content and a binary position field contains the starting >>>> and >>>>>> stop position of the content. Having done some performance testing, it >>>>>> seems to us that Lucene is well capable of doing this. >>>>>> >>>>>> At the moment, we are planning to create one Lucene index per file, so >>>> if >>>>>> we have new files to be added to the system, we can simply generate a >>>> new >>>>>> index. The problem is do with searching, this approach means that we >>>> need >>>>>> to create an new IndexSearcher every time a file is accessed through >> our >>>>>> web service. We knew that it is rather expensive to open a new >>>>>> IndexSearcher, and are thinking of using some kind of pooling >> mechanism. >>>>>> Our questions are: >>>>>> >>>>>> 1. Is this one index per file approach a viable solution? What do you >>>>>> think about pooling IndexSearcher? >>>>>> >>>>>> 2. If we have many IndexSearchers opened at the same time, would the >>>>>> memory usage go through the roof? I couldn't find any document on how >>>>>> Lucene use allocate memory. >>>>>> >>>>>> Thank you very much for your help. >>>>>> >>>>>> Many thanks, >>>>>> Rui Wang >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *N.S.KARTHIK >>>>> R.M.S.COLONY >>>>> BEHIND BANK OF INDIA >>>>> R.M.V 2ND STAGE >>>>> BANGALORE >>>>> 560094* >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org