Hi Hrishi, The only way you'll know is to try it with some subset of your data - some queries can be very expensive, some are really easy. It'll depend on your document size, the vocabulary (total number and distribution of terms), and kinds of queries, as well as of course your hardware. I would start out indexing the sizes you mention (10-1000GB), and run queries like those you expect to be running in production against it, and measure your TPS after it's been running for a while under load.
To index even 1TB you should probably do this in parallel and then merge afterwards if you want to build up this test index in any reasonable time, but that final merge of the last two segments in your 1TB index is gonna be a killer. One of the big problems you'll run into with this index size is that you'll never have enough RAM to give your OS's IO cache enough room to keep much of this index in memory, so you're going to be seeking in this monster file a lot. I'm not saying that you need to keep your index in RAM for good performance, but I've always tried to keep the individual indexes I use at least within a (binary) order of magnitude of the RAM available - if I'm on a box with 16GB of memory, then an index bigger than 32GB is getting dangerously big for my preferences. This may be mitigated by using really fast disks, possibly, which is yet another reason why you'll need to do some performance profiling on a variety of sizes with similar-to-production data sets. I wish I could be of more help - but I think on this size, you'll need to play with it to see what works. We here on the list would be *very* interested to hear what you find, because I'll bet that the reason why you're not getting very many responses to this question is not because nobody cares, but because most of us don't really know if you can ever really search multi-TB *single* indexes, or what kind of cluster configuration works best for searching a 75 TB distributed lucene index! -jake On Thu, Oct 22, 2009 at 11:29 PM, Hrishikesh Agashe < hrishikesh_aga...@persistent.co.in> wrote: > Thanks Jake. > > I have around 75 TB data to be indexed. So even though I do the sharding, > individual index file size might still be pretty high. And that's why I > wanted to find out whether there is any limit as such. And obviously whether > such a huge index files can be searched at all. > > From your response it appears that 1 TB of 1 index file is too much. Is > there any guideline to what kind of hardware will be required to handle > (10GB, 50GB, 100GB, 500GB etc) size of index file (with sensible search > times) > > --Hrishi > > -----Original Message----- > From: Jake Mannix [mailto:jake.man...@gmail.com] > Sent: Friday, October 23, 2009 11:09 AM > To: java-user@lucene.apache.org > Subject: Re: Maximum index file size > > On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe < > hrishikesh_aga...@persistent.co.in> wrote: > > > Can I create an index file with very large size, like 1 TB or so? Is > there > > any limit on how large index file one can create? Also, will I be able to > > search on this 1 TB index file at all? > > > > Leaving aside the question of hardware or JVM limits on monstrous files, > this question (can you search this file) is easier: if you've got say, a > ten > billion documents in one index, and you have a query which is going to hit > maybe even just 0.1% of the documents, you'll need to do scoring of 10 > million hits in the course of that query. To do this in under a second > means you only have 100 nanoseconds to look at each document. If your > query > hits 1% of your documents, you're down to 10 ns per document. I've never > tried searching a 1TB index, but I'd say that's pushing it. > > Is there a reason you can't shard your index, and instead put maybe 20 > shards of 50GB (or better - 100 shards of 10GB) each on a variety of > machines, and just merge results? > > -jake > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >