Indexing a fraction of the data, such as 10% or 5%, is probably the best way to do size estimation.
The only real caveat is that you also need to look at RAM as well. Most modern hardware has huge mass storage capacity relative to the CPU requirements for Lucene to process that data, while IT staffs tend to be very, very stingy with RAM (or they give you big, fat nodes, but way too few of them.) So even though most hardware easily has the disk space for 32 or 64 or 128 GB of index, getting that much RAM can be problematic, especially when the IT staff has drunk heavily of the "hey, everything runs great on commodity hardware!" Kool-Aid. IOW, running a 32GB index on a 16 GB box is probably not a great idea if you need low latency. -- Jack Krupansky On Tue, Mar 24, 2015 at 8:37 AM, Gaurav gupta <gupta.gaurav0...@gmail.com> wrote: > Erick, > When further testing the index sizes using Lucene APIs (I am directing > using Lucene not through Solr), I found that the index sizes are quite huge > compare to the formula (I have attached the excel sheet). But one thing > which I observe that the index sizes increases linearly w.r.t. no. of input > records/documents, so can I convey customer to create index of 1M, 5M and > 10M records and then extrapolate it for 250 M records. BAsically customer > wants to do the capacity planning for disk etc. and thats why he us looking > to some how reasonably predict the Lucene index size. > > > > Lucene Index size calculation # of Indexed Fields 11 > # > of Stored Fields 11 *Note : *I am using standard > Analyzer for all fields. I am indexing the records from a CSV file and each > records is of size 0.2 KB ( size of each doc = total file size/no of > records) Records(Million) Actual Index Size (MB) Size > as per formula (for Optimize) 1 255 31.1897507 5 1361.92 > 75.9487534 10 2703.36 131.897507 25 7239.68 299.743767 50 15257.6 > 579.487534 75 26009.6 859.2313 100 32256 1138.97507 125 39526.4 > 1418.71883 Thanks > > On Tue, Mar 10, 2015 at 9:08 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> In a word... no. There are simply too many variables here to give any >> decent estimate. >> >> The spreadsheet is, at best, an estimate. It hasn't been put through >> any rigorous QA so the fact that it's off in your situation is not >> surprising. I wish we had a better answer. >> >> And the disk size isn't particularly interesting anyway. The *.fdt and >> *.fdx files contain compressed copies of the raw data in _stored_ >> fields. If I index the same data with all fields set stored="true" >> then stored="false", my disk size may vary by a large factor. And the >> stored data has very little memory cost, memory usually being the >> limiting factor in your Solr installation. >> >> Are you storing position information? Term vectors? Are you ngramming >> your fields? and on and on. Each and every one of these changes the >> memory requirements... >> >> Sorry we can't be more help >> Erick >> >> On Mon, Mar 9, 2015 at 12:20 PM, Gaurav gupta >> <gupta.gaurav0...@gmail.com> wrote: >> > Could you please guide me how to reasonably estimate the disk size for >> > Lucene 4.x (precisely 4.8.1 version) including worst case scenario. >> > >> > I have referred the formula and excel sheet shared @ >> > >> https://lucidworks.com/blog/estimating-memory-and-storage-for-lucenesolr/ >> > >> > I think it seems to be devised for Lucene 2.9. I am not sure if it's >> hold >> > true for 4.x version. >> > In my case, either the actual index size is coming close to the worst >> case >> > or higher than that. Even, one of our enterprise customer has observed 3 >> > times higher index size than the estimated index size (based on excel >> > sheet). >> > >> > Alternatively, can I know the average doc size in Lucene index (of a >> > reasonable size of data) so that I can extrapolate that for complete 250 >> > million documents. >> > >> > Thanks >> > Gaurav >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >