Erick, When further testing the index sizes using Lucene APIs (I am directing using Lucene not through Solr), I found that the index sizes are quite huge compare to the formula (I have attached the excel sheet). But one thing which I observe that the index sizes increases linearly w.r.t. no. of input records/documents, so can I convey customer to create index of 1M, 5M and 10M records and then extrapolate it for 250 M records. BAsically customer wants to do the capacity planning for disk etc. and thats why he us looking to some how reasonably predict the Lucene index size.
Lucene Index size calculation # of Indexed Fields 11 # of Stored Fields 11 *Note : *I am using standard Analyzer for all fields. I am indexing the records from a CSV file and each records is of size 0.2 KB ( size of each doc = total file size/no of records) Records(Million) Actual Index Size (MB) Size as per formula (for Optimize) 1 255 31.1897507 5 1361.92 75.9487534 10 2703.36 131.897507 25 7239.68 299.743767 50 15257.6 579.487534 75 26009.6 859.2313 100 32256 1138.97507 125 39526.4 1418.71883 Thanks On Tue, Mar 10, 2015 at 9:08 PM, Erick Erickson <erickerick...@gmail.com> wrote: > In a word... no. There are simply too many variables here to give any > decent estimate. > > The spreadsheet is, at best, an estimate. It hasn't been put through > any rigorous QA so the fact that it's off in your situation is not > surprising. I wish we had a better answer. > > And the disk size isn't particularly interesting anyway. The *.fdt and > *.fdx files contain compressed copies of the raw data in _stored_ > fields. If I index the same data with all fields set stored="true" > then stored="false", my disk size may vary by a large factor. And the > stored data has very little memory cost, memory usually being the > limiting factor in your Solr installation. > > Are you storing position information? Term vectors? Are you ngramming > your fields? and on and on. Each and every one of these changes the > memory requirements... > > Sorry we can't be more help > Erick > > On Mon, Mar 9, 2015 at 12:20 PM, Gaurav gupta > <gupta.gaurav0...@gmail.com> wrote: > > Could you please guide me how to reasonably estimate the disk size for > > Lucene 4.x (precisely 4.8.1 version) including worst case scenario. > > > > I have referred the formula and excel sheet shared @ > > > https://lucidworks.com/blog/estimating-memory-and-storage-for-lucenesolr/ > > > > I think it seems to be devised for Lucene 2.9. I am not sure if it's hold > > true for 4.x version. > > In my case, either the actual index size is coming close to the worst > case > > or higher than that. Even, one of our enterprise customer has observed 3 > > times higher index size than the estimated index size (based on excel > > sheet). > > > > Alternatively, can I know the average doc size in Lucene index (of a > > reasonable size of data) so that I can extrapolate that for complete 250 > > million documents. > > > > Thanks > > Gaurav > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Lucene index size - forum.xlsx
Description: MS-Excel 2007 spreadsheet
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org