Walter: What did you change? I might like to put that in my bag of tricks ;)
Erick On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> wrote: > That approach doesn’t work very well for estimates. > > Some parts of the index size and speed scale with the vocabulary instead of > the number of documents. > Vocabulary usually grows at about the square root of the total amount of text > in the index. OCR’ed text > breaks that estimate badly, with huge vocabularies. > > Also, it is common to find non-linear jumps in performance. I’m benchmarking > a change in a 12 million > document index. It improves the 95th percentile response time for one style > of query from 3.8 seconds > to 2 milliseconds. I’m testing with a log of 200k queries from a production > host, so I’m pretty sure that > is accurate. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote: >> >> In short, if you want your estimate to be closer then run some actual >> ingestion for say 1-5% of your total docs and extrapolate since every >> search product may have different schema,different set of fields, different >> index vs. stored fields, copy fields, different analysis chain etc. >> >> If you want to just have a very quick rough estimate, create few flat json >> sample files (below) with field names and key values(actual data for better >> estimate). Put all the fields names which you are going to index/put into >> Solr and check the json file size. This will give you average size of a doc >> and then multiply with # docs to get a rough index size. >> >> { >> "id":"product12345" >> "name":"productA", >> "category":"xyz", >> ... >> ... >> } >> >> Thanks, >> Susheel >> >> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org> >> wrote: >> >>> This doesn't answer your question, but Erick Erickson's blog on this topic >>> is invaluable: >>> >>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in- >>> the-abstract-why-we-dont-have-a-definitive-answer/ >>> >>> -----Original Message----- >>> From: Vasu Y [mailto:vya...@gmail.com] >>> Sent: Monday, October 3, 2016 2:09 PM >>> To: solr-user@lucene.apache.org >>> Subject: SOLR Sizing >>> >>> Hi, >>> I am trying to estimate disk space requirements for the documents indexed >>> to SOLR. >>> I went through the LucidWorks blog ( >>> https://lucidworks.com/blog/2011/09/14/estimating-memory- >>> and-storage-for-lucenesolr/) >>> and using this as the template. I have a question regarding estimating >>> "Avg. Document Size (KB)". >>> >>> When calculating Disk Storage requirements, can we use the Java Types >>> sizing ( >>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) >>> & come up average document size? >>> >>> Please let know if the following assumptions are correct. >>> >>> Data Type Size >>> -------------- ------ >>> long 8 bytes >>> tint 4 bytes >>> tdate 8 bytes (Stored as long?) >>> string 1 byte per char for ASCII chars and 2 bytes per char for >>> Non-ASCII chars (Double byte chars) >>> text 1 byte per char for ASCII chars and 2 bytes per char for >>> Non-ASCII (Double byte chars) (For both with & without norm?) >>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars) >>> boolean 1 bit? >>> >>> Thanks, >>> Vasu >>> >