Walter:

What did you change? I might like to put that in my bag of tricks ;)

Erick

On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> That approach doesn’t work very well for estimates.
>
> Some parts of the index size and speed scale with the vocabulary instead of 
> the number of documents.
> Vocabulary usually grows at about the square root of the total amount of text 
> in the index. OCR’ed text
> breaks that estimate badly, with huge vocabularies.
>
> Also, it is common to find non-linear jumps in performance. I’m benchmarking 
> a change in a 12 million
> document index. It improves the 95th percentile response time for one style 
> of query from 3.8 seconds
> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
> host, so I’m pretty sure that
> is accurate.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>
>> In short, if you want your estimate to be closer then run some actual
>> ingestion for say 1-5% of your total docs and extrapolate since every
>> search product may have different schema,different set of fields, different
>> index vs. stored fields,  copy fields, different analysis chain etc.
>>
>> If you want to just have a very quick rough estimate, create few flat json
>> sample files (below) with field names and key values(actual data for better
>> estimate). Put all the fields names which you are going to index/put into
>> Solr and check the json file size. This will give you average size of a doc
>> and then multiply with # docs to get a rough index size.
>>
>> {
>> "id":"product12345"
>> "name":"productA",
>> "category":"xyz",
>> ...
>> ...
>> }
>>
>> Thanks,
>> Susheel
>>
>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>> wrote:
>>
>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>> is invaluable:
>>>
>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>
>>> -----Original Message-----
>>> From: Vasu Y [mailto:vya...@gmail.com]
>>> Sent: Monday, October 3, 2016 2:09 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: SOLR Sizing
>>>
>>> Hi,
>>> I am trying to estimate disk space requirements for the documents indexed
>>> to SOLR.
>>> I went through the LucidWorks blog (
>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>> and-storage-for-lucenesolr/)
>>> and using this as the template. I have a question regarding estimating
>>> "Avg. Document Size (KB)".
>>>
>>> When calculating Disk Storage requirements, can we use the Java Types
>>> sizing (
>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>> & come up average document size?
>>>
>>> Please let know if the following assumptions are correct.
>>>
>>> Data Type       Size
>>> --------------      ------
>>> long           8 bytes
>>> tint       4 bytes
>>> tdate         8 bytes (Stored as long?)
>>> string         1 byte per char for ASCII chars and 2 bytes per char for
>>> Non-ASCII chars (Double byte chars)
>>> text           1 byte per char for ASCII chars and 2 bytes per char for
>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>> boolean 1 bit?
>>>
>>> Thanks,
>>> Vasu
>>>
>

Reply via email to