I did not believe the benchmark results the first time, but it seems to hold up.
Nobody gets a speedup of over a thousand (unless you are going from that
Oracle search thing to Solr).

It probably won’t help for most people. We have one service with very, very long
queries, up to 1000 words of free text. We also do as-you-type instant results,
so we have been using edge ngrams. Not using edge ngrams made the huge
speedup.

Query results cache hit rate almost doubled, which is part of the non-linear 
speedup.

We already trim the number of terms passed to Solr to a reasonable amount.
Google cuts off at 32; we use a few more.

We’re running a relevance A/B test for dropping the ngrams. If that doesn’t 
pass,
we’ll try something else, like only ngramming the first few words. Or something.

I wanted to use MLT to extract the best terms out of the long queries. 
Unfortunately,
you can’t highlight and MLT (MLT was never moved to the new component system)
and the MLT handler was really slow. Dang.

I still might do an outboard MLT with a snapshot of high-idf terms.

The queries are for homework help. I’ve only found one other search that had to
deal with this. I was talking with someone who worked on Encarta, and they had
the same challenge.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 8:06 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Walter:
> 
> What did you change? I might like to put that in my bag of tricks ;)
> 
> Erick
> 
> On Mon, Oct 3, 2016 at 6:30 PM, Walter Underwood <wun...@wunderwood.org> 
> wrote:
>> That approach doesn’t work very well for estimates.
>> 
>> Some parts of the index size and speed scale with the vocabulary instead of 
>> the number of documents.
>> Vocabulary usually grows at about the square root of the total amount of 
>> text in the index. OCR’ed text
>> breaks that estimate badly, with huge vocabularies.
>> 
>> Also, it is common to find non-linear jumps in performance. I’m benchmarking 
>> a change in a 12 million
>> document index. It improves the 95th percentile response time for one style 
>> of query from 3.8 seconds
>> to 2 milliseconds. I’m testing with a log of 200k queries from a production 
>> host, so I’m pretty sure that
>> is accurate.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2...@gmail.com> wrote:
>>> 
>>> In short, if you want your estimate to be closer then run some actual
>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>> search product may have different schema,different set of fields, different
>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>> 
>>> If you want to just have a very quick rough estimate, create few flat json
>>> sample files (below) with field names and key values(actual data for better
>>> estimate). Put all the fields names which you are going to index/put into
>>> Solr and check the json file size. This will give you average size of a doc
>>> and then multiply with # docs to get a rough index size.
>>> 
>>> {
>>> "id":"product12345"
>>> "name":"productA",
>>> "category":"xyz",
>>> ...
>>> ...
>>> }
>>> 
>>> Thanks,
>>> Susheel
>>> 
>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <talli...@mitre.org>
>>> wrote:
>>> 
>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>> is invaluable:
>>>> 
>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>> -----Original Message-----
>>>> From: Vasu Y [mailto:vya...@gmail.com]
>>>> Sent: Monday, October 3, 2016 2:09 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: SOLR Sizing
>>>> 
>>>> Hi,
>>>> I am trying to estimate disk space requirements for the documents indexed
>>>> to SOLR.
>>>> I went through the LucidWorks blog (
>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>>> and-storage-for-lucenesolr/)
>>>> and using this as the template. I have a question regarding estimating
>>>> "Avg. Document Size (KB)".
>>>> 
>>>> When calculating Disk Storage requirements, can we use the Java Types
>>>> sizing (
>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>>> & come up average document size?
>>>> 
>>>> Please let know if the following assumptions are correct.
>>>> 
>>>> Data Type       Size
>>>> --------------      ------
>>>> long           8 bytes
>>>> tint       4 bytes
>>>> tdate         8 bytes (Stored as long?)
>>>> string         1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII chars (Double byte chars)
>>>> text           1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>>> boolean 1 bit?
>>>> 
>>>> Thanks,
>>>> Vasu
>>>> 
>> 

Reply via email to