Re: Performance if there is a large number of field

Erick Erickson Fri, 11 May 2018 19:40:15 -0700

Deepak:

I would strongly urge you to consider changing your problem solution
to _not_ need 35,000 fields. What that usually indicates is that there
are much better ways of tackling the problem. As Shawn says, 35,000
fields won't make much difference for an individual search. But 35,000
fields _do_ take up meta-data space, there has to be a catalog of all
the possibilities somewhere.


The question about missing fields is tricky. For the inverted index,
consider the structure. For each _field_ the structure looks like
this:
term, doc1, doc45, doc93.....

so really, the doc not having the field is pretty much similar to the
doc not having a term in that field, it's just missing.

But back to your problem. Think hard about _why_ you think you need
35,000 fields. Could you tag your field? Say you are storing prices
for stores for some item. Instead of having a field for store1_price,
store2_price... what about having a single field store1_price_1.53
store2_price_2.35 etc.

Or consider payloads. store1_price|1.53 store2_price|2.35 and using
that See: https://lucidworks.com/2017/09/14/solr-payloads/

I've rarely seen situations where having that many fields is an
optimal solution.

Best,
Erick

On Fri, May 11, 2018 at 12:20 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 5/11/2018 9:26 AM, Andy C wrote:
>> Why are range searches more efficient than wildcard searches? I guess I
>> would have expected that they just provide different mechanism for defining
>> the range of unique terms that are of interest, and that the merge
>> processing would be identical.
>
> I hope I can explain the reason that wildcard queries tend to be slow.
> I will use an example field from one of my own indexes.
>
> Choosing one of the shards of my main index, and focusing on the
> "keywords" field for that Solr core:  Here's the histogram data that the
> Luke handler gives for this field:
>
>       "histogram":[
>         "1",14095268,
>         "2",767777,
>         "4",425610,
>         "8",312156,
>         "16",236743,
>         "32",177718,
>         "64",122603,
>         "128",80513,
>         "256",52746,
>         "512",34925,
>         "1024",24770,
>         "2048",17516,
>         "4096",11467,
>         "8192",7748,
>         "16384",5210,
>         "32768",3433,
>         "65536",2164,
>         "131072",1280,
>         "262144",688,
>         "524288",355,
>         "1048576",163,
>         "2097152",53,
>         "4194304",12]}},
>
>
> The first entry means that there are 14 million terms that only appear
> once in the keywords field across the whole index. The last entry means
> that there are twelve terms that appear 4 million times in the keywords
> field across the whole index.
>
> Adding this all up, I can see that there are a little more than 16
> million unique terms in this field.
>
> This means that when I do a "keywords:*" query, that Solr/Lucene will
> expand this query such that the query literally contains 16 million
> individual terms.  It's going to take time just to make the query.  And
> then that query will have to be executed.  No matter how quickly each
> term in the query executes, doing 16 million of them is going to be slow.
>
> Just for giggles, I used my dev server to execute that "keywords:*"
> query on this single shard.  The reported QTime in the response was
> 18017 milliseconds.  Then I ran the full range query.  The reported
> QTime for that was 14569 milliseconds.  Which is honestly slower than I
> thought it would be, but faster than the wildcard.  The number of unique
> terms in the field affects both kinds of queries, but the effect of a
> large number of terms on the wildcard is usually greater than the effect
> on the range.
>
>> Would a search such as:
>>
>> field:c*
>>
>> be more efficient if rewritten as:
>>
>> field:[c TO d}
>
> On most indexes, probably.  That would depend on the number of terms in
> the field, I think.  But there's something to consider:  Not every
> wildcard query can be easily rewritten as a range.  I think this one is
> impossible to rewrite as a range:  field:abc*xyz
>
> I tried your c* example as well on my keywords field.  The wildcard had
> a QTime of 1702 milliseconds.  The range query had a QTime of 1434
> milliseconds.  The numFound on both queries was identical, at 16399711.
>
> Thanks,
> Shawn
>

Re: Performance if there is a large number of field

Reply via email to