Re: distributed search is significantly slower than direct search

Michael Sokolov Sat, 16 Nov 2013 04:40:44 -0800

Did you say what the memory profile of your machine is? How muchmemory, and how large are the shards? This is just a random guess, butit might be that if you are memory-constrained, there is a lot ofthrashing caused by paging (swapping?) in and out the sharded indexeswhile a single index can be scanned linearly, even if it does need to bepaged in.


-Mike


On 11/14/2013 8:10 AM, Elran Dvir wrote:

Hi,

We tried returning just the id field and got exactly the same performance.
Our system is distributed but all shards are in a single machine so network 
issues are not a factor.
The code we found where Solr is spending its time is on the shard and not on 
the routing core, again all shards are local.
We investigated the getFirstMatch() method and noticed that the 
MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact take 
99% of the time.
Inside these methods, the call to 
BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock  takes most 
of the time.
Out of the 7 seconds  run these methods take ~5 and BinaryResponseWriter.write 
takes the rest(~ 2 seconds).

We tried increasing cache sizes and got hits, but it only improved the query 
time by a second (~6), so no major effect.
We are not indexing during our tests. The performance is similar.
(How do we measure doc size? Is it important due to the fact that the 
performance is the same when returning only id field?)

We still don't completely understand why the query takes this much longer 
although the cores are on the same machine.

Is there a way to improve the performance (code, configuration, query)?

-----Original Message-----
From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel Le 
Normand
Sent: Thursday, November 14, 2013 1:30 AM
To: solr-user@lucene.apache.org
Subject: Re: distributed search is significantly slower than direct search

It's surprising such a query takes a long time, I would assume that after 
trying consistently q=*:* you should be getting cache hits and times should be 
faster. Try see in the adminUI how do your query/doc cache perform.
Moreover, the query in itself is just asking the first 5000 docs that were 
indexed (returing the first [docid]), so seems all this time is wasted on 
transfer. Out of these 7 secs how much is spent on the above method? What do 
you return by default? How big is every doc you display in your results?
Might be the matter that both collections work on the same ressources. Try 
elaborating your use-case.

Anyway, it seems like you just made a test to see what will be the performance 
hit in a distributed environment so I'll try to explain some things we 
encountered in our benchmarks, with a case that has at least the similarity of 
the num of docs fetched.

We reclaim 2000 docs every query, running over 40 shards. This means every 
shard is actually transfering to our frontend 2000 docs every document-match 
request (the first you were referring to). Even if lazily loaded, reading 2000 
id's (on 40 servers) and lazy loading the fields is a tough job. Waiting for 
the slowest shard to respond, then sorting the docs and reloading (lazy or not) 
the top 2000 docs might take a long time.

Our times are 4-8 secs, but do it's not possible comparing cases. We've done 
few steps that improved it along the way, steps that led to others.
These were our starters:

    1. Profile these queries from different servers and solr instances, try
    putting your finger what collection is working hard and why. Check if
    you're stuck on components that don't have an added value for you but are
    used by default.
    2. Consider eliminating the doc cache. It loads lots of (partly) lazy
    documents that their probability of secondary usage is low. There's no such
    thing "popular docs" when requesting so many docs. You may be using your
    memory in a better way.
    3. Bottleneck check - inner server metrics as cpu user / iowait, packets
    transferred over the network, page faults etc. are excellent in order to
    understand if the disk/network/cpu is slowing you down. Then upgrade
    hardware in one of the shards to check if it helps by looking at the
    upgraded shard qTime compared to other.
    4. Warm up the index after commiting - try to benchmark how do queries
    performs before and after some warm-up, let's say some few hundreds of
    queries (from your previous system) in order to warm up the os cache
    (assuming your using NRTDirectoryFactory)


Good luck,
Manu


On Wed, Nov 13, 2013 at 2:38 PM, Erick Erickson <erickerick...@gmail.com>wrote:

One thing you can try, and this is more diagnostic than a cure, is
return just the id field (and insure that lazy field loading is true).
That'll tell you whether the issue is actually fetching the document
off disk and decompressing, although frankly that's unlikely since you
can get your 5,000 rows from a single machine quickly.

The code you found where Solr is spending its time, is that on the
"routing" core or on the shards? I actually have a hard time
understanding how that code could take a long time, doesn't seem
right.

You are transferring 5,000 docs across the network, so it's possible
that your network is just slow, that's certainly a difference between
the local and remote case, but that's a stab in the dark.

Not much help I know,
Erick



On Wed, Nov 13, 2013 at 2:52 AM, Elran Dvir <elr...@checkpoint.com> wrote:

Erick, Thanks for your response.

We are upgrading our system using Solr.
We need to preserve old functionality.  Our client displays 5K
document and groups them.

Is there a way to refactor code in order to improve distributed
documents fetching?

Thanks.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, October 30, 2013 3:17 AM
To: solr-user@lucene.apache.org
Subject: Re: distributed search is significantly slower than direct

search

You can't. There will inevitably be some overhead in the distributed

case.

That said, 7 seconds is quite long.

5,000 rows is excessive, and probably where your issue is. You're
having to go out and fetch the docs across the wire. Perhaps there
is some batching that could be done there, I don't know whether this
is one document per request or not.

Why 5K docs?

Best,
Erick


On Tue, Oct 29, 2013 at 2:54 AM, Elran Dvir <elr...@checkpoint.com>

wrote:

Hi all,

I am using Solr 4.4 with multi cores. One core (called template)
is my "routing" core.

When I run
http://127.0.0.1:8983/solr/template/select?rows=5000&q=*:*&shards=127.
0.0.1:8983/solr/core1,
it consistently takes about 7s.
When I run
http://127.0.0.1:8983/solr/core1/select?rows=5000&q=*:*, it consistently takes 
about 40ms.

I profiled the distributed query.
This is the distributed query process (I hope the terms are accurate):
When solr identifies a distributed query, it sends the query to
the shard and get matched shard docs.
Then it sends another query to the shard to get the Solr documents.
Most time is spent in the last stage in the function "process" of
"QueryComponent" in:

for (int i=0; i<idArr.size(); i++) {
         int id = req.getSearcher().getFirstMatch(
                 new Term(idField.getName(),
idField.getType().toInternal(idArr.get(i))));

How can I make my distributed query as fast as the direct one?

Thanks.


Email secured by Check Point


Email secured by Check Point

Re: distributed search is significantly slower than direct search

Reply via email to