Re: Searching documents on big index by using ParallelMultiSearcher is slow...

Scott Wed, 04 Oct 2006 06:32:25 -0700

Indeed, I am using a bit complex Query (4 fields with OR).


My index has fields Title, Sub-title, Content, Author.
And search them by one query like as web search engine.

Thank you for details about weight.

So I need to avoid remote calls to rewrite() and docFreq().
I'll try to make Hits object remotely and SearchMaster collects
top N of Hit from Hits then SearchMaster sort it.

I tested ParallelMultiSearcher performance.
it makes and starts thread serially.
Then wait for all threads ended.
But it is threaded, so searching is parallelly on remote server,

I insert debug program that calc elapsed times.
into below methods.

ParallelMultiSearcher.java:
public TopDocs search(Weight weight, Filter filter, int nDocs);

public TopFieldDocs search(Weight weight, Filter filter, int nDocs, Sortsort);


Searcher.java:
public Hits search(Query query, Filter filter);
public Hits search(Query query, Sort sort);
public Hits search(Query query, Filter filter, Sort sort);

debug program is :
--------
long startTime = System.currentTimeMillis();
System.out.println("Start ClassNameHere search");

... original main routine ...

long endTime = System.currentTimeMillis();
float totalTime = (endTime - startTime) / 1000.0f;
System.out.println("End ClassNameHere search in " + totalTime + "ms");
--------

Then, result is below.

Start Searcher search
Start ParallelMultiSearcher search
End ParallelMultiSearcher search in 0.049ms
End Searcher search in 0.449ms

I thinks the time 0.449 - 0.049 = '0.400' is weight calculation,
need to reduce this by trick...

Haines, Ronald C. (LNG-DAY) wrote:

Keep in mind, that depending on your queries (lots of terms, wildcards,
date ranges), you can spend quite a bit of time during the 'Weight'
calculation...this all happens pre-search.  During the Weight
calculation, you will be making remote calls to the rewrite() and
docFreq() methods.  There will be (# of terms * # of remotes) of these
remote calls made for each of the above methods.

And, I think the ParallelMultiSearcher will make all of these calls
serially before it starts to thread the search process.  I have found
that this, serially, can account for quite a bit of the overall response
time.

I too am interested in learning more about a large scale distributed
Lucene model.

-----Original Message-----From: Erick Erickson [mailto:[EMAIL PROTECTED]Sent: Wednesday, October 04, 2006 7:33 AM

To: java-user@lucene.apache.org
Subject: Re: Searching documents on big index by using
ParallelMultiSearcher is slow...

OK, you're now officially beyond my competence, so I'll have to wait for
people who actually know <G>....

Although if I read your stats right, you're getting approximately 1 sec
response time over 10M documents on a 10G index. That's not bad at all.
What
kind of response time do you need?

On 10/3/06, Scott <[EMAIL PROTECTED]> wrote:

Hi,

Well, the first question is always "are you opening/closing your
IndexSearchers for each request on your remote machines?". This is

always a

no-no. This is also a question for your single-searcher version.

Yes I know, each search slave (RMI server) have single instance
  of IndexSearcher and it's open once when RMI server starts.

What is your performance if you only go to one server? I'd start by

finding

A performance on one server with FULL index (not divided by 10)
  is about 2500 ms.
On one server with splitted index (divided by 10) is about 50 ms.

And on ParallelMultiSearcher with 10 of remote searchable,
  each RemoteSearchable returns in about 50 - 100 ms,
  and ParallelMultiSearcher returns also 50 - 100 ms, because of
  threading.
but Hits Searcher.search(Query, Sort) responds in about 500 - 1000 ms.

I think that Searcher.search with Sort reads all of SortFields from
  IndexReader and it's bottleneck.

Are there results of high performance distributed Lucene with
ParallelMultiSearcher?
Or need hadoop?

Erick Erickson wrote:

Well, the first question is always "are you opening/closing your
IndexSearchers for each request on your remote machines?". This is

always a

no-no. This is also a question for your single-searcher version.

What is your performance if you only go to one server? I'd start by

finding

out what happens when you forget all the ParallelMultiSearcher

stuff,

all

the RMI stuff etc, and just see what your performance is on one of

your

index parts locally. Once that is answered, extend to RMI, then the
Parallel...., at each step seeing if your performance degrades
unacceptably.
That'll at least give you a clue what part of the process is the

biggest

problem.

And without knowing a LOT more about your searches, and your index,

it's

kind of hard to come up with solutions <G>....

Best
Erick

On 10/3/06, Scott <[EMAIL PROTECTED]> wrote:

Hi,

I have a question about ParallelMultiSearcher performance.

I want to search documents on about 10 gigabytes of index.
(The index has 10,000,000 documents.)

I get very slow performance using IndexSearcher with ONE index

normally.

Then I tried to use ParallelMultiSearcher with 10 servers of remote
searchable.

Index:
Each search slaves have 1/10 of index.
(ONE index divided to 10 servers.)

Search slave:
Each search slaves start remote searchable RMI server,
and wait connecting from search master.

Search master:
The search master use Naming.lookup() to get remote searchable.
Get 10 remote searchables from each search slaves and build
ParallelMultiSearcher.
Then search.

Any solution?

--
Scott

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

Reply via email to