Re: cache persistent Hits

Erick Erickson Tue, 26 Sep 2006 13:42:28 -0700

See below.

On 9/26/06, Gaston <[EMAIL PROTECTED]> wrote:


hi,

first thank you for the fast reply.

I use MultiSearcher that opens 3 indexes, so this makes the whole
operation surly slower, but 20seconds for 5260 results out of an 212MB
index  is  much too slow.
Another reason can of course be my ISP.

Here is my code:

        IndexSearcher[] searchers;
        searchers=new IndexSearcher[3];
        String path="/home/sn/public_html/";
        searchers[0]=new IndexSearcher(path+"index1");
        searchers[1]=new IndexSearcher(path+"index2");
        searchers[2]=new IndexSearcher(path+"index3");
        MultiSearcher saercher=new MultiSearcher(searchers);




Above you've opened the searcher for each search, exactly as I feared. This
is a major hit. Don't do this, but keep the searchers open between calls.
You can demonstrate this to yourself by returning time intervals in your
HTML page. Take one timestamp right here, one after a new dummy query that
you make up and hard-code, and one after the "real" query you already have
below. Return them all in your HTML page and take a look. I think you'll see
that the first query takes a while, and the second is very fast. And don't
iterate over all the hits (more below).


       QueryParser parser=new QueryParser("content",new

StandardAnalyzer());
            parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

            Query query=parser.parse("urlName:"+userInput+" OR
"+"content:"+userInput);

            Hits hits=searcher.search(query);

            for(int i=0;i<hits.length();i++)
            {

                Document doc=hits.doc(i);


            }



what is the purpose of iteration above? This does nothing except waste time.
I'd just remove it (unless there's something else you're doing here that you
left out). If you're trying to get to the startPoint below, well, there's no
reason to iterate above, just to directly to the loop below. For 5000 hits,
you're repeating the search 50 times or so, as has been discussed in these
archives repeatedly. See my previous mail.....


      // Outprint only 10 results per page


    for(int i=startPoint;i<startPoint+10;i++)
            {

                Document doc=hits.doc(i);

                    out.println(escapeHTML(doc.get("description"))+"<p>");
                    out.println("<a
href="+doc.get("url")+">"+doc.get("url").substring(7)+"</a>");
                    out.println("<p><p><p>");

            }

Perhaps somebody see the reason why it is so slow.

Thank you in advance

Greetings Gaston



I'm assuming that your ISP comment is just where you're getting your page
from, and that your searchers and indexes are at least on the same network
and NOT separated by the web, as that would be slow and hard to fix.

To get a sense of where you're really spending your time, I'd actually get
the system time at various points in the process and send the *times* back
in your HTML page. That'll give you a much better sense of where you're
actually spending time. You can't really tell anything by measuring now long
it takes to get your HTML page back, you've *got* to measure at discreet
points in the code and return those.

5,000+ results should not be taking 20 seconds. I strongly suspect that the
fact that you're opening your searchers every time and uselessly iterating
through all the hits is the culprit. If I remember correctly, and you have
5,000 documents, you're executing the query about 50 times when you iterate
through all the hits. Under the covers, Hits is optimized for about 100
results. As you iterate through, each "next 100" re-executes the query. You
could search the mail archive for this topic, maybe "hits slow" or some such
for greater explications.

Hope this helps
Erick


Erick Erickson schrieb:


> Well, my index is over 1.4G, and others are reporting very large
> indexes in
> the 10s of gigabytes. So I suspect your index size isn't the issue.
> I'd be
> very, very, very surprised if it was.
>
> Three things spring immediately to mind.
>
> First, opening an IndexSearcher is a slow operation. Are you opening a
> new
> IndexSearcher for each query? If so, don't <G>. You can re-use the same
> searcher across threads without fear and you should *definitely* keep it
> open between queries.
>
> Second, your query could just be very, very interesting. It would be
more
> helpful if you posted an example of the code where you take your timings
> (including opening the IndexSearcher).
>
> Third, if you're using a Hits object to iterate over many documents, be
> aware that it re-executes the query every hundred results or so. You
> want to
> use one of the  HitCollector/TopDocs/TopDocsCollector classes if you are
> iterating over all the returned documents. And you really *don't* want
> to do
> an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document.
>
> If none of this helps, please post some code fragments and I'm sure
> others
> will chime in.
>
> Best
> Erick
>
> On 9/26/06, Gaston <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi,
>>
>> Lucene has itself  volatile caching mechanism provided by a weak
>> HashMap. Is there a possibilty to serialize the Hits Object? I think of
>> a HashMap that for each found result, caches the first 100 results. Is
>> it possible to implement such a feature or is there such an extension?
>> My problem is that the searching of my application with an index with
>> the size of 212MB takes to much time, despite I set the BooleanOperator
>> from OR to AND
>>
>> I am happy about every suggestion.
>>
>> Greetings
>>
>> Gaston.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: cache persistent Hits

Reply via email to