At 02:39 PM 03/26/02 +0800, Stas Bekman wrote:

>Per Einar Ellefsen wrote:
>> What I can suggest: as we generate our HTML from POD files, knowing what 
>> is code, could there maybe be some possibility of putting some <div> 
>> tags around the <pre> ones, and then patch Swish in some way to get it 
>> to treat those parts as searchable but not displayable? If I understood 
>> it right, it's already using some <div> tags to know what to index, so 
>> maybe it would be possible to make it a little more advanced?

You mean don't display the context since it doesn't look nice in the
summary?  I think I'd rather have it show the context even if it is ugly.

The highlighting code is designed to just show the first X words or X
characters (depending on which highlight module is used) if no matching
context is found to display, but I'd still rather see the word hits, if
possible.

 Search google for: ["the guide" AND "light registry"]

It does basically the same thing.

>I don't think this is possible, since the hit doesn't happen in the 
>sentence but an index which points to the section which includes this 
>sentence.

Right.  It's isn't grep.

Also, if you start trying to preserve HTML then it becomes a bit more
tricky and slow to do the phrase highlighting, since a phrase match in
swish might match across HTML formatting.  For example imagine highlighting
the *phrase* that matches the last word of one link, and the first word of
a link that follows.  Matching "foo bar":

      <a href="first">bla bla foo</a>  <a href="second">bar bla bla</a>

ends up as:

      <a href="first">bla bla <span class="mark">foo</span></a><span
class="mark"> </span><a href="second"><span class="mark">bar</span> bla
bla</a>

And it gets even harder when that first link might have looked like:

    <a href="first">bla <em>bla fo</em>o</a>

Not very likely, buy you can see why you then need HTML::TreeBuilder to do
that kind of rewriting of the HTML.  The phrase highlighting code is messy
enough working with just plain text.  So if you start highlighting code in
50K or 100K documents where you need to first build a HTML tree then
parsing speed becomes quite noticeable.  

Currently, even without parsing the HTML, all the slowness in returning
results you see is coming from the highlighting code (well, that I see on
my LAN, as you may have a slow connection).  Turn off highlighting and the
results are returned very fast.

I'm sure Google is much smarter about highlighting than swish (considering
swish doesn't do highlighting).  But google doesn't try too hard either.
Here's how it highlighted the phrase
"light set" which included html formatting:

     <i><B style="color:black;background-color:#99ff99">light</i> </B>
    <B style="color:black;background-color:#99ff99">set</B>

A little nesting problem there, I think.
         

>I've another suggesting: is it possible to distinguish between sentences 
>(or parts of) when presenting the hit's context? If so we could add 
><br>'s after each sentence/part of and therefore make it more readable. 
>I know you said that \n are removed, but if there is a way to keep the 
>original strings as tokens in the index, this will improve the 
>readability a lot.

I probably could substitute \n inside of <pre> sections with %0A or
something like that in the swish parser code, or in the code that splits
the documents into sections replace \n inside <pre> with a set of chars
that won't be indexed, but can then be used as a flag to show where \n are
found (and thus replace with <br>).  But I think that's overkill.

Look at http://hank.org:5000/search/swish.cgi?query=registry

I'll put back in the \n below -- swish does a s/\s+/ /g in some cases (when
joining text together), so swish would need to be modified to keep
whitespace, too, inside of <pre> sections.

    ... the light set we are going to use the registry.pl script running
under Apache::Registry: benchmarks/registry.pl
----------------------
use strict; 
print "Content-type: text/plain ... pl file:
use Apache::RegistryLoader ();
Apache::RegistryLoader->new->handler( 
"/perl/benchmarks/registry.pl",
"/home/httpd/perl/benchmarks/registry .pl");
To create the heavy benchmark set let ...
results:------------------------------
name | avtime rps
------------------------------
light handler | 15 911
light registry | 21 680
------------------------------
heavy handler | 183 81
heavy registry | 191 77
------------------------------
Let's look at the results ... comparison:
------------------------------
name | avtime rps
------------------------------
light handler | 50 196
light registry | 160 61
------------------------------
heavy handler | 149 67

Even if it was formatted correctly (without swish doing s/\s+//g;) you end
up with a lot longer summary for a little readability gain.  

The idea is not that the search results show the page correctly, rather
that it just shows some content to help you decide if you should follow the
link in the search result.

If I'm missing what you are suggesting, send what you think

   http://hank.org:5000/search/swish.cgi?query=registry

should look like.


-- 
Bill Moseley
mailto:[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to