According to Geoff Hutchison:
> As far as using the HEAD for the checksum, my point is that most documents
> we already GET, so we don't save any bandwidth. I'm also not completely
> sure that there's enough to ensure the checksums are unique. (This is why
> I'd want to test the feature very thoro
At 2:56 AM -0500 12/15/98, Walter Hafner wrote:
>And a second thought on document checksums: Quite often I see root
>documents more than 100 kb in size. Why not just computing a checksum of
>the header. A HTTP 1.1 'HEAD' command would be sufficient and imho this
>would save a lot of bandwidth and
At 8:00 AM -0500 12/15/98, John Grohol PsyD wrote:
>How about a file_aliases option? For instance, on our server,
>the index.html file is nearly always a symbolic link to the
>actual file, which is named something different. If I could
>put "index.html" into a file_aliases option, I would solve a
At 5:28 AM -0500 12/15/98, Walter Hafner wrote:
>I'd like to have real substring search and case sensitive search. And
>while I'm dreaming, a regexp subset would be nice. :-)
There's already a substring search. As for case_sensitive, it's an idea,
but currently all words are stored as lowercase,
Geoff Hutchison writes:
> > We have lots of links on our website and it's annoying to see duplicates in
> > search results. But the problem with duplicate detection is deciding which
> > duplicate to use! My current thought is to use the document with the lower
> > hopcount.
Walter Hafner replie
Geoff Hutchison writes:
> At 5:28 AM -0500 12/14/98, Walter Hafner wrote:
> >1) The lack of support for German umlauts (äöüß)
>
> My suggestion would be to look at the locale option.
Oops, sorry. I stand corrected. Missed that one.
> >2) The somewhat limited queries.
>
> I think you'll
Webmaster writes:
> Geoff Hutchison writes:
> >We have lots of links on our website and it's annoying to see duplicates in
> >search results. But the problem with duplicate detection is deciding which
> >duplicate to use! My current thought is to use the document with the lower
> >hopcount.
Earlier part of message deleted for brevities sake
>Actually NCS is being pretty naive in just using the size. The best way to
>detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty
>quick to generate a checksum, this isn't too slow.
>
>Though checking the root documents for
At 5:28 AM -0500 12/14/98, Walter Hafner wrote:
>1) The lack of support for German umlauts (äöüß)
My suggestion would be to look at the locale option.
>2) The somewhat limited queries.
I think you'll have to be more specific. I'd say we easily cover the 80/20
rule. From my search logs, most peo
Hi!
I'm in the process of evaluating Webcrawler software for full-text
indexing purposes.
Currently we use ht://Dig 3.1.0b2 for indexing the whole
*.tu-muenchen.de domain. The domain consists of ~300 WWW Servers, that
answer to ~540 names (vitual hosts _and_ server aliases). All in all
there are
10 matches
Mail list logo