Geoff Hutchison <[EMAIL PROTECTED]> writes:
> But to clarify your point, zlib and u_p_a and c_u_p are used on
> different things. The first is used *solely* on document excerpts
> (the DocHead field), while the latter two are used on URLs in both
> the document database and the document index (the URL->DocID list).
I should probably do more research before asking further questions, as
I don't even know what your database schema looks like (one of the
reasons why I suggested having an "architectural overview" document),
but are document titles compressed using one of these schemes? Is it
part of the DocHead that is compressed with zlib?
> So there are two steps to decoding an entry--first decoding based on
> url_part_aliases and common_url_parts, then decompressing the
> DocHead field if it's compressed.
Well if one goal is to just get a report of indexed URLs, I can
probably forgo dealing with zlib.
Decoding common_url_parts reliably with an external script will be
tricky because even if you parse htdig.conf and find it absent, you
still need to keep in sync with htdig's compiled-in defaults, which
may have changed since the script was written.
Are you aware of any Perl scripts that have been written to decompress
the URLs?
-Tom
--
Tom Metro
Venture Logic [EMAIL PROTECTED]
Newton, MA, USA
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.