Geoff Hutchison <[EMAIL PROTECTED]> writes:
> But to clarify your point, zlib and u_p_a and c_u_p are used on 
> different things. The first is used *solely* on document excerpts 
> (the DocHead field), while the latter two are used on URLs in both 
> the document database and the document index (the URL->DocID list).
I should probably do more research before asking further questions, as 
I don't even know what your database schema looks like (one of the 
reasons why I suggested having an "architectural overview" document), 
but are document titles compressed using one of these schemes? Is it 
part of the DocHead that is compressed with zlib?

> So there are two steps to decoding an entry--first decoding based on 
> url_part_aliases and common_url_parts, then decompressing the 
> DocHead field if it's compressed.
Well if one goal is to just get a report of indexed URLs, I can 
probably forgo dealing with zlib.

Decoding common_url_parts reliably with an external script will be 
tricky because even if you parse htdig.conf and find it absent, you 
still need to keep in sync with htdig's compiled-in defaults, which 
may have changed since the script was written.

Are you aware of any Perl scripts that have been written to decompress 
the URLs?

 -Tom

-- 
Tom Metro
Venture Logic                                     [EMAIL PROTECTED]
Newton, MA, USA


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to