Re: [htdig-dev] Retriever/Parser

Neal Richter Tue, 29 Jan 2002 17:47:05 -0800


On Tue, 29 Jan 2002, Geoff Hutchison wrote:


> The Retriever class isn't really built around much of anything IMHO. It
> requires that documents have a URL and that the URLs can be grouped into
> Server objects. 

        True, but the main Start function assumes a spidering
approach.  What if you just want to index a list of documents
already in memory (fetched from another source)?  The Start function is
cumbersome and there is no clear function that seems to say "here is
some data, please index it".  At least in my current reading it looks like
the core fetching + parsing + indexing + write-to-db process is shared
between Start & Retrieved Document.  Correct?

        Also, there are a few features of Retriever that are not useful
in other contexts... max_hop_count for instance.  Definitely a usable
class, but it's overkill for a very basic document whose source is outside
the Transport context.

> you're talking more about Transport-type concepts. 

        Yes, I was speaking in generalities.  I am basically thinking of
how HtDig can be used a a general purpose Information Retrieval tool.  I
probably switched topics a bit there.

        It's the difference between telling htdig "go over here, fetch the
data and index it all by yourself" vs "please index this data as I provide
it to you".  Using htdig as an 'application' vs 'a text indexing & query 
component of another system'

> Again, I think it's the URL that's the critical point. Otherwise how are
> the search results useful? How do you "jump to" a particular result from
> the output?

        For this project, all I really store as a 'URL' is part of the
path to an XML file.. so by itself the URL is useless to any transport
object.  For that matter you could use URL simply as a document-id in
another separate system.

 Integrating the necessary external code to find/fetch,
transcode (character set switch), parse via XSLT, etc would require as
much coding of a new Transport class (and integration of many external
libraries) as it would to:

Define a BasicDocument class with no bells and whistles other than a
Parser binding.

Define a TextCollecter (cousin of Retriever) whose sole job is to
facilitate parsing of documents and update the index.  No need to make
network connections, look at server codes, examine the document for links
to other documents, etc.  No 'document fetch' loop anywhere.   The
'index_doc' routine is called as needed, once per document by an outside
piece of code.

        The file is viewed via another piece of code that loads an XML
file, reads the fields contained via a specialized parser and does a kind
of Rendering/Formatting to present the information in a specific UI
complete with other bells and whistles.

        Similarly the query process is integrated inside another
UI.  A Query is received via user input, passed to htdig search APIs and
the results are repackaged with in the existing UI.


> Your work is appreciated. I'm just trying to point out a few things as
> someone who's been around for a while.

        Great, it's good input.. definitely helpful in understanding
htdig and the project team's conceptualization of it.

> 2) It's better not to reinvent the wheel. The less code that needs to be
> maintained, generally the better. Do we really need new Retriever classes,
> or do we need to refactor what we have?

        It's very powerful now, and very useful in a network centric
document environment.

        At some point the Retriever-as-swiss-army-knife approach can be
overly complex.  A more basic class for optional use can be good for a
narrow set of uses.

        What it comes down to is that I'm suggesting is that
libhtdig.so could use two additional classes that are very basic.  These
classes aren't really useful to anyone not using htdig as a separate
Information Retrieval component of another app.

        One could make an argument that mifluz could be used directly for
this.  Very true, but mifluz is a bucket of nice parts.  Htdig is a
working tool with the wrappers that make mifluz usefull quickly.

Thanks.
-- 
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site





_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Retriever/Parser

Reply via email to