Tillman, James writes:
> Loic: I guess we're approaching this work from different viewpoints
> (although this doesn't lessen my enthusiasm for working with you!). I was
> concentrating more on allowing htDig to do its thing with the indexing and
> just providing Perl scripters with a module for querying. The idea was to
> come up with a compatible data structure for returning a "result set" from
> the C++ querying class to Perl after a search had been run, and then
> allowing Perl to request the desired document data at will.
Ok. That means that we have distinct goals as far as the Perl interface
is concerned. That's good, it will require less cooperation on that side.
We could just concentrate on module naming, publication, packaging. If
you're not yet a registered CPAN author, I suggest you do so now because
it takes a some time and two or three mails to get it :-}
Do you already have an idea of the name to use ? Search::Htdig ? Htdig
(AltaVista has a top level name after all). I vote for Htdig. Lower level
interfaces could have Htdig::Word, Htdig::Documents etc.. names.
> Although I didn't have any plans to rework the query syntax or directly
> access the htDig database, my current plans to "class out" the query parser
> would make providing an interface to Perl users a snap. We could simply
> wrap the class in XS and let Perl access the query parser directly. Again,
> what I wanted to do was make use of what code htDig already had and get my
> modifications rolled into the distribution. That way when changes were
> made, I didn't have so much work to do. Internal processes change often,
> but interfaces (i.e, public class methods and properties), should seldom
> change.
Right. This leaves us with the need to define a syntax tree. I tend to
like the Text::Query syntax tree (for obvious reasons :-) but you may think
differently.
> Oh, and on the subject of the database indexing: I had figured on the system
> not really being able to handle the database fields, and was anticipating
> some sort of <meta> tag usage for providing indexing capability on the data
> fields. What would really make things easier is the indexing of XML in
> addition to HTML, which would make indexing individual records of data, and
> enumerations of primary keys, very simple. As long as your URL handler
> returned valid XML, it could be indexed and searched and linked to the
> primary key's "URL". Much better IMHO than trying to retrofit the system
> for database specific structures. This is a text indexer, after all.
XML/RDF would be the format of choice to represent the database
contents. A DBI based perl CGI that would translate any SQL database
into a single XML/RDF format could be the solution. Two drawbacks :
very inefficient (impractical for databases that contain more than 100 000
records) and does not solve the fact that the inverted index
is not suited. The other approach is to take mysql++, get rid of
templates, add an interface for at least another database (postgres),
change the interface to be as close as possible to DBI interface,
rename it sql++ (or dbi++ ;-), plug it in htdig. Add a word.desc in
htcommon suited to databases so that WordKey is able to handle it,
generalize the htdig code so that it does not depend on a specific
word key layout.
The bonus to the later approach is that htdig will be ready to switch
from the Berkeley DB formated document database to SQL database.
Cheers,
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.