Seeing as people are talking about this, I thought I'd relate my hack at 
displaying & searching on specific metadata fields.  We had a requirement 
to restrict queries to pages containing a specific meta-tag element or 
elements, and also to display meta-tag content in the output.

First I added a meta-data List to DocumentRef, with methods to return the 
list, and add strings to the list (DocumentRef is what's stored in the 
db.docdb for each document).  Then I added a got_metatag method to 
Retriever to store a meta-tag content in the DocumentRef.  Then in the 
HTML parser I modified do_tag to call  Reteirver::got_metatag when it sees 
an appropriate metatag (I made a config attribute meta_tag_store which can 
take a StringList of metatag names to store, or 'all', and only matching 
meta-tag NAMEs are stored - I only store the meta-tag content attribute, 
indexed by the name attribute. This might be a bit non-robust, but it's 
good enough for 2 hrs of hacking).  This gives us the ability to store 
arbitrary meta-data.

Then I modified htsearch/Display::displayMatch to read another config 
attribute meta_tag_display, which is a list of tag names to make into 
variables suitable for inclusion in output templates (again 'all' is an 
option too). For each meta-tag stored in the retrieved DocumentRef, if it 
matches a name in the list of tags to display, it prefixes the name with 
mt_ and puts the new name and the content in the vars structure to make it 
available for the page writer.

This means that we can surface meta-data, but we still can't search on it. 
 I decided that introducing new terms into the main index was the easiest 
way - it's not flash, but it does us for now.  As part of the HTML parser 
do_tag, I read another config attribute meta_tag_index which has a list of 
all the tag names which will be indexed.  When a matching tag comes up, I 
make up a keyword mt_<tag-name>_<tag-content> and add that to the index 
for the current document (I use the existing word breaking code to break 
up multi-word tag contents, so a tag <meta name="Platform" content="Linux, 
Solaris, Irix"> would turn into three words mt_platform_linux, 
mt_platform_solaris, mt_platform_irix - they're all forced to lower case). 
 Then I just use the keywords= CGI parameter to htsearch to include the 
keywords I want to restrict - we've got a php advanced search page with a 
bunch of list selects on it, and a redirect page which flattens the 
multiple options into a single keywords entry (sometime I'd like to modify 
the keywords parameter handling to allow it to take boolean queries, but 
that can wait for a bit).  I needed to play with the punctuation 
characters to allow _ in words, and Bob's your uncle.

I hope this is of use of interest to someone - I've implemented this on 
our 3.2.x based tree (and I won't post a patch because our tree has 
diverged too far - soon I'll have to make it based on the snapshots 
again), but something similar should work on the 3.1.x too.

Jamie Anstice
Search Scientist,  S.L.I. Systems, Inc
[EMAIL PROTECTED]
ph:  64 961 3262
mobile: 64 21 264 9347


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to