Seeing as people are talking about this, I thought I'd relate my hack at displaying & searching on specific metadata fields. We had a requirement to restrict queries to pages containing a specific meta-tag element or elements, and also to display meta-tag content in the output.
First I added a meta-data List to DocumentRef, with methods to return the list, and add strings to the list (DocumentRef is what's stored in the db.docdb for each document). Then I added a got_metatag method to Retriever to store a meta-tag content in the DocumentRef. Then in the HTML parser I modified do_tag to call Reteirver::got_metatag when it sees an appropriate metatag (I made a config attribute meta_tag_store which can take a StringList of metatag names to store, or 'all', and only matching meta-tag NAMEs are stored - I only store the meta-tag content attribute, indexed by the name attribute. This might be a bit non-robust, but it's good enough for 2 hrs of hacking). This gives us the ability to store arbitrary meta-data. Then I modified htsearch/Display::displayMatch to read another config attribute meta_tag_display, which is a list of tag names to make into variables suitable for inclusion in output templates (again 'all' is an option too). For each meta-tag stored in the retrieved DocumentRef, if it matches a name in the list of tags to display, it prefixes the name with mt_ and puts the new name and the content in the vars structure to make it available for the page writer. This means that we can surface meta-data, but we still can't search on it. I decided that introducing new terms into the main index was the easiest way - it's not flash, but it does us for now. As part of the HTML parser do_tag, I read another config attribute meta_tag_index which has a list of all the tag names which will be indexed. When a matching tag comes up, I make up a keyword mt_<tag-name>_<tag-content> and add that to the index for the current document (I use the existing word breaking code to break up multi-word tag contents, so a tag <meta name="Platform" content="Linux, Solaris, Irix"> would turn into three words mt_platform_linux, mt_platform_solaris, mt_platform_irix - they're all forced to lower case). Then I just use the keywords= CGI parameter to htsearch to include the keywords I want to restrict - we've got a php advanced search page with a bunch of list selects on it, and a redirect page which flattens the multiple options into a single keywords entry (sometime I'd like to modify the keywords parameter handling to allow it to take boolean queries, but that can wait for a bit). I needed to play with the punctuation characters to allow _ in words, and Bob's your uncle. I hope this is of use of interest to someone - I've implemented this on our 3.2.x based tree (and I won't post a patch because our tree has diverged too far - soon I'll have to make it based on the snapshots again), but something similar should work on the 3.1.x too. Jamie Anstice Search Scientist, S.L.I. Systems, Inc [EMAIL PROTECTED] ph: 64 961 3262 mobile: 64 21 264 9347 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
