On Thu, 25 Mar 2004, Jeff Kirby wrote: > Here is a brief description of what I'm trying to accomplish: > We have about 60,000 documents that we are indexing, most of them have > statute numbers (similar to "356.47(b)(a)" )... you'll notice a problem > right off the bat when looking at this... and that is the period. Now > if I include the period and open/close paranthesis, then I'm going to be > indexing invalid words as well... > > So, I thought of two possible of solutions, but I don't think they are > implemented in ht://Dig. One would be the ability to include a list of > valid words to search and index (i.e. these would be recognized in a > document before the removal of punctuation). The second would be to > have a regular expression that also searches for valid words.
You might take a look at htdig's external parser support. If you are up for writing a bit of code, this should provide you with complete control over what is passed to htdig for indexing. Check the following for more info on external parser support. http://www.htdig.org/attrs.html#external_parsers Jim ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

