On Wed, 24 Nov 2004, Stefan Seiz wrote:
Replying to my own post:
Contrary to what i have written above, having: extra_word_characters: _- valid_punctuation: ./!#$%^&' does not take care of making words with hyphens searchable with htdig.
Then there might be some other issue that is creating problems for you. I haven't tried the exact settings that you have above, but in general the most correct solution to the problem you describe is to add '-' to the extra_word_characters attribute and remove it from valid_punctuation. The latter is necessary to prevent the '-' from being thrown away when the attempt is made to break the term into parts on punctuation delimiters. The former allows the term to retain the '-' when inserted into the word
database.
To realy make it work, the hyphen only has to be taken out of the valid_punctuation! As soon as you add a hyphen to the extra_word_characters, it would break again (i tried)!
These attributes affect both indexing and searching. Are you sure that you reindexed after adding the '-' to extra_word_characters? If you indexed with the '-' removed from valid_punctuation and then searched with it added to extra_word_characters, I would not expect a hit.
While only removing the '-' from valid_punctuation will typically work, it is probably not really doing what you want. If you index a term of form part1-part2 in this manner, it is indexed as two separate terms (i.e. the word database contains a part1 term and a part2 term). The reason that you see a hit is that the search side splits the query in the same manner. If you check the query reported on the result page, you will see something like 'Search results for part1 and part2'. If that is good enough, you are done :)
This seems contradictionary to what the htdig documentation says.
Here is the way it works, to the best of my understanding. I will assume the part1-part2 term example and break things out into the four cases.
1. The default case ('-' in valid punctuation, no extra_word_characters).
The indexer both breaks up and collapses the term. For this example, you
end up with three terms added to the word database (i.e. part1, part2,
and part1part2).
On the search side the term is only collapsed, so the actual search
query is part1part2.2. A '-' added to extra_word_characters only (default valid_punctuation). The indexer collapses the term, so only part1part2 ends up in the word database. I think what happens here is that the '-' in valid_punctuation interferes with correct processing of the one in extra_word_characters. It certainly results in behavior that is at odds with what one would expect from reading only the extra_word_characters documentation. On the search side the same processing occurs, so the search query becomes part1part2.
3. The '-' removed from valid_punctuation only (default extra_word_characters). The indexer splits the term on valid punctuation. This results in two terms being added to the word database (i.e. part1 and part2). On the search side, using the default search type, a query for part1-part2 becomes an AND of the two parts, so you end up searching for 'part1 and part2' rather than 'part1-part2'.
4. A '-' added to extra_word_characters and removed from valid_punctuation. The indexer leaves the term as is because the '-' being removed from valid_punctuation allows the '-' in the term to survive long enough to be processed as an extra word character. The term part1-part2 is the only thing that ends up in the word database. On the search side the same thing happens, so the search query really is part1-part2.
In theory, any of the four combinations should work for a term like part1-part2. But this only holds if you use the same settings for both indexing and search. Any mixing and matching opens the door for problems.
With that said, I would still claim that 4 is the most correct solution
for the case you described since it preserves the term as is, both on the
index and search sides. The other cases construct queries that might not
be exactly what you intend and the result highlighting might not work as
expected.
The above will change somewhat for cases where one of the parts is less than your minimum_word_length setting (default 3). In particular it will affect what is indexed in 1 and 3, where the term is broken up. However I believe searches should still return results in all cases.
The most likely cause for the confusion you are running into is probably using different settings for search and index (or changing search after indexing), though I am sure there are other possibilities.
Jim
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

