On Wed, 24 Nov 2004, Stefan Seiz wrote:

Replying to my own post:

Contrary to what i have written above, having:
    extra_word_characters: _-
    valid_punctuation: ./!#$%^&'
does not take care of making words with hyphens searchable with htdig.

Then there might be some other issue that is creating problems for you. I haven't tried the exact settings that you have above, but in general the most correct solution to the problem you describe is to add '-' to the extra_word_characters attribute and remove it from valid_punctuation. The latter is necessary to prevent the '-' from being thrown away when the attempt is made to break the term into parts on punctuation delimiters. The former allows the term to retain the '-' when inserted into the word
database.


To realy make it work, the hyphen only has to be taken out of the
valid_punctuation! As soon as you add a hyphen to the extra_word_characters,
it would break again (i tried)!

These attributes affect both indexing and searching. Are you sure that you reindexed after adding the '-' to extra_word_characters? If you indexed with the '-' removed from valid_punctuation and then searched with it added to extra_word_characters, I would not expect a hit.

While only removing the '-' from valid_punctuation will typically work, it
is probably not really doing what you want. If you index a term of form
part1-part2 in this manner, it is indexed as two separate terms (i.e. the
word database contains a part1 term and a part2 term). The reason that you
see a hit is that the search side splits the query in the same manner. If
you check the query reported on the result page, you will see something
like 'Search results for part1 and part2'. If that is good enough, you are
done :)

This seems contradictionary to what the htdig documentation says.

Here is the way it works, to the best of my understanding. I will assume the part1-part2 term example and break things out into the four cases.

1. The default case ('-' in valid punctuation, no extra_word_characters).
   The indexer both breaks up and collapses the term. For this example, you
   end up with three terms added to the word database (i.e. part1, part2,
   and part1part2).
   On the search side the term is only collapsed, so the actual search
   query is part1part2.

2. A '-' added to extra_word_characters only (default valid_punctuation).
   The indexer collapses the term, so only part1part2 ends up in the word
   database. I think what happens here is that the '-' in valid_punctuation
   interferes with correct processing of the one in extra_word_characters.
   It certainly results in behavior that is at odds with what one would
   expect from reading only the extra_word_characters documentation.
   On the search side the same processing occurs, so the search query
   becomes part1part2.

3. The '-' removed from valid_punctuation only (default extra_word_characters).
   The indexer splits the term on valid punctuation. This results in two
   terms being added to the word database (i.e. part1 and part2).
   On the search side, using the default search type, a query for
   part1-part2 becomes an AND of the two parts, so you end up searching
   for 'part1 and part2' rather than 'part1-part2'.

4. A '-' added to extra_word_characters and removed from valid_punctuation.
   The indexer leaves the term as is because the '-' being removed from
   valid_punctuation allows the '-' in the term to survive long enough to
   be processed as an extra word character. The term part1-part2 is the
   only thing that ends up in the word database.
   On the search side the same thing happens, so the search query really is
   part1-part2.

In theory, any of the four combinations should work for a term like part1-part2. But this only holds if you use the same settings for both indexing and search. Any mixing and matching opens the door for problems.
With that said, I would still claim that 4 is the most correct solution
for the case you described since it preserves the term as is, both on the
index and search sides. The other cases construct queries that might not
be exactly what you intend and the result highlighting might not work as
expected.


The above will change somewhat for cases where one of the parts is less
than your minimum_word_length setting (default 3). In particular it will
affect what is indexed in 1 and 3, where the term is broken up. However
I believe searches should still return results in all cases.

The most likely cause for the confusion you are running into is probably
using different settings for search and index (or changing search after
indexing), though I am sure there are other possibilities.

Jim


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to