On 3/9/06, Raul Raja Martinez <[EMAIL PROTECTED]> wrote: > Hi I have a lot of html indexed such as: > > Martínez > > Of course my users are gonna search for MartÃnez and they're not gonna > get a match. > > Is there a common approach to solve this kind of problem in lucene, > Maybe some utility class or something?
If you might have other random HTML markup as well as entities check out, Solr's HTMLStrip* tokenizers: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters It's good if your input is dirty - if you don't know if it's HTML or not, or if there are HTML fragments that would cause a normaly HTML parser to choke. If you actually have HTML documents, I would go with an HTML parser. If you have *just* entities, there is probably a simpler approach. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]