[ https://issues.apache.org/jira/browse/JSPWIKI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905634#comment-16905634 ]
Juan Pablo Santos RodrÃguez commented on JSPWIKI-893: ----------------------------------------------------- Hi Ulf, regarding {quote}As of removing the __ sequence, I think it wouldn't suffice, as you'd also need to remove all the wiki markup, while at the same time, not removing those characters when they are not part of the wiki markup (f.ex., an opening __ with no closing sequence shouldn't be deleted ( ? ) ). {quote} poorly expressed, let me try to rephrase differently. The wiki page content _is_ getting indexed, the problem lies with the analyzer used to index that text. Some analyzers are able to discard non alphanumeric characters from tokens ({{ClassicAnalyzer}}), others are not. Some are able to mantain urls and email addresses, others (like {{WhitespaceAnalyzer}}) are not. You're going to get different results based on the {{Analyzer}} that you choose. If we remove the __ sequence, then unusual texts like {{wiki__woko}} won't yield any results when looking for either {{wiki}} or {{woko}}, as opposed to the behaviour shown by the {{ClassicAnalyzer}}. Yeah, I know it's a dumb example, but you get the point. We can try to format the text so it gets indexed the same on most analyzers, but we're going to get this issue under different flavours anyway (i.e. "doesn't keep urls", or "doesn't take into account hyphens", or compound words, etc.), and that's what got me thinking that perhaps the way to fix this is to provide a custom {{Analyzer}} which behaves taking into account that we're indexing wiki markup, not plain text. The fact that (most probably) all LanguageAnalyzers use a {{StandardTokenizer}} instead of a {{ClassicAnalyzer}} and that there isn't an easy way to overwrite this behaviour is unfortunate, because at first glance they seem to be like a perfect drop-in replacement for the {{ClassicAnalyzer}}. So, one way of fixing this would be providing a custom {{Analyzer}}. Another could be stripping all markup from the wikipage body and then indexing plain text instead of markup text. Right now, there isn't a direct way other than going through the JSPWiki renderers, but that has other implications (plugins or filters being run, ACLs, etc.). I'm looking for a more direct way to clean the markup, but right now I'm not seeing how that could be done. Also, I'm not sure if we'd still be getting this type of issue, on a lesser degree. It feels to me like we would be patching the symptons (forcing the text to be indexed the same by different analyzers) instead of fixing the root cause (analyzers behaving differently). Hope I'm making more sense now.. best regards, juan pablo > Cannot search for bold words with GermanAnalyzer > ------------------------------------------------ > > Key: JSPWIKI-893 > URL: https://issues.apache.org/jira/browse/JSPWIKI-893 > Project: JSPWiki > Issue Type: Bug > Affects Versions: 2.10 > Reporter: Arend v. Reinersdorff > Priority: Major > > Reproduce: > * in jspwiki-custom.properties set > {{jspwiki.lucene.analyzer = org.apache.lucene.analysis.de.GermanAnalyzer}} > * in a wiki page include {{\_\_mysearchterm\_\_}} > Result: > * Searching for {{mysearchterm}} doesn't find the text > * Searching for {{\_\_mysearchterm\_\_}} finds the text > Expected: > Searching for {{mysearchterm}} should find the text, as it does with > {{org.apache.lucene.analysis.standard.StandardAnalyzer}} > See also the thread on the mailinglist: > http://mail-archives.apache.org/mod_mbox/jspwiki-user/201506.mbox/%3CCAJCBYx0M_Xqm1jt8Pr466rRb8sLLLf28eygn9FrA4%3Dhrb6aHeg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)