[ https://issues.apache.org/jira/browse/JSPWIKI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904719#comment-16904719 ]
Juan Pablo Santos RodrÃguez commented on JSPWIKI-893: ----------------------------------------------------- Having touched lately the {{LuceneSearchProvider}}, went to see what is happening with this one. Right now, I'd fear that {{GermanAnalyzer}} won't suffice. As it is also final (as most analyzers, it seems), you can't extend/overwrite it, so that means that, as of now, you'd need your/another custom {{Analyzer}}. {{GermanAnalyzer}} uses an {{StandardTokenizer}} to "tokenize" texts, which treats underscores as part of the word, hence the behaviour you are expecting. If you copy-and-rename the {{GermanAnalyzer}} to something else (like {{WikiGermanAnalyzer}} or whatever) and overwrite the {{createComponents}} method so it looks like: {code:java} @Override protected TokenStreamComponents createComponents(String fieldName) { final ClassicTokenizer source = new ClassicTokenizer(); source.setMaxTokenLength(255); // frp, ClassicAnalyzer TokenStream tok = new ClassicFilter(source); TokenStream result = new LowerCaseFilter(tok); result = new StopFilter(result, stopwords); result = new SetKeywordMarkerFilter(result, exclusionSet); result = new GermanNormalizationFilter(result); result = new GermanLightStemFilter(result); return new TokenStreamComponents(source, result); } {code} that should be enough to solve your issue. Note that this issue also happens with the {{SpanishTokenizer}}, {{EnglishTokenizer}} and, most probably, with the others "LanguageTokenizers" as well, as they seem to be using {{StandardTokenizer}} instead of the {{ClassicTokenizer}} -which probably makes sense, in the context of analyzing plain texts. As of how to fix this issue, on JSPWiki, I'm not really sure on how could we solve it other than pointing to a custom/parallel language analyzer. It seems there isn't an easy way to either * extend or compose analyzers, * compose {{TokenStreamComponents}} or * fetch the sink associated to the {{TokenStreamComponents}} and iterate on it to build a custom {{TokenStreamComponents}}, which would use a {{ClassicTokenizer}}, set by JSPWiki, and all the filters associated to the given Lucene analyzer (=first three lines on the method above set by JSPWiki, the rest set by the other Lucene analyzer) So, JSPWiki bundling a custom analyzer doesn't seem like a viable option (we'd also have to decide if we want to keep e-mails an urls, or if to index compound words, etc.) As of removing the __ sequence, I think it wouldn't suffice, as you'd also need to remove all the wiki markup, while at the same time, not removing those characters when they are not part of the wiki markup (f.ex., an opening __ with no closing sequence shouldn't be deleted ( ? ) ). I also thought of not indexing the page markup, but getting the associated WikiDocument instead, and parse it through the {{CleanTextRenderer}}, and the index the resulting text. But that also means that we'd also take a lot of other associated weird issues: ACLs interfering with indexing, plugins that may have side effects, like increasing counters or setting WikiVariables, filters being run, etc. Have to get a better look at it to see if somehow we could find a way to clean the markup, though. I'd propose to add a note on {{jspwiki.lucene.analyzer}} pointing here to note that an Analyzer which uses a {{ClassicTokenizer}} should be preferred and mark this as won't fix, as the problem lies on some Lucene Analyzers, on which we can't do nearly anything. What do others think? > Cannot search for bold words with GermanAnalyzer > ------------------------------------------------ > > Key: JSPWIKI-893 > URL: https://issues.apache.org/jira/browse/JSPWIKI-893 > Project: JSPWiki > Issue Type: Bug > Affects Versions: 2.10 > Reporter: Arend v. Reinersdorff > Priority: Major > > Reproduce: > * in jspwiki-custom.properties set > {{jspwiki.lucene.analyzer = org.apache.lucene.analysis.de.GermanAnalyzer}} > * in a wiki page include {{\_\_mysearchterm\_\_}} > Result: > * Searching for {{mysearchterm}} doesn't find the text > * Searching for {{\_\_mysearchterm\_\_}} finds the text > Expected: > Searching for {{mysearchterm}} should find the text, as it does with > {{org.apache.lucene.analysis.standard.StandardAnalyzer}} > See also the thread on the mailinglist: > http://mail-archives.apache.org/mod_mbox/jspwiki-user/201506.mbox/%3CCAJCBYx0M_Xqm1jt8Pr466rRb8sLLLf28eygn9FrA4%3Dhrb6aHeg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)