[jira] [Commented] (JSPWIKI-893) Cannot search for bold words with GermanAnalyzer

JIRA Sun, 11 Aug 2019 11:28:36 -0700


    [ 
https://issues.apache.org/jira/browse/JSPWIKI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904719#comment-16904719
 ]


Juan Pablo Santos Rodríguez commented on JSPWIKI-893:
-----------------------------------------------------

Having touched lately the {{LuceneSearchProvider}}, went to see what is 
happening with this one.

Right now, I'd fear that {{GermanAnalyzer}} won't suffice. As it is also final 
(as most analyzers, it seems), you can't extend/overwrite it, so that means 
that, as of now, you'd need your/another custom {{Analyzer}}. 
{{GermanAnalyzer}} uses an {{StandardTokenizer}} to "tokenize" texts, which 
treats underscores as part of the word, hence the behaviour you are expecting. 

If you copy-and-rename the {{GermanAnalyzer}} to something else (like 
{{WikiGermanAnalyzer}} or whatever) and overwrite the {{createComponents}} 
method so it looks like:

{code:java}
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final ClassicTokenizer source = new ClassicTokenizer();
        source.setMaxTokenLength(255); // frp, ClassicAnalyzer
        TokenStream tok = new ClassicFilter(source);
        TokenStream result = new LowerCaseFilter(tok);
        result = new StopFilter(result, stopwords);
        result = new SetKeywordMarkerFilter(result, exclusionSet);
        result = new GermanNormalizationFilter(result);
        result = new GermanLightStemFilter(result);
        return new TokenStreamComponents(source, result);
    }
{code}

that should be enough to solve your issue. Note that this issue also happens 
with the {{SpanishTokenizer}}, {{EnglishTokenizer}} and, most probably, with 
the others "LanguageTokenizers" as well, as they seem to be using 
{{StandardTokenizer}} instead of the {{ClassicTokenizer}} -which probably makes 
sense, in the context of analyzing plain texts.

As of how to fix this issue, on JSPWiki, I'm not really sure on how could we 
solve it other than pointing to a custom/parallel language analyzer. It seems 
there isn't an easy way to either
 * extend or compose analyzers,
 * compose {{TokenStreamComponents}} or
 * fetch the sink associated to the {{TokenStreamComponents}} and iterate on it 
to build a custom {{TokenStreamComponents}}, which would use a 
{{ClassicTokenizer}}, set by JSPWiki, and all the filters associated to the 
given Lucene analyzer (=first three lines on the method above set by JSPWiki, 
the rest set by the other Lucene analyzer)

So, JSPWiki bundling a custom analyzer doesn't seem like a viable option (we'd 
also have to decide if we want to keep e-mails an urls, or if to index compound 
words, etc.)

As of removing the __ sequence, I think it wouldn't suffice, as you'd also need 
to remove all the wiki markup, while at the same time, not removing those 
characters when they are not part of the wiki markup (f.ex., an opening __ with 
no closing sequence shouldn't be deleted ( ? ) ). 

I also thought of not indexing the page markup, but getting the associated 
WikiDocument instead, and parse it through the {{CleanTextRenderer}}, and the 
index the resulting text. But that also means that we'd also take a lot of 
other associated weird issues: ACLs interfering with indexing, plugins that may 
have side effects, like increasing counters or setting WikiVariables, filters 
being run, etc. Have to get a better look at it to see if somehow we could find 
a way to clean the markup, though. 

I'd propose to add a note on {{jspwiki.lucene.analyzer}} pointing here to note 
that an Analyzer which uses a {{ClassicTokenizer}} should be preferred and mark 
this as won't fix, as the problem lies on some Lucene Analyzers, on which we 
can't do nearly anything. What do others think?

> Cannot search for bold words with GermanAnalyzer
> ------------------------------------------------
>
>                 Key: JSPWIKI-893
>                 URL: https://issues.apache.org/jira/browse/JSPWIKI-893
>             Project: JSPWiki
>          Issue Type: Bug
>    Affects Versions: 2.10
>            Reporter: Arend v. Reinersdorff
>            Priority: Major
>
> Reproduce:
> * in jspwiki-custom.properties set
>    {{jspwiki.lucene.analyzer = org.apache.lucene.analysis.de.GermanAnalyzer}}
> * in a wiki page include {{\_\_mysearchterm\_\_}}
> Result:
> * Searching for {{mysearchterm}} doesn't find the text
> * Searching for {{\_\_mysearchterm\_\_}} finds the text
> Expected:
> Searching for {{mysearchterm}} should find the text, as it does with 
> {{org.apache.lucene.analysis.standard.StandardAnalyzer}}
> See also the thread on the mailinglist:
> http://mail-archives.apache.org/mod_mbox/jspwiki-user/201506.mbox/%3CCAJCBYx0M_Xqm1jt8Pr466rRb8sLLLf28eygn9FrA4%3Dhrb6aHeg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (JSPWIKI-893) Cannot search for bold words with GermanAnalyzer

Reply via email to