[jira] [Commented] (JSPWIKI-893) Cannot search for bold words with GermanAnalyzer

JIRA Mon, 12 Aug 2019 15:22:08 -0700


    [ 
https://issues.apache.org/jira/browse/JSPWIKI-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905634#comment-16905634
 ]


Juan Pablo Santos Rodríguez commented on JSPWIKI-893:
-----------------------------------------------------

Hi Ulf,

regarding
{quote}As of removing the __ sequence, I think it wouldn't suffice, as you'd 
also need to remove all the wiki markup, while at the same time, not removing 
those characters when they are not part of the wiki markup (f.ex., an opening 
__ with no closing sequence shouldn't be deleted ( ? ) ).
{quote}
poorly expressed, let me try to rephrase differently. The wiki page content 
_is_ getting indexed, the problem lies with the analyzer used to index that 
text.

Some analyzers are able to discard non alphanumeric characters from tokens 
({{ClassicAnalyzer}}), others are not. Some are able to mantain urls and email 
addresses, others (like {{WhitespaceAnalyzer}}) are not. You're going to get 
different results based on the {{Analyzer}} that you choose. If we remove the 
__ sequence, then unusual texts like {{wiki__woko}} won't yield any results 
when looking for either {{wiki}} or {{woko}}, as opposed to the behaviour shown 
by the {{ClassicAnalyzer}}. Yeah, I know it's a dumb example, but you get the 
point.

We can try to format the text so it gets indexed the same on most analyzers, 
but we're going to get this issue under different flavours anyway (i.e. 
"doesn't keep urls", or "doesn't take into account hyphens", or compound words, 
etc.), and that's what got me thinking that perhaps the way to fix this is to 
provide a custom {{Analyzer}} which behaves taking into account that we're 
indexing wiki markup, not plain text. The fact that (most probably) all 
LanguageAnalyzers use a {{StandardTokenizer}} instead of a {{ClassicAnalyzer}} 
and that there isn't an easy way to overwrite this behaviour is unfortunate, 
because at first glance they seem to be like a perfect drop-in replacement for 
the {{ClassicAnalyzer}}.

So, one way of fixing this would be providing a custom {{Analyzer}}. Another 
could be stripping all markup from the wikipage body and then indexing plain 
text instead of markup text. Right now, there isn't a direct way other than 
going through the JSPWiki renderers, but that has other implications (plugins 
or filters being run, ACLs, etc.). I'm looking for a more direct way to clean 
the markup, but right now I'm not seeing how that could be done. Also, I'm not 
sure if we'd still be getting this type of issue, on a lesser degree. It feels 
to me like we would be patching the symptons (forcing the text to be indexed 
the same by different analyzers) instead of fixing the root cause (analyzers 
behaving differently).

Hope I'm making more sense now..

best regards,
 juan pablo

> Cannot search for bold words with GermanAnalyzer
> ------------------------------------------------
>
>                 Key: JSPWIKI-893
>                 URL: https://issues.apache.org/jira/browse/JSPWIKI-893
>             Project: JSPWiki
>          Issue Type: Bug
>    Affects Versions: 2.10
>            Reporter: Arend v. Reinersdorff
>            Priority: Major
>
> Reproduce:
> * in jspwiki-custom.properties set
>    {{jspwiki.lucene.analyzer = org.apache.lucene.analysis.de.GermanAnalyzer}}
> * in a wiki page include {{\_\_mysearchterm\_\_}}
> Result:
> * Searching for {{mysearchterm}} doesn't find the text
> * Searching for {{\_\_mysearchterm\_\_}} finds the text
> Expected:
> Searching for {{mysearchterm}} should find the text, as it does with 
> {{org.apache.lucene.analysis.standard.StandardAnalyzer}}
> See also the thread on the mailinglist:
> http://mail-archives.apache.org/mod_mbox/jspwiki-user/201506.mbox/%3CCAJCBYx0M_Xqm1jt8Pr466rRb8sLLLf28eygn9FrA4%3Dhrb6aHeg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (JSPWIKI-893) Cannot search for bold words with GermanAnalyzer

Reply via email to