Hello

I am trying to filter out characters per unicode block or before
tokenization, so I use "PatternReplaceCharFilterFactory". In the end, I want
to filter out all non-CJK characters, basically latin, greek, arabic and
hebrew scripts.

The problem is, PatternReplaceCharFilterFactory does not fully support the
block or script pattern notation. Example:
<charFilter class="solr.PatternReplaceCharFilterFactory"
               pattern="\p{InBasic_Latin}"
               replacement=""
              replace="all"
/>
This works. Other patterns tried were: \p{InLatin-1_Supplement} or \p{Latin}
These throw an exception, from the log:
***
Mar 29, 2012 5:56:45 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType:Plugin init failure for [schema.xml]
analyzer/charFilter:Configuration Error: 'pattern' can not be parsed in
org.apache.solr.analysis.PatternReplaceCharFilterFactory
***

I am running the latest 4.0 nightly (version 4.0.0.2012.03.09.11.46.05)

Can anybody help? Or, might this be a java issue?

Thanks a lot
Oliver

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pattern-error-in-PatternReplaceCharFilterFactory-tp3868174p3868174.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to