[
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197952#comment-13197952
]
Christian Moen commented on SOLR-3056:
--------------------------------------
Robert, Let's enable stop-words and stop-tags by default.
The stopwords list in the Lucene analyzer looks too small unless it's always
used in combination with a stoptags filter. I'll look into both of these.
Also, if we're using search mode, part-of-speech F will decrease so we might
want to rely more on stopwords rather than stoptags if it goes down by a whole
lot. However, since tokens agree in 99.7% of the cases based on the tests I
did earlier and the part-of-speech tags we'd typically use as stop tags aren't
involved with tokens split by search mode, I don't expect this to be a real
issue, but it's something to keep in mind.
I'll do some testing to verify this and I'll follow up with further
improvements to configuration.
I'll also open up a separate JIRA for stopwords and stoptags, and aligning the
Solr and Lucene default configuration.
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
> Key: SOLR-3056
> URL: https://issues.apache.org/jira/browse/SOLR-3056
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Affects Versions: 3.6, 4.0
> Reporter: Christian Moen
> Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch,
> SOLR-3056_schema40.patch
>
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again
> Robert, Uwe and Simon). It would be very good to get a default field type
> defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box
> support in Solr.
> I've been playing with the below configuration today, which I think is a
> reasonable starting point for Japanese. There's lot to be said about various
> considerations necessary when searching Japanese, but perhaps a wiki page is
> more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and
> its analyzers need to be seen by the Solr classloader. However, these are
> currently in contrib and I'm wondering if we should consider moving them to
> core to make them directly available. If there are concerns with additional
> memory usage, etc. for non-Japanese users, we can make sure resources are
> loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological
> analysis
> NOTE: Please copy files
> contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
> dist/apache-solr-analysis-extras-x.y.z.jar
> to your Solr lib directory (i.e. example/solr/lib) before before
> starting Solr.
> (x.y.z refers to a version number)
> If you would like to optimize for precision, default operator AND with
> <solrQueryParser defaultOperator="AND"/>
> below (this file). Use "OR" if you would like to optimize for recall
> (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="false">
> <analyzer>
> <!-- Kuromoji Japanese morphological analyzer/tokenizer
> Use search-mode to get a noun-decompounding effect useful for search.
> Example:
> 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際
> (International) 空港 (airport)
> so we get a match for 空港 (airport) as we would expect from a good
> search engine
> Valid values for mode are:
> normal: default segmentation
> search: segmentation useful for search (extra compound splitting)
> extended: search mode with unigramming of unknown words
> (experimental)
> NOTE: Search mode improves segmentation for search at the expense of
> part-of-speech accuracy
> -->
> <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
> <!-- Reduces inflected verbs and adjectives to their base/dectionary
> forms (辞書形) -->
> <filter class="solr.KuromojiBaseFormFilterFactory"/>
> <!-- Optionally remove tokens with certain part-of-speeches
> <filter class="solr.KuromojiPartOfSpeechStopFilterFactory"
> tags="stopTags.txt" enablePositionIncrements="true"/> -->
> <!-- Normalizes full-width romaji to half-with and half-width kana to
> full-width (Unicode NFKC subset) -->
> <filter class="solr.CJKWidthFilterFactory"/>
> <!-- Lower-case romaji characters -->
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]