[jira] [Commented] (SOLR-3056) Introduce Japanese field type in schema.xml

Christian Moen (Commented) (JIRA) Wed, 01 Feb 2012 08:59:25 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197952#comment-13197952
 ]


Christian Moen commented on SOLR-3056:
--------------------------------------

Robert, Let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always 
used in combination with a stoptags filter.  I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might 
want to rely more on stopwords rather than stoptags if it goes down by a whole 
lot.  However, since tokens agree in 99.7% of the cases based on the tests I 
did earlier and the part-of-speech tags we'd typically use as stop tags aren't 
involved with tokens split by search mode, I don't expect this to be a real 
issue, but it's something to keep in mind.

I'll do some testing to verify this and I'll follow up with further 
improvements to configuration.

I'll also open up a separate JIRA for stopwords and stoptags, and aligning the 
Solr and Lucene default configuration.
                
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
>                 Key: SOLR-3056
>                 URL: https://issues.apache.org/jira/browse/SOLR-3056
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch, 
> SOLR-3056_schema40.patch
>
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
> Robert, Uwe and Simon). It would be very good to get a default field type 
> defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
> support in Solr.
> I've been playing with the below configuration today, which I think is a 
> reasonable starting point for Japanese.  There's lot to be said about various 
> considerations necessary when searching Japanese, but perhaps a wiki page is 
> more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and 
> its analyzers need to be seen by the Solr classloader.  However, these are 
> currently in contrib and I'm wondering if we should consider moving them to 
> core to make them directly available.  If there are concerns with additional 
> memory usage, etc. for non-Japanese users, we can make sure resources are 
> loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological 
> analysis
>      NOTE: Please copy files
>        contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
>        dist/apache-solr-analysis-extras-x.y.z.jar
>      to your Solr lib directory (i.e. example/solr/lib) before before 
> starting Solr.
>      (x.y.z refers to a version number)
>      If you would like to optimize for precision, default operator AND with
>        <solrQueryParser defaultOperator="AND"/>
>      below (this file).  Use "OR" if you would like to optimize for recall 
> (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" 
> autoGeneratePhraseQueries="false">
>   <analyzer>
>     <!-- Kuromoji Japanese morphological analyzer/tokenizer
>          Use search-mode to get a noun-decompounding effect useful for search.
>          Example:
>            関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
> (International) 空港 (airport)
>            so we get a match for 空港 (airport) as we would expect from a good 
> search engine
>          Valid values for mode are:
>             normal: default segmentation
>             search: segmentation useful for search (extra compound splitting)
>           extended: search mode with unigramming of unknown words 
> (experimental)
>          NOTE: Search mode improves segmentation for search at the expense of 
> part-of-speech accuracy
>     -->
>     <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
>     <!-- Reduces inflected verbs and adjectives to their base/dectionary 
> forms (辞書形) -->      
>     <filter class="solr.KuromojiBaseFormFilterFactory"/>
>     <!-- Optionally remove tokens with certain part-of-speeches
>     <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" 
> tags="stopTags.txt" enablePositionIncrements="true"/> -->
>     <!-- Normalizes full-width romaji to half-with and half-width kana to 
> full-width (Unicode NFKC subset) -->
>     <filter class="solr.CJKWidthFilterFactory"/>
>     <!-- Lower-case romaji characters -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3056) Introduce Japanese field type in schema.xml

Reply via email to