[jira] [Updated] (SOLR-3056) Introduce Japanese field type in schema.xml

Robert Muir (Updated) (JIRA) Wed, 08 Feb 2012 05:14:25 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated SOLR-3056:
------------------------------

    Attachment: SOLR-3056.patch

Attached is Christians patch, synced up to trunk.

Additionally, I modified the factory to be more lazy, such that you pay no RAM 
unless you then go and use text_ja.

Segmenter itself is very lightweight (except the first time called, where the 
classloader ensures the singletons are loaded). In fact the Lucene tokenizer 
even has a no-arg ctor with "new Segmenter()".

Because tokenstreams are reused anyway via threadlocal, we only call create() 
once per thread... and again its just a lightweight Segmenter which is likely 
cheaper than even all the attributesource stuff already needed for the 
tokenstream.

So this has no impact on kuromoji's performance, just defers the initialization 
so that if you don't use text_ja the resources are not loaded.

I reviewed the fieldtype, and only have one last question! (I didnt change 
anything from your configuration)

I noticed the order of the tokenfilters is different from the order defined in 
KuromojiAnalyzer. This order can be important in some situations, so I think we 
should correct one or the other to be consistent?

                
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
>                 Key: SOLR-3056
>                 URL: https://issues.apache.org/jira/browse/SOLR-3056
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: SOLR-3056.patch, SOLR-3056_move.patch, 
> SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch
>
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
> Robert, Uwe and Simon). It would be very good to get a default field type 
> defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
> support in Solr.
> I've been playing with the below configuration today, which I think is a 
> reasonable starting point for Japanese.  There's lot to be said about various 
> considerations necessary when searching Japanese, but perhaps a wiki page is 
> more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and 
> its analyzers need to be seen by the Solr classloader.  However, these are 
> currently in contrib and I'm wondering if we should consider moving them to 
> core to make them directly available.  If there are concerns with additional 
> memory usage, etc. for non-Japanese users, we can make sure resources are 
> loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological 
> analysis
>      NOTE: Please copy files
>        contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
>        dist/apache-solr-analysis-extras-x.y.z.jar
>      to your Solr lib directory (i.e. example/solr/lib) before before 
> starting Solr.
>      (x.y.z refers to a version number)
>      If you would like to optimize for precision, default operator AND with
>        <solrQueryParser defaultOperator="AND"/>
>      below (this file).  Use "OR" if you would like to optimize for recall 
> (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" 
> autoGeneratePhraseQueries="false">
>   <analyzer>
>     <!-- Kuromoji Japanese morphological analyzer/tokenizer
>          Use search-mode to get a noun-decompounding effect useful for search.
>          Example:
>            関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
> (International) 空港 (airport)
>            so we get a match for 空港 (airport) as we would expect from a good 
> search engine
>          Valid values for mode are:
>             normal: default segmentation
>             search: segmentation useful for search (extra compound splitting)
>           extended: search mode with unigramming of unknown words 
> (experimental)
>          NOTE: Search mode improves segmentation for search at the expense of 
> part-of-speech accuracy
>     -->
>     <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
>     <!-- Reduces inflected verbs and adjectives to their base/dectionary 
> forms (辞書形) -->      
>     <filter class="solr.KuromojiBaseFormFilterFactory"/>
>     <!-- Optionally remove tokens with certain part-of-speeches
>     <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" 
> tags="stopTags.txt" enablePositionIncrements="true"/> -->
>     <!-- Normalizes full-width romaji to half-with and half-width kana to 
> full-width (Unicode NFKC subset) -->
>     <filter class="solr.CJKWidthFilterFactory"/>
>     <!-- Lower-case romaji characters -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-3056) Introduce Japanese field type in schema.xml

Reply via email to