[ https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197952#comment-13197952 ]
Christian Moen edited comment on SOLR-3056 at 2/1/12 5:06 PM: -------------------------------------------------------------- Robert, let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier -- and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind. I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. was (Author: cm): Robert, Let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier and the part-of-speech tags we'd typically use as stop tags aren't involved with tokens split by search mode, I don't expect this to be a real issue, but it's something to keep in mind. I'll do some testing to verify this and I'll follow up with further improvements to configuration. I'll also open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configuration. > Introduce Japanese field type in schema.xml > ------------------------------------------- > > Key: SOLR-3056 > URL: https://issues.apache.org/jira/browse/SOLR-3056 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Affects Versions: 3.6, 4.0 > Reporter: Christian Moen > Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch, > SOLR-3056_schema40.patch > > > Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again > Robert, Uwe and Simon). It would be very good to get a default field type > defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box > support in Solr. > I've been playing with the below configuration today, which I think is a > reasonable starting point for Japanese. There's lot to be said about various > considerations necessary when searching Japanese, but perhaps a wiki page is > more suitable to cover the wider topic? > In order to make the below {{text_ja}} field type work, Kuromoji itself and > its analyzers need to be seen by the Solr classloader. However, these are > currently in contrib and I'm wondering if we should consider moving them to > core to make them directly available. If there are concerns with additional > memory usage, etc. for non-Japanese users, we can make sure resources are > loaded lazily and only when needed in factory-land. > Any thoughts? > {code:xml} > <!-- Text field type is suitable for Japanese text using morphological > analysis > NOTE: Please copy files > contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar > dist/apache-solr-analysis-extras-x.y.z.jar > to your Solr lib directory (i.e. example/solr/lib) before before > starting Solr. > (x.y.z refers to a version number) > If you would like to optimize for precision, default operator AND with > <solrQueryParser defaultOperator="AND"/> > below (this file). Use "OR" if you would like to optimize for recall > (default). > --> > <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" > autoGeneratePhraseQueries="false"> > <analyzer> > <!-- Kuromoji Japanese morphological analyzer/tokenizer > Use search-mode to get a noun-decompounding effect useful for search. > Example: > 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 > (International) 空港 (airport) > so we get a match for 空港 (airport) as we would expect from a good > search engine > Valid values for mode are: > normal: default segmentation > search: segmentation useful for search (extra compound splitting) > extended: search mode with unigramming of unknown words > (experimental) > NOTE: Search mode improves segmentation for search at the expense of > part-of-speech accuracy > --> > <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/> > <!-- Reduces inflected verbs and adjectives to their base/dectionary > forms (辞書形) --> > <filter class="solr.KuromojiBaseFormFilterFactory"/> > <!-- Optionally remove tokens with certain part-of-speeches > <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" > tags="stopTags.txt" enablePositionIncrements="true"/> --> > <!-- Normalizes full-width romaji to half-with and half-width kana to > full-width (Unicode NFKC subset) --> > <filter class="solr.CJKWidthFilterFactory"/> > <!-- Lower-case romaji characters --> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org