[jira] [Created] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method --- Key: LUCENE-3935 URL: https://issues.apache.org/jira/browse/LUCENE-3935 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected. This method is currently backed by a {{short[][]}}. This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension. (The data is {{matrix.def}} in MeCab-IPADIC.) We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache. I think this will be a nice optimization. Working on it... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run 1 million or so typical Japanese queries against the index at 3-5 queries per second While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings to this JIRA as I get things going. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3276) Update ja_text entry in schema.xml with useful info
Update ja_text entry in schema.xml with useful info --- Key: SOLR-3276 URL: https://issues.apache.org/jira/browse/SOLR-3276 Project: Solr Issue Type: Improvement Components: documentation Affects Versions: 3.6, 4.0 Reporter: Christian Moen Searching Japanese text is a big topic with many considerations that need to be made. I think it's helpful to add a link to the wiki in a comment near {{text_ja}} in {{scheme.xml}} to guide users to detailed information on features available, how to use them, etc. I've made a placeholder page on [http://wiki.apache.org/solr/JapaneseLanguageSupport] and I'll add details post-release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3916) Consider different query and index segmentation for Japanese
Consider different query and index segmentation for Japanese Key: LUCENE-3916 URL: https://issues.apache.org/jira/browse/LUCENE-3916 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Priority: Minor Kuromoji today uses search mode segmentation both at query and index time. The benefit with search mode segmentation is that it segments compounds such as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 関西. This segmentation allows us to get a match for 空港 (airport), which is good for recall and we'd get good precision when searching for the compound 関西国際空港 because of IDF. However, if we search for the compound 関西国際空港 (Kansai International Airport) our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 (Kansai International Airport), 国際 (international) and 空港 (airport). This behaviour is by-design when using OR as the default operator, but this also has the effect of returning generic hits like 空港 (airport) when the user searches for something very specific like 関西国際空港 (Kansai International Airport) -- and these hits are also highlighted. This doesn't necessarily mean that ranking is flawed per se, but a user or application might prefer precision over recall. In order to favour precision, we can consider using normal mode segmentation for queries, but retain search mode segmentation on the indexing side. Does anyone have any general opinion on this? What would we do for other language in the case of compound splitting? Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}} while keeping the current behaviour? Many thanks for any input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3915) Add Japanese filter to replace term attribute with readings
Add Japanese filter to replace term attribute with readings --- Key: LUCENE-3915 URL: https://issues.apache.org/jira/browse/LUCENE-3915 Project: Lucene - Java Issue Type: New Feature Reporter: Christian Moen Priority: Minor Koji and Robert are working on LUCENE-3888 that allows spell-checkers to do their similarity matching using a different word than its surface form. This approach is very useful for languages such as Japanese where the surface form and the form we'd like to use for similarity matching is very different. For Japanese, it's useful to use readings for this -- probably with some normalization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3909) Move Kuromoji to analysis.ja and introduce Japanese* naming
Move Kuromoji to analysis.ja and introduce Japanese* naming --- Key: LUCENE-3909 URL: https://issues.apache.org/jira/browse/LUCENE-3909 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Lucene/Solr 3.6 and 4.0 will get out-of-the-box Japanese language support through {{KuromojiAnalyzer}}, {{KuromojiTokenizer}} and various other filters. These filters currently live in {{org.apache.lucene.analysis.kuromoji}}. I'm proposing that we move Kuromoji to a new Japanese package {{org.apache.lucene.analysis.ja}} in line with how other languages are organized. As part of this, I also think we should rename {{KuromojiAnalyzer}} to {{JapaneseAnalyzer}}, etc. to further align naming to our conventions by making it very clear that these analyzers are for Japanese. (As much as I like the name "Kuromoji", I think "Japanese" is more fitting.) A potential issue I see with this that I'd like to raise and get feedback on, is that end-users in Japan and elsewhere who use lucene-gosen could have issues after an upgrade since lucene-gosen is in fact releasing its analyzers under the {{org.apache.lucene.analysis.ja}} namespace (and we'd have a name clash). I believe users should have the freedom to choose whichever Japanese analyzer, filter, etc. they'd like to use, and I don't want to propose a name change that just creates unnecessary problems for users, but I think the naming proposed above is most fitting for a Lucene/Solr release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3901) Add katakana filter to better deal with katakana spelling variants
Add katakana filter to better deal with katakana spelling variants -- Key: LUCENE-3901 URL: https://issues.apache.org/jira/browse/LUCENE-3901 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Christian Moen Fix For: 3.6, 4.0 Many Japanese katakana words end in a long sound that is sometimes optional. For example, パーティー and パーティ are both perfectly valid for "party". Similarly we have センター and センタ that are variants of "center" as well as サーバー and サーバ for "server". I'm proposing that we add a katakana stemmer that removes this long sound if the terms are longer than a configurable length. It's also possible to add the variant as a synonym, but I think stemming is preferred from a ranking point of view. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3115) Improve default Japanese stopwords.txt description
Improve default Japanese stopwords.txt description -- Key: SOLR-3115 URL: https://issues.apache.org/jira/browse/SOLR-3115 Project: Solr Issue Type: Improvement Components: Rules Reporter: Christian Moen Priority: Minor As discussed in SOLR-3056, the description in the default Japanese stopwords.txt should be improved to describe case- and width-handling. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3107) Disable random sampling in LangDetectLanguageIdentifierUpdateProcessor
Disable random sampling in LangDetectLanguageIdentifierUpdateProcessor -- Key: SOLR-3107 URL: https://issues.apache.org/jira/browse/SOLR-3107 Project: Solr Issue Type: Improvement Components: contrib - LangId Affects Versions: 3.6, 4.0 Reporter: Christian Moen Priority: Minor The {{language-detection}} library used by {{LangDetectLanguageIdentifierUpdateProcessor}} uses a random sampling feature enabled by default as a means of avoiding local noise in input. The feature has its merits, but it can also be confusing to users who aren't aware of it since it may give different on the same input. I recommend turning it off to prevent confusion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3097) Introduce default Japanese stoptags and stopwords to Solr's example configuration
Introduce default Japanese stoptags and stopwords to Solr's example configuration - Key: SOLR-3097 URL: https://issues.apache.org/jira/browse/SOLR-3097 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen SOLR-3056 discusses introducing a default field type {{text_ja}} for Japanese in {{schema.xml}}. This configuration will be improved by also introducing default stopwords and stoptags configuration for the field type. I believe this configuration should be easily available and tunable to Solr users and I'm proposing that we introduce the same stopwords and stoptags provided in LUCENE-3745 to Solr example configuration. I'm proposing that files can live in {{solr/example/solr/conf}} as {{stopwords_ja.txt}} and {{stoptags_ja.txt}} alongside {{stopwords_en.txt}} for English. (Longer term, I think should reconsider our overall approach to this across all languages, but that's perhaps a separate discussion.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr
Align default Japanese configurations for Lucene and Solr - Key: LUCENE-3751 URL: https://issues.apache.org/jira/browse/LUCENE-3751 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration
Need stopwords and stoptags lists for default Japanese configuration Key: LUCENE-3745 URL: https://issues.apache.org/jira/browse/LUCENE-3745 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Christian Moen Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated into Lucene. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3730) Improved Kuromoji search mode segmentation/decompounding
Improved Kuromoji search mode segmentation/decompounding Key: LUCENE-3730 URL: https://issues.apache.org/jira/browse/LUCENE-3730 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Kuromoji has a segmentation mode for search that uses a heuristic to promote additional segmentation of long candidate tokens to get a decompounding effect. This heuristic has been improved. Patch is coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3056) Introduce Japanese field type in schema.xml
Introduce Japanese field type in schema.xml --- Key: SOLR-3056 URL: https://issues.apache.org/jira/browse/SOLR-3056 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box support in Solr. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? In order to make the below {{text_ja}} field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Any thoughts? {code:xml} {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org