[jira] [Created] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

2012-03-29 Thread Christian Moen (Created) (JIRA)
Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
---

 Key: LUCENE-3935
 URL: https://issues.apache.org/jira/browse/LUCENE-3935
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


I've been profiling Kuromoji, and not very surprisingly, method 
{{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in 
the Viterbi is called many many times and contributes to more processing time 
than I had expected.

This method is currently backed by a {{short[][]}}.  This data stored here 
structure is a two dimensional array with both dimensions being fixed with 1316 
elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)

We can rewrite this to use a single one-dimensional array instead, and we will 
at least save one bounds check, a pointer reference, and we should also get 
much better cache utilization since this structure is likely to be in very 
local CPU cache.

I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Created) (JIRA)
Perform Kuromoji/Japanese stability test before 3.6 freeze
--

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


Kuromoji might be used by many and also in mission critical systems.  I'd like 
to run a stability test before we freeze 3.6.

My thinking is to test the out-of-the-box configuration using fieldtype 
{{text_ja}} as follows:

# Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never 
ending loop
# Simultaneously run 1 million or so typical Japanese queries against the index 
at 3-5 queries per second

While Solr is indexing and searching, I'd like to verify that:

* Indexing and queries are working as expected
* Memory and heap usage looks stable over time
* Garbage collection is overall low over time -- no Full-GC issues

I'll post findings to this JIRA as I get things going.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3276) Update ja_text entry in schema.xml with useful info

2012-03-26 Thread Christian Moen (Created) (JIRA)
Update ja_text entry in schema.xml with useful info
---

 Key: SOLR-3276
 URL: https://issues.apache.org/jira/browse/SOLR-3276
 Project: Solr
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


Searching Japanese text is a big topic with many considerations that need to be 
made.  I think it's helpful to add a link to the wiki in a comment near 
{{text_ja}} in {{scheme.xml}} to guide users to detailed information on 
features available, how to use them, etc.

I've made a placeholder page on 
[http://wiki.apache.org/solr/JapaneseLanguageSupport] and I'll add details 
post-release.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3916) Consider different query and index segmentation for Japanese

2012-03-25 Thread Christian Moen (Created) (JIRA)
Consider different query and index segmentation for Japanese


 Key: LUCENE-3916
 URL: https://issues.apache.org/jira/browse/LUCENE-3916
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Priority: Minor


Kuromoji today uses search mode segmentation both at query and index time.

The benefit with search mode segmentation is that it segments compounds such as 
関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 
(airport), and leaves the compound 関西国際空港 as a synonym to 関西.

This segmentation allows us to get a match for 空港 (airport), which is good for 
recall and we'd get good precision when searching for the compound 関西国際空港 
because of IDF.

However, if we search for the compound 関西国際空港 (Kansai International Airport) 
our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 
(Kansai International Airport), 国際 (international) and 空港 (airport).

This behaviour is by-design when using OR as the default operator, but this 
also has the effect of returning generic hits like 空港 (airport) when the user 
searches for something very specific like 関西国際空港 (Kansai International Airport) 
-- and these hits are also highlighted.

This doesn't necessarily mean that ranking is flawed per se, but a user or 
application might prefer precision over recall.  In order to favour precision, 
we can consider using normal mode segmentation for queries, but retain search 
mode segmentation on the indexing side.

Does anyone have any general opinion on this?  What would we do for other 
language in the case of compound splitting?

Perhaps this can be dealt with as a documentation issue with a comment in 
{{schema.xml}} while keeping the current behaviour?

Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3915) Add Japanese filter to replace term attribute with readings

2012-03-24 Thread Christian Moen (Created) (JIRA)
Add Japanese filter to replace term attribute with readings
---

 Key: LUCENE-3915
 URL: https://issues.apache.org/jira/browse/LUCENE-3915
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Christian Moen
Priority: Minor


Koji and Robert are working on LUCENE-3888 that allows spell-checkers to do 
their similarity matching using a different word than its surface form.

This approach is very useful for languages such as Japanese where the surface 
form and the form we'd like to use for similarity matching is very different.  
For Japanese, it's useful to use readings for this -- probably with some 
normalization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3909) Move Kuromoji to analysis.ja and introduce Japanese* naming

2012-03-24 Thread Christian Moen (Created) (JIRA)
Move Kuromoji to analysis.ja and introduce Japanese* naming
---

 Key: LUCENE-3909
 URL: https://issues.apache.org/jira/browse/LUCENE-3909
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


Lucene/Solr 3.6 and 4.0 will get out-of-the-box Japanese language support 
through {{KuromojiAnalyzer}}, {{KuromojiTokenizer}} and various other filters.  
These filters currently live in {{org.apache.lucene.analysis.kuromoji}}.

I'm proposing that we move Kuromoji to a new Japanese package 
{{org.apache.lucene.analysis.ja}} in line with how other languages are 
organized.  As part of this, I also think we should rename {{KuromojiAnalyzer}} 
to {{JapaneseAnalyzer}}, etc. to further align naming to our conventions by 
making it very clear that these analyzers are for Japanese.  (As much as I like 
the name "Kuromoji", I think "Japanese" is more fitting.)

A potential issue I see with this that I'd like to raise and get feedback on, 
is that end-users in Japan and elsewhere who use lucene-gosen could have issues 
after an upgrade since lucene-gosen is in fact releasing its analyzers under 
the {{org.apache.lucene.analysis.ja}} namespace (and we'd have a name clash).

I believe users should have the freedom to choose whichever Japanese analyzer, 
filter, etc. they'd like to use, and I don't want to propose a name change that 
just creates unnecessary problems for users, but I think the naming proposed 
above is most fitting for a Lucene/Solr release.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3901) Add katakana filter to better deal with katakana spelling variants

2012-03-22 Thread Christian Moen (Created) (JIRA)
Add katakana filter to better deal with katakana spelling variants
--

 Key: LUCENE-3901
 URL: https://issues.apache.org/jira/browse/LUCENE-3901
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Christian Moen
 Fix For: 3.6, 4.0


Many Japanese katakana words end in a long sound that is sometimes optional.

For example, パーティー and パーティ are both perfectly valid for "party".  Similarly we 
have センター and センタ that are variants of "center" as well as サーバー and サーバ for 
"server".

I'm proposing that we add a katakana stemmer that removes this long sound if 
the terms are longer than a configurable length.  It's also possible to add the 
variant as a synonym, but I think stemming is preferred from a ranking point of 
view.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3115) Improve default Japanese stopwords.txt description

2012-02-09 Thread Christian Moen (Created) (JIRA)
Improve default Japanese stopwords.txt description
--

 Key: SOLR-3115
 URL: https://issues.apache.org/jira/browse/SOLR-3115
 Project: Solr
  Issue Type: Improvement
  Components: Rules
Reporter: Christian Moen
Priority: Minor


As discussed in SOLR-3056, the description in the default Japanese 
stopwords.txt should be improved to describe case- and width-handling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3107) Disable random sampling in LangDetectLanguageIdentifierUpdateProcessor

2012-02-07 Thread Christian Moen (Created) (JIRA)
Disable random sampling in LangDetectLanguageIdentifierUpdateProcessor
--

 Key: SOLR-3107
 URL: https://issues.apache.org/jira/browse/SOLR-3107
 Project: Solr
  Issue Type: Improvement
  Components: contrib - LangId
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Priority: Minor


The {{language-detection}} library used by 
{{LangDetectLanguageIdentifierUpdateProcessor}} uses a random sampling feature 
enabled by default as a means of avoiding local noise in input.  The feature 
has its merits, but it can also be confusing to users who aren't aware of it 
since it may give different on the same input.  I recommend turning it off to 
prevent confusion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3097) Introduce default Japanese stoptags and stopwords to Solr's example configuration

2012-02-04 Thread Christian Moen (Created) (JIRA)
Introduce default Japanese stoptags and stopwords to Solr's example 
configuration
-

 Key: SOLR-3097
 URL: https://issues.apache.org/jira/browse/SOLR-3097
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


SOLR-3056 discusses introducing a default field type {{text_ja}} for Japanese 
in {{schema.xml}}.  This configuration will be improved by also introducing 
default stopwords and stoptags configuration for the field type.  

I believe this configuration should be easily available and tunable to Solr 
users and I'm proposing that we introduce the same stopwords and stoptags 
provided in LUCENE-3745 to Solr example configuration.  I'm proposing that 
files can live in {{solr/example/solr/conf}} as {{stopwords_ja.txt}} and 
{{stoptags_ja.txt}} alongside {{stopwords_en.txt}} for English.  (Longer term, 
I think should reconsider our overall approach to this across all languages, 
but that's perhaps a separate discussion.)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-04 Thread Christian Moen (Created) (JIRA)
Align default Japanese configurations for Lucene and Solr
-

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration as 
the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

2012-02-01 Thread Christian Moen (Created) (JIRA)
Need stopwords and stoptags lists for default Japanese configuration


 Key: LUCENE-3745
 URL: https://issues.apache.org/jira/browse/LUCENE-3745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen


Stopwords and stoptags lists for Japanese needs to be developed, tested and 
integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3730) Improved Kuromoji search mode segmentation/decompounding

2012-01-29 Thread Christian Moen (Created) (JIRA)
Improved Kuromoji search mode segmentation/decompounding


 Key: LUCENE-3730
 URL: https://issues.apache.org/jira/browse/LUCENE-3730
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


Kuromoji has a segmentation mode for search that uses a heuristic to promote 
additional segmentation of long candidate tokens to get a decompounding effect. 
 This heuristic has been improved.  Patch is coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-01-22 Thread Christian Moen (Created) (JIRA)
Introduce Japanese field type in schema.xml
---

 Key: SOLR-3056
 URL: https://issues.apache.org/jira/browse/SOLR-3056
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
Robert, Uwe and Simon). It would be very good to get a default field type 
defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
support in Solr.

I've been playing with the below configuration today, which I think is a 
reasonable starting point for Japanese.  There's lot to be said about various 
considerations necessary when searching Japanese, but perhaps a wiki page is 
more suitable to cover the wider topic?

In order to make the below {{text_ja}} field type work, Kuromoji itself and its 
analyzers need to be seen by the Solr classloader.  However, these are 
currently in contrib and I'm wondering if we should consider moving them to 
core to make them directly available.  If there are concerns with additional 
memory usage, etc. for non-Japanese users, we can make sure resources are 
loaded lazily and only when needed in factory-land.

Any thoughts?

{code:xml}


  









  

{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org