[jira] Issue Comment Edited: (LUCENE-1387) Add LocalLucene
[ https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707821#action_12707821 ] Earwin Burrfoot edited comment on LUCENE-1387 at 5/10/09 11:16 AM: --- LatLonDistanceFilter.java: public BitSet bits(IndexReader reader) throws IOException { /* Create a BitSet to store the result */ int maxdocs = reader.numDocs(); - probably reader.maxDoc ? BitSet bits = new BitSet(maxdocs); was (Author: earwin): LatLonDistanceFilter.java: public BitSet bits(IndexReader reader) throws IOException { /* Create a BitSet to store the result */ int maxdocs = reader.numDocs(); - probably reader.maxDocs ? BitSet bits = new BitSet(maxdocs); Add LocalLucene --- Key: LUCENE-1387 URL: https://issues.apache.org/jira/browse/LUCENE-1387 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: Grant Ingersoll Assignee: Ryan McKinley Priority: Minor Fix For: 2.9 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip Local Lucene (Geo-search) has been donated to the Lucene project, per https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle the Lucene portion of integration. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1387) Add LocalLucene
[ https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707821#action_12707821 ] Earwin Burrfoot commented on LUCENE-1387: - LatLonDistanceFilter.java: public BitSet bits(IndexReader reader) throws IOException { /* Create a BitSet to store the result */ int maxdocs = reader.numDocs(); - probably reader.maxDocs ? BitSet bits = new BitSet(maxdocs); Add LocalLucene --- Key: LUCENE-1387 URL: https://issues.apache.org/jira/browse/LUCENE-1387 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: Grant Ingersoll Assignee: Ryan McKinley Priority: Minor Fix For: 2.9 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip Local Lucene (Geo-search) has been donated to the Lucene project, per https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle the Lucene portion of integration. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1632) boolean docid set iterator improvement
[ https://issues.apache.org/jira/browse/LUCENE-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707827#action_12707827 ] Paul Elschot commented on LUCENE-1632: -- The real performance improvement here is in the disjunctions, is that correct? In that case the patch for performance could be simplified by only inlining the heap in ScorerDocQueue in a similar way as in the patch here. ScorerDocQueue might even disappear completely. However by concentrating on disjunctions only, one would loose for example AndDocIdSetIterator from the patch, which might be useful as a superclass (or attribute) of all current Scorers that perform conjunctions. Actually this is (yet) another issue. For the longer class names DISI might be preferable over DocIdSetIterator as a name component. boolean docid set iterator improvement -- Key: LUCENE-1632 URL: https://issues.apache.org/jira/browse/LUCENE-1632 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: John Wang Attachments: Lucene-1632-patch.txt This was first brought up in Lucene-1345. But Lucene-1345 conversation has digressed. As per suggested, creating a separate issue to track. Added perf comparisons with boolean set iterators with current scorers See patch System: Ubunto, java version 1.6.0_11 Intel core2 Duo 2.44ghz new milliseconds=470 new milliseconds=534 new milliseconds=450 new milliseconds=443 new milliseconds=444 new milliseconds=445 new milliseconds=449 new milliseconds=441 new milliseconds=444 new milliseconds=445 new total milliseconds=4565 old milliseconds=529 old milliseconds=491 old milliseconds=428 old milliseconds=549 old milliseconds=427 old milliseconds=424 old milliseconds=420 old milliseconds=424 old milliseconds=423 old milliseconds=422 old total milliseconds=4537 New/Old Time 4565/4537 (100.61715%) OrDocIdSetIterator milliseconds=1138 OrDocIdSetIterator milliseconds=1106 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1066 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1067 OrDocIdSetIterator milliseconds=1072 OrDocIdSetIterator milliseconds=1118 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1069 OrDocIdSetIterator total milliseconds=10831 DisjunctionMaxScorer milliseconds=1914 DisjunctionMaxScorer milliseconds=1981 DisjunctionMaxScorer milliseconds=1861 DisjunctionMaxScorer milliseconds=1893 DisjunctionMaxScorer milliseconds=1886 DisjunctionMaxScorer milliseconds=1885 DisjunctionMaxScorer milliseconds=1887 DisjunctionMaxScorer milliseconds=1889 DisjunctionMaxScorer milliseconds=1891 DisjunctionMaxScorer milliseconds=1888 DisjunctionMaxScorer total milliseconds=18975 Or/DisjunctionMax Time 10831/18975 (57.080368%) OrDocIdSetIterator milliseconds=1079 OrDocIdSetIterator milliseconds=1075 OrDocIdSetIterator milliseconds=1076 OrDocIdSetIterator milliseconds=1093 OrDocIdSetIterator milliseconds=1077 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator milliseconds=1078 OrDocIdSetIterator milliseconds=1075 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator total milliseconds=10775 DisjunctionSumScorer milliseconds=1398 DisjunctionSumScorer milliseconds=1322 DisjunctionSumScorer milliseconds=1320 DisjunctionSumScorer milliseconds=1305 DisjunctionSumScorer milliseconds=1304 DisjunctionSumScorer milliseconds=1301 DisjunctionSumScorer milliseconds=1304 DisjunctionSumScorer milliseconds=1300 DisjunctionSumScorer milliseconds=1301 DisjunctionSumScorer milliseconds=1317 DisjunctionSumScorer total milliseconds=13172 Or/DisjunctionSum Time 10775/13172 (81.80231%) AndDocIdSetIterator milliseconds=330 AndDocIdSetIterator milliseconds=336 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=299 AndDocIdSetIterator milliseconds=310 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=334 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=299 AndDocIdSetIterator total milliseconds=3100 ConjunctionScorer milliseconds=332 ConjunctionScorer milliseconds=307 ConjunctionScorer milliseconds=302 ConjunctionScorer milliseconds=350 ConjunctionScorer milliseconds=300 ConjunctionScorer milliseconds=304 ConjunctionScorer milliseconds=305 ConjunctionScorer milliseconds=303 ConjunctionScorer milliseconds=303 ConjunctionScorer milliseconds=299 ConjunctionScorer total milliseconds=3105 And/Conjunction Time 3100/3105 (99.83897%) main contributors to the patch: Anmol Bhasin Yasuhiro Matsuda -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: (was: LUCENE-1629-java1.4.patch) contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Assignee: Michael McCandless Fix For: 2.9 Attachments: analysis-data.zip I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: (was: LUCENE-1629.patch) contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Assignee: Michael McCandless Fix For: 2.9 Attachments: analysis-data.zip I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: LUCENE-1629-java1.4.patch changes 1. Add two binary dictionary files into the java package: coredict.mem(1.6M) bigramdict.mem(4.7M), I'll post them after this 2. Using Class.getResourceAsStream() to load the dictionary, so users don't need to download dictionaries manually. 3. Switch TestSmartChineseAnalyzer into a real JUnit test case contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Assignee: Michael McCandless Fix For: 2.9 Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: bigramdict.mem coredict.mem two binary dictionary files, please put them into contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/ contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Assignee: Michael McCandless Fix For: 2.9 Attachments: analysis-data.zip, bigramdict.mem, coredict.mem, LUCENE-1629-java1.4.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org