[jira] Issue Comment Edited: (LUCENE-1387) Add LocalLucene

2009-05-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707821#action_12707821
 ] 

Earwin Burrfoot edited comment on LUCENE-1387 at 5/10/09 11:16 AM:
---

LatLonDistanceFilter.java:

  public BitSet bits(IndexReader reader) throws IOException {

/* Create a BitSet to store the result */
int maxdocs = reader.numDocs();   - probably reader.maxDoc 
?
BitSet bits = new BitSet(maxdocs);


  was (Author: earwin):
LatLonDistanceFilter.java:

  public BitSet bits(IndexReader reader) throws IOException {

/* Create a BitSet to store the result */
int maxdocs = reader.numDocs();   - probably 
reader.maxDocs ?
BitSet bits = new BitSet(maxdocs);

  
 Add LocalLucene
 ---

 Key: LUCENE-1387
 URL: https://issues.apache.org/jira/browse/LUCENE-1387
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: Grant Ingersoll
Assignee: Ryan McKinley
Priority: Minor
 Fix For: 2.9

 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip


 Local Lucene (Geo-search) has been donated to the Lucene project, per 
 https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
 the Lucene portion of integration.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1387) Add LocalLucene

2009-05-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707821#action_12707821
 ] 

Earwin Burrfoot commented on LUCENE-1387:
-

LatLonDistanceFilter.java:

  public BitSet bits(IndexReader reader) throws IOException {

/* Create a BitSet to store the result */
int maxdocs = reader.numDocs();   - probably 
reader.maxDocs ?
BitSet bits = new BitSet(maxdocs);


 Add LocalLucene
 ---

 Key: LUCENE-1387
 URL: https://issues.apache.org/jira/browse/LUCENE-1387
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: Grant Ingersoll
Assignee: Ryan McKinley
Priority: Minor
 Fix For: 2.9

 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip


 Local Lucene (Geo-search) has been donated to the Lucene project, per 
 https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
 the Lucene portion of integration.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1632) boolean docid set iterator improvement

2009-05-10 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707827#action_12707827
 ] 

Paul Elschot commented on LUCENE-1632:
--

The real performance improvement here is in the disjunctions, is that correct?
In that case the patch for performance could be simplified by only inlining the 
heap in ScorerDocQueue
in a similar way as in the patch here. ScorerDocQueue might even disappear 
completely.

However by concentrating on disjunctions only, one would loose for example 
AndDocIdSetIterator
from the patch, which might be useful as a superclass (or attribute) of all 
current Scorers that
perform conjunctions. Actually this is (yet) another issue.

For the longer class names DISI might be preferable over DocIdSetIterator as a 
name component.

 boolean docid set iterator improvement
 --

 Key: LUCENE-1632
 URL: https://issues.apache.org/jira/browse/LUCENE-1632
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: John Wang
 Attachments: Lucene-1632-patch.txt


 This was first brought up in Lucene-1345. But Lucene-1345 conversation has 
 digressed. As per suggested, creating a separate issue to track.
 Added perf comparisons with boolean set iterators with current scorers
 See patch
 System: Ubunto, 
 java version 1.6.0_11
 Intel core2 Duo 2.44ghz
 new milliseconds=470
 new milliseconds=534
 new milliseconds=450
 new milliseconds=443
 new milliseconds=444
 new milliseconds=445
 new milliseconds=449
 new milliseconds=441
 new milliseconds=444
 new milliseconds=445
 new total milliseconds=4565
 old milliseconds=529
 old milliseconds=491
 old milliseconds=428
 old milliseconds=549
 old milliseconds=427
 old milliseconds=424
 old milliseconds=420
 old milliseconds=424
 old milliseconds=423
 old milliseconds=422
 old total milliseconds=4537
 New/Old Time 4565/4537 (100.61715%)
 OrDocIdSetIterator milliseconds=1138
 OrDocIdSetIterator milliseconds=1106
 OrDocIdSetIterator milliseconds=1065
 OrDocIdSetIterator milliseconds=1066
 OrDocIdSetIterator milliseconds=1065
 OrDocIdSetIterator milliseconds=1067
 OrDocIdSetIterator milliseconds=1072
 OrDocIdSetIterator milliseconds=1118
 OrDocIdSetIterator milliseconds=1065
 OrDocIdSetIterator milliseconds=1069
 OrDocIdSetIterator total milliseconds=10831
 DisjunctionMaxScorer milliseconds=1914
 DisjunctionMaxScorer milliseconds=1981
 DisjunctionMaxScorer milliseconds=1861
 DisjunctionMaxScorer milliseconds=1893
 DisjunctionMaxScorer milliseconds=1886
 DisjunctionMaxScorer milliseconds=1885
 DisjunctionMaxScorer milliseconds=1887
 DisjunctionMaxScorer milliseconds=1889
 DisjunctionMaxScorer milliseconds=1891
 DisjunctionMaxScorer milliseconds=1888
 DisjunctionMaxScorer total milliseconds=18975
 Or/DisjunctionMax Time 10831/18975 (57.080368%)
 OrDocIdSetIterator milliseconds=1079
 OrDocIdSetIterator milliseconds=1075
 OrDocIdSetIterator milliseconds=1076
 OrDocIdSetIterator milliseconds=1093
 OrDocIdSetIterator milliseconds=1077
 OrDocIdSetIterator milliseconds=1074
 OrDocIdSetIterator milliseconds=1078
 OrDocIdSetIterator milliseconds=1075
 OrDocIdSetIterator milliseconds=1074
 OrDocIdSetIterator milliseconds=1074
 OrDocIdSetIterator total milliseconds=10775
 DisjunctionSumScorer milliseconds=1398
 DisjunctionSumScorer milliseconds=1322
 DisjunctionSumScorer milliseconds=1320
 DisjunctionSumScorer milliseconds=1305
 DisjunctionSumScorer milliseconds=1304
 DisjunctionSumScorer milliseconds=1301
 DisjunctionSumScorer milliseconds=1304
 DisjunctionSumScorer milliseconds=1300
 DisjunctionSumScorer milliseconds=1301
 DisjunctionSumScorer milliseconds=1317
 DisjunctionSumScorer total milliseconds=13172
 Or/DisjunctionSum Time 10775/13172 (81.80231%)
 AndDocIdSetIterator milliseconds=330
 AndDocIdSetIterator milliseconds=336
 AndDocIdSetIterator milliseconds=298
 AndDocIdSetIterator milliseconds=299
 AndDocIdSetIterator milliseconds=310
 AndDocIdSetIterator milliseconds=298
 AndDocIdSetIterator milliseconds=298
 AndDocIdSetIterator milliseconds=334
 AndDocIdSetIterator milliseconds=298
 AndDocIdSetIterator milliseconds=299
 AndDocIdSetIterator total milliseconds=3100
 ConjunctionScorer milliseconds=332
 ConjunctionScorer milliseconds=307
 ConjunctionScorer milliseconds=302
 ConjunctionScorer milliseconds=350
 ConjunctionScorer milliseconds=300
 ConjunctionScorer milliseconds=304
 ConjunctionScorer milliseconds=305
 ConjunctionScorer milliseconds=303
 ConjunctionScorer milliseconds=303
 ConjunctionScorer milliseconds=299
 ConjunctionScorer total milliseconds=3105
 And/Conjunction Time 3100/3105 (99.83897%)
 main contributors to the patch: Anmol Bhasin  Yasuhiro Matsuda

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-10 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: (was: LUCENE-1629-java1.4.patch)

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: analysis-data.zip


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-10 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: (was: LUCENE-1629.patch)

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: analysis-data.zip


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-10 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: LUCENE-1629-java1.4.patch

changes
1. Add two binary dictionary files into the java package: coredict.mem(1.6M) 
bigramdict.mem(4.7M), I'll post them after this
2. Using Class.getResourceAsStream() to load the dictionary, so users don't 
need to download dictionaries manually.
3. Switch TestSmartChineseAnalyzer into a real JUnit test case


 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-10 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: bigramdict.mem
coredict.mem

two binary dictionary files, please put them into 
contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: analysis-data.zip, bigramdict.mem, coredict.mem, 
 LUCENE-1629-java1.4.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org