[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: analysis-data.zip

Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer 
with this command:
java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer 
-Danalysis.data.dir=/path/to/analysis-data/


 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706782#action_12706782
 ] 

Michael McCandless commented on LUCENE-1629:


Patch looks good -- thanks Xiaoping!

One problem is that contrib/analyzers is currently limited to Java 1.4, and I 
don't think we should change that at this point (though in 3.0, we will change 
it to 1.5).  How hard would it be to switch your sources to use only Java 1.4?

A couple other issues:

  * Each copyright header is missing the starting 'S' in the sentence 'ee the 
License for the specific language governing permissions and'

  * Can you remove the @author tags?  (Lucene sources don't include author tags 
anymore)

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706887#action_12706887
 ] 

Uwe Schindler commented on LUCENE-1629:
---

Hi Xiaoping,

looks good, but I have some suggestions:
- Making the datafile only readable from a RandomAccessFile makes it hard to 
e.g. move the data file directly into the jar file. I would like to put the 
data file directly into the package directory  and load it with 
Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would 
be ready-to-use and the analyzer would run out of the box. Often configuring 
external files in e.g. web applications is complicated. An automatism to load 
the file from the JAR would be fine.
- I have seen some singleton implementations, where the getInstance() static 
method is not synchronized. Without it there may be more than one instance, if 
different threads call getInstance() at the same time or close together.
- Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). 
If not, there may be problems, if the Java compiler uses another encoding 
(because platform default).

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706928#action_12706928
 ] 

Xiaoping Gao commented on LUCENE-1629:
--

to McCandless:
There is lots of code depending on Java 1.5, I use enum, generalization 
frequently. Because I saw these points on apache wiki:
* All core code to be included in 2.X releases should be compatible with 
Java 1.4.
* All contrib code should be compatible with *either Java 5 or 1.4*.
I have corrected the copyright header and @author tags, thank you.

to Schindler:
1. This is really a good idea, I wanna to move the data file into jar in next 
develop cycle, but now I need to make some changes to the data files 
independently, can I just commit the codes now?
2. I have changed the getInstance() method to synchronized
3. All the source files are fixed encoded using UTF-8, and I had put a notice 
in package.html,  Should I do something else?

Thank you all!

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: (was: LUCENE-1629.patch)

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: LUCENE-1629.patch

New patch in reply to Michael McCandless and Uwe Schindler 's comments.

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706948#action_12706948
 ] 

Robert Muir commented on LUCENE-1629:
-

Hi,

I see in the paper that lexical resources were also developed for Big5 
(traditional chinese). Are you able to acquire these resources with BSD license 
as well?

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1594) Use source code specialization to maximize search performance

2009-05-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1594:
---

Attachment: LUCENE-1594.patch

New patch attached:

  * Specialize for the no norms cases

  * N-clause BooleanQuery of TermQuerys now handled

  * Handle setMinimumNumberShouldMatch

  * MUST_NOT clauses handled

  * Allow total hits to NOT be computed, and then when sorting by
field, do a fail fast on a doc while iterating the TermDocs if the
doc can't compete in the current PQ (discussed under LUCENE-1593)

  * Pre-replace nulls with U+ in StringIndex

  * Other random optimizations

Patch is small because I'm not including all generated sources (there
are too many).

This patch always pre-fills the queue w/ sentinel values.

These optimizations result is very sizable performance gains,
especially with OR queries that sort by field, do not require total
hit count (with or without filtering, deletions, scoring, etc.).  In
these cases the specialized code runs 2.5-3.5X faster than Lucene
core.


 Use source code specialization to maximize search performance
 -

 Key: LUCENE-1594
 URL: https://issues.apache.org/jira/browse/LUCENE-1594
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: FastSearchTask.java, LUCENE-1594.patch, 
 LUCENE-1594.patch, LUCENE-1594.patch


 Towards eeking absolute best search performance, and after seeing the
 Java ghosts in LUCENE-1575, I decided to build a simple prototype
 source code specializer for Lucene's searches.
 The idea is to write dynamic Java code, specialized to run a very
 specific query context (eg TermQuery, collecting top N by field, no
 filter, no deletions), compile that Java code, and run it.
 Here're the performance gains when compared to trunk:
 ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%||
 |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}|
 |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}|
 |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}|
 |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}|
 |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}|
 |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}|
 |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}|
 |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}|
 |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}|
 |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}|
 |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}|
 |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}|
 |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}|
 |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}|
 |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}|
 |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}|
 |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}|
 |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}|
 Those tests were run on a 19M doc wikipedia index (splitting each
 Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10
 But: it only works with TermQuery for now; it's just a start.
 It should be easy for others to run this test:
   * apply patch
   * cd contrib/benchmark
   * run python -u bench.py -delindex /path/to/index/with/deletes
 -nodelindex /path/to/index/without/deletes
 (You can leave off one of -delindex or -nodelindex and it'll skip
 those tests).
 For each test, bench.py generates a single Java source file that runs
 that one query; you can open
 contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java
 to see it.  I'll attach an example.  It writes results.txt, in Jira
 table format, which you should be able to copy/paste back here.
 The specializer uses pretty much every search speedup I can think of
 -- the ones from LUCENE-1575 (to score or not, to maxScore or not),
 the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels,
 don't use docID for tie breaking), LUCENE-1536 (random access
 filters).  It bypasses TermDocs and interacts directly with the
 IndexInput, and with BitVector for deletions.  It directly folds in
 the collector, if possible.  A filter if used must be random access,
 and is assumed to pre-multiply-in the deleted docs.
 Current status:
   * I only handle TermQuery.  I'd like to add others over time...
   * It can 

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707042#action_12707042
 ] 

Michael McCandless commented on LUCENE-1629:


bq. There is lots of code depending on Java 1.5, I use enum, generalization 
frequently. Because I saw these points on apache wiki:

Well... in general contrib packages can be 1.5, but the analyzers contrib 
package is widely used, and is not 1.5 now, so it's a biggish change to force 
it to 1.5 with this.  We should at least separate discuss in on java-dev if we 
want to consider allowing 1.5 code into contrib-analyzers.

We could hold off on committing this until 3.0?

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-05-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1618:
-

Attachment: LUCENE-1618.patch

Added fileExists checking in getDirectory before asking
regarding the extension. This is useful when IndexFileDeleter
uses FSD as a way to combine directories in LUCENE-1313.

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1618.patch, LUCENE-1618.patch, LUCENE-1618.patch, 
 LUCENE-1618.patch, LUCENE-1618.patch, MemoryCachedDirectory.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1594) Use source code specialization to maximize search performance

2009-05-07 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707116#action_12707116
 ] 

Eks Dev commented on LUCENE-1594:
-

huh, it reduces hardware costs 2-3 times for larger setup! great

 Use source code specialization to maximize search performance
 -

 Key: LUCENE-1594
 URL: https://issues.apache.org/jira/browse/LUCENE-1594
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: FastSearchTask.java, LUCENE-1594.patch, 
 LUCENE-1594.patch, LUCENE-1594.patch


 Towards eeking absolute best search performance, and after seeing the
 Java ghosts in LUCENE-1575, I decided to build a simple prototype
 source code specializer for Lucene's searches.
 The idea is to write dynamic Java code, specialized to run a very
 specific query context (eg TermQuery, collecting top N by field, no
 filter, no deletions), compile that Java code, and run it.
 Here're the performance gains when compared to trunk:
 ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%||
 |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}|
 |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}|
 |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}|
 |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}|
 |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}|
 |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}|
 |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}|
 |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}|
 |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}|
 |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}|
 |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}|
 |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}|
 |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}|
 |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}|
 |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}|
 |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}|
 |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}|
 |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}|
 Those tests were run on a 19M doc wikipedia index (splitting each
 Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10
 But: it only works with TermQuery for now; it's just a start.
 It should be easy for others to run this test:
   * apply patch
   * cd contrib/benchmark
   * run python -u bench.py -delindex /path/to/index/with/deletes
 -nodelindex /path/to/index/without/deletes
 (You can leave off one of -delindex or -nodelindex and it'll skip
 those tests).
 For each test, bench.py generates a single Java source file that runs
 that one query; you can open
 contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java
 to see it.  I'll attach an example.  It writes results.txt, in Jira
 table format, which you should be able to copy/paste back here.
 The specializer uses pretty much every search speedup I can think of
 -- the ones from LUCENE-1575 (to score or not, to maxScore or not),
 the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels,
 don't use docID for tie breaking), LUCENE-1536 (random access
 filters).  It bypasses TermDocs and interacts directly with the
 IndexInput, and with BitVector for deletions.  It directly folds in
 the collector, if possible.  A filter if used must be random access,
 and is assumed to pre-multiply-in the deleted docs.
 Current status:
   * I only handle TermQuery.  I'd like to add others over time...
   * It can collect by score, or single field (with the 3 scoring
 options in LUCENE-1575).  It can't do reverse field sort nor
 multi-field sort now.
   * The auto-gen code (gen.py) is rather hideous.  It could use some
 serious refactoring, etc.; I think we could get it to the point
 where each Query can gen its own specialized code, maybe.  It also
 needs to be eventually ported to Java.
   * The script runs old, then new, then checks that the topN results
 are identical, and aborts if not.  So I'm pretty sure the
 specialized code is working correctly, for the cases I'm testing.
   * The patch includes a few small changes to core, mostly to open up
 package protected APIs so I can access stuff
 I think this is an interesting effort for several reasons:
   * It gives us a best-case upper bound 

Re: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread DM Smith
I'd prefer it to stay 1.4 for now and would be willing to make the  
change, if needed.


-- DM

On May 7, 2009, at 3:04 PM, Michael McCandless (JIRA) wrote:



   [ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707042 
#action_12707042 ]


Michael McCandless commented on LUCENE-1629:


bq. There is lots of code depending on Java 1.5, I use enum,  
generalization frequently. Because I saw these points on apache wiki:


Well... in general contrib packages can be 1.5, but the analyzers  
contrib package is widely used, and is not 1.5 now, so it's a  
biggish change to force it to 1.5 with this.  We should at least  
separate discuss in on java-dev if we want to consider allowing 1.5  
code into contrib-analyzers.


We could hold off on committing this until 3.0?


contrib intelligent Analyzer for Chinese


   Key: LUCENE-1629
   URL: https://issues.apache.org/jira/browse/LUCENE-1629
   Project: Lucene - Java
Issue Type: Improvement
Components: contrib/analyzers
  Affects Versions: 2.4.1
   Environment: for java 1.5 or higher, lucene 2.4.1
  Reporter: Xiaoping Gao
   Attachments: analysis-data.zip, LUCENE-1629.patch


I wrote a Analyzer for apache lucene for analyzing sentences in  
Chinese language. it's called imdict-chinese-analyzer, the  
project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/
In Chinese, 我是中国人(I am Chinese), should be tokenized as  
我(I)   是(am)   中国人(Chinese), not 我 是中 国 
人. So the analyzer must handle each sentence properly, or there  
will be mis-understandings everywhere in the index constructed by  
Lucene, and the accuracy of the search engine will be affected  
seriously!
Although there are two analyzer packages in apache repository which  
can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each  
character or every two adjoining characters as a single word, this  
is obviously not true in reality, also this strategy will increase  
the index size and hurt the performance baddly.
The algorithm of imdict-chinese-analyzer is based on Hidden Markov  
Model (HMM), so it can tokenize chinese sentence in a really  
intelligent way. Tokenizaion accuracy of this model is above 90%  
according to the paper HHMM-based Chinese Lexical analyzer  
ICTCLAL while other analyzer's is about 60%.
As imdict-chinese-analyzer is a really fast and intelligent. I want  
to contribute it to the apache lucene repository.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707235#action_12707235
 ] 

Xiaoping Gao commented on LUCENE-1629:
--

I have ported the code to Java1.4 today, fortunately there were not so much 
problems.

Lucene-1629-java1.4.patch  is all the code working on Java 1.4, I have just 
changed it to fit Java1.4 code style.Data structures and algorithms are not 
modified. 
It has been tested that it would produce the very same result, just with a 
slight affection on speed.

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch, 
 LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: LUCENE-1629-java1.4.patch

all the code working on java1.4

 contrib intelligent Analyzer for Chinese
 

 Key: LUCENE-1629
 URL: https://issues.apache.org/jira/browse/LUCENE-1629
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
 Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch, 
 LUCENE-1629.patch


 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
 language. it's called imdict-chinese-analyzer, the project on google code 
 is here: http://code.google.com/p/imdict-chinese-analyzer/
 In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
 properly, or there will be mis-understandings everywhere in the index 
 constructed by Lucene, and the accuracy of the search engine will be affected 
 seriously!
 Although there are two analyzer packages in apache repository which can 
 handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
 every two adjoining characters as a single word, this is obviously not true 
 in reality, also this strategy will increase the index size and hurt the 
 performance baddly.
 The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
 (HMM), so it can tokenize chinese sentence in a really intelligent way. 
 Tokenizaion accuracy of this model is above 90% according to the paper 
 HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 
 60%.
 As imdict-chinese-analyzer is a really fast and intelligent. I want to 
 contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707243#action_12707243
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

Something in the DocumentsWriter API we may need to change is to
allow passing a directory through the IndexingChain. In the RAM
NRT case, which directory we write to can change depending on if
a ram buffer has exceeded it's maximum available size. If it is
under half the available ram it will to go the ram dir, if not
the new segment will be written to disk. For this reason we
can't simply pass a directory into the constructor of
DocumentsWriter, nor can we rely on calling
IW.getFlushDirectory. We should be able to rely on the directory
in SegmentWriteState?

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org