[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: analysis-data.zip Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer with this command: java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer -Danalysis.data.dir=/path/to/analysis-data/ contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706782#action_12706782 ] Michael McCandless commented on LUCENE-1629: Patch looks good -- thanks Xiaoping! One problem is that contrib/analyzers is currently limited to Java 1.4, and I don't think we should change that at this point (though in 3.0, we will change it to 1.5). How hard would it be to switch your sources to use only Java 1.4? A couple other issues: * Each copyright header is missing the starting 'S' in the sentence 'ee the License for the specific language governing permissions and' * Can you remove the @author tags? (Lucene sources don't include author tags anymore) contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706887#action_12706887 ] Uwe Schindler commented on LUCENE-1629: --- Hi Xiaoping, looks good, but I have some suggestions: - Making the datafile only readable from a RandomAccessFile makes it hard to e.g. move the data file directly into the jar file. I would like to put the data file directly into the package directory and load it with Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would be ready-to-use and the analyzer would run out of the box. Often configuring external files in e.g. web applications is complicated. An automatism to load the file from the JAR would be fine. - I have seen some singleton implementations, where the getInstance() static method is not synchronized. Without it there may be more than one instance, if different threads call getInstance() at the same time or close together. - Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default). contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706928#action_12706928 ] Xiaoping Gao commented on LUCENE-1629: -- to McCandless: There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki: * All core code to be included in 2.X releases should be compatible with Java 1.4. * All contrib code should be compatible with *either Java 5 or 1.4*. I have corrected the copyright header and @author tags, thank you. to Schindler: 1. This is really a good idea, I wanna to move the data file into jar in next develop cycle, but now I need to make some changes to the data files independently, can I just commit the codes now? 2. I have changed the getInstance() method to synchronized 3. All the source files are fixed encoded using UTF-8, and I had put a notice in package.html, Should I do something else? Thank you all! contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: (was: LUCENE-1629.patch) contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: LUCENE-1629.patch New patch in reply to Michael McCandless and Uwe Schindler 's comments. contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706948#action_12706948 ] Robert Muir commented on LUCENE-1629: - Hi, I see in the paper that lexical resources were also developed for Big5 (traditional chinese). Are you able to acquire these resources with BSD license as well? contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1594) Use source code specialization to maximize search performance
[ https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1594: --- Attachment: LUCENE-1594.patch New patch attached: * Specialize for the no norms cases * N-clause BooleanQuery of TermQuerys now handled * Handle setMinimumNumberShouldMatch * MUST_NOT clauses handled * Allow total hits to NOT be computed, and then when sorting by field, do a fail fast on a doc while iterating the TermDocs if the doc can't compete in the current PQ (discussed under LUCENE-1593) * Pre-replace nulls with U+ in StringIndex * Other random optimizations Patch is small because I'm not including all generated sources (there are too many). This patch always pre-fills the queue w/ sentinel values. These optimizations result is very sizable performance gains, especially with OR queries that sort by field, do not require total hit count (with or without filtering, deletions, scoring, etc.). In these cases the specialized code runs 2.5-3.5X faster than Lucene core. Use source code specialization to maximize search performance - Key: LUCENE-1594 URL: https://issues.apache.org/jira/browse/LUCENE-1594 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: FastSearchTask.java, LUCENE-1594.patch, LUCENE-1594.patch, LUCENE-1594.patch Towards eeking absolute best search performance, and after seeing the Java ghosts in LUCENE-1575, I decided to build a simple prototype source code specializer for Lucene's searches. The idea is to write dynamic Java code, specialized to run a very specific query context (eg TermQuery, collecting top N by field, no filter, no deletions), compile that Java code, and run it. Here're the performance gains when compared to trunk: ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%|| |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}| |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}| |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}| |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}| |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}| |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}| |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}| |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}| |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}| |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}| |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}| |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}| |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}| |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}| |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}| |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}| |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}| |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}| Those tests were run on a 19M doc wikipedia index (splitting each Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10 But: it only works with TermQuery for now; it's just a start. It should be easy for others to run this test: * apply patch * cd contrib/benchmark * run python -u bench.py -delindex /path/to/index/with/deletes -nodelindex /path/to/index/without/deletes (You can leave off one of -delindex or -nodelindex and it'll skip those tests). For each test, bench.py generates a single Java source file that runs that one query; you can open contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java to see it. I'll attach an example. It writes results.txt, in Jira table format, which you should be able to copy/paste back here. The specializer uses pretty much every search speedup I can think of -- the ones from LUCENE-1575 (to score or not, to maxScore or not), the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels, don't use docID for tie breaking), LUCENE-1536 (random access filters). It bypasses TermDocs and interacts directly with the IndexInput, and with BitVector for deletions. It directly folds in the collector, if possible. A filter if used must be random access, and is assumed to pre-multiply-in the deleted docs. Current status: * I only handle TermQuery. I'd like to add others over time... * It can
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707042#action_12707042 ] Michael McCandless commented on LUCENE-1629: bq. There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki: Well... in general contrib packages can be 1.5, but the analyzers contrib package is widely used, and is not 1.5 now, so it's a biggish change to force it to 1.5 with this. We should at least separate discuss in on java-dev if we want to consider allowing 1.5 code into contrib-analyzers. We could hold off on committing this until 3.0? contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1618: - Attachment: LUCENE-1618.patch Added fileExists checking in getDirectory before asking regarding the extension. This is useful when IndexFileDeleter uses FSD as a way to combine directories in LUCENE-1313. Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1618.patch, LUCENE-1618.patch, LUCENE-1618.patch, LUCENE-1618.patch, LUCENE-1618.patch, MemoryCachedDirectory.java Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1594) Use source code specialization to maximize search performance
[ https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707116#action_12707116 ] Eks Dev commented on LUCENE-1594: - huh, it reduces hardware costs 2-3 times for larger setup! great Use source code specialization to maximize search performance - Key: LUCENE-1594 URL: https://issues.apache.org/jira/browse/LUCENE-1594 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: FastSearchTask.java, LUCENE-1594.patch, LUCENE-1594.patch, LUCENE-1594.patch Towards eeking absolute best search performance, and after seeing the Java ghosts in LUCENE-1575, I decided to build a simple prototype source code specializer for Lucene's searches. The idea is to write dynamic Java code, specialized to run a very specific query context (eg TermQuery, collecting top N by field, no filter, no deletions), compile that Java code, and run it. Here're the performance gains when compared to trunk: ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%|| |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}| |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}| |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}| |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}| |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}| |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}| |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}| |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}| |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}| |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}| |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}| |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}| |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}| |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}| |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}| |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}| |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}| |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}| Those tests were run on a 19M doc wikipedia index (splitting each Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10 But: it only works with TermQuery for now; it's just a start. It should be easy for others to run this test: * apply patch * cd contrib/benchmark * run python -u bench.py -delindex /path/to/index/with/deletes -nodelindex /path/to/index/without/deletes (You can leave off one of -delindex or -nodelindex and it'll skip those tests). For each test, bench.py generates a single Java source file that runs that one query; you can open contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java to see it. I'll attach an example. It writes results.txt, in Jira table format, which you should be able to copy/paste back here. The specializer uses pretty much every search speedup I can think of -- the ones from LUCENE-1575 (to score or not, to maxScore or not), the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels, don't use docID for tie breaking), LUCENE-1536 (random access filters). It bypasses TermDocs and interacts directly with the IndexInput, and with BitVector for deletions. It directly folds in the collector, if possible. A filter if used must be random access, and is assumed to pre-multiply-in the deleted docs. Current status: * I only handle TermQuery. I'd like to add others over time... * It can collect by score, or single field (with the 3 scoring options in LUCENE-1575). It can't do reverse field sort nor multi-field sort now. * The auto-gen code (gen.py) is rather hideous. It could use some serious refactoring, etc.; I think we could get it to the point where each Query can gen its own specialized code, maybe. It also needs to be eventually ported to Java. * The script runs old, then new, then checks that the topN results are identical, and aborts if not. So I'm pretty sure the specialized code is working correctly, for the cases I'm testing. * The patch includes a few small changes to core, mostly to open up package protected APIs so I can access stuff I think this is an interesting effort for several reasons: * It gives us a best-case upper bound
Re: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
I'd prefer it to stay 1.4 for now and would be willing to make the change, if needed. -- DM On May 7, 2009, at 3:04 PM, Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707042 #action_12707042 ] Michael McCandless commented on LUCENE-1629: bq. There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki: Well... in general contrib packages can be 1.5, but the analyzers contrib package is widely used, and is not 1.5 now, so it's a biggish change to force it to 1.5 with this. We should at least separate discuss in on java-dev if we want to consider allowing 1.5 code into contrib-analyzers. We could hold off on committing this until 3.0? contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国 人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707235#action_12707235 ] Xiaoping Gao commented on LUCENE-1629: -- I have ported the code to Java1.4 today, fortunately there were not so much problems. Lucene-1629-java1.4.patch is all the code working on Java 1.4, I have just changed it to fit Java1.4 code style.Data structures and algorithms are not modified. It has been tested that it would produce the very same result, just with a slight affection on speed. contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: LUCENE-1629-java1.4.patch all the code working on java1.4 contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch, LUCENE-1629.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707243#action_12707243 ] Jason Rutherglen commented on LUCENE-1313: -- Something in the DocumentsWriter API we may need to change is to allow passing a directory through the IndexingChain. In the RAM NRT case, which directory we write to can change depending on if a ram buffer has exceeded it's maximum available size. If it is under half the available ram it will to go the ram dir, if not the new segment will be written to disk. For this reason we can't simply pass a directory into the constructor of DocumentsWriter, nor can we rely on calling IW.getFlushDirectory. We should be able to rely on the directory in SegmentWriteState? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org