[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857107#action_12857107 ] Otis Gospodnetic commented on LUCENE-2393: -- I think creating a small index with a couple of docs would be the way to go. Utility to output total term frequency and df from a lucene index - Key: LUCENE-2393 URL: https://issues.apache.org/jira/browse/LUCENE-2393 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Tom Burton-West Priority: Trivial Attachments: LUCENE-2393.patch This is a command line utility that takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). It is useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2127) Improved large result handling
[ https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797776#action_12797776 ] Otis Gospodnetic commented on LUCENE-2127: -- +1 for Aaron's patch in a separate issue, too. Improved large result handling -- Key: LUCENE-2127 URL: https://issues.apache.org/jira/browse/LUCENE-2127 Project: Lucene - Java Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: LUCENE-2127.patch, LUCENE-2127.patch Per http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5, it would be nice to offer some other Collectors that are better at handling really large number of results. This could be implemented in a variety of ways via Collectors. For instance, we could have a raw collector that does no sorting and just returns the ScoreDocs, or we could do as Mike suggests and have Collectors that have heuristics about memory tradeoffs and only heapify when appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790889#action_12790889 ] Otis Gospodnetic commented on LUCENE-1910: -- * I'll second Mark's suggestion to extract the Information Gain piece of the patch into separate class(es), so we can reuse it in other places. It looks like it's currently an integral part of MoreLikeThisUsingTags class. Would that be possible? * I noticed the code needs ASL (the Apache Software License) added. * Also, could you please use the Lucene code format? (Eclipse/IntelliJ templates are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute ) Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785473#action_12785473 ] Otis Gospodnetic commented on LUCENE-2091: -- +1 for skipping BM25 and going straight to BM25F. I think the answer to Uwe's question about why this can't just be a different Similarity or some such is that BM25 requires some data that Lucene currently doesn't collect. That's what there were some of those static methods in examples on the author's site. I *think* what I'm saying is correct. :) Add BM25 Scoring to Lucene -- Key: LUCENE-2091 URL: https://issues.apache.org/jira/browse/LUCENE-2091 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Yuval Feinstein Priority: Minor Fix For: 3.1 Attachments: LUCENE-2091.patch, persianlucene.jpg Original Estimate: 48h Remaining Estimate: 48h http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework, as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF). I have refactored this a bit, added unit tests and improved the runtime somewhat. I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785690#action_12785690 ] Otis Gospodnetic commented on LUCENE-2091: -- Joaquin - could you please explain what you mean by Saturate the effect of frequency with k1? Thanks. Add BM25 Scoring to Lucene -- Key: LUCENE-2091 URL: https://issues.apache.org/jira/browse/LUCENE-2091 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Yuval Feinstein Priority: Minor Fix For: 3.1 Attachments: LUCENE-2091.patch, persianlucene.jpg Original Estimate: 48h Remaining Estimate: 48h http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework, as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF). I have refactored this a bit, added unit tests and improved the runtime somewhat. I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783530#action_12783530 ] Otis Gospodnetic commented on LUCENE-2091: -- Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. Add BM25 Scoring to Lucene -- Key: LUCENE-2091 URL: https://issues.apache.org/jira/browse/LUCENE-2091 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Yuval Feinstein Priority: Minor Fix For: 3.1 Original Estimate: 48h Remaining Estimate: 48h http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework, as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF). I have refactored this a bit, added unit tests and improved the runtime somewhat. I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783530#action_12783530 ] Otis Gospodnetic edited comment on LUCENE-2091 at 11/30/09 4:21 AM: Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Aha, I found something: http://markmail.org/message/c2r4v7zj7mduzs5d Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. was (Author: otis): Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. Add BM25 Scoring to Lucene -- Key: LUCENE-2091 URL: https://issues.apache.org/jira/browse/LUCENE-2091 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Yuval Feinstein Priority: Minor Fix For: 3.1 Original Estimate: 48h Remaining Estimate: 48h http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework, as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF). I have refactored this a bit, added unit tests and improved the runtime somewhat. I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1491. -- Resolution: Fixed Thanks Todd Co. SendingCHANGES.txt Sending analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java Sending analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java Sending analyzers/src/test/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilterTest.java Sending analyzers/src/test/org/apache/lucene/analysis/ngram/NGramTokenFilterTest.java Transmitting file data . Committed revision 794034. EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716428#action_12716428 ] Otis Gospodnetic commented on LUCENE-1677: -- In my cca 10 year history of being around Lucene I think I saw GCJ mentioned only about half a dozen times. Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716053#action_12716053 ] Otis Gospodnetic commented on LUCENE-1491: -- I'm getting convinced to just drop ngrams minNgram. If nobody complains by the end of the week, I'll commit. EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1378) Remove remaining @author references
[ https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1378. -- Resolution: Fixed Done. Thank you Paul. Sendingsrc/java/org/apache/lucene/analysis/package.html Sendingsrc/java/org/apache/lucene/analysis/standard/package.html Sendingsrc/java/org/apache/lucene/index/package.html Sendingsrc/java/org/apache/lucene/queryParser/package.html Sendingsrc/java/org/apache/lucene/search/package.html Sendingsrc/java/org/apache/lucene/store/package.html Sendingsrc/java/org/apache/lucene/util/package.html Sendingsrc/test/org/apache/lucene/search/TestBooleanOr.java Transmitting file data Committed revision 781055. Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Assignee: Otis Gospodnetic Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1378.patch, LUCENE-1378.patch, LUCENE-1378b.patch, LUCENE-1378c.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ \...@author.*//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-898) contrib/javascript is not packaged into releases
[ https://issues.apache.org/jira/browse/LUCENE-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-898. - Resolution: Fixed Done. D javascript/queryEscaper/luceneQueryEscaper.js D javascript/queryEscaper/testQueryEscaper.html D javascript/queryEscaper D javascript/queryConstructor/luceneQueryConstructor.js D javascript/queryConstructor/luceneQueryConstructor.html D javascript/queryConstructor/testQueryConstructor.html D javascript/queryConstructor D javascript/queryValidator/luceneQueryValidator.js D javascript/queryValidator/testQueryValidator.html D javascript/queryValidator D javascript Committed revision 781057. contrib/javascript is not packaged into releases Key: LUCENE-898 URL: https://issues.apache.org/jira/browse/LUCENE-898 Project: Lucene - Java Issue Type: Bug Components: Build Reporter: Hoss Man Assignee: Otis Gospodnetic Priority: Trivial the contrib/javascript directory is (apparently) a collection of javascript utilities for lucene .. but it has not build files or any mechanism to package it, so it is excluded form releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715551#action_12715551 ] Otis Gospodnetic commented on LUCENE-1491: -- I agree this is an improvement, but like Hoss I'm worried about silently skipping shorter-than-specified-min-ngram-size tokens. Perhaps we need boolean keepSmaller somewhere, so we can explicitly control the behaviour? EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1272) Support for boost factor in MoreLikeThis
[ https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1271#action_1271 ] Otis Gospodnetic commented on LUCENE-1272: -- Jonathan, would it be possible for you to update this patch to work with the trunk, so I can apply it? Thanks! Support for boost factor in MoreLikeThis Key: LUCENE-1272 URL: https://issues.apache.org/jira/browse/LUCENE-1272 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Jonathan Leibiusky Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.9 Attachments: morelikethis_boostfactor.patch This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1378) Remove remaining @author references
[ https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715578#action_12715578 ] Otis Gospodnetic commented on LUCENE-1378: -- I think a bunch of that xdocs stuff under site should/will really be removed with time, as some of it is out of date (e.g. benchmarks, contrib) and harder to maintain than Wiki pages. Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Assignee: Otis Gospodnetic Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1378.patch, LUCENE-1378.patch, LUCENE-1378b.patch, LUCENE-1378c.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ \...@author.*//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715661#action_12715661 ] Otis Gospodnetic commented on LUCENE-1491: -- I'm not 100% sure - I'm not using ngrams at the moment, so I have no place to test this out, but skipping a shorter than minimal ngrams seems like it would result in silent data loss. Ah, here, example: What would happen to to be or not to be if min=4 and we relied on ngrams to perform phrase queries? All of those terms would be dropped, so a search for to be or not to be would result in 0 hits. If the above is correct, I think this sounds like a bad thing that one wouldn't expect... EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715763#action_12715763 ] Otis Gospodnetic commented on LUCENE-1491: -- Karl - LUCENE-1306 - I agree, I think the existing edge and non-edge ngram stuff should be folded into LUCENE-1306 (or the other way around, if it's easier). But won't question of what we do with the chunks shorter than min ngram remain? Does adding that boolean hurt anything? (other than an if test for every ngram :) ). EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR
[ https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715780#action_12715780 ] Otis Gospodnetic commented on LUCENE-1377: -- Could we make this even more generic and say that all basic tokenizers and filters that currently live in Solr should really move to Lucene? Add HTMLStripReader and WordDelimiterFilter from SOLR - Key: LUCENE-1377 URL: https://issues.apache.org/jira/browse/LUCENE-1377 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.3.2 Reporter: Jason Rutherglen Priority: Minor Original Estimate: 24h Remaining Estimate: 24h SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very useful for a wide variety of use cases. It would be good to place them into core Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714285#action_12714285 ] Otis Gospodnetic commented on LUCENE-1629: -- I just got to look at this code and I only scanned it quickly. Is all of the code really Chinese-specific? Would any of it be applicable to other languages, say Japanese or Korean? (assuming we have dictionaries in suitable format) contrib intelligent Analyzer for Chinese Key: LUCENE-1629 URL: https://issues.apache.org/jira/browse/LUCENE-1629 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4.1 Environment: for java 1.5 or higher, lucene 2.4.1 Reporter: Xiaoping Gao Assignee: Michael McCandless Fix For: 2.9 Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called imdict-chinese-analyzer, the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I) 是(am) 中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper HHMM-based Chinese Lexical analyzer ICTCLAL while other analyzer's is about 60%. As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702192#action_12702192 ] Otis Gospodnetic commented on LUCENE-1284: -- Hm, I feel that because of these command-line non-Java and GPLed tools it may not be possible (or will be very clunky) to integrate this with Lucene. What do others think? Felipe, although Java equivalents of those command-line tools don't exist currently, do you think one could implement them in Java (and release them under ASL)? I don't know what exactly is in those tools and what it would take to port them to Java. Thanks. Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.0.9.0.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697952#action_12697952 ] Otis Gospodnetic commented on LUCENE-1284: -- Hi Felipe, OK, I looked at this some more. So the Java code you contributed is ASL and Apertium's tools (and data?) is GPL v2? The thing that puzzles me are the language pairs themselves. Why are they in pairs? Is that simply for the translation part of Apertium, and something that's ignored when you use the pair for Lucene and morphological analysis? If I'm interested in, say, French morphological analyzer, why do I need any other language? For French, I see: * br-fr * en-fr * fr-ca * fr-es If I'm interested in French, which of the 4 above is the right one to use? The one with the highest number of lemmata? I had a look at the Indexer and Searcher to get an idea about the usage. Those classes are really just for demonstration, right? Still, do you mind replacing the deprecated Hits object in the Searcher class? In the README you mention this: {quote} 2. The Spanish morphological dictionary must be preprocessed in advance to remove multiword expressions: $ java -classpath lucene-apertium-morph-2.4-dev.jar \ org.apache.lucene.apertium.tools.RemoveMultiWordsFromDix \ --dix apertium-es-ca.es.dix apertium-es-ca.es-nomw.dix {quote} Could you explain why the removal of multiword expressions is needed? Is that Spanish-specific or something one needs to do regardless of the language? Also: {quote} 4. Each file to be indexed must be preprocessed using the Apertium tools: $ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin | apertium-tagger -g -f es-ca.prob file.pos.txt {quote} So these are a few command-line tools that end up marking up the input text with POS? (I seem to be missing some libraries and can't compile Apterium locally to check what that this marked up file looks like). But my main question here is whether there are Java equivalents of these command-line tools, so that one can easily use them from Java? Or is one forced to use Runtime.exec(...)? Thanks. Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.0.9.0.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697185#action_12697185 ] Otis Gospodnetic commented on LUCENE-1284: -- Felipe: I took another look at this. I spotted mentions of GPL, but it's not clear to me what's GPLed. We can't have GPL software in Apache, unfortunately. Could you please explain which pieces are GPLed and tell us if this is something that could be changed to ASL? Thanks. Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.0.9.0.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697208#action_12697208 ] Otis Gospodnetic commented on LUCENE-1284: -- One more for Felipe. Is there a page on http://wiki.apertium.org/ that lists the definite/up to date list of supported languages and perhaps some kind of indicator of status (e.g. anyone actively working on the language or not) and level of support. I see http://wiki.apertium.org/wiki/List_of_language_pairs and http://wiki.apertium.org/wiki/Language_and_pair_maintainer ...but I can't quite translate (no pun intended) those numbers into the level of support for a language. Could you please shed some light on this? Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.0.9.0.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688385#action_12688385 ] Otis Gospodnetic commented on LUCENE-1561: -- Might be good to keep a consistent name across Lucene/Solr. More info coming up in SOLR-1079. Maybe rename Field.omitTf, and strengthen the javadocs -- Key: LUCENE-1561 URL: https://issues.apache.org/jira/browse/LUCENE-1561 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1561.patch Spinoff from here: http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html Maybe rename omitTf to something like omitTermPositions, and make it clear what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675604#action_12675604 ] Otis Gospodnetic commented on LUCENE-1284: -- Felipe - I'll have a look at this next week, thanks for the reminder! Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.0.9.0.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1519) Change Primitive Data Types from int to long in class SegmentMerger.java
[ https://issues.apache.org/jira/browse/LUCENE-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663633#action_12663633 ] Otis Gospodnetic commented on LUCENE-1519: -- Deepak - could you please bring this up on the java-user mailing list instead and close this issue? Change Primitive Data Types from int to long in class SegmentMerger.java Key: LUCENE-1519 URL: https://issues.apache.org/jira/browse/LUCENE-1519 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: lucene 2.4.0, jdk1.6.0_03/07/11 Reporter: Deepak Original Estimate: 4h Remaining Estimate: 4h Hi We are getting an exception while optimize. We are getting this exception mergeFields produced an invalid result: docCount is 385282378 but fdx file size is 3082259028; now aborting this merge to prevent index corruption I have checked the code for class SegmentMerger.java and found this check *** if (4+docCount*8 != fdxFileLength) // This is most likely a bug in Sun JRE 1.6.0_04/_05; // we detect that the bug has struck, here, and // throw an exception to prevent the corruption from // entering the index. See LUCENE-1282 for // details. throw new RuntimeException(mergeFields produced an invalid result: docCount is + docCount + but fdx file size is + fdxFileLength + ; now aborting this merge to prevent index corruption); } *** In our case docCount is 385282378 and fdxFileLength size is 3082259028, even though 4+385282378*8 is equal to 3082259028, the above code will not work because number 3082259028 is out of int range. So type of variable docCount needs to be changed to long I have written a small test for this public class SegmentMergerTest { public static void main(String[] args) { int docCount = 385282378; long fdxFileLength = 3082259028L; if(4+docCount*8 != fdxFileLength) System.out.println(No Match + (4+docCount*8)); else System.out.println(Match + (4+docCount*8)); } } Above test will print No Match but if you change the data type of docCount to long, it will print Match Can you please advise us if this issue will be fixed in next release? Regards Deepak -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1513) fastss fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661302#action_12661302 ] Otis Gospodnetic commented on LUCENE-1513: -- I feel like I missed some FastSS discussion on the list was there one? I took a quick look at the paper and the code. Is the following the general idea: # index fuzzy/misspelled terms in addition to the normal terms (= larger index, slower indexing). How much fuzziness one wants to allow or handle is decided at index time. # rewrite the query to include variations/misspellings of each terms and use that to search (= more clauses, slower than normal search, but faster than the normal fuzzy query whose speed depends on the number of indexed terms) ? Quick code comments: * Need to add ASL * Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector * No @author * Unit test if possible * Should FastSSwC not be able to take a variable K? * Should variables named after types (e.g. set in public static String getNeighborhoodString(SetString set) { ) be renamed, so they describe what's in them instead? (easier to understand API?) fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1511) Improve Java packages (remove shared/split packages, refactore naming scheme)
[ https://issues.apache.org/jira/browse/LUCENE-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660808#action_12660808 ] Otis Gospodnetic commented on LUCENE-1511: -- Perhaps this should have been brough up on java-dev first... How does one deal with package private classes/methods then? Improve Java packages (remove shared/split packages, refactore naming scheme) - Key: LUCENE-1511 URL: https://issues.apache.org/jira/browse/LUCENE-1511 Project: Lucene - Java Issue Type: Wish Components: contrib/*, Search Affects Versions: 2.4 Reporter: Gunnar Wagenknecht I recently prepared Lucene OSGi bundles for the Eclipse Orbit repository. During the preparation I discovered that some packages (eg. org.apache.lucene.search) are shared between different JARs, i.e. the package is in Lucene Core and in a contrib lib. While this is perfectly fine, it just makes OSGi packaging more complex and complexity also has a higher potential for errors. Thus, my wish for a Lucene 3.0 would be to rename some packages. For example, all contribs/extensions could be moved into their own package namespace. (Apologize if this has been reported elsewhere. I did a search in JIRA but did not find a similar issue.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1491: - Lucene Fields: [New, Patch Available] (was: [New]) Fix Version/s: 2.9 Assignee: Otis Gospodnetic EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. Key: LUCENE-1491 URL: https://issues.apache.org/jira/browse/LUCENE-1491 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.4, 2.4.1, 2.9, 3.0 Reporter: Todd Feak Assignee: Otis Gospodnetic Fix For: 2.9 Attachments: LUCENE-1491.patch If a token is encountered in the stream that is shorter in length than the min gram size, the filter will stop processing the token stream. Working up a unit test now, but may be a few days before I can provide it. Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1487) FieldCacheTermsFilter
[ https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655517#action_12655517 ] Otis Gospodnetic commented on LUCENE-1487: -- Would it be possible to reformat to use Lucene code style and add a bit of javadoc/unit test? Eclipse and IDEA styles are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute FieldCacheTermsFilter - Key: LUCENE-1487 URL: https://issues.apache.org/jira/browse/LUCENE-1487 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4 Reporter: Tim Sturge Fix For: 2.9 Attachments: FieldCacheTermsFilter.java This is a companion to FieldCacheRangeFilter except it operates on a set of terms rather than a range. It works best when the set is comparatively large or the terms are comparatively common. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)
[ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1380: - Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Fix Version/s: 2.4.1 Patch for ShingleFilter.enablePositions (or PositionFilter) --- Key: LUCENE-1380 URL: https://issues.apache.org/jira/browse/LUCENE-1380 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Mck SembWever Priority: Trivial Fix For: 2.4.1 Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other. Today the shingles generated are synonyms only to the first term in the shingle. For example the query abcd efgh ijkl results in: (abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) (ijkl) where abcd efgh and abcd efgh ijkl are synonyms of abcd, and efgh ijkl is a synonym of efgh. There exists no way today to alter which token a particular shingle is a synonym for. This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other. See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655525#action_12655525 ] Otis Gospodnetic commented on LUCENE-1026: -- My impression was that this didn't stick, so I'd drop it. Provide a simple way to concurrently access a Lucene index from multiple threads Key: LUCENE-1026 URL: https://issues.apache.org/jira/browse/LUCENE-1026 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Reporter: Mark Miller Priority: Minor Attachments: DefaultIndexAccessor.java, DefaultMultiIndexAccessor.java, IndexAccessor-02.04.2008.zip, IndexAccessor-02.07.2008.zip, IndexAccessor-02.28.2008.zip, IndexAccessor-05.27.2008.zip, IndexAccessor-1.26.2008.zip, IndexAccessor-2.15.2008.zip, IndexAccessor.04032008.zip, IndexAccessor.java, IndexAccessor.zip, IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor-2.zip, shai-IndexAccessor.zip, shai-IndexAccessor3.zip, SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java For building interactive indexes accessed through a network/internet (multiple threads). This builds upon the LuceneIndexAccessor patch. That patch was not very newbie friendly and did not properly handle MultiSearchers (or at the least made it easy to get into trouble). This patch simplifies things and provides out of the box support for sharing the IndexAccessors across threads. There is also a simple test class and example SearchServer to get you started. Future revisions will be zipped. Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653900#action_12653900 ] Otis Gospodnetic commented on LUCENE-855: - Hi Matt! :) Tim, want to benchmark the two? (since you already benchmarked 1461, you should be able to plug in Matt's thing and see how it compares) MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1461) Cached filter for a single term field
[ https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653361#action_12653361 ] Otis Gospodnetic commented on LUCENE-1461: -- Is this related to LUCENE-855? The same? Aha, I see Paul asked the reverse question in LUCENE-855 already... Tim? Cached filter for a single term field - Key: LUCENE-1461 URL: https://issues.apache.org/jira/browse/LUCENE-1461 Project: Lucene - Java Issue Type: New Feature Reporter: Tim Sturge Assignee: Michael McCandless Fix For: 2.9 Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch, LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch, LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java, TermMultiFilter.java, TestFieldCacheRangeFilter.patch These classes implement inexpensive range filtering over a field containing a single term. They do this by building an integer array of term numbers (storing the term-number mapping in a TreeMap) and then implementing a fast integer comparison based DocSetIdIterator. This code is currently being used to do age range filtering, but could also be used to do other date filtering or in any application where there need to be multiple filters based on the same single term field. I have an untested implementation of single term filtering and have considered but not yet implemented term set filtering (useful for location based searches) as well. The code here is fairly rough; it works but lacks javadocs and toString() and hashCode() methods etc. I'm posting it here to discover if there is other interest in this feature; I don't mind fixing it up but would hate to go to the effort if it's not going to make it into Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-689) NullPointerException thrown by equals method in SpanOrQuery
[ https://issues.apache.org/jira/browse/LUCENE-689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-689: --- Assignee: Otis Gospodnetic (was: Steven Parkes) NullPointerException thrown by equals method in SpanOrQuery --- Key: LUCENE-689 URL: https://issues.apache.org/jira/browse/LUCENE-689 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: Java 1.5.0_09, RHEL 3 Linux, Tomcat 5.0.28 Reporter: Michael Goddard Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-689.txt Part of our code utilizes the equals method in SpanOrQuery and, in certain cases (details to follow, if necessary), a NullPointerException gets thrown as a result of the String field being null. After applying the following patch, the problem disappeared: Index: src/java/org/apache/lucene/search/spans/SpanOrQuery.java === --- src/java/org/apache/lucene/search/spans/SpanOrQuery.java(revision 465065) +++ src/java/org/apache/lucene/search/spans/SpanOrQuery.java(working copy) @@ -121,7 +121,8 @@ final SpanOrQuery that = (SpanOrQuery) o; if (!clauses.equals(that.clauses)) return false; -if (!field.equals(that.field)) return false; +if (field != null !field.equals(that.field)) return false; +if (field == null that.field != null) return false; return getBoost() == that.getBoost(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-1378) Remove remaining @author references
[ https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reopened LUCENE-1378: -- Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Assignee: Otis Gospodnetic Priority: Trivial Fix For: 2.4 Attachments: LUCENE-1378.patch, LUCENE-1378.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ [EMAIL PROTECTED]//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1439) Inconsistent API
[ https://issues.apache.org/jira/browse/LUCENE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647672#action_12647672 ] Otis Gospodnetic commented on LUCENE-1439: -- Wiki may be more suitable for that. Note that it may be better to work on getting more of the pending patches reviewed and tested, so they can be committed faster. That way we can then proceed to making API changes that won't break existing/pending patches. Inconsistent API - Key: LUCENE-1439 URL: https://issues.apache.org/jira/browse/LUCENE-1439 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Environment: any Reporter: Ivan.S Priority: Minor The API of Lucene is totally inconsistent: 1) There are a lot of containers which don't implement an interface which indicates this fact (for pre-java-1.5 Lucene it could be Collection, for post-ajva-1.5 Lucene it could be more general Iterable) Example: IndexSearcher: int maxDoc() and doc(int i) 2) There are a lot of classes having non-final public accessible fields. 3) Some methods which return values are named something() others are named getSomething() Best one is: Fieldable: without get: String stringValue(), Reader readerValue(), byte[] binaryValue(), ... with get: byte[] getBinaryValue(), int getBinaryLength(), ... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-524) Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites
[ https://issues.apache.org/jira/browse/LUCENE-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647181#action_12647181 ] otis edited comment on LUCENE-524 at 11/13/08 7:06 AM: --- Based on the description, yes. Doesn't this also sound a lot like that old Mark H's LUCENE-329 issue? was (Author: otis): Based on the description, yes. Doesn't this also sound a lot like that old Mark H's issue that you commented on earlier? Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites -- Key: LUCENE-524 URL: https://issues.apache.org/jira/browse/LUCENE-524 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.9 Reporter: Randy Puttick Priority: Minor Attachments: MultiTermQuery.java, MultiTermScorer.java The implementation of MultiTermQuery in terms of BooleanQuery introduces several problems: 1) Collisions with maximum clause limit on boolean queries which throws an exception. This is most problematic because it is difficult to ascertain in advance how many terms a fuzzy query or wildcard query might involve. 2) The boolean disjunctive scoring is not appropriate for either fuzzy or wildcard queries. In effect the score is divided by the number of terms in the query which has nothing to do with the relevancy of the results. 3) Performance of disjunctive boolean queries for large term sets is quite sub-optimal -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1417) Allowing for distance measures that incorporate frequency/popularity for SuggestWord comparison
[ https://issues.apache.org/jira/browse/LUCENE-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1417: - Priority: Minor (was: Major) I agree with Grant. I like that we not have pluggable distance metric, for example. Allowing for distance measures that incorporate frequency/popularity for SuggestWord comparison --- Key: LUCENE-1417 URL: https://issues.apache.org/jira/browse/LUCENE-1417 Project: Lucene - Java Issue Type: Improvement Components: contrib/spellchecker Affects Versions: 2.4 Reporter: Jason Rennie Priority: Minor Original Estimate: 4h Remaining Estimate: 4h Spelling suggestions are currently ordered first by a string edit distance measure, then by popularity/frequency. This limits the ability of popularity/frequency to affect suggestions. I think it would be better for the distance measure to accept popularity/frequency as an argument and provide a distance/score that incorporates any popularity/frequency considerations. I.e. change StringDistance.getDistance to accept an additional argument: frequency of the potential suggestion. The new SuggestWord.compareTo function would only order by score. We could achieve the existing behavior by adding a small inverse frequency value to the distances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1296) Allow use of compact DocIdSet in CachingWrapperFilter
[ https://issues.apache.org/jira/browse/LUCENE-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1296: - Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Fix Version/s: 2.9 Allow use of compact DocIdSet in CachingWrapperFilter - Key: LUCENE-1296 URL: https://issues.apache.org/jira/browse/LUCENE-1296 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Paul Elschot Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: cachedFilter20080529.patch, cachedFilter20080605.patch Extends CachingWrapperFilter with a protected method to determine the DocIdSet to be cached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-518) document field lengths count analyzer synonym overlays
[ https://issues.apache.org/jira/browse/LUCENE-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-518. - Resolution: Fixed I think LUCENE-1420 fixed this. document field lengths count analyzer synonym overlays -- Key: LUCENE-518 URL: https://issues.apache.org/jira/browse/LUCENE-518 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 1.9 Environment: N/A Reporter: Randy Puttick Priority: Minor Using a synonym expansion analyzer to add tokens with zero offset from the substituted token should not extend the length of the field in the document (for scoring purposes) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1413) Creating PlainTextDictionary with UTF8 files
[ https://issues.apache.org/jira/browse/LUCENE-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1413: - Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Fix Version/s: (was: 2.3.3) 2.9 Issue Type: Improvement (was: New Feature) Creating PlainTextDictionary with UTF8 files Key: LUCENE-1413 URL: https://issues.apache.org/jira/browse/LUCENE-1413 Project: Lucene - Java Issue Type: Improvement Components: contrib/spellchecker Affects Versions: 2.3.2 Environment: All platform / operating systems Reporter: YourSoft Fix For: 2.9 Generate indexes from text files is good, but can't read utf8 files. It can easily made by adding the following code to PlainTextDictionary.java: public PlainTextDictionary(InputStream dictFile, String fileEncoding) throws UnsupportedEncodingException { in = new BufferedReader(new InputStreamReader(dictFile, fileEncoding)); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12646971#action_12646971 ] Otis Gospodnetic commented on LUCENE-1306: -- Could/should this not be folded into the existing Ngram code in contrib? CombinedNGramTokenFilter Key: LUCENE-1306 URL: https://issues.apache.org/jira/browse/LUCENE-1306 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Karl Wettin Assignee: Karl Wettin Priority: Trivial Attachments: LUCENE-1306.txt, LUCENE-1306.txt Alternative NGram filter that produce tokens with composite prefix and suffix markers. {code:java} ts = new WhitespaceTokenizer(new StringReader(hello)); ts = new CombinedNGramTokenFilter(ts, 2, 2); assertNext(ts, ^h); assertNext(ts, he); assertNext(ts, el); assertNext(ts, ll); assertNext(ts, lo); assertNext(ts, o$); assertNull(ts.next()); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-548) Sort bug using ParallelMultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-548. - Resolution: Won't Fix I agree, doesn't seem worth fixing. Explicit STRING is recommended. Sort bug using ParallelMultiSearcher Key: LUCENE-548 URL: https://issues.apache.org/jira/browse/LUCENE-548 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9 Environment: Linux FC2 Java 1.4.9 Reporter: dan Priority: Minor Output: java.lang.ClassCastException: java.lang.String at org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:119) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:271) Input: - This only occurs when searching more than one index using ParallelMultiSearcher - I use the signature new Sort( date, true) - The values in dates are strings in the form 20060419 - The call to getType in FieldDocSortedHitQueue misinterprets the value as an INT, then the exception is thrown Available workaround - I use the the signature new Sort(new SortField( date, SortField.STRING, true)) and the problem goes away. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-711) BooleanWeight should size the weights Vector correctly
[ https://issues.apache.org/jira/browse/LUCENE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-711. - Resolution: Fixed Assignee: Otis Gospodnetic Sendingsrc/java/org/apache/lucene/search/BooleanQuery.java Transmitting file data . Committed revision 713634. BooleanWeight should size the weights Vector correctly -- Key: LUCENE-711 URL: https://issues.apache.org/jira/browse/LUCENE-711 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.9, 2.0.0, 2.1 Reporter: paul constantinides Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-711.patch, vector_sizing.patch The weights field on BooleanWeight uses a Vector that will always be sized exactly the same as the outer class' clauses Vector, therefore can be sized correctly in the constructor. This is a trivial memory saving enhancement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-524) Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites
[ https://issues.apache.org/jira/browse/LUCENE-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647181#action_12647181 ] Otis Gospodnetic commented on LUCENE-524: - Based on the description, yes. Doesn't this also sound a lot like that old Mark H's issue that you commented on earlier? Current implementation of fuzzy and wildcard queries inappropriately implemented as Boolean query rewrites -- Key: LUCENE-524 URL: https://issues.apache.org/jira/browse/LUCENE-524 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.9 Reporter: Randy Puttick Priority: Minor Attachments: MultiTermQuery.java, MultiTermScorer.java The implementation of MultiTermQuery in terms of BooleanQuery introduces several problems: 1) Collisions with maximum clause limit on boolean queries which throws an exception. This is most problematic because it is difficult to ascertain in advance how many terms a fuzzy query or wildcard query might involve. 2) The boolean disjunctive scoring is not appropriate for either fuzzy or wildcard queries. In effect the score is divided by the number of terms in the query which has nothing to do with the relevancy of the results. 3) Performance of disjunctive boolean queries for large term sets is quite sub-optimal -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-38) RangeQuery without lower term and inclusive=false skips blank fields
[ https://issues.apache.org/jira/browse/LUCENE-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647185#action_12647185 ] Otis Gospodnetic commented on LUCENE-38: This thing is 6+ years old and I don't recall this being mentioned on the list in the last half a decade. I'll leave you the Won't Fix pleasure, Mark. RangeQuery without lower term and inclusive=false skips blank fields Key: LUCENE-38 URL: https://issues.apache.org/jira/browse/LUCENE-38 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Assignee: Lucene Developers Priority: Minor Attachments: TestRangeQuery.patch This was reported by James Ricci [EMAIL PROTECTED] at: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=1835 When you create a ranged query and omit the lower term, my expectation would be that I would find everything less than the upper term. Now if I pass false for the inclusive term, then I would expect that I would find all terms less than the upper term excluding the upper term itself. What is happening in the case of lower_term=null, upper_term=x, inclusive=false is that empty strings are being excluded because inclusive is set false, and the implementation of RangedQuery creates a default lower term of Term(fieldName, ). Since it's not inclusive, it excludes . This isn't what I intended, and I don't think it's what most people would imagine RangedQuery would do in the case I've mentioned. I equate lower=null, upper=x, inclusive=false to Field x. lower=null, upper=x, inclusive=true would be Field = x. In both cases, the only difference should be whether or not Field = x is true for the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-38) RangeQuery without lower term and inclusive=false skips blank fields
[ https://issues.apache.org/jira/browse/LUCENE-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647185#action_12647185 ] Otis Gospodnetic commented on LUCENE-38: This thing is 6+ years old and I don't recall this being mentioned on the list in the last half a decade. I'll leave you the Won't Fix pleasure, Mark. RangeQuery without lower term and inclusive=false skips blank fields Key: LUCENE-38 URL: https://issues.apache.org/jira/browse/LUCENE-38 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Assignee: Lucene Developers Priority: Minor Attachments: TestRangeQuery.patch This was reported by James Ricci [EMAIL PROTECTED] at: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=1835 When you create a ranged query and omit the lower term, my expectation would be that I would find everything less than the upper term. Now if I pass false for the inclusive term, then I would expect that I would find all terms less than the upper term excluding the upper term itself. What is happening in the case of lower_term=null, upper_term=x, inclusive=false is that empty strings are being excluded because inclusive is set false, and the implementation of RangedQuery creates a default lower term of Term(fieldName, ). Since it's not inclusive, it excludes . This isn't what I intended, and I don't think it's what most people would imagine RangedQuery would do in the case I've mentioned. I equate lower=null, upper=x, inclusive=false to Field x. lower=null, upper=x, inclusive=true would be Field = x. In both cases, the only difference should be whether or not Field = x is true for the query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1180) Syns2Index fails
[ https://issues.apache.org/jira/browse/LUCENE-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1180. -- Resolution: Fixed It looks like Mike fixed this 2 months ago: r694222 | mikemccand | 2008-09-11 08:11:03 -0400 (Thu, 11 Sep 2008) | 1 line fix wordnet's Syns2Index to not fiddle with mergeFactor maxBuffereDocs (the latter was hitting an exception) Syns2Index fails Key: LUCENE-1180 URL: https://issues.apache.org/jira/browse/LUCENE-1180 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.3 Reporter: Jeffrey Yang Assignee: Otis Gospodnetic Priority: Minor Attachments: syns2index_fix_2.3.patch, syns2index_fix_2.4-dev.patch Original Estimate: 1h Remaining Estimate: 1h Running Syns2Index fails with a java.lang.IllegalArgumentException: maxBufferedDocs must at least be 2 when enabled exception. at org.apache.lucene.index.IndexWriter.setMaxBufferedDocs(IndexWriter.java:883) at org.apache.lucene.wordnet.Syns2Index.index(Syns2Index.java:249) at org.apache.lucene.wordnet.Syns2Index.main(Syns2Index.java:208) The code is here // blindly up these parameters for speed writer.setMergeFactor( writer.getMergeFactor() * 2); writer.setMaxBufferedDocs( writer.getMaxBufferedDocs() * 2); It looks like getMaxBufferedDocs used to return 10, and now it returns -1, not sure when that started happening. My suggestion would be to just remove these three lines. Since speed has already improved vastly, there isn't a need to speed things up. To run this, Syns2Index requires two args. The first is the location of the wn_s.pl file, and the second is the directory to create the index in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-896) Let users set Similarity for MoreLikeThis
[ https://issues.apache.org/jira/browse/LUCENE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-896. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Actually, my copy of MLT already takes Similarity in ctor and has set/getSimilarity, so no patch is needed. You want/need that isNoise method protected? Let users set Similarity for MoreLikeThis - Key: LUCENE-896 URL: https://issues.apache.org/jira/browse/LUCENE-896 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Ryan McKinley Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-896-MoreLikeThisSimilarity.patch Let users set Similarity used for MoreLikeThis For discussion, see: http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis
[ https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1272: - Priority: Minor (was: Major) Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Fix Version/s: 2.9 Assignee: Otis Gospodnetic I don't see any harm in this, I'll make the change later this week. Support for boost factor in MoreLikeThis Key: LUCENE-1272 URL: https://issues.apache.org/jira/browse/LUCENE-1272 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Jonathan Leibiusky Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.9 Attachments: morelikethis_boostfactor.patch This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1424) Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode
[ https://issues.apache.org/jira/browse/LUCENE-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1424: - Summary: Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode (was: Change all mutli term querys so that they extend MultiTermQuery and allow for a constant score mode) Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode --- Key: LUCENE-1424 URL: https://issues.apache.org/jira/browse/LUCENE-1424 Project: Lucene - Java Issue Type: New Feature Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch, LUCENE-1424.patch Cleans up a bunch of code duplication, closer to how things should be - design wise, gives us constant score for all the multi term queries, and allows us at least the option of highlighting the constant score queries without much further work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636686#action_12636686 ] Otis Gospodnetic commented on LUCENE-1410: -- For people not intimately familiar with PFOR (like me), I found the following to be helpful: http://cis.poly.edu/cs912/indexcomp.pdf PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-1410b.patch, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1409) read past EOF
[ https://issues.apache.org/jira/browse/LUCENE-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636068#action_12636068 ] Otis Gospodnetic commented on LUCENE-1409: -- Since Lucene 2.4 is about to be released, if I were you I would get Lucene from trunk, build the jar, and replace your 2.3.2 version. If that eliminates this error, could you please close this issue? read past EOF - Key: LUCENE-1409 URL: https://issues.apache.org/jira/browse/LUCENE-1409 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.3.2 Environment: jdk 1.5.0_08 Reporter: Adam Łączyński I create index with a lot of documents (~500 000). During add documents to read past EOF error occured. It occure after random number of indexed documents. I used lucene with compass framework but I think that is not important. It is a link to compass forum where that problem was reporeted http://forum.compass-project.org/thread.jspa?threadID=215641tstart=0 java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:304) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:59) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:298) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:197) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:109) at org.apache.lucene.index.MultiSegmentReader.doReopen(MultiSegmentReader.java:203) at org.apache.lucene.index.DirectoryIndexReader$2.doBody(DirectoryIndexReader.java:98) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636) at org.apache.lucene.index.DirectoryIndexReader.reopen(DirectoryIndexReader.java:92) at org.compass.core.lucene.engine.manager.DefaultLuceneSearchEngineIndexManager.internalRefreshCache(DefaultLuceneSearchEngineIndexManager.java:368) at org.compass.core.lucene.engine.manager.DefaultLuceneSearchEngineIndexManager.refreshCache(DefaultLuceneSearchEngineIndexManager.java:358) at org.compass.core.lucene.engine.transaction.readcommitted.ReadCommittedTransaction$CommitCallable.call(ReadCommittedTransaction.java:422) at org.compass.core.transaction.context.TransactionalCallable$1.doInTransaction(TransactionalCallable.java:44) at org.compass.core.impl.DefaultCompass$CompassTransactionContext.execute(DefaultCompass.java:342) at org.compass.core.transaction.context.TransactionalCallable.call(TransactionalCallable.java:41) at org.compass.core.executor.DefaultExecutorManager.invokeAllWithLimit(DefaultExecutorManager.java:104) at org.compass.core.executor.DefaultExecutorManager.invokeAllWithLimitBailOnException(DefaultExecutorManager.java:73) at org.compass.core.lucene.engine.transaction.readcommitted.ReadCommittedTransaction.doCommit(ReadCommittedTransaction.java:142) at org.compass.core.lucene.engine.transaction.AbstractTransaction.commit(AbstractTransaction.java:98) at org.compass.core.lucene.engine.LuceneSearchEngine.commit(LuceneSearchEngine.java:172) at org.compass.core.transaction.LocalTransaction.doCommit(LocalTransaction.java:97) at org.compass.core.transaction.AbstractTransaction.commit(AbstractTransaction.java:46) at org.compass.core.CompassTemplate.execute(CompassTemplate.java:131) at org.compass.core.CompassTemplate.execute(CompassTemplate.java:112) at asl.simplesearch.compass.CompassService.createCall(Unknown Source) at asl.util.IndexCreator.createIndex(Unknown Source) at asl.util.IndexCreator.start(Unknown Source) at asl.util.IndexCreatorTestCase.main(IndexCreatorTestCase.java:20) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1390: - Priority: Minor (was: Major) Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Fix Version/s: 2.9 add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter Key: LUCENE-1390 URL: https://issues.apache.org/jira/browse/LUCENE-1390 Project: Lucene - Java Issue Type: Improvement Components: Analysis Environment: any Reporter: Andi Vajda Priority: Minor Fix For: 2.9 Attachments: ISOLatinAccentFilter.java The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin 1 character set. It does what it does and there is no bug with it. It would be nicer, though, if there was a more comprehensive version of this code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode blocks. See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block That way, all languages using roman characters are covered. A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter which should get deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-112) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale
[ https://issues.apache.org/jira/browse/LUCENE-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630265#action_12630265 ] Otis Gospodnetic commented on LUCENE-112: - +1 for closing it. Half a decade ago [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale Key: LUCENE-112 URL: https://issues.apache.org/jira/browse/LUCENE-112 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: CVS Nightly - Specify date in submission Environment: Operating System: All Platform: All Reporter: Eric Isakson Priority: Minor Attachments: IdleTimeoutRefreshingIndexReader.html, IdleTimeoutRefreshingIndexReader.java Here is a little something I worked on this weekend that I wanted to contribute back as I think others might find it very useful. I extended IndexReader and added support for configuring an idle timeout and refresh interval. It uses a monitoring thread to watch for the reader going idle. When the reader goes idle it is closed. When the index is read again it is re-opened. It uses another thread to periodically check when the reader needs to be refreshed due to a change to index. When the reader is stale, it closes the reader and reopens the index. It is acually delegating all the work to another IndexReader implementation and just handling the threading and synchronization. When it closes a reader, it delegates the close to another thread that waits a bit (configurable how long) before actually closing the reader it was delegating to. This gives any consumers of the original reader a chance to finish up their last action on the reader. This implementation sacrifices a little bit of speed since there is a bit more synchroniztion to deal with and the delegation model puts extra calls on the stack, but it should provide long running applications that have idle periods or frequently changing indices from having to open and close readers all the time or hold open unused resources. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1381) Hanging while indexing/digesting on multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630270#action_12630270 ] otis edited comment on LUCENE-1381 at 9/11/08 10:47 AM: David, why not bring this up on java-user list first? Are your 4 IndexWriters writing to the same index? Is this really a Lucene problem? (I don't see any mentions of Lucene in those traces) was (Author: otis): David, why not bring this up on java-user list first? Are your 4 IndexWriters writing to the same index? Hanging while indexing/digesting on multiple threads Key: LUCENE-1381 URL: https://issues.apache.org/jira/browse/LUCENE-1381 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3.2 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: David Fertig With several older lucene projects already running and stable, I have recently written a multi-threading indexer using to the 2.3.2 release. My volume is in the millions of documents indexed daily and I have been stress testing for a while now. My current setup has 3 JVMs, each running 6 threads indexing different documents, with 1 IndexWriter per JVM. For stability testing, the indexer shutsdown and exits every 5-10 minutes, with a new JVM is started again for a clean restart. At this rate, I have noticed an rare, but eventually consistent internal hang/deadlock in all indexer threads while parsing documents. My 'manager' thread is alive and regularly polling the indexer threads and displaying their state variables, but the indexer threads themselves appear not to be making progress while using up nearly 100% of available CPU. Memory usage is relativly low and stable at 481m out of 2048m available. Most stack traces, and STAY in this state even after repeated inspections: (pressing CTRL-\ in active JVM window) -- Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode): Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable [0x41465000..0x41465db0] at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at org.apache.commons.digester.Digester.parse(Digester.java:1685) ... Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable [0x41364000..0x41364d30] at java.lang.String.equals(String.java:858) at org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833) at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at
[jira] Commented: (LUCENE-1381) Hanging while indexing/digesting on multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12630270#action_12630270 ] Otis Gospodnetic commented on LUCENE-1381: -- David, why not bring this up on java-user list first? Are your 4 IndexWriters writing to the same index? Hanging while indexing/digesting on multiple threads Key: LUCENE-1381 URL: https://issues.apache.org/jira/browse/LUCENE-1381 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3.2 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: David Fertig With several older lucene projects already running and stable, I have recently written a multi-threading indexer using to the 2.3.2 release. My volume is in the millions of documents indexed daily and I have been stress testing for a while now. My current setup has 3 JVMs, each running 6 threads indexing different documents, with 1 IndexWriter per JVM. For stability testing, the indexer shutsdown and exits every 5-10 minutes, with a new JVM is started again for a clean restart. At this rate, I have noticed an rare, but eventually consistent internal hang/deadlock in all indexer threads while parsing documents. My 'manager' thread is alive and regularly polling the indexer threads and displaying their state variables, but the indexer threads themselves appear not to be making progress while using up nearly 100% of available CPU. Memory usage is relativly low and stable at 481m out of 2048m available. Most stack traces, and STAY in this state even after repeated inspections: (pressing CTRL-\ in active JVM window) -- Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode): Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable [0x41465000..0x41465db0] at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at org.apache.commons.digester.Digester.parse(Digester.java:1685) ... Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable [0x41364000..0x41364d30] at java.lang.String.equals(String.java:858) at org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833) at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at
[jira] Resolved: (LUCENE-1381) Hanging while indexing/digesting on multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1381. -- Resolution: Invalid This is a new piece of code and the stack trace doesn't show Lucene, so I'm marking this as Invalid for now. Hanging while indexing/digesting on multiple threads Key: LUCENE-1381 URL: https://issues.apache.org/jira/browse/LUCENE-1381 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3.2 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: David Fertig With several older lucene projects already running and stable, I have recently written a multi-threading indexer using to the 2.3.2 release. My volume is in the millions of documents indexed daily and I have been stress testing for a while now. My current setup has 3 JVMs, each running 6 threads indexing different documents, with 1 IndexWriter per JVM. For stability testing, the indexer shutsdown and exits every 5-10 minutes, with a new JVM is started again for a clean restart. At this rate, I have noticed an rare, but eventually consistent internal hang/deadlock in all indexer threads while parsing documents. My 'manager' thread is alive and regularly polling the indexer threads and displaying their state variables, but the indexer threads themselves appear not to be making progress while using up nearly 100% of available CPU. Memory usage is relativly low and stable at 481m out of 2048m available. Most stack traces, and STAY in this state even after repeated inspections: (pressing CTRL-\ in active JVM window) -- Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode): Thread-6 prio=1 tid=0x002b25750920 nid=0x34f6 runnable [0x41465000..0x41465db0] at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at org.apache.commons.digester.Digester.parse(Digester.java:1685) ... Thread-5 prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable [0x41364000..0x41364d30] at java.lang.String.equals(String.java:858) at org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833) at java.util.WeakHashMap.eq(WeakHashMap.java:254) at java.util.WeakHashMap.get(WeakHashMap.java:345) at org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530) at org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209) at org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625) at org.apache.commons.digester.Rule.end(Rule.java:230) at org.apache.commons.digester.Digester.endElement(Digester.java:1130) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at
[jira] Commented: (LUCENE-1378) Remove remaining @author references
[ https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12629631#action_12629631 ] Otis Gospodnetic commented on LUCENE-1378: -- Eh, rusty perl $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/\* @author.*//' Doesn't work -- that \* in front of @author doesn't cut it. Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Assignee: Otis Gospodnetic Priority: Trivial Fix For: 2.4 Attachments: LUCENE-1378.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ [EMAIL PROTECTED]//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1378) Remove remaining @author references
Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Priority: Trivial Fix For: 2.4 Attachments: LUCENE-1378.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ [EMAIL PROTECTED]//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1378) Remove remaining @author references
[ https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1378: - Attachment: LUCENE-1378.patch Remove remaining @author references --- Key: LUCENE-1378 URL: https://issues.apache.org/jira/browse/LUCENE-1378 Project: Lucene - Java Issue Type: Task Reporter: Otis Gospodnetic Priority: Trivial Fix For: 2.4 Attachments: LUCENE-1378.patch $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl -pi -e 's/ [EMAIL PROTECTED]//' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12628355#action_12628355 ] Otis Gospodnetic commented on LUCENE-1131: -- I think so - applies and compiles. Add numDeletedDocs to IndexReader - Key: LUCENE-1131 URL: https://issues.apache.org/jira/browse/LUCENE-1131 Project: Lucene - Java Issue Type: New Feature Reporter: Shai Erera Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: LUCENE-1131.patch Add numDeletedDocs to IndexReader. Basically, the implementation is as simple as doing: public int numDeletedDocs() { return deletedDocs == null ? 0 : deletedDocs.count(); } in SegmentReader. Patch to follow to include in all IndexReader extensions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1366) Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS
[ https://issues.apache.org/jira/browse/LUCENE-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12626654#action_12626654 ] Otis Gospodnetic commented on LUCENE-1366: -- I like the name choices - they read nicely, are easy to understand, and match what actually happens. Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS -- Key: LUCENE-1366 URL: https://issues.apache.org/jira/browse/LUCENE-1366 Project: Lucene - Java Issue Type: Improvement Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.4 Attachments: LUCENE-1366.patch There is confusion about these current Field options and I think we should rename them, deprecating the old names in 2.4/2.9 and removing them in 3.0. How about this: {code} TOKENIZED -- ANALYZED UN_TOKENIZED -- NOT_ANALYZED NO_NORMS -- NOT_ANALYZED_NO_NORMS {code} Should we also add ANALYZED_NO_NORMS? Spinoff from here: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/%3C48a3076a.2679420a.1c53.a5c4%40mx.google.com%3E -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1360) A Similarity class which has unique length norms for numTerms = 10
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-1360: Assignee: Otis Gospodnetic A Similarity class which has unique length norms for numTerms = 10 --- Key: LUCENE-1360 URL: https://issues.apache.org/jira/browse/LUCENE-1360 Project: Lucene - Java Issue Type: Improvement Reporter: Sean Timm Assignee: Otis Gospodnetic Priority: Trivial Attachments: ShortFieldNormSimilarity.java A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms = 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index. This is useful if your search is only on short fields such as titles or product descriptions. See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Closed: (LUCENE-1275) Expose Document Number
[ https://issues.apache.org/jira/browse/LUCENE-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic closed LUCENE-1275. Resolution: Invalid Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Expose Document Number -- Key: LUCENE-1275 URL: https://issues.apache.org/jira/browse/LUCENE-1275 Project: Lucene - Java Issue Type: New Feature Components: Index, Store Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9, 3.0 Environment: All Reporter: Hasan Diwan Priority: Minor Attachments: lucene.pat Lucene maintains an internal document number, which this patch exposes using a mutator/accessor pair of methods. The field is set on document addition. This creates a unique way to refer to a document for editing and updating individual documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623433#action_12623433 ] Otis Gospodnetic commented on LUCENE-1219: -- Eks Dev: out of curiosity, did you ever measure the before/after performance difference? If so, what numbers did you see? support array/offset/ length setters for Field with binary data --- Key: LUCENE-1219 URL: https://issues.apache.org/jira/browse/LUCENE-1219 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Eks Dev Assignee: Michael McCandless Priority: Minor Fix For: 2.4 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.take2.patch, LUCENE-1219.take3.patch currently Field/Fieldable interface supports only compact, zero based byte arrays. This forces end users to create and copy content of new objects before passing them to Lucene as such fields are often of variable size. Depending on use case, this can bring far from negligible performance improvement. this approach extends Fieldable interface with 3 new methods getOffset(); gettLenght(); and getBinaryValue() (this only returns reference to the array) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1359) FrenchAnalyzer's tokenStream method does not honour the contract of Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1359: - Summary: FrenchAnalyzer's tokenStream method does not honour the contract of Analyzer (was: FrenchAnalyzer's tokenStream method does not honour the contact of Analyzer) FrenchAnalyzer's tokenStream method does not honour the contract of Analyzer Key: LUCENE-1359 URL: https://issues.apache.org/jira/browse/LUCENE-1359 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.2 Reporter: Andrew Lynch In {{Analyzer}} : {code} /** Creates a TokenStream which tokenizes all the text in the provided Reader. Default implementation forwards to tokenStream(Reader) for compatibility with older version. Override to allow Analyzer to choose strategy based on document and/or field. Must be able to handle null field name for backward compatibility. */ public abstract TokenStream tokenStream(String fieldName, Reader reader); {code} and in {{FrenchAnalyzer}} {code} public final TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName == null) throw new IllegalArgumentException(fieldName must not be null); if (reader == null) throw new IllegalArgumentException(reader must not be null); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1124: - Summary: short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity (was: short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity --- Key: LUCENE-1124 URL: https://issues.apache.org/jira/browse/LUCENE-1124 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Hoss Man Attachments: LUCENE-1124.patch, LUCENE-1124.patch I found this (unreplied to) email floating around in my Lucene folder from during the holidays... {noformat} From: Timo Nentwig To: java-dev Subject: Fuzzy makes no sense for short tokens Date: Mon, 31 Dec 2007 16:01:11 +0100 Message-Id: [EMAIL PROTECTED] Hi! it generally makes no sense to search fuzzy for short tokens because changing even only a single character of course already results in a high edit distance. So it actually only makes sense in this case: if( token.length() 1f / (1f - minSimilarity) ) E.g. changing one character in a 3-letter token (foo) results in an edit distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher we can save all the expensive rewrite() logic. {noformat} I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1358) Deadlock for some Query objects in the equals method (f.ex. PhraseQuery) in a concurrent environment
[ https://issues.apache.org/jira/browse/LUCENE-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623437#action_12623437 ] Otis Gospodnetic commented on LUCENE-1358: -- It sounds like you are simply demonstrating an old bug, right? If so, then we can close this issue, since LUCENE-1346 fixed the bug you described (I didn't verify that). Deadlock for some Query objects in the equals method (f.ex. PhraseQuery) in a concurrent environment Key: LUCENE-1358 URL: https://issues.apache.org/jira/browse/LUCENE-1358 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 2.3.2 Reporter: Torbjørn Køhler Priority: Minor Attachments: TestDeadLock.java Original Estimate: 0h Remaining Estimate: 0h Some Query objects in lucene 2.3.2 (and previous versions) have internal variables using Vector. These variables are used during the call to the equals method. In a concurrent environment a deadlock might occur.The attached code example shows this happening in lucene 2.3.2, but the patch in LUCENE-1346 fixes this issue (though that doesn't seem to be the intention of that patch according to the description :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1275) Expose Document Number
[ https://issues.apache.org/jira/browse/LUCENE-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622903#action_12622903 ] Otis Gospodnetic commented on LUCENE-1275: -- Hasan, please see Hoss' and my comments above. Expose Document Number -- Key: LUCENE-1275 URL: https://issues.apache.org/jira/browse/LUCENE-1275 Project: Lucene - Java Issue Type: New Feature Components: Index, Store Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9, 3.0 Environment: All Reporter: Hasan Diwan Priority: Minor Attachments: lucene.pat Lucene maintains an internal document number, which this patch exposes using a mutator/accessor pair of methods. The field is set on document addition. This creates a unique way to refer to a document for editing and updating individual documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1308) Remove String.intern() from Field.java to increase performance and lower contention
[ https://issues.apache.org/jira/browse/LUCENE-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605832#action_12605832 ] Otis Gospodnetic commented on LUCENE-1308: -- Rene, can you provide a patch along with unit tests? Have you or can you run contrib/benchmarks and include your before the changes and after the changes results here, so we can see what difference this change makes? Thanks. Remove String.intern() from Field.java to increase performance and lower contention --- Key: LUCENE-1308 URL: https://issues.apache.org/jira/browse/LUCENE-1308 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.3.2 Reporter: Rene S Right now, *document.Field is interning all field names. While this makes sense because it lowers the overall memory consumption, the method intern() of String is know to be difficult to handle. 1) it is a native call and therefore slower than anything on the Java level 2) the String pool is part of the perm space and not of the general heap, so it's size is more restricted and needs extra VM params to be managed 3) Some VMs show GC problems with strings in the string pool Suggested solution is a WeakHashMap instead, that takes care of unifying the String instances and at the same time keeping the pool in the heap space and releasing the String when it is not longer needed. For extra performance in a concurrent environment, a ConcurrentHashMap-like implementation of a weak hashmap is recommended, because we mostly read from the pool. We saw a 10% improvement in throughout and response time of our application and the application is not only doing searches (we read a lot of documents from the result). So a single measurement test case could show even more improvement in single and concurrent usage. The Cache: /** Cache to replace the expensive String.intern() call with the java version */ private final static MapString, WeakReferenceString unifiedStringsCache = Collections.synchronizedMap(new WeakHashMapString, WeakReferenceString(109)); The access to it, instead of this.name = name.intern; // unify the strings, but do not use the expensive String.intern() version // which is not weak enough, uses the perm space and is a native call String unifiedName = null; WeakReferenceString ref = unifiedStringsCache.get(name); if (ref != null) { unifiedName = ref.get(); } if (unifiedName == null) { unifiedStringsCache.put(name, new WeakReference(name)); unifiedName = name; } this.name = unifiedName; I guess it is sufficient to have mostly all fields names interned, so I skipped the additional synchronization around the access and take the risk that only 99.99% :) of all field names are interned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1142) Updated Snowball package
[ https://issues.apache.org/jira/browse/LUCENE-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1142: - Fix Version/s: 2.4 Updated Snowball package Key: LUCENE-1142 URL: https://issues.apache.org/jira/browse/LUCENE-1142 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Karl Wettin Priority: Minor Fix For: 2.4 Attachments: snowball.tartarus.txt Updated Snowball contrib package * New org.tartarus.snowball java package with patched SnowballProgram to be abstract to avoid using reflection. * Introducing Hungarian, Turkish and Romanian stemmers * Introducing constructor SnowballFilter(SnowballProgram) It is possible there have been some changes made to the some of there stemmer algorithms between this patch and the current SVN trunk of Lucene, an index might thus not be compatible with new stemmers! The API is backwards compatibile and the test pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1180) Syns2Index fails
[ https://issues.apache.org/jira/browse/LUCENE-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-1180: Assignee: Otis Gospodnetic Syns2Index fails Key: LUCENE-1180 URL: https://issues.apache.org/jira/browse/LUCENE-1180 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.3 Reporter: Jeffrey Yang Assignee: Otis Gospodnetic Priority: Minor Attachments: syns2index_fix_2.3.patch, syns2index_fix_2.4-dev.patch Original Estimate: 1h Remaining Estimate: 1h Running Syns2Index fails with a java.lang.IllegalArgumentException: maxBufferedDocs must at least be 2 when enabled exception. at org.apache.lucene.index.IndexWriter.setMaxBufferedDocs(IndexWriter.java:883) at org.apache.lucene.wordnet.Syns2Index.index(Syns2Index.java:249) at org.apache.lucene.wordnet.Syns2Index.main(Syns2Index.java:208) The code is here // blindly up these parameters for speed writer.setMergeFactor( writer.getMergeFactor() * 2); writer.setMaxBufferedDocs( writer.getMaxBufferedDocs() * 2); It looks like getMaxBufferedDocs used to return 10, and now it returns -1, not sure when that started happening. My suggestion would be to just remove these three lines. Since speed has already improved vastly, there isn't a need to speed things up. To run this, Syns2Index requires two args. The first is the location of the wn_s.pl file, and the second is the directory to create the index in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1307) Remove Contributions page
Remove Contributions page - Key: LUCENE-1307 URL: https://issues.apache.org/jira/browse/LUCENE-1307 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Otis Gospodnetic Priority: Minor On Fri, May 16, 2008 at 10:06 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hola, Does anyone think the Contributions page should be removed? http://lucene.apache.org/java/2_3_2/contributions.html It looks so outdated that I think it may give newcomers a bad impression of Lucene (What, this is it for contributions?). The only really valuable piece there is Luke, but Luke must be mentioned in a dozen places on the Wiki anyway. Should we remove the Contributions page? Yonik and Grant gave their +1s. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605120#action_12605120 ] Otis Gospodnetic commented on LUCENE-1306: -- Should there be a way for the client of this class to specify the prefix and suffix char? Is having, for example, ^h as the first bi-gram token really the right thing to do? Would ^he make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead? CombinedNGramTokenFilter Key: LUCENE-1306 URL: https://issues.apache.org/jira/browse/LUCENE-1306 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Karl Wettin Assignee: Karl Wettin Priority: Trivial Attachments: LUCENE-1306.txt Alternative NGram filter that produce tokens with composite prefix and suffix markers. {code:java} ts = new WhitespaceTokenizer(new StringReader(hello)); ts = new CombinedNGramTokenFilter(ts, 2, 2); assertNext(ts, ^h); assertNext(ts, he); assertNext(ts, el); assertNext(ts, ll); assertNext(ts, lo); assertNext(ts, o$); assertNull(ts.next()); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1297: - Attachment: LUCENE-1297.patch Attaching a new version (only added ASL 2.0 to StringDistance + typo fix) Question (why - what does it do?) about this TRStringDistance change: -return p[n]; +return 1.0f - ((float) p[n] / Math.min(other.length(), sa.length)); Allow other string distance measures in spellchecker Key: LUCENE-1297 URL: https://issues.apache.org/jira/browse/LUCENE-1297 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.4 Environment: n/a Reporter: Thomas Morton Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: LUCENE-1297.patch, LUCENE-1297.patch Updated spelling code to allow for other string distance measures to be used. Created StringDistance interface. Modified existing Levenshtein distance measure to implement interface (and renamed class). Verified that change to Levenshtein distance didn't impact runtime performance. Implemented Jaro/Winkler distance metric Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12604538#action_12604538 ] Otis Gospodnetic commented on LUCENE-1297: -- Tom, I agree with Grant and I'll assume you'll update the patch. As for that TRStringDistance - LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit. Allow other string distance measures in spellchecker Key: LUCENE-1297 URL: https://issues.apache.org/jira/browse/LUCENE-1297 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.4 Environment: n/a Reporter: Thomas Morton Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: string_distance3.patch Updated spelling code to allow for other string distance measures to be used. Created StringDistance interface. Modified existing Levenshtein distance measure to implement interface (and renamed class). Verified that change to Levenshtein distance didn't impact runtime performance. Implemented Jaro/Winkler distance metric Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1178) Hits does not use MultiSearcher's createWeight
[ https://issues.apache.org/jira/browse/LUCENE-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1178. -- Resolution: Won't Fix With Hits getting deprecated, I think it doesn't make sense to pursue this. If anyone disagrees, we can reopen. Hits does not use MultiSearcher's createWeight -- Key: LUCENE-1178 URL: https://issues.apache.org/jira/browse/LUCENE-1178 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.3 Reporter: Israel Tsadok Assignee: Otis Gospodnetic Attachments: hits.diff I am developing a distributed index, using MultiSearcher and RemoteSearcher. When investigating some performance issues, I noticed that there is a lot of back-and-forth traffic between the servers during the weight calculation. Although MultiSearcher has a method called createWeight that minimizes the calls to the sub-searchers, this method never actually gets called when I call search(query). From what I can tell, this is fixable by changing in Hits.java the line: weight = q.weight(s); to: weight = s.createWeight(q); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12604121#action_12604121 ] Otis Gospodnetic commented on LUCENE-1297: -- Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page. I see JaroWinklerDistance.java doesn't have ASL on top. Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply. Thanks. Allow other string distance measures in spellchecker Key: LUCENE-1297 URL: https://issues.apache.org/jira/browse/LUCENE-1297 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.4 Environment: n/a Reporter: Thomas Morton Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: string_distance.patch2 Updated spelling code to allow for other string distance measures to be used. Created StringDistance interface. Modified existing Levenshtein distance measure to implement interface (and renamed class). Verified that change to Levenshtein distance didn't impact runtime performance. Implemented Jaro/Winkler distance metric Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12602151#action_12602151 ] Otis Gospodnetic commented on LUCENE-1297: -- Thomas - any chance you can write a simple unit test that exercises JaroWinkler? Allow other string distance measures in spellchecker Key: LUCENE-1297 URL: https://issues.apache.org/jira/browse/LUCENE-1297 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.4 Environment: n/a Reporter: Thomas Morton Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: string_distance.patch Updated spelling code to allow for other string distance measures to be used. Created StringDistance interface. Modified existing Levenshtein distance measure to implement interface (and renamed class). Verified that change to Levenshtein distance didn't impact runtime performance. Implemented Jaro/Winkler distance metric Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12601335#action_12601335 ] Otis Gospodnetic commented on LUCENE-1297: -- You read my mind, Thomas. Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then? Allow other string distance measures in spellchecker Key: LUCENE-1297 URL: https://issues.apache.org/jira/browse/LUCENE-1297 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.4 Environment: n/a Reporter: Thomas Morton Assignee: Otis Gospodnetic Priority: Minor Fix For: 2.4 Attachments: string_distance.patch Updated spelling code to allow for other string distance measures to be used. Created StringDistance interface. Modified existing Levenshtein distance measure to implement interface (and renamed class). Verified that change to Levenshtein distance didn't impact runtime performance. Implemented Jaro/Winkler distance metric Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis
[ https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12601340#action_12601340 ] Otis Gospodnetic commented on LUCENE-1295: -- I think cosmetic changes are OK if: * they are not mixed with functional changes * there are no patches for the cleaned-up class(es) in JIRA In this case I see only a couple of MLT issues, all of which look like we can take care of them quickly, and then somebody can clean up a little if we feel like it. Anyhow... Make retrieveTerms(int docNum) public in MoreLikeThis - Key: LUCENE-1295 URL: https://issues.apache.org/jira/browse/LUCENE-1295 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial Attachments: LUCENE-1295.patch It would be useful if {code} private PriorityQueue retrieveTerms(int docNum) throws IOException { {code} were public, since it is similar in use to {code} public PriorityQueue retrieveTerms(Reader r) throws IOException { {code} It also seems useful to add {code} public String [] retrieveInterestingTerms(int docNum) throws IOException{ {code} to mirror the one that works on Reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text
[ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-725: --- Assignee: Otis Gospodnetic NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all boilerplate text --- Key: LUCENE-725 URL: https://issues.apache.org/jira/browse/LUCENE-725 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mark Harwood Assignee: Otis Gospodnetic Priority: Minor Attachments: NovelAnalyzer.java, NovelAnalyzer.java This is a class I have found to be useful for analyzing small (in the hundreds) collections of documents and removing any duplicate content such as standard disclaimers or repeated text in an exchange of emails. This has applications in sampling query results to identify key phrases, improving speed-reading of results with similar content (eg email threads/forum messages) or just removing duplicated noise from a search index. To be more generally useful it needs to scale to millions of documents - in which case an alternative implementation is required. See the notes in the Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis
[ https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600679#action_12600679 ] Otis Gospodnetic commented on LUCENE-1295: -- Perque no. I see MLT is full of tabs, should you feel like fixing the formating. Make retrieveTerms(int docNum) public in MoreLikeThis - Key: LUCENE-1295 URL: https://issues.apache.org/jira/browse/LUCENE-1295 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial Attachments: LUCENE-1295.patch It would be useful if {code} private PriorityQueue retrieveTerms(int docNum) throws IOException { {code} were public, since it is similar in use to {code} public PriorityQueue retrieveTerms(Reader r) throws IOException { {code} It also seems useful to add {code} public String [] retrieveInterestingTerms(int docNum) throws IOException{ {code} to mirror the one that works on Reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1178) Hits does not use MultiSearcher's createWeight
[ https://issues.apache.org/jira/browse/LUCENE-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-1178: Assignee: Otis Gospodnetic Hits does not use MultiSearcher's createWeight -- Key: LUCENE-1178 URL: https://issues.apache.org/jira/browse/LUCENE-1178 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.3 Reporter: Israel Tsadok Assignee: Otis Gospodnetic Attachments: hits.diff I am developing a distributed index, using MultiSearcher and RemoteSearcher. When investigating some performance issues, I noticed that there is a lot of back-and-forth traffic between the servers during the weight calculation. Although MultiSearcher has a method called createWeight that minimizes the calls to the sub-searchers, this method never actually gets called when I call search(query). From what I can tell, this is fixable by changing in Hits.java the line: weight = q.weight(s); to: weight = s.createWeight(q); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-954) Toggle score normalization in Hits
[ https://issues.apache.org/jira/browse/LUCENE-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600366#action_12600366 ] Otis Gospodnetic commented on LUCENE-954: - I suppose there is now suddenly no need to work on Hits. I'll resolve this as Won't Fix in a few days, unless somebody has some more thoughts on this. Toggle score normalization in Hits -- Key: LUCENE-954 URL: https://issues.apache.org/jira/browse/LUCENE-954 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.2, 2.3, 2.3.1, 2.4 Environment: any Reporter: Christian Kohlschütter Assignee: Otis Gospodnetic Fix For: 2.4 Attachments: hits-scoreNorm.patch, LUCENE-954.patch The current implementation of the Hits class sometimes performs score normalization. In particular, whenever the top-ranked score is bigger than 1.0, it is normalized to a maximum of 1.0. In this case, Hits may return different score results than TopDocs-based methods. In my scenario (a federated search system), Hits delievered just plain wrong results. I was merging results from several sources, all having homogeneous statistics (similar to MultiSearcher, but over the Internet using HTTP/XML-based protocols). Sometimes, some of the sources had a top-score greater than 1, so I ended up with garbled results. I suggest to add a switch to enable/disable this score-normalization at runtime. My patch (attached) has an additional peformance benefit, since score normalization now occurs only when Hits#score() is called, not when creating the Hits result list. Whenever scores are not required, you save one multiplication per retrieved hit (i.e., at least 100 multiplications with the current implementation of Hits). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-691) Bob Carpenter's FuzzyTermEnum refactoring
[ https://issues.apache.org/jira/browse/LUCENE-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-691. - Resolution: Duplicate Assignee: Otis Gospodnetic The patch for Bob's change suggestions is in LUCENE-1183, so this issue is redundant. Bob Carpenter's FuzzyTermEnum refactoring - Key: LUCENE-691 URL: https://issues.apache.org/jira/browse/LUCENE-691 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Otis Gospodnetic Assignee: Otis Gospodnetic Priority: Minor I'll just paste Bob's complete email here. I refactored the org.apache.lucene.search.FuzzyTermEnum edit distance implementation. It now only uses a single pair of arrays, and those never get resized. That required turning the order of text/target around in the loops. You'll see that with the pair of arrays method, they get re-used hand-over-hand, and are assigned to local variables in the tight loops. I removed the calculation of min distance and replaced it with a boolean -- the min wasn't needed, only the test vs. the max. I also flipped some variables around so there's one less addition in the very inner loop and the arrays are now looping in the ordinary way (starting at 0 with a comparison). I also commented out the redundant definition of the public close() [which just called super.close() and had none of its own doc.] I also just compute the max distance each time rather than fiddling with an array -- it's just a little arithmetic done once per term, but that could be put back. I also rewrote min(int,int,int) to get rid of intermediate assignments. Is there a lib for this kind of thing? An intermediate refactoring that does the hand-over-hand with the existing array and resizing strategy is in FuzzyTermEnum.intermed.java. I ran the unit tests as follows on both versions (my hat's off to the build.xml author(s) -- this all just worked out of the box and was really easy to follow the first through): C:\java\lucene-2.0.0ant -Djunit.includes= -Dtestcase=TestFuzzyQuery test Buildfile: build.xml javacc-uptodate-check: javacc-notice: init: common.compile-core: [javac] Compiling 1 source file to C:\java\lucene-2.0.0\build\classes\java compile-core: compile-demo: common.compile-test: compile-test: test: [junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec BUILD SUCCESSFUL Total time: 2 seconds Does anyone have regression/performance test harnesses? The unit tests were pretty minimal (which is a good thing!). It'd be nice to test the min impl (ternary vs. if/then) and the assumption about not allocating an array of max distances. All told, the refactored version should be a modest speed improvement, mainly from unfolding the arrays so they're local one-dimensional refs. I don't know what the protocol is for one-off contributions. I'm happy with the Apache license, so that shouldn't be a problem. I also don't know whether you use tabs or spaces -- I untabified the final version and used your two-space format in emacs. - Bob Carpenter package org.apache.lucene.search; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import java.io.IOException; /** Subclass of FilteredTermEnum for enumerating all terms that are similiar * to the specified filter term. * * pTerm enumerations are always ordered by Term.compareTo(). Each term in * the enumeration is greater than all that precede it. */ public final class FuzzyTermEnum extends FilteredTermEnum { /* This should be somewhere around the average long word. * If it is longer, we waste time and space. If it is shorter, we waste a * little bit of time growing the array as we encounter longer words. */ private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19; /* Allows us save time required to create a new array * everytime similarity is called. These are slices that * will be reused during dynamic programming hand-over-hand * style. */ private final int[] d0; private final int[] d1; private
[jira] Commented: (LUCENE-1293) Tweaks to PhraseQuery.explain()
[ https://issues.apache.org/jira/browse/LUCENE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12599557#action_12599557 ] Otis Gospodnetic commented on LUCENE-1293: -- Itamar - could you explain, in plain English, why the above is better? (sorry, I'm not terribly familiar with PhraseQuery's explain(), so I can't tell why this reordering makes the explain output better). Also, if you have more changes to make, please go ahead and put them in a patch. Thanks! Tweaks to PhraseQuery.explain() --- Key: LUCENE-1293 URL: https://issues.apache.org/jira/browse/LUCENE-1293 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4 Reporter: Itamar Syn-Hershko Priority: Minor Fix For: 2.3.2, 2.4 The explain() function in PhraseQuery.java is very clumzy and could use many optimizations. Perhaps it is only because it is intended to use while debugging? Here's an example: {noformat} result.addDetail(fieldExpl); // combine them result.setValue(queryExpl.getValue() * fieldExpl.getValue()); if (queryExpl.getValue() == 1.0f) return fieldExpl; return result; } {noformat} Can easily be tweaked and become: {noformat} if (queryExpl.getValue() == 1.0f) { return fieldExpl; } result.addDetail(fieldExpl); // combine them result.setValue(queryExpl.getValue() * fieldExpl.getValue()); return result; } {noformat} And thats really just for a start... Itamar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1152) SpellChecker does not work properly on calling indexDictionary after clearIndex
[ https://issues.apache.org/jira/browse/LUCENE-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1152. -- Resolution: Fixed Thank you for the patch! Committed revision 659013. SpellChecker does not work properly on calling indexDictionary after clearIndex --- Key: LUCENE-1152 URL: https://issues.apache.org/jira/browse/LUCENE-1152 Project: Lucene - Java Issue Type: Bug Components: contrib/spellchecker Affects Versions: 2.3 Reporter: Naveen Belkale Assignee: Otis Gospodnetic Priority: Minor Attachments: spellchecker.diff, spellchecker.diff We have to call clearIndex and indexDictionary to rebuild dictionary from fresh. The call to SpellChecker clearIndex() function does not reset the searcher. Hence, when we call indexDictionary after that, many entries that are already in the stale searcher will not be indexed. Also, I see that IndexReader reader is used for the sole purpose of obtaining the docFreq of a given term in exist() function. This functionality can also be obtained by using just the searcher by calling searcher.docFreq. Thus, can we get away completely with reader and code associated with it like if (IndexReader.isLocked(spellIndex)){ IndexReader.unlock(spellIndex); } and the reader related code in finalize? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)
[ https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598918#action_12598918 ] Otis Gospodnetic commented on LUCENE-1183: -- Committed the TRStringDistance patch -- thank you! Committed revision 659016. I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, Cédrik? TRStringDistance uses way too much memory (with patch) -- Key: LUCENE-1183 URL: https://issues.apache.org/jira/browse/LUCENE-1183 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3 Reporter: Cédrik LIME Assignee: Otis Gospodnetic Priority: Minor Attachments: FuzzyTermEnum.patch, TRStringDistance.java, TRStringDistance.patch Original Estimate: 0.17h Remaining Estimate: 0.17h The implementation of TRStringDistance is based on version 2.1 of org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), which uses an un-optimized implementation of the Levenshtein Distance algorithm (it uses way too much memory). Please see Bug 38911 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more information. The commons-lang implementation has been heavily optimized as of version 2.2 (3x speed-up). I have reported the new implementation to TRStringDistance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1046) Dead code in SpellChecker.java (branch never executes)
[ https://issues.apache.org/jira/browse/LUCENE-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-1046. -- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Danke schön. Committed revision 659019. Dead code in SpellChecker.java (branch never executes) -- Key: LUCENE-1046 URL: https://issues.apache.org/jira/browse/LUCENE-1046 Project: Lucene - Java Issue Type: Bug Components: contrib/spellchecker Affects Versions: 2.2 Reporter: Joe Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-1046.diff SpellChecker contains the following lines of code: final int goalFreq = (morePopular ir != null) ? ir.docFreq(new Term(field, word)) : 0; // if the word exists in the real index and we don't care for word frequency, return the word itself if (!morePopular goalFreq 0) { return new String[] { word }; } The branch will never execute: the only way for goalFreq to be greater than zero is if morePopular is true, but if morePopular is true, the expression in the if statement evaluates to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-852) spellchecker: make hard-coded values configurable
[ https://issues.apache.org/jira/browse/LUCENE-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-852: Attachment: LUCENE-852.patch spellchecker: make hard-coded values configurable - Key: LUCENE-852 URL: https://issues.apache.org/jira/browse/LUCENE-852 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: karin Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-852.patch, LUCENE-852.patch the class org.apache.lucene.search.spell.SpellChecker uses the following hard-coded values in its method indexDictionary: writer.setMergeFactor(300); writer.setMaxBufferedDocs(150); this poses problems when the spellcheck index is created on systems with certain limits, i.e. in unix environments where the ulimit settings are restricted for the user (http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428). there are several ways to circumvent this: 1. add another indexDictionary method with additional parameters: public void indexDictionary (Dictionary dict, int mergeFactor, int maxBufferedDocs) throws IOException 2. add setter methods for mergeFactor and maxBufferedDocs (see code in http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428 ) 3. Make SpellChecker subclassing easier as suggested by Chris Hostetter (see reply http://www.gossamer-threads.com/lists/lucene/java-dev/47463#47463) thanx, karin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-898) contrib/javascript is not packaged into releases
[ https://issues.apache.org/jira/browse/LUCENE-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598924#action_12598924 ] Otis Gospodnetic commented on LUCENE-898: - I'll take care of this in a few days...it looks like nobody will miss it. contrib/javascript is not packaged into releases Key: LUCENE-898 URL: https://issues.apache.org/jira/browse/LUCENE-898 Project: Lucene - Java Issue Type: Bug Components: Build Reporter: Hoss Man Assignee: Otis Gospodnetic Priority: Trivial the contrib/javascript directory is (apparently) a collection of javascript utilities for lucene .. but it has not build files or any mechanism to package it, so it is excluded form releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-852) spellchecker: make hard-coded values configurable
[ https://issues.apache.org/jira/browse/LUCENE-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-852. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Thanks for the patch, Otis. Committed revision 659021. spellchecker: make hard-coded values configurable - Key: LUCENE-852 URL: https://issues.apache.org/jira/browse/LUCENE-852 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: karin Assignee: Otis Gospodnetic Priority: Minor Attachments: LUCENE-852.patch, LUCENE-852.patch the class org.apache.lucene.search.spell.SpellChecker uses the following hard-coded values in its method indexDictionary: writer.setMergeFactor(300); writer.setMaxBufferedDocs(150); this poses problems when the spellcheck index is created on systems with certain limits, i.e. in unix environments where the ulimit settings are restricted for the user (http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428). there are several ways to circumvent this: 1. add another indexDictionary method with additional parameters: public void indexDictionary (Dictionary dict, int mergeFactor, int maxBufferedDocs) throws IOException 2. add setter methods for mergeFactor and maxBufferedDocs (see code in http://www.gossamer-threads.com/lists/lucene/java-dev/47428#47428 ) 3. Make SpellChecker subclassing easier as suggested by Chris Hostetter (see reply http://www.gossamer-threads.com/lists/lucene/java-dev/47463#47463) thanx, karin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types
[ https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598419#action_12598419 ] Otis Gospodnetic commented on LUCENE-1285: -- Mark, are you done with this/would you like to commit this? Or should I? (Asking because of SOLR-553) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types -- Key: LUCENE-1285 URL: https://issues.apache.org/jira/browse/LUCENE-1285 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Reporter: Andrzej Bialecki Fix For: 2.4 Attachments: highlighter-test.patch, highlighter.patch Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect. Example text: t1 t2 t3 t4 t2 Example query: t2 t3 t1 t2 Current highlighting: [t1 t2] [t3] t4 t2 Correct highlighting: [t1 t2] [t3] t4 [t2] The problem comes from the fact that we keep a MaptermText, WeightedSpanTerm, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed. My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-112) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale
[ https://issues.apache.org/jira/browse/LUCENE-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-112: Assignee: (was: Eric Isakson) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale Key: LUCENE-112 URL: https://issues.apache.org/jira/browse/LUCENE-112 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: CVS Nightly - Specify date in submission Environment: Operating System: All Platform: All Reporter: Eric Isakson Priority: Minor Attachments: IdleTimeoutRefreshingIndexReader.html, IdleTimeoutRefreshingIndexReader.java Here is a little something I worked on this weekend that I wanted to contribute back as I think others might find it very useful. I extended IndexReader and added support for configuring an idle timeout and refresh interval. It uses a monitoring thread to watch for the reader going idle. When the reader goes idle it is closed. When the index is read again it is re-opened. It uses another thread to periodically check when the reader needs to be refreshed due to a change to index. When the reader is stale, it closes the reader and reopens the index. It is acually delegating all the work to another IndexReader implementation and just handling the threading and synchronization. When it closes a reader, it delegates the close to another thread that waits a bit (configurable how long) before actually closing the reader it was delegating to. This gives any consumers of the original reader a chance to finish up their last action on the reader. This implementation sacrifices a little bit of speed since there is a bit more synchroniztion to deal with and the delegation model puts extra calls on the stack, but it should provide long running applications that have idle periods or frequently changing indices from having to open and close readers all the time or hold open unused resources. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types
[ https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-1285: Assignee: Otis Gospodnetic WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types -- Key: LUCENE-1285 URL: https://issues.apache.org/jira/browse/LUCENE-1285 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4 Reporter: Andrzej Bialecki Assignee: Otis Gospodnetic Fix For: 2.4 Attachments: highlighter-test.patch, highlighter.patch Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect. Example text: t1 t2 t3 t4 t2 Example query: t2 t3 t1 t2 Current highlighting: [t1 t2] [t3] t4 t2 Correct highlighting: [t1 t2] [t3] t4 [t2] The problem comes from the fact that we keep a MaptermText, WeightedSpanTerm, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed. My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www
[ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597979#action_12597979 ] Otis Gospodnetic commented on LUCENE-1284: -- Thanks, I'll have a look later this week. Note that if you always use the same file name for attachments, JIRA will manage them for you and you won't have to delete old ones. Use a name such as LUCENE-1284.patch or LUCENE-1284.tgz or some such. Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) -- Key: LUCENE-1284 URL: https://issues.apache.org/jira/browse/LUCENE-1284 Project: Lucene - Java Issue Type: New Feature Environment: New feature developed under GNU/Linux, but it should work in any other Java-compliance platform Reporter: Felipe Sánchez Martínez Assignee: Otis Gospodnetic Attachments: apertium-morph.2008-05-19.tgz Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms. The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1290) Deprecate Hits
[ https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598170#action_12598170 ] Otis Gospodnetic commented on LUCENE-1290: -- I'm actually feeling -1-ish about this. I don't think Hits are hurting those who are truly concerned about performance. Those who want performance have other API options. But Hits is so nice and simple, and that must be valuable to a large portion of Lucene users (think CD searches, site searches, desktop search apps, etc., not massive distributed searches and such). Why can't we let Hits live? If we are concerned about its performance, we can easily javadoc and Wiki that. Deprecate Hits -- Key: LUCENE-1290 URL: https://issues.apache.org/jira/browse/LUCENE-1290 Project: Lucene - Java Issue Type: Task Components: Search Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: lucene-1290.patch The Hits class has several drawbacks as pointed out in LUCENE-954. The other search APIs that use TopDocCollector and TopDocs should be used instead. This patch: - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as the Searcher.search( * ) methods which return a Hits Object. - removes all references to Hits from the core and uses TopDocs and ScoreDoc instead - Changes the demo SearchFiles: adds the two modes 'paging search' and 'streaming search', each of which demonstrating a different way of using the search APIs. The former uses TopDocs and a TopDocCollector, the latter a custom HitCollector implementation. - Updates the online tutorial that descibes the demo. All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]