[jira] [Updated] (LUCENE-3690) JFlex-based HTMLStripCharFilter replacement
[ https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-3690: Attachment: LUCENE-3690.patch Here is the final patch. {quote} bq. sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice... https://svn.apache.org/viewvc?view=revision&revision=658003 {quote} I took the same approach - here are the changes from the previous version of the patch: # The previous {{HTMLStripCharFilter}} implementation is moved to Solr, renamed to {{LegacyHTMLStripCharFilter}}, and deprecated, and a Factory is added for it. # {{JFlexHTMLStripCharFilter}} is renamed to {{HTMLStripCharFilter}}. # Support for {{HTMLStripCharFilter}}'s "escapedTags" functionality is added to {{HTMLStripCharFilterFactory}}. # Added {{TestHTMLStripCharFilterFactory}}. # Solr and Lucene {{CHANGES.txt}} entries are added. Run the following svn copy script before applying the patch: {noformat} svn cp modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilter.java svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/htmlStripReaderTest.html solr/core/src/test/org/apache/solr/analysis/ svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterTest.java solr/core/src/test/org/apache/solr/analysis/LegacyHTMLStripCharFilterTest.java svn cp solr/core/src/java/org/apache/solr/analysis/HTMLStripCharFilterFactory.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilterFactory.java {noformat} I plan to commit to trunk shortly, then backport and commit to branch_3x. > JFlex-based HTMLStripCharFilter replacement > --- > > Key: LUCENE-3690 > URL: https://issues.apache.org/jira/browse/LUCENE-3690 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 3.5, 4.0 >Reporter: Steven Rowe >Assignee: Steven Rowe > Fix For: 3.6, 4.0 > > Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, > JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch, LUCENE-3690.patch, > LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch > > > A JFlex-based HTMLStripCharFilter replacement would be more performant and > easier to understand and maintain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts
[ https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-3703. Resolution: Fixed Committed revision 1234450 (3x), 1234451 (trunk). Thanks Doron ! > DirectoryTaxonomyReader.refresh misbehaves with ref counts > -- > > Key: LUCENE-3703 > URL: https://issues.apache.org/jira/browse/LUCENE-3703 > Project: Lucene - Java > Issue Type: Bug > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3703.patch, LUCENE-3703.patch > > > DirectoryTaxonomyReader uses the internal IndexReader in order to track its > own reference counting. However, when you call refresh(), it reopens the > internal IndexReader, and from that point, all previous reference counting > gets lost (since the new IndexReader's refCount is 1). > The solution is to track reference counting in DTR itself. I wrote a simple > unit test which exposes the bug (will be attached with the patch shortly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1283) Mark Invalid error on indexing
[ https://issues.apache.org/jira/browse/SOLR-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190605#comment-13190605 ] Steven Rowe commented on SOLR-1283: --- The below-listed exception, which appears to be the same as that in other reports on this issue, is triggered when processing with {{HTMLStripCharFilter}} the ClueWeb09 documents with TREC-IDs clueweb09-en-00-14171, clueweb09-en-00-14228, clueweb09-en-00-14235, clueweb09-en-00-14240, clueweb09-en-00-14248, and clueweb09-en-00-14265: {noformat} java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734) {noformat} Once LUCENE-3690 has been committed, this will only affect the (deprecated) old implementation, which will be renamed to {{LegacyHTMLStripCharFilter}}. > Mark Invalid error on indexing > -- > > Key: SOLR-1283 > URL: https://issues.apache.org/jira/browse/SOLR-1283 > Project: Solr > Issue Type: Bug >Affects Versions: 1.3 > Environment: Ubuntu 8.04, Sun Java 6 >Reporter: solrize >Assignee: Yonik Seeley > Fix For: 3.1, 4.0 > > Attachments: SOLR-1283.modules.patch, SOLR-1283.patch > > > When indexing large (1 megabyte) documents I get a lot of exceptions with > stack traces like the below. It happens both in the Solr 1.3 release and in > the July 9 1.4 nightly. I believe this to NOT be the same issue as SOLR-42. > I found some further discussion on solr-user: > http://www.nabble.com/IOException:-Mark-invalid-while-analyzing-HTML-td17052153.html > > In that discussion, Grant asked the original poster to open a Jira issue, but > I didn't see one so I'm opening one; please feel free to merge or close if > it's redundant. > My stack trace follows. > Jul 15, 2009 8:36:42 AM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/update params={} status=500 QTime=3 > Jul 15, 2009 8:36:42 AM org.apache.solr.common.SolrException log > SEVERE: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(BufferedReader.java:485) > at > org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) > at > org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728) > at > org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742) > at java.io.Reader.read(Reader.java:123) > at > org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:108) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:178) > at > org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84) > at > org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53) > at > org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:347) > at > org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159) > at > org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) > at > org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) > at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765) > at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:748) > at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2512) > at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2484) > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240) > at > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) > at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1292) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > at > org.mortbay.jetty.ser
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 1577 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/1577/ 1 tests failed. REGRESSION: org.apache.solr.search.TestRealTimeGet.testStressGetRealtime Error Message: java.lang.AssertionError: Some threads threw uncaught exceptions! Stack Trace: java.lang.RuntimeException: java.lang.AssertionError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:658) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:86) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:686) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:630) Build Log (for compile errors): [...truncated 9820 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3690) JFlex-based HTMLStripCharFilter replacement
[ https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190588#comment-13190588 ] Steven Rowe commented on LUCENE-3690: - bq. AFAICT, SOLR-2891 will be fixed by this implementation. I misspoke, having misread that issue - despite the reference to {{HTMLStripCharFilter}} in the most recent comment on the issue, SOLR-2891 is not about {{HTMLStripCharFilter}}. > JFlex-based HTMLStripCharFilter replacement > --- > > Key: LUCENE-3690 > URL: https://issues.apache.org/jira/browse/LUCENE-3690 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 3.5, 4.0 >Reporter: Steven Rowe >Assignee: Steven Rowe > Fix For: 3.6, 4.0 > > Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, > JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch, LUCENE-3690.patch, > LUCENE-3690.patch, LUCENE-3690.patch > > > A JFlex-based HTMLStripCharFilter replacement would be more performant and > easier to understand and maintain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3055) Use NGramPhraseQuery in Solr
Use NGramPhraseQuery in Solr Key: SOLR-3055 URL: https://issues.apache.org/jira/browse/SOLR-3055 Project: Solr Issue Type: New Feature Components: Schema and Analysis, search Reporter: Koji Sekiguchi Priority: Minor Solr should use NGramPhraseQuery when searching with default slop on n-gram field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3426) optimizer for n-gram PhraseQuery
[ https://issues.apache.org/jira/browse/LUCENE-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190560#comment-13190560 ] Koji Sekiguchi commented on LUCENE-3426: bq. Is this automatic in SOLR? No. I've opened SOLR-3055. > optimizer for n-gram PhraseQuery > > > Key: LUCENE-3426 > URL: https://issues.apache.org/jira/browse/LUCENE-3426 > Project: Lucene - Java > Issue Type: Improvement > Components: core/search >Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3, 3.4, 4.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 3.5, 4.0 > > Attachments: LUCENE-3426.patch, LUCENE-3426.patch, LUCENE-3426.patch, > LUCENE-3426.patch, LUCENE-3426.patch, LUCENE-3426.patch, PerfTest.java, > PerfTest.java > > > If 2-gram is used and the length of query string is 4, for example q="ABCD", > QueryParser generates (when autoGeneratePhraseQueries is true) > PhraseQuery("AB BC CD") with slop 0. But it can be optimized PhraseQuery("AB > CD") with appropriate positions. > The idea came from the Japanese paper "N.M-gram: Implementation of Inverted > Index Using N-gram with Hash Values" by Mikio Hirabayashi, et al. (The main > theme of the paper is different from the idea that I'm using here, though) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190559#comment-13190559 ] Uwe Schindler commented on LUCENE-2858: --- I created the branch at [https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2858] and committed my first steps: - Add CompositeIndexReader and AtomicIndexReader - Moved methods around, still now finished (see below) - DirectoryReader is public now and is returned by IR.open() and IW.getReader() TODO: - IR.openIfChanged makes no sense for any reader other than DirectoryReader, let's move it also there - isCurrent and getVersion() is also useless for atomic readers and composite readers except DR - The strange generics in ReaderContext caused by the final field will go away, when changing reader field to aaccessor method returning the correct type (by return type overloading). Comments welcome and also heavy committing. > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190559#comment-13190559 ] Uwe Schindler edited comment on LUCENE-2858 at 1/21/12 11:51 PM: - I created the branch at [https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2858] and committed my first steps: - Add CompositeIndexReader and AtomicIndexReader - Moved methods around, still not yet finished (see below) - DirectoryReader is public now and is returned by IR.open() and IW.getReader() TODO: - IR.openIfChanged makes no sense for any reader other than DirectoryReader, let's move it also there - isCurrent and getVersion() is also useless for atomic readers and composite readers except DR - The strange generics in ReaderContext caused by the final field will go away, when changing reader field to aaccessor method returning the correct type (by return type overloading). Comments welcome and also heavy committing. was (Author: thetaphi): I created the branch at [https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2858] and committed my first steps: - Add CompositeIndexReader and AtomicIndexReader - Moved methods around, still now finished (see below) - DirectoryReader is public now and is returned by IR.open() and IW.getReader() TODO: - IR.openIfChanged makes no sense for any reader other than DirectoryReader, let's move it also there - isCurrent and getVersion() is also useless for atomic readers and composite readers except DR - The strange generics in ReaderContext caused by the final field will go away, when changing reader field to aaccessor method returning the correct type (by return type overloading). Comments welcome and also heavy committing. > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3426) optimizer for n-gram PhraseQuery
[ https://issues.apache.org/jira/browse/LUCENE-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190550#comment-13190550 ] Bill Bell commented on LUCENE-3426: --- Is this automatic in SOLR? Or do we need to add a feature to support his in SOLR? > optimizer for n-gram PhraseQuery > > > Key: LUCENE-3426 > URL: https://issues.apache.org/jira/browse/LUCENE-3426 > Project: Lucene - Java > Issue Type: Improvement > Components: core/search >Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3, 3.4, 4.0 >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Trivial > Fix For: 3.5, 4.0 > > Attachments: LUCENE-3426.patch, LUCENE-3426.patch, LUCENE-3426.patch, > LUCENE-3426.patch, LUCENE-3426.patch, LUCENE-3426.patch, PerfTest.java, > PerfTest.java > > > If 2-gram is used and the length of query string is 4, for example q="ABCD", > QueryParser generates (when autoGeneratePhraseQueries is true) > PhraseQuery("AB BC CD") with slop 0. But it can be optimized PhraseQuery("AB > CD") with appropriate positions. > The idea came from the Japanese paper "N.M-gram: Implementation of Inverted > Index Using N-gram with Hash Values" by Mikio Hirabayashi, et al. (The main > theme of the paper is different from the idea that I'm using here, though) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190547#comment-13190547 ] Dawid Weiss commented on LUCENE-3714: - If my feeling is right and the PQ can be kept constant-size then it won't matter much at runtime I think. With realistic data distributions the number of elements to be inserted into the PQ before you reach the top-N will be pretty much the same (?). And the benefit would be a much cleaner traversal (no need to deal with buckets, early termination, etc.). > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3054) Add a TypeTokenFilterFactory
Add a TypeTokenFilterFactory Key: SOLR-3054 URL: https://issues.apache.org/jira/browse/SOLR-3054 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Tommaso Teofili Fix For: 3.6, 4.0 Create a TypeTokenFilterFactory to make the TypeTokenFilter (filtering tokens depending on token types, see LUCENE-3671) available in Solr too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3671) Add a TypeTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190544#comment-13190544 ] Tommaso Teofili commented on LUCENE-3671: - Sure Uwe, I'll open a new one for the related Solr factory > Add a TypeTokenFilter > - > > Key: LUCENE-3671 > URL: https://issues.apache.org/jira/browse/LUCENE-3671 > Project: Lucene - Java > Issue Type: New Feature > Components: core/queryparser >Reporter: Santiago M. Mola >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3671.patch, LUCENE-3671_2.patch, > LUCENE-3671_3.patch > > > It would be convenient to have a TypeTokenFilter that filters tokens by its > type, either with an exclude or include list. This might be a stupid thing to > provide for people who use Lucene directly, but it would be very useful to > later expose it to Solr and other Lucene-backed search solutions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190543#comment-13190543 ] Robert Muir commented on LUCENE-3714: - {quote} we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization... {quote} I like this idea! > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190541#comment-13190541 ] Dawid Weiss commented on LUCENE-3714: - If I seem inconsistent above then it's because I don't have ready-to-use answers and I'm sort of thinking out loud :) > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190539#comment-13190539 ] Dawid Weiss commented on LUCENE-3714: - I thought you had a solution that collects top-N, but your patch selects one (best) matching solution only. I don't know how you planned to go around selecting top-N, but in my understanding (at that moment) top-N selection is not going to work via recursive scan because an output at the given level doesn't tell you much about which arcs to follow. I can see how this can be solved by picking the arc/direction with the "next smallest/largest" output among all arcs traversed so far but this will be more complex and I cannot provide any bounds on how large the queue can be or what the worst case lookup then is. I do have a feeling a degenerate example can be devised, but then I also have a feeling these are uncommon in practice. Sorting arcs by score doesn't help if you use the pq -- you need to add all of them to the pq and then pick the smallest path. In a way it is like what you did, but the pq is maintaining fast access to the next-smaller-cost path. Another feeling is that the PQ can be bound to a maximum size of N? Every arc leads to at least one leaf so while traversing you'd drop those arcs that definitely would have fallen out of the first N smallest/largest weights... Yes, this could work. I'd still try to devise a degenerate example to see what the cost of maintaining the PQ can be. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190536#comment-13190536 ] Michael McCandless commented on LUCENE-3714: Dawid, by problematic example, you mean you think this approach is functionally correct but may not perform very well...? That is definitely the worst-case performance (for either top-1 or top-K on a wFST with simple PQ), but this (number of non-competitive arcs you have to scan and discard) is a constant factor on the overall complexity right? I think we should at least test the simple PQ on PositiveIntsOutputs wFST and see how it performs in practice. If indeed having everything "in one bucket" is too slow, we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization... Or, maybe, we could keep one bucket but sort each node's arcs by their output instead of by label. This'd mean the initial lookup-by-prefix gets slower (linear scan instead a bin search, assuming those nodes had array'd arcs), but then producing the top-N is very fast (no wasted arcs need to be scanned). Maybe we could keep the by-label sort for nodes within depth N, and then sort by output beyond that... Or we could change the outputs algebra so that more "lookahead" is stored in each output so we have more guidance on which arcs are worth pursuing... > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190535#comment-13190535 ] Robert Muir commented on LUCENE-3714: - Yeah I think we should try that first, and see how it performs. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190526#comment-13190526 ] Dawid Weiss commented on LUCENE-3714: - I'm sure there are solutions to the problem if you change algebra ops -- the pq is a naive solutions that would work on top of positive outputs. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190524#comment-13190524 ] Dawid Weiss commented on LUCENE-3714: - The patch works because it finds the first (topmost) suggestion, but collecting suggestions with max-N (or min-N) will require a priority queue so that one knows which next arc to follow next (and this will also require storing partially collected paths for pointers in the fst/queue)? > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190525#comment-13190525 ] Robert Muir commented on LUCENE-3714: - Not sure it requires one, http://www.cs.nyu.edu/~mohri/pub/nbest.ps has some solutions. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-3714: Attachment: out.png An problematic example where root arcs, when traversed min-to-max collect outputs, but every outgoing arc only collects a single better suggestion (and should skip possibly lots of other suggestions). This is created by the following input: aa|N ab|1 ba|N bb|2 ca|N cb|3 .. collecting the K-th suggestion with the smallest score will require scanning pessimistically all of the arcs. Note that you can put arbitrarily large subtrees on _a|N nodes like: aaa|N aab|N aac|N etc. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3715) TestStressIndexing2 failes with AssertionFailedError
TestStressIndexing2 failes with AssertionFailedError Key: LUCENE-3715 URL: https://issues.apache.org/jira/browse/LUCENE-3715 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 JENKINS reported this lately, I suspect a test issue due to the RandomDWPThreadPool but I need to dig deeper. here is the failure to reproduce: {noformat} [junit] Testcase: testMultiConfig(org.apache.lucene.index.TestStressIndexing2): FAILED [junit] r1 is not empty but r2 is [junit] junit.framework.AssertionFailedError: r1 is not empty but r2 is [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:339) [junit] at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:277) [junit] at org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:126) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) [junit] [junit] [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 2.598 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestStressIndexing2 -Dtestmethod=testMultiConfig -Dtests.seed=5df78431615a5fbf:45b35512c8b8741a:235b5758de97148e -Dtests.multiplier=3 -Dtests.nightly=true -Dargs="-Dfile.encoding=ISO8859-1" [junit] NOTE: test params are: codec=Lucene3x, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {f34=DFR GZ(0.3), f33=IB SPL-D2, f32=DFR I(n)B2, f31=DFR I(ne)B1, f30=IB LL-L2, f79=DFR I(n)3(800.0), f78=DFR I(F)L2, f75=DFR I(n)BZ(0.3), f76=DFR GLZ(0.3), f39=DFR I(n)BZ(0.3), f38=DFR I(F)3(800.0), f73=DFR I(ne)L1, f74=DFR I(F)3(800.0), f37=DFR I(ne)L1, f36=DFR I(ne)3(800.0), f71=DFR I(F)B3(800.0), f35=DFR I(F)B3(800.0), f72=DFR I(ne)3(800.0), f81=DFR GZ(0.3), f80=IB SPL-D2, f43=DFR I(ne)BZ(0.3), f42=DFR I(F)Z(0.3), f45=IB SPL-L2, f41=DFR I(F)BZ(0.3), f40=DFR I(n)B1, f86=DFR I(ne)B3(800.0), f87=DFR GB1, f88=IB SPL-D3(800.0), f89=DFR I(F)L3(800.0), f82=DFR GL2, f47=DFR I(ne)LZ(0.3), f46=DFR GL2, f83=DFR I(ne)LZ(0.3), f49=DFR I(ne)Z(0.3), f84=DFR I(F)B2, f48=DFR I(F)B2, f85=DFR I(ne)Z(0.3), f90=DFR I(ne)BZ(0.3), f92=IB SPL-L2, f91=DFR I(n)Z(0.3), f59=DFR G2, f6=IB SPL-DZ(0.3), f7=IB LL-L1, f57=IB LL-L3(800.0), f8=DFR I(n)L3(800.0), f58=DFR I(n)LZ(0.3), f12=DFR I(F)1, f11=DFR I(n)L2, f10=DFR I(F)LZ(0.3), f51=DFR I(n)L1, f15=DFR I(n)L1, f52=DFR I(F)L2, f14=DFR GLZ(0.3), f13=DFR I(n)BZ(0.3), f55=DFR GL3(800.0), f19=DFR GL3(800.0), f56=IB LL-L2, f53=DFR I(F)L1, f18=BM25(k1=1.2,b=0.75), f17=DFR I(F)L1, f54=BM25(k1=1.2,b=0.75), id=DFR I(F)L2, f1=DFR I(n)B3(800.0), f0=DFR G2, f3=DFR I(ne)3(800.0), f2=DFR I(F)B3(800.0), f5=DFR I(F)3(800.0), f4=DFR I(ne)L1, f68=DFR I(n)2, f69=DFR I(ne)2, f21=IB LL-LZ(0.3), f20=DFR I(n)1, f23=DFR GB2, f22=DFR I(ne)B2, f60=DFR I(ne)B3(800.0), f25=DFR GB1, f61=DFR GB1, f24=DFR I(ne)B3(800.0), f62=IB SPL-D3(800.0), f27=DFR I(F)L3(800.0), f26=IB SPL-D3(800.0), f63=DFR I(F)L3(800.0), f64=DFR GL1, f29=DFR I(ne)1, f65=DFR I(ne)1, f28=DFR GL1, f66=DFR I(n)B1, f67=DFR I(F)BZ(0.3), f98=DFR I(n)LZ(0.3), f97=IB LL-L3(800.0), f99=DFR G2, f94=DefaultSimilarity, f93=DFR I(n)3(800.0), f70=DFR GB2, f96=LM Jelinek-Mercer(0.70), f95=DFR GBZ(0.3)}, locale=ms, timezone=Africa/Bangui [junit] NOTE: all tests run in this JVM: [junit] [TestDemo, TestSearch, TestCachingTokenFilter, TestSurrogates, TestPulsingReuse, TestAddIndexes, TestBinaryTerms, TestCodecs, TestCrashCausesCorruptIndex, TestDocsAndPositions, TestFieldInfos, TestFilterIndexReader, TestFlex, TestIndexReader, TestIndexWriterMergePolicy, TestIndexWriterNRTIsCurrent, TestIndexWriterOnJRECrash, TestIndexWriterWithThreads, TestNeverDelete, TestNoDeletionPolicy, TestOmitNorms, TestParallelReader, TestPayloads, TestRandomStoredFields, TestRollback, TestRollingUpdates, TestSegmentInfo, TestStressIndexing2] [junit] NOTE: FreeBSD 8.2-RELEASE amd64/Sun Microsystems Inc. 1.6.0 (64-bit)/cpus=16,threads=1,free=349545000,total=477233152 {noformat} this failed on revision: http://svn.apache.org/repos/asf/lucene/dev/trunk : 1233708 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.
[jira] [Updated] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
[ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3714: Attachment: LUCENE-3714.patch patch that Mike and I came up with that finds the minimal output from an arc, and a random test showing it works. > add suggester that uses shortest path/wFST instead of buckets > - > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker >Reporter: Robert Muir > Attachments: LUCENE-3714.patch > > > Currently the FST suggester (really an FSA) quantizes weights into buckets > (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with > positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at > each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some > operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3713) TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with java.lang.IllegalStateException: CFS has pending open files
[ https://issues.apache.org/jira/browse/LUCENE-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3713. - Resolution: Fixed > TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with > java.lang.IllegalStateException: CFS has pending open files > > > Key: LUCENE-3713 > URL: https://issues.apache.org/jira/browse/LUCENE-3713 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-3713.patch > > > {noformat} > Testsuite: org.apache.lucene.index.TestIndexWriterOnDiskFull > [junit] Testcase: > testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriterOnDiskFull): > Caused an ERROR > [junit] CFS has pending open files > [junit] java.lang.IllegalStateException: CFS has pending open files > [junit] at > org.apache.lucene.store.CompoundFileWriter.close(CompoundFileWriter.java:162) > [junit] at > org.apache.lucene.store.CompoundFileDirectory.close(CompoundFileDirectory.java:206) > [junit] at > org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4099) > [junit] at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3661) > [junit] at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3260) > [junit] at > org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) > [junit] at > org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1902) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1716) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1670) > [junit] at > org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:304) > [junit] at > org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) > [junit] > [junit] > [junit] Tests run: 4, Failures: 0, Errors: 1, Time elapsed: 31.96 sec > [junit] > [junit] - Standard Error - > [junit] NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterOnDiskFull -Dtestmethod=testAddIndexOnDiskFull > -Dtests.seed=-7dd066d256827211:127c018cbf5b0975:20481cd18a7d8b6e > -Dtests.multiplier=3 -Dtests.nightly=true -Dargs="-Dfile.encoding=ISO8859-1" > [junit] NOTE: test params are: codec=SimpleText, > sim=RandomSimilarityProvider(queryNorm=true,coord=false): {field=DFR GB1, > id=DFR I(F)L1, content=IB SPL-D3(800.0), f=DFR G2}, locale=de_AT, > timezone=America/Cambridge_Bay > [junit] NOTE: all tests run in this JVM: > [junit] [TestAssertions, TestSearchForDuplicates, TestMockAnalyzer, > TestDocValues, TestPerFieldPostingsFormat, TestDocument, TestAddIndexes, > TestConcurrentMergeScheduler, TestCrashCausesCorruptIndex, TestDocCount, > TestDocumentsWriterDeleteQueue, TestFieldInfos, TestFilterIndexReader, > TestFlex, TestIndexInput, TestIndexWriter, TestIndexWriterMergePolicy, > TestIndexWriterMerging, TestIndexWriterNRTIsCurrent, > TestIndexWriterOnDiskFull] > [junit] NOTE: FreeBSD 8.2-RELEASE amd64/Sun Microsystems Inc. 1.6.0 > (64-bit)/cpus=16,threads=1,free=39156976,total=180748288 > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets
add suggester that uses shortest path/wFST instead of buckets - Key: LUCENE-3714 URL: https://issues.apache.org/jira/browse/LUCENE-3714 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Reporter: Robert Muir Currently the FST suggester (really an FSA) quantizes weights into buckets (e.g. single byte) and puts them in front of the word. This makes it fast, but you lose granularity in your suggestions. Lately the question was raised, if you build lucene's FST with positiveintoutputs, does it behave the same as a tropical semiring wFST? In other words, after completing the word, we instead traverse min(output) at each node to find the 'shortest path' to the best suggestion (with the highest score). This means we wouldnt need to quantize weights at all and it might make some operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3671) Add a TypeTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-3671. --- Resolution: Fixed Committed trunk revision: 1234396 Committed 3.x revision: 1234397 Tommaso: Can you maybe provide a Solr factory in a separate Solr issue (or reopen this one)? > Add a TypeTokenFilter > - > > Key: LUCENE-3671 > URL: https://issues.apache.org/jira/browse/LUCENE-3671 > Project: Lucene - Java > Issue Type: New Feature > Components: core/queryparser >Reporter: Santiago M. Mola >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3671.patch, LUCENE-3671_2.patch, > LUCENE-3671_3.patch > > > It would be convenient to have a TypeTokenFilter that filters tokens by its > type, either with an exclude or include list. This might be a stupid thing to > provide for people who use Lucene directly, but it would be very useful to > later expose it to Solr and other Lucene-backed search solutions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3713) TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with java.lang.IllegalStateException: CFS has pending open files
[ https://issues.apache.org/jira/browse/LUCENE-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3713: Attachment: LUCENE-3713.patch sneaky, great example why random testing rocks! I really wonder why this took so long to fail right there. here is a patch - kind of obvious what went wrong here. Essentially, we don't release the "direct output" lock since the assignment to the flag marking the lock as taken is after the IO resource is accessed. I plan to commit shortly > TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with > java.lang.IllegalStateException: CFS has pending open files > > > Key: LUCENE-3713 > URL: https://issues.apache.org/jira/browse/LUCENE-3713 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-3713.patch > > > {noformat} > Testsuite: org.apache.lucene.index.TestIndexWriterOnDiskFull > [junit] Testcase: > testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriterOnDiskFull): > Caused an ERROR > [junit] CFS has pending open files > [junit] java.lang.IllegalStateException: CFS has pending open files > [junit] at > org.apache.lucene.store.CompoundFileWriter.close(CompoundFileWriter.java:162) > [junit] at > org.apache.lucene.store.CompoundFileDirectory.close(CompoundFileDirectory.java:206) > [junit] at > org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4099) > [junit] at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3661) > [junit] at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3260) > [junit] at > org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) > [junit] at > org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1902) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1716) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1670) > [junit] at > org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:304) > [junit] at > org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) > [junit] > [junit] > [junit] Tests run: 4, Failures: 0, Errors: 1, Time elapsed: 31.96 sec > [junit] > [junit] - Standard Error - > [junit] NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterOnDiskFull -Dtestmethod=testAddIndexOnDiskFull > -Dtests.seed=-7dd066d256827211:127c018cbf5b0975:20481cd18a7d8b6e > -Dtests.multiplier=3 -Dtests.nightly=true -Dargs="-Dfile.encoding=ISO8859-1" > [junit] NOTE: test params are: codec=SimpleText, > sim=RandomSimilarityProvider(queryNorm=true,coord=false): {field=DFR GB1, > id=DFR I(F)L1, content=IB SPL-D3(800.0), f=DFR G2}, locale=de_AT, > timezone=America/Cambridge_Bay > [junit] NOTE: all tests run in this JVM: > [junit] [TestAssertions, TestSearchForDuplicates, TestMockAnalyzer, > TestDocValues, TestPerFieldPostingsFormat, TestDocument, TestAddIndexes, > TestConcurrentMergeScheduler, TestCrashCausesCorruptIndex, TestDocCount, > TestDocumentsWriterDeleteQueue, TestFieldInfos, TestFilterIndexReader, > TestFlex, TestIndexInput, TestIndexWriter, TestIndexWriterMergePolicy, > TestIndexWriterMerging, TestIndexWriterNRTIsCurrent, > TestIndexWriterOnDiskFull] > [junit] NOTE: FreeBSD 8.2-RELEASE amd64/Sun Microsystems Inc. 1.6.0 > (64-bit)/cpus=16,threads=1,free=39156976,total=180748288 > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3713) TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with java.lang.IllegalStateException: CFS has pending open files
[ https://issues.apache.org/jira/browse/LUCENE-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3713: Lucene Fields: New,Patch Available (was: New) > TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with > java.lang.IllegalStateException: CFS has pending open files > > > Key: LUCENE-3713 > URL: https://issues.apache.org/jira/browse/LUCENE-3713 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-3713.patch > > > {noformat} > Testsuite: org.apache.lucene.index.TestIndexWriterOnDiskFull > [junit] Testcase: > testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriterOnDiskFull): > Caused an ERROR > [junit] CFS has pending open files > [junit] java.lang.IllegalStateException: CFS has pending open files > [junit] at > org.apache.lucene.store.CompoundFileWriter.close(CompoundFileWriter.java:162) > [junit] at > org.apache.lucene.store.CompoundFileDirectory.close(CompoundFileDirectory.java:206) > [junit] at > org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4099) > [junit] at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3661) > [junit] at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3260) > [junit] at > org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) > [junit] at > org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1902) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1716) > [junit] at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1670) > [junit] at > org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:304) > [junit] at > org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) > [junit] at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) > [junit] > [junit] > [junit] Tests run: 4, Failures: 0, Errors: 1, Time elapsed: 31.96 sec > [junit] > [junit] - Standard Error - > [junit] NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterOnDiskFull -Dtestmethod=testAddIndexOnDiskFull > -Dtests.seed=-7dd066d256827211:127c018cbf5b0975:20481cd18a7d8b6e > -Dtests.multiplier=3 -Dtests.nightly=true -Dargs="-Dfile.encoding=ISO8859-1" > [junit] NOTE: test params are: codec=SimpleText, > sim=RandomSimilarityProvider(queryNorm=true,coord=false): {field=DFR GB1, > id=DFR I(F)L1, content=IB SPL-D3(800.0), f=DFR G2}, locale=de_AT, > timezone=America/Cambridge_Bay > [junit] NOTE: all tests run in this JVM: > [junit] [TestAssertions, TestSearchForDuplicates, TestMockAnalyzer, > TestDocValues, TestPerFieldPostingsFormat, TestDocument, TestAddIndexes, > TestConcurrentMergeScheduler, TestCrashCausesCorruptIndex, TestDocCount, > TestDocumentsWriterDeleteQueue, TestFieldInfos, TestFilterIndexReader, > TestFlex, TestIndexInput, TestIndexWriter, TestIndexWriterMergePolicy, > TestIndexWriterMerging, TestIndexWriterNRTIsCurrent, > TestIndexWriterOnDiskFull] > [junit] NOTE: FreeBSD 8.2-RELEASE amd64/Sun Microsystems Inc. 1.6.0 > (64-bit)/cpus=16,threads=1,free=39156976,total=180748288 > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3713) TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with java.lang.IllegalStateException: CFS has pending open files
TestIndexWriterOnDiskFull.testAddIndexOnDiskFull fails with java.lang.IllegalStateException: CFS has pending open files Key: LUCENE-3713 URL: https://issues.apache.org/jira/browse/LUCENE-3713 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 {noformat} Testsuite: org.apache.lucene.index.TestIndexWriterOnDiskFull [junit] Testcase: testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriterOnDiskFull): Caused an ERROR [junit] CFS has pending open files [junit] java.lang.IllegalStateException: CFS has pending open files [junit] at org.apache.lucene.store.CompoundFileWriter.close(CompoundFileWriter.java:162) [junit] at org.apache.lucene.store.CompoundFileDirectory.close(CompoundFileDirectory.java:206) [junit] at org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4099) [junit] at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3661) [junit] at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3260) [junit] at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) [junit] at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1902) [junit] at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1716) [junit] at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1670) [junit] at org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:304) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) [junit] [junit] [junit] Tests run: 4, Failures: 0, Errors: 1, Time elapsed: 31.96 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtestmethod=testAddIndexOnDiskFull -Dtests.seed=-7dd066d256827211:127c018cbf5b0975:20481cd18a7d8b6e -Dtests.multiplier=3 -Dtests.nightly=true -Dargs="-Dfile.encoding=ISO8859-1" [junit] NOTE: test params are: codec=SimpleText, sim=RandomSimilarityProvider(queryNorm=true,coord=false): {field=DFR GB1, id=DFR I(F)L1, content=IB SPL-D3(800.0), f=DFR G2}, locale=de_AT, timezone=America/Cambridge_Bay [junit] NOTE: all tests run in this JVM: [junit] [TestAssertions, TestSearchForDuplicates, TestMockAnalyzer, TestDocValues, TestPerFieldPostingsFormat, TestDocument, TestAddIndexes, TestConcurrentMergeScheduler, TestCrashCausesCorruptIndex, TestDocCount, TestDocumentsWriterDeleteQueue, TestFieldInfos, TestFilterIndexReader, TestFlex, TestIndexInput, TestIndexWriter, TestIndexWriterMergePolicy, TestIndexWriterMerging, TestIndexWriterNRTIsCurrent, TestIndexWriterOnDiskFull] [junit] NOTE: FreeBSD 8.2-RELEASE amd64/Sun Microsystems Inc. 1.6.0 (64-bit)/cpus=16,threads=1,free=39156976,total=180748288 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-trunk - Build # 1805 - Still Failing
I opened LUCENE-3713 for this failure On Sat, Jan 21, 2012 at 6:20 AM, Apache Jenkins Server wrote: > Build: https://builds.apache.org/job/Lucene-trunk/1805/ > > 1 tests failed. > REGRESSION: > org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull > > Error Message: > CFS has pending open files > > Stack Trace: > java.lang.IllegalStateException: CFS has pending open files > at > org.apache.lucene.store.CompoundFileWriter.close(CompoundFileWriter.java:162) > at > org.apache.lucene.store.CompoundFileDirectory.close(CompoundFileDirectory.java:206) > at > org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4099) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3661) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3260) > at > org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) > at > org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1902) > at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1716) > at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1670) > at > org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:304) > at > org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) > at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) > at > org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) > > > > > Build Log (for compile errors): > [...truncated 13077 lines...] > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2983) Unable to load custom MergePolicy
[ https://issues.apache.org/jira/browse/SOLR-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190500#comment-13190500 ] Simon Willnauer commented on SOLR-2983: --- tomasso can you update changes.txt too. once this is done I can just commit it, thanks! > Unable to load custom MergePolicy > - > > Key: SOLR-2983 > URL: https://issues.apache.org/jira/browse/SOLR-2983 > Project: Solr > Issue Type: Bug >Reporter: Mathias Herberts >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.6, 4.0 > > Attachments: SOLR-2983.patch > > > As part of a recent upgrade to Solr 3.5.0 we encountered an error related to > our use of LinkedIn's ZoieMergePolicy. > It seems the code that loads a custom MergePolicy was at some point moved > into SolrIndexConfig.java from SolrIndexWriter.java, but as this code was > copied verbatim it now contains a bug: > try { > policy = (MergePolicy) > schema.getResourceLoader().newInstance(mpClassName, null, new > Class[]{IndexWriter.class}, new Object[]{this}); > } catch (Exception e) { > policy = (MergePolicy) > schema.getResourceLoader().newInstance(mpClassName); > } > 'this' is no longer an IndexWriter but a SolrIndexConfig, therefore the call > to newInstance will always throw an exception and the catch clause will be > executed. If the custom MergePolicy does not have a default constructor > (which is the case of ZoieMergePolicy), the second attempt to create the > MergePolicy will also fail and Solr won't start. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190456#comment-13190456 ] Uwe Schindler commented on LUCENE-2858: --- Simon: Just to inform you, I am working on this. Currently I have a heavy broken checkout that does no longer compile at all :( Working, working, working... It's a mess! Once I have something intially compiling for core (not tests), I will create a branch! > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Blocker > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3706) add offsets into lucene40 postings
[ https://issues.apache.org/jira/browse/LUCENE-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3706: Attachment: LUCENE-3706.patch Updated patch with tests for skipping and offsets + payloads. this found a bad assert in FieldInfosWriter, I think its ready now. > add offsets into lucene40 postings > -- > > Key: LUCENE-3706 > URL: https://issues.apache.org/jira/browse/LUCENE-3706 > Project: Lucene - Java > Issue Type: New Feature >Affects Versions: 4.0 >Reporter: Robert Muir > Attachments: LUCENE-3706.patch, LUCENE-3706.patch > > > LUCENE-3684 added support for > IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, but > only SimpleText implements it. > I think we should implement it in the other 4.0 codecs (starting with > Lucene40PostingsFormat). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3712) Remove unused (and untested) methods from ReaderUtil that are also veeeeery ineffective
[ https://issues.apache.org/jira/browse/LUCENE-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190429#comment-13190429 ] Robert Muir commented on LUCENE-3712: - +1, untested and unused, nuke it. > Remove unused (and untested) methods from ReaderUtil that are also very > ineffective > --- > > Key: LUCENE-3712 > URL: https://issues.apache.org/jira/browse/LUCENE-3712 > Project: Lucene - Java > Issue Type: Task > Components: core/other >Affects Versions: 3.5 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3712.patch > > > ReaderUtil contains two methods that are nowhere used and not even tested. > Additionally those are implemented with useless List->array copying; > ineffective docStart calculation for a binary search later instead directly > returning the reader while scanning -- and I am not sure if they really work > as expected. As ReaderUtil is @lucene.internal we should remove them in 3.x > and trunk, alternatively the useless array copy / docStarts handling should > be removed and tests added: > {code:java} > public static IndexReader subReader(int doc, IndexReader reader) > public static IndexReader subReader(IndexReader reader, int subIndex) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3712) Remove unused (and untested) methods from ReaderUtil that are also veeeeery ineffective
[ https://issues.apache.org/jira/browse/LUCENE-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3712: -- Affects Version/s: 3.5 Fix Version/s: 4.0 3.6 > Remove unused (and untested) methods from ReaderUtil that are also very > ineffective > --- > > Key: LUCENE-3712 > URL: https://issues.apache.org/jira/browse/LUCENE-3712 > Project: Lucene - Java > Issue Type: Task > Components: core/other >Affects Versions: 3.5 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3712.patch > > > ReaderUtil contains two methods that are nowhere used and not even tested. > Additionally those are implemented with useless List->array copying; > ineffective docStart calculation for a binary search later instead directly > returning the reader while scanning -- and I am not sure if they really work > as expected. As ReaderUtil is @lucene.internal we should remove them in 3.x > and trunk, alternatively the useless array copy / docStarts handling should > be removed and tests added: > {code:java} > public static IndexReader subReader(int doc, IndexReader reader) > public static IndexReader subReader(IndexReader reader, int subIndex) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3712) Remove unused (and untested) methods from ReaderUtil that are also veeeeery ineffective
[ https://issues.apache.org/jira/browse/LUCENE-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3712: -- Attachment: LUCENE-3712.patch > Remove unused (and untested) methods from ReaderUtil that are also very > ineffective > --- > > Key: LUCENE-3712 > URL: https://issues.apache.org/jira/browse/LUCENE-3712 > Project: Lucene - Java > Issue Type: Task > Components: core/other >Affects Versions: 3.5 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3712.patch > > > ReaderUtil contains two methods that are nowhere used and not even tested. > Additionally those are implemented with useless List->array copying; > ineffective docStart calculation for a binary search later instead directly > returning the reader while scanning -- and I am not sure if they really work > as expected. As ReaderUtil is @lucene.internal we should remove them in 3.x > and trunk, alternatively the useless array copy / docStarts handling should > be removed and tests added: > {code:java} > public static IndexReader subReader(int doc, IndexReader reader) > public static IndexReader subReader(IndexReader reader, int subIndex) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3712) Remove unused (and untested) methods from ReaderUtil that are also veeeeery ineffective
Remove unused (and untested) methods from ReaderUtil that are also very ineffective --- Key: LUCENE-3712 URL: https://issues.apache.org/jira/browse/LUCENE-3712 Project: Lucene - Java Issue Type: Task Components: core/other Reporter: Uwe Schindler Assignee: Uwe Schindler ReaderUtil contains two methods that are nowhere used and not even tested. Additionally those are implemented with useless List->array copying; ineffective docStart calculation for a binary search later instead directly returning the reader while scanning -- and I am not sure if they really work as expected. As ReaderUtil is @lucene.internal we should remove them in 3.x and trunk, alternatively the useless array copy / docStarts handling should be removed and tests added: {code:java} public static IndexReader subReader(int doc, IndexReader reader) public static IndexReader subReader(IndexReader reader, int subIndex) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3671) Add a TypeTokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190425#comment-13190425 ] Tommaso Teofili commented on LUCENE-3671: - Thanks Uwe for taking care of it :) > Add a TypeTokenFilter > - > > Key: LUCENE-3671 > URL: https://issues.apache.org/jira/browse/LUCENE-3671 > Project: Lucene - Java > Issue Type: New Feature > Components: core/queryparser >Reporter: Santiago M. Mola >Assignee: Uwe Schindler > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3671.patch, LUCENE-3671_2.patch, > LUCENE-3671_3.patch > > > It would be convenient to have a TypeTokenFilter that filters tokens by its > type, either with an exclude or include list. This might be a stupid thing to > provide for people who use Lucene directly, but it would be very useful to > later expose it to Solr and other Lucene-backed search solutions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3045) eDismax: Allow virtual fields
[ https://issues.apache.org/jira/browse/SOLR-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190371#comment-13190371 ] Jan Høydahl commented on SOLR-3045: --- Alternatively, Hoss' suggestion from SOLR-3026, with per-field override syntax for the virtual fields that will cause DMQ sub-queries. I like this syntax better than mine :) {noformat} q=elephant title:dumbo who:george &qf=title^3 firstname lastname^2 description^2 catchall &uf=title^5 who^2 * &f.who.qf=firstname lastname^10 {noformat} > eDismax: Allow virtual fields > - > > Key: SOLR-3045 > URL: https://issues.apache.org/jira/browse/SOLR-3045 > Project: Solr > Issue Type: New Feature > Components: search >Reporter: Jan Høydahl > > Imagine a one-field yellow page search using eDisMax across fields > {noformat} > qf=firstname middlename lastname companyname category^10.0 subcategory > products address street zip city^5.0 state > {noformat} > Now this of course works well. But what if I want to offer my users fielded > search on "who", "what" and "where". > A way to do this now is copyField into three new fields with these names. But > then you lose the internal weight between the sub fields. > A more elegant way would be allowing virtual field names mapping to multiple > fields, so user can search where:london and match address, street, zip, city > or state automatically. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3026) eDismax: Locking down which fields can be explicitly queried (user fields aka uf)
[ https://issues.apache.org/jira/browse/SOLR-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190369#comment-13190369 ] Jan Høydahl commented on SOLR-3026: --- I like the f.who.qf style. And the fact that you then can boost the whole DMQ clause as a whole.. I'll add that to SOLR-3045 as a suggestion. But it's a bit overkill to spin a DMQ for simple single-field-aliasing, i.e. my example &uf=title:searchable_title_t. Ideally such a simple field name aliasing should be supported on the Lucene parser level. Alternatively it could be another per-field param {noformat} &f.title.fmap=searchable_title_t {noformat} I'm still not sure how to use the built-in aliasing to implement this > eDismax: Locking down which fields can be explicitly queried (user fields aka > uf) > - > > Key: SOLR-3026 > URL: https://issues.apache.org/jira/browse/SOLR-3026 > Project: Solr > Issue Type: Improvement > Components: search >Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5 >Reporter: Jan Høydahl >Assignee: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-3026.patch > > > We need a way to specify exactly what fields should be available to the end > user as fielded search. > In the original SOLR-1553, there's a patch implementing "user fields", but it > was never committed even if that issue was closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3045) eDismax: Allow virtual fields
[ https://issues.apache.org/jira/browse/SOLR-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-3045: -- Description: Imagine a one-field yellow page search using eDisMax across fields {noformat} qf=firstname middlename lastname companyname category^10.0 subcategory products address street zip city^5.0 state {noformat} Now this of course works well. But what if I want to offer my users fielded search on "who", "what" and "where". A way to do this now is copyField into three new fields with these names. But then you lose the internal weight between the sub fields. A more elegant way would be allowing virtual field names mapping to multiple fields, so user can search where:london and match address, street, zip, city or state automatically. was: Imagine a one-field yellow page search using eDisMax across fields {noformat} qf=firstname middlename lastname companyname category^10.0 subcategory products address street zip city^5.0 state {noformat} Now this of course works well. But what if I want to offer my users fielded search on "who", "what" and "where". A way to do this now is copyField into three new fields with these names. But then you lose the internal weight between the sub fields. A more elegant way would be allowing virtual field names mapping to multiple fields. Imagine uf extended further: {noformat} &uf=who:firstname,middlename,lastname^2.0,companyname what:category,subcategory,products where:address,street,zip,city^10.0,state {noformat} This could probably be solved by adding each as a dismax sub-Query One option: Imagine uf extended further: {noformat} &uf=who:firstname,middlename,lastname^2.0,companyname what:category,subcategory,products where:address,street,zip,city^10.0,state {noformat} This could probably be solved by adding each as a dismax sub-Query > eDismax: Allow virtual fields > - > > Key: SOLR-3045 > URL: https://issues.apache.org/jira/browse/SOLR-3045 > Project: Solr > Issue Type: New Feature > Components: search >Reporter: Jan Høydahl > > Imagine a one-field yellow page search using eDisMax across fields > {noformat} > qf=firstname middlename lastname companyname category^10.0 subcategory > products address street zip city^5.0 state > {noformat} > Now this of course works well. But what if I want to offer my users fielded > search on "who", "what" and "where". > A way to do this now is copyField into three new fields with these names. But > then you lose the internal weight between the sub fields. > A more elegant way would be allowing virtual field names mapping to multiple > fields, so user can search where:london and match address, street, zip, city > or state automatically. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org