[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546148 ] Michael Busch commented on LUCENE-584: -- {quote} 1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. 2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher {quote} Paul, I like the iterative plan you suggested. I started reviewing the Matcher-20071122-1ground.patch. I've some question: - Is the API fully backwards compatible? - Did you make performance tests to check whether BitSetMatcher is slower than using a bitset directly? - With just the mentioned patch applied I get compile errors, because the DefaultMatcher is missing. Could you provide a patch that also includes the BitSetMatcher and Filter#getMatcher() returns it? Also I believe the patch should modify Hits.java to use MatchFilter instead of Filter? And a unit test that tests the BitSetMatcher would be nice! > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546072 ] Michael Busch commented on LUCENE-1058: --- I'm quite busy currently with other stuff. Feel free to go ahead ;) > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546069 ] Grant Ingersoll commented on LUCENE-1058: - OK, looks good to me and is much simpler. Only thing that gets complicated is the constructors, but that should be manageable. Thanks for bearing w/ me :-) One of you want to whip up a patch w/ tests or do you want me to do it? > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546062 ] Michael Busch commented on LUCENE-1058: --- I like the TeeTokenFilter! +1 > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546058 ] Yonik Seeley commented on LUCENE-1058: -- I think having the "tee" solves the many-to-many case... you can have many fields contribute tokens to a new field. {code} ListTokenizer sink1 = new ListTokenizer(null); ListTokenizer sink2 = new ListTokenizer(null); TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2); // now sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer // now we can further wrap any of these in extra analysis, and more "tees" can be inserted if desired. TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); {code} > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546052 ] Yonik Seeley commented on LUCENE-1058: -- Very similar to what I came up with I think... (all untested, etc) {code} class ListTokenizer extends Tokenizer { protected List lst = new ArrayList(); protected Iterator iter; public ListTokenizer(List input) { this.lst = input; if (this.lst==null) this.lst = new ArrayList(); } /** only valid if tokens have not been consumed, * i.e. if this tokenizer is not part of another tokenstream */ public List getTokens() { return lst; } public Token next(Token result) throws IOException { if (iter==null) iter = lst.iterator(); return iter.next(); } /** Override this method to cache only certain tokens, or new tokens based * on the old tokens. */ public void add(Token t) { if (t==null) return; lst.add((Token)t.clone()); } public void reset() throws IOException { iter = lst.iterator(); } } class TeeTokenFilter extends TokenFilter { ListTokenizer sink; protected TeeTokenFilter(TokenStream input, ListTokenizer sink) { super(input); this.sink = sink; } public Token next(Token result) throws IOException { Token t = input.next(result); sink.add(t); return t; } } {code} > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546051 ] Doug Cutting commented on LUCENE-1044: -- > How about if we don't sync every single commit point? I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546050 ] Michael Busch commented on LUCENE-1058: --- We need to change the CachingTokenFilter a bit (untested code): {code:java} public class CachingTokenFilter extends TokenFilter { private List cache; private Iterator iterator; public CachingTokenFilter(TokenStream input) { super(input); this.cache = new LinkedList(); } public Token next() throws IOException { if (iterator != null) { if (!iterator.hasNext()) { // the cache is exhausted, return null return null; } return (Token) iterator.next(); } else { Token token = input.next(); addTokenToCache(token); return token; } } public void reset() throws IOException { if(cache != null) { iterator = cache.iterator(); } } protected void addTokenToCache(Token token) { if (token != null) { cache.add(token); } } } {code} Then you can implement the ProperNounTF: {code:java} class ProperNounTF extends CachingTokenFilter { protected void addTokenToCache(Token token) { if (token != null && isProperNoun(token)) { cache.add(token); } } private boolean isProperNoun() {...} } {code} And then you add everything to Document: {code:java} Document d = new Document(); TokenStream properNounTf = new ProperNounTF(new StandardTokenizer(reader)); TokenStream stdTf = new CachingTokenFilter(new StopTokenFilter(properNounTf)); TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf); d.add(new Field("std", stdTf)); d.add(new Field("nouns", properNounTf)); d.add(new Field("lowerCase", lowerCaseTf)); {code} Again, this is untested, but I believe should work? > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546042 ] Michael McCandless commented on LUCENE-1044: How about if we don't sync every single commit point? I think on a crash what's important when you come back up is 1) index is consistent and 2) you have not lost that many docs from your index. Losing the last N (up to mergeFactor) flushes might be acceptable? EG we could force a full sync only when we commit the merge, before we remove the merged segments. This would mean on a crash that you're "guaranteed" to have the last successfully committed & sync'd merge to fall back to, and possibly a newer commit point if the OS had sync'd those files on its own? That would be a big simplification because I think we could just do the sync() in the foreground since ConcurrentMergeScheduler is already using BG threads to do merges. This would also mean we cannot delete the commit points that were not sync'd. So the first 10 flushes would result in 10 segments_N files. But then when the merge of these segments completes, and the result is sync'd, those files could all be deleted. Plus we would have to fix retry logic on loading the segments file to try more than just the 2 most recent commit points but that's a pretty minor change. I think it should mean better performance, because the longer you wait to call sync() presumably the more likely it is a no-op if the OS has already sync'd the file. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546040 ] Grant Ingersoll commented on LUCENE-1058: - OK, I am trying not be fixated on the Analyzer. I guess I haven't fully synthesized the new TokenStream use in DocsWriter I agree, I don't like the no-value Field, and am open to suggestions. So, I guess I am going to push back and ask, how would you solve the case of where you have two fields and the Analysis given by: source field: StandardTokenizer Proper Noun TF LowerCaseTF StopTF buffered1 Field: Proper Noun Cache TF (cache of all terms found to be proper nouns by the Proper Noun TF) buffered2 Field: All terms lower cased And the requirement is that you only do the Analysis phase once (i.e. for the source field) and the other two fields are from memory. I am just not seeing it yet, so I appreciate the explanation as it will better cement my understanding of the new Token Stream stuff and DocsWriter > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546039 ] Michael McCandless commented on LUCENE-1044: Woops, the last line in the table above is wrong (it's a copy of the line before it). I'll re-run the test. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546031 ] Michael Busch commented on LUCENE-1058: --- I think the ideas here make sense, e. g. to have a buffering TokenFilter that doesn't buffer all tokens but enables the user to control which tokens to buffer. What is still not clear to me is why we have to introduce a new API for this and a new kind of analyzer? To allow creating an no-value field seems strange. Can't we achieve all this by using the Field(String, TokenStream) API without the analyzer indirection? The javadocs should make clear that the IndexWriter processes fields in the same order the user added them. So if a user adds TokenStream ts1 and thereafter ts2, they can be sure that ts1 is processed first. With that knowledge ts1 can buffer certain tokens that ts2 uses then. Adding even more fields that use the same tokens is straightforward. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546028 ] Yonik Seeley commented on LUCENE-1058: -- I dunno... it feels like we should have the right generic solution (many-to-many) before committing anything in this case, simply because this is all user-level code (the absence of this patch doesn't prohibit the user from doing anything... no package protected access rights are needed, etc). > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546022 ] Grant Ingersoll commented on LUCENE-1058: - Any objection to me committing the CachedAnalyzer and CachedTokenizer pieces of this patch, as I don't think they are effected by the other parts of this and they solve the pre-analysis portion of this discussion. In the meantime, I will think some more about the generic field case, as I do think it is useful. I am also trying out some basic benchmarking on this. {quote} Things like entity extraction are normally not done by lucene analyzers AFAIK {quote} Consider yourself in the "know" now, as I have done this on a few occasions, but, yes, I do agree a one to many approach is probably better if it can be done in a generic way. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546004 ] Yonik Seeley commented on LUCENE-1058: -- {quote}To some extent, I was thinking that this could help optimize Solr's copyField mechanism.{quote} Maybe... it would take quite a bit of work to automate it though I think. As far as pre-analysis costs. iteration is pretty much free in comparison to everything else. Memory is the big factor. Things like entity extraction are normally not done by lucene analyzers AFAIK... but if one wanted a framework to do that, the problem is more generic. Your really want to be able to add to multiple fields from multiple other fields. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546002 ] Grant Ingersoll commented on LUCENE-1058: - {quote} What if they wanted 3 fields instead of two? {quote} True. I'll have to think about a more generic approach. In some sense, I think 2 is often sufficient, but you are right it isn't totally generic in the spirit of Lucene. To some extent, I was thinking that this could help optimize Solr's copyField mechanism. In Solr's case, I think you often have copy fields that have marginal differences in the filters that are applied. It would be useful for Solr to be able to optimize these so that it doesn't have to go through the whole analysis chain again. {quote} Isn't this what your current code does? {quote} No, in my main use case (# of buffered tokens is << # of source tokens) the only tokens kept around is the (much) smaller subset of buffered tokens. In the pre-analysis approach you have to keep the source field tokens and the buffered tokens. Not to mention that you are increasing the work by having to iterate over the cached tokens in the list in Lucene. Thus, you have the cost of the analysis in your application plus the storage of both token lists (one large, one small, likely) then in Lucene you have the cost of iterating over two lists. In my approach, I think, you have the cost of analysis plus the cost of storage of one list of tokens (small) and the cost of iterating that list. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545999 ] Yonik Seeley commented on LUCENE-1058: -- {quote}As for the convoluted cross-field logic, I don't think it is all that convoluted.{quote} But it's baked into CollaboratingAnalyzer... it seems like this is better left to the user. What if they wanted 3 fields instead of two? {quote} I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory {quote} Isn't this what your current code does? > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545995 ] Grant Ingersoll commented on LUCENE-1058: - {quote} Maybe I'm missing something? {quote} No, I don't think you are missing anything in that use case, it's just an example of its use. And I am not totally sold on this approach, but mostly am :-) I had originally considered your option, but didn't feel it was satisfactory for the case where you are extracting things like proper nouns or maybe it is generating a category value. The more general case is where not all the tokens are needed (in fact, very few are). In those cases, you have to go back through the whole list of cached tokens in order to extract the ones you want. In fact, thinking some more of on it, I am not sure my patch goes far enough in the sense that what if you want it to buffer in mid stream. For example, if you had: StandardTokenizer Proper Noun TF LowerCaseTF StopTF and Proper Noun TF is solely responsible for setting aside proper nouns as it comes across them in the stream. As for the convoluted cross-field logic, I don't think it is all that convoluted. There are only two fields and the implementing Analyzer takes care of all of it. Only real requirement the application has is that the fields be ordered correctly. I do agree somewhat about the pre-analysis approach, except for the case where there may be a large number of tokens in the source field, in which case, you are holding them around in memory (maxFieldLength mitigates to some extent.) Also, it puts the onus on the app. writer to do it, when it could be pretty straight forward for Lucene to do it w/o it's usual analysis pipeline. At any rate, separate of the CollaboratingAnalyzer, I do think the CachedTokenFilter is useful, especially in supporting the pre-analysis approach. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1044: --- Attachment: LUCENE-1044.take4.patch OK I did a simplistic patch (attached) whereby FSDirectory has a background thread that re-opens, syncs, and closes those files that Lucene has written. (I'm using a modified version of the class from Doron's test). This patch is nowhere near ready to commit; I just coded up enough so we could get a rough measure of performance cost of syncing. EG we must prevent deletion of a commit point until a future commit point is fully sync'd to stable storage; we must also take care not to sync a file that has been deleted before we sync'd it; don't sync until the end when running with autoCommit=false; merges if run by ConcurrentMergeScheduler should [maybe] sync in the foreground; maybe forcefully throttle back updates if syncing is falling too far behind; etc. I ran the same alg as the tests above (index first 150K docs of Wikipedia). I ran CFS and no CFS X sync and nosync (4 tests) for each IO system. Time is the fastest of 2 runs: || IO System || CFS sync || CFS nosync || CFS % slower || non-CFS sync || non-CFS nosync || non-CFS % slower || | ReiserFS 6-drive RAID5 array Linux (2.6.22.1) | 188 | 157 | 19.7% | 143 | 147 | -2.7% | | EXT3 single internal drive Linux (2.6.22.1) | 173 | 157 | 10.2% | 136 | 132 | 3.0% | | 4 drive RAID0 array Mac Pro (10.4 Tiger) | 153 | 152 | 0.7% | 150 | 149 | 0.7% | | Win XP Pro laptop, single drive | 463 | 352 | 31.5% | 343 | 335 | 2.4% | | Mac Pro single external drive | 463 | 352 | 31.5% | 343 | 335 | 2.4% | The good news is, the non-CFS case shows very little cost when we do BG sync'ing! The bad news is, the CFS case still shows a high cost. However, by not sync'ing the files that go into the CFS (and also not committing a new segments_N file until after the CFS is written) I expect that cost to go way down. One caveat: I'm using a 8 MB RAM buffer for all of these tests. As Yonik pointed out, if you have a smaller buffer, or, you add just a few docs and then close your writer, the sync cost as a pctg of net indexing time will be quite a bit higher. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545966 ] Yonik Seeley commented on LUCENE-1058: -- Maybe I'm not looking at it the right way yet, but I'm not sure this feels "right"... Since Field has a tokenStreamValue(), wouldn't it be easiest to just use that? If the tokens of two fields are related, one could just pre-analyze those fields and set the token streams appropriately. Seems more flexible and keeps any convoluted cross-field logic in the application domain. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545959 ] Michael Busch commented on LUCENE-1058: --- Grant, I'm not sure why we need this patch. For the testcase that you're describing: {quote} For example, if you want to have two fields, one lowercased and one not, but all the other analysis is the same, then you could save off the tokens to be output for a different field. {quote} can't you simply do something like this: {code:java} Document d = new Document(); TokenStream t1 = new CachingTokenFilter(new WhitespaceTokenizer(reader)); TokenStream t2 = new LowerCaseFilter(t1); d.add(new Field("f1", t1)); d.add(new Field("f2", t2)); {code} Maybe I'm missing something? > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone
DateTools with DAY resoltion dosn't work depending on your timezone --- Key: LUCENE-1070 URL: https://issues.apache.org/jira/browse/LUCENE-1070 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.2 Reporter: Mike Baroukh Hi. There is another issue, closed, that introduced a bug : https://issues.apache.org/jira/browse/LUCENE-491 Here is a simple TestCase : DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm"); Date d1 = df.parse("10/10/2008 10:00"); System.err.println(DateTools.dateToString(d1, Resolution.DAY)); Date d2 = df.parse("10/10/2008 00:00"); System.err.println(DateTools.dateToString(d2, Resolution.DAY)); this output : 20081010 20081009 So, days are the same, but with DAY resolution, the value indexed doesn't refer to the same day. This is because of DateTools.round() : using a Calendar initialised to GMT can make that the Date given is on yesterday depending on my timezone . The part I don't understand is why take a date for inputfield then convert it to calendar then convert it again before printing ? This operation is supposed to "round" the date but using simply DateFormat to format the date and print only wanted fields do the same work, isn't it ? The problem is : I see absolutly no solution actually. We could have a WorkAround if datetoString() took a Date as inputField but with a long, the timezone is lost. I also suppose that the correction made on the other issue (https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug because it correct only for those who use date with a different timezone than the local timezone of the JVM. So, my solution : add a DateTools.dateToString() that take a Date in parameter and deprecate the version that use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1058: Attachment: LUCENE-1058.patch fixed a failing test > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption
[ https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545921 ] Michael McCandless commented on LUCENE-1069: This is the thread that spawned this issue: http://www.gossamer-threads.com/lists/lucene/java-user/55124 > CheckIndex incorrectly sees deletes as index corruption > --- > > Key: LUCENE-1069 > URL: https://issues.apache.org/jira/browse/LUCENE-1069 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1069.patch > > > There is a silly bug in CheckIndex whereby any segment with deletes is > considered corrupt. > Thanks to Bogdan Ghidireac for reporting this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption
[ https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1069. Resolution: Fixed I just committed this. Thanks for catching this & reporting it Bogdan. > CheckIndex incorrectly sees deletes as index corruption > --- > > Key: LUCENE-1069 > URL: https://issues.apache.org/jira/browse/LUCENE-1069 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1069.patch > > > There is a silly bug in CheckIndex whereby any segment with deletes is > considered corrupt. > Thanks to Bogdan Ghidireac for reporting this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-935) Improve maven artifacts
[ https://issues.apache.org/jira/browse/LUCENE-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545919 ] Grant Ingersoll commented on LUCENE-935: Done. > Improve maven artifacts > --- > > Key: LUCENE-935 > URL: https://issues.apache.org/jira/browse/LUCENE-935 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Attachments: lucene-935-new.patch, lucene-935-rename-poms.patch, > lucene-935.patch > > > There are a couple of things we can improve for the next release: > - "*pom.xml" files should be renamed to "*pom.xml.template" > - artifacts "lucene-parent" should extend "apache-parent" > - add source jars as artifacts > - update task to work with latest version of > maven-ant-tasks.jar > - metadata filenames should not contain "local" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1067. Resolution: Fixed I just committed this. Thanks Grant for catching this. > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1067.patch > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545909 ] Michael McCandless commented on LUCENE-1067: Thanks for the review Yonik! I'll commit shortly. > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1067.patch > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545904 ] Yonik Seeley commented on LUCENE-1067: -- Looks good, +1 > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1067.patch > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures
On Nov 27, 2007 11:20 AM, robert engels <[EMAIL PROTECTED]> wrote: > Can you describe exactly how the lockless commits affects this? Or > could a reader be accessing the same RAMFile as a writer? No read/commit lock exists any more... so a writer could be in the process of writing the segments.nnn file and the reader may try to read it. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption
[ https://issues.apache.org/jira/browse/LUCENE-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1069: --- Attachment: LUCENE-1069.patch Attached patch (with new unit test) fixes it. I plan to commit shortly... > CheckIndex incorrectly sees deletes as index corruption > --- > > Key: LUCENE-1069 > URL: https://issues.apache.org/jira/browse/LUCENE-1069 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1069.patch > > > There is a silly bug in CheckIndex whereby any segment with deletes is > considered corrupt. > Thanks to Bogdan Ghidireac for reporting this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1069) CheckIndex incorrectly sees deletes as index corruption
CheckIndex incorrectly sees deletes as index corruption --- Key: LUCENE-1069 URL: https://issues.apache.org/jira/browse/LUCENE-1069 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.3 Attachments: LUCENE-1069.patch There is a silly bug in CheckIndex whereby any segment with deletes is considered corrupt. Thanks to Bogdan Ghidireac for reporting this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures
Can you describe exactly how the lockless commits affects this? Or could a reader be accessing the same RAMFile as a writer? Seems that this really deviates from the simplicity of the write-once design of the original Lucene. Do writers share the same underlying RAMDirectory? Seems this would cause a lot of contention. Or point me to the relevant documentation. On Nov 27, 2007, at 10:11 AM, Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1067? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12545885 ] Michael McCandless commented on LUCENE-1067: OK I think this is just a thread safety issue on RAMFile. That class has these comments: // Only one writing stream with no concurrent reading streams, so no file synchronization required // Direct read-only access to state supported for streams since a writing stream implies no other concurrent streams which were true before lockless commits but after lockless commits are not true, specifically for the segments_N and segments.gen files. I think this fix is to make "ArrayList buffers" private (it's package private now), add methods to get a buffer & get the number of buffers, and make sure all methods that access "buffers" are synchronized. TestStressIndexing has intermittent failures Key: LUCENE-1067 URL: https://issues.apache.org/jira/browse/ LUCENE-1067 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Assignee: Michael McCandless Priority: Minor Fix For: 2.3 See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: OK, I have seen this twice in the last two days: Testsuite: org.apache.lucene.index.TestStressIndexing [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 sec [junit] [junit] - Standard Output --- [junit] java.lang.NullPointerException [junit] at org.apache.lucene.store.RAMInputStream.readByte (RAMInputStream.java:67) [junit] at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) [junit] at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:544) [junit] at org .apache .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:192) [junit] at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) [junit] at org.apache.lucene.index.TestStressIndexing $SearcherThread.doWork(TestStressIndexing.java:111) [junit] at org.apache.lucene.index.TestStressIndexing $TimedThread.run(TestStressIndexing.java:55) [junit] - --- [junit] Testcase: testStressIndexAndSearching (org.apache.lucene.index.TestStressIndexing): FAILED [junit] hit unexpected exception in search1 [junit] junit.framework.AssertionFailedError: hit unexpected exception in search1 [junit] at org .apache .lucene.index.TestStressIndexing.runStressTest (TestStressIndexing.java: 159) [junit] at org .apache .lucene .index .TestStressIndexing .testStressIndexAndSearching(TestStressIndexing.java:187) [junit] [junit] [junit] Test org.apache.lucene.index.TestStressIndexing FAILED Subsequent runs have, however passed. Has anyone else hit this on trunk? I am running using "ant clean test" I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure how to reproduce at this point, but strikes me as a threading issue. Oh joy! I'll try to investigate more tomorrow to see if I can dream up a test case. -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1067: --- Attachment: LUCENE-1067.patch Attached patch. All tests pass. I plan to commit in a day or two. With this fix I can't get the test to fail after running 90+ times on the MacPro quad. > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1067.patch > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545885 ] Michael McCandless commented on LUCENE-1067: OK I think this is just a thread safety issue on RAMFile. That class has these comments: // Only one writing stream with no concurrent reading streams, so no file synchronization required // Direct read-only access to state supported for streams since a writing stream implies no other concurrent streams which were true before lockless commits but after lockless commits are not true, specifically for the segments_N and segments.gen files. I think this fix is to make "ArrayList buffers" private (it's package private now), add methods to get a buffer & get the number of buffers, and make sure all methods that access "buffers" are synchronized. > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1058: Attachment: LUCENE-1058.patch Added some more documentation, plus a test showing it is bad to use the no value Field constructor w/o support from the Analyzer to produce tokens. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545846 ] gsingers edited comment on LUCENE-1058 at 11/27/07 6:11 AM: --- Added some more documentation, plus a test showing it is bad to use the no value Field constructor w/o support from the Analyzer to produce tokens. If no objections, I will commit on Thursday or Friday of this week. was (Author: gsingers): Added some more documentation, plus a test showing it is bad to use the no value Field constructor w/o support from the Analyzer to produce tokens. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential bug in StandardTokenizerImpl
I am the guy who throw the question about the Acronym - Host detection anomaly in the StandardAnalyzer class. Thanks to Shai Erera for traslating the discussion into the developers' list. I am surprised about Chris Hostetter's response, as this issue was treated by Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now, Lucene in Action, trying to override this issue, but i can't believe that this wasn't fixed yet. As i explained at the user's list, i've found that indexing fails to include certain emails and words that are present in the logfile when i launch an IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I got the acronyms' interpretation issue. Maybe there will be more hidden anomalies in the StandardAnalyzer behavior with such a hughe load. At this moment I can say this behavior is deterministic, so I can reproduce it over subsequent index and search calls, and takes place with the same words and emails over and over. Should it be a collateral efect of document vectorization as the logs are not natural language? As Lucene computes if the token conveys relevant info (as the vector space model states), what about that Lucene decided about the token not to be relevant? All of this supossing it works well, of course... Any idea about this, or have you heard about? Thanks and regards. Eugenio F. Martínez Pacheco Fundación Instituto Tecnológico de Galicia - Área TIC TFN: 981 173 206FAX: 981 173 223 VIDEOCONFERENCIA: 981 173 596 [EMAIL PROTECTED] __ ¿Chef por primera vez? Sé un mejor Cocinillas. http://es.answers.yahoo.com/info/welcome
Re: (LUCENE-1067) - Make TopDocs constructor public
Ooops, the issue number is 1064, not 1067. sorry for the confusion. On Nov 27, 2007 2:10 PM, Shai Erera <[EMAIL PROTECTED]> wrote: > Hey guys, > > No one has commented on this feature yet. The change is very simple. I > don't mind doing it myself, if you explain me the process ... do I just > commit the change and then one of the committers need to approve, or my part > in this issue is the patch I sent? > > Cheers, > > Shai Erera > -- Regards, Shai Erera
(LUCENE-1067) - Make TopDocs constructor public
Hey guys, No one has commented on this feature yet. The change is very simple. I don't mind doing it myself, if you explain me the process ... do I just commit the change and then one of the committers need to approve, or my part in this issue is the patch I sent? Cheers, Shai Erera
Re: Potential bug in StandardTokenizerImpl
Ok I opened https://issues.apache.org/jira/browse/LUCENE-1068 and attached the patch files. I don't know if and how you can deprecate a JFlex grammar though. On Nov 27, 2007 1:43 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Yes, please open a JIRA issue and submit your patches. > > I wonder if there is anyway to deprecate functionality in a JFlex > grammar? That is, is there anyway we can communicate to people that > both will be supported through 2.9 and then the correct way will be > supported in 3.x? > > -Grant > > On Nov 27, 2007, at 2:18 AM, Shai Erera wrote: > > > I understand it would change the behavior of existing search > > solutions, > > however the current behavior is just wrong. An ACRONYM cannot be > > ABC.DEF. If > > you look up acronym in Wikipedia, you find only examples of I.B.M. / > > U.S.A. > > like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer > > currently > > recognizes. > > > > There are several ways to solve this change: > > 1. Create a new analyzer that fixes the problem - that way, > > applications > > that don't want to use it will not have to, if they feel ok with the > > current > > behavior. However, for those who would like to get a correct behavior, > > they'll be able to. This is not my favorite solution, but I think it > > would > > be preferable than simply not fixing it. > > 2. Fix it in the new version (2.3) and specifically mention that in > > the > > release notes. Aren't there releases where applications need to re- > > build the > > index because of fundamental changes? > > > > Am I the only one who thinks that? > > > > BTW, I changed the definition in the jflex file and recompiled using > > jflex > > and it indeed solved the problem. It now recognizes www.abc.com. and > > www.abc.com as hosts. I can attach the 'patch' files if you'd like to > > compare. > > > > On Nov 27, 2007 9:07 AM, Chris Hostetter <[EMAIL PROTECTED]> > > wrote: > > > >> > >> : If you pass "www.abc.com", the output is (www.abc.com, > >> 0,11,type=) > >> : (which is correct in my opinion). > >> : However, if you pass "www.abc.com." (notice the extra '.' at the > >> end), > >> the > >> : output is (wwwabccom,0,12,type=). > >> > >> see also... > >> > >> > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > >> > >> > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 > >> > >> one hitch which potentially changing this now is that it would break > >> some searches in applications that have existing indexes built using > >> previous versions. > >> > >> > >> > >> -Hoss > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > > Regards, > > > > Shai Erera > > -- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1068: --- Attachment: standardTokenizerImpl.patch This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway. > Invalid behavior of StandardTokenizerImpl > - > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Reporter: Shai Erera > Attachments: standardTokenizerImpl.jflex.patch, > standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new > StringReader("")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) > (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the > output is (wwwabccom,0,12,type=). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which > is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from > this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM= {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something > else. I changed the definition to > ACRONYM= {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1068: --- Attachment: standardTokenizerImpl.jflex.patch This fixes the JFlex definition file. The change simply replaces: ACRONYM= {ALPHA} "." ({ALPHA} ".")+ with ACRONYM= {LETTER} "." ({LETTER} ".")+ > Invalid behavior of StandardTokenizerImpl > - > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Reporter: Shai Erera > Attachments: standardTokenizerImpl.jflex.patch, > standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new > StringReader("")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) > (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the > output is (wwwabccom,0,12,type=). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which > is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form > A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from > this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM= {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something > else. I changed the definition to > ACRONYM= {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1067) TestStressIndexing has intermittent failures
[ https://issues.apache.org/jira/browse/LUCENE-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1067: -- Assignee: Michael McCandless > TestStressIndexing has intermittent failures > > > Key: LUCENE-1067 > URL: https://issues.apache.org/jira/browse/LUCENE-1067 > Project: Lucene - Java > Issue Type: Bug >Reporter: Grant Ingersoll >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > > See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > I am running using "ant clean test" > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > I'll try to investigate more tomorrow to see if I can dream up a test > case. > -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
Invalid behavior of StandardTokenizerImpl - Key: LUCENE-1068 URL: https://issues.apache.org/jira/browse/LUCENE-1068 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Shai Erera The following code prints the output of StandardAnalyzer: Analyzer analyzer = new StandardAnalyzer(); TokenStream ts = analyzer.tokenStream("content", new StringReader("")); Token t; while ((t = ts.next()) != null) { System.out.println(t); } If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) (which is correct in my opinion). However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=). I think the behavior in the second case is incorrect for several reasons: 1. It recognizes the string incorrectly (no argue on that). 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal. 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF. I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition: // acronyms: U.S.A., I.B.M., etc. // use a post-filter to remove dots ACRONYM= {ALPHA} "." ({ALPHA} ".")+ Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to ACRONYM= {LETTER} "." ({LETTER} ".")+ and it solved the problem. This was also reported here: http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential bug in StandardTokenizerImpl
Yes, please open a JIRA issue and submit your patches. I wonder if there is anyway to deprecate functionality in a JFlex grammar? That is, is there anyway we can communicate to people that both will be supported through 2.9 and then the correct way will be supported in 3.x? -Grant On Nov 27, 2007, at 2:18 AM, Shai Erera wrote: I understand it would change the behavior of existing search solutions, however the current behavior is just wrong. An ACRONYM cannot be ABC.DEF. If you look up acronym in Wikipedia, you find only examples of I.B.M. / U.S.A. like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer currently recognizes. There are several ways to solve this change: 1. Create a new analyzer that fixes the problem - that way, applications that don't want to use it will not have to, if they feel ok with the current behavior. However, for those who would like to get a correct behavior, they'll be able to. This is not my favorite solution, but I think it would be preferable than simply not fixing it. 2. Fix it in the new version (2.3) and specifically mention that in the release notes. Aren't there releases where applications need to re- build the index because of fundamental changes? Am I the only one who thinks that? BTW, I changed the definition in the jflex file and recompiled using jflex and it indeed solved the problem. It now recognizes www.abc.com. and www.abc.com as hosts. I can attach the 'patch' files if you'd like to compare. On Nov 27, 2007 9:07 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: : If you pass "www.abc.com", the output is (www.abc.com, 0,11,type=) : (which is correct in my opinion). : However, if you pass "www.abc.com." (notice the extra '.' at the end), the : output is (wwwabccom,0,12,type=). see also... http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 one hitch which potentially changing this now is that it would break some searches in applications that have existing indexes built using previous versions. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1067) TestStressIndexing has intermittent failures
TestStressIndexing has intermittent failures Key: LUCENE-1067 URL: https://issues.apache.org/jira/browse/LUCENE-1067 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Priority: Minor Fix For: 2.3 See http://www.gossamer-threads.com/lists/lucene/java-dev/55092 copied below: OK, I have seen this twice in the last two days: Testsuite: org.apache.lucene.index.TestStressIndexing [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 sec [junit] [junit] - Standard Output --- [junit] java.lang.NullPointerException [junit] at org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) [junit] at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) [junit] at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:544) [junit] at org .apache .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:192) [junit] at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) [junit] at org.apache.lucene.index.TestStressIndexing $SearcherThread.doWork(TestStressIndexing.java:111) [junit] at org.apache.lucene.index.TestStressIndexing $TimedThread.run(TestStressIndexing.java:55) [junit] - --- [junit] Testcase: testStressIndexAndSearching (org.apache.lucene.index.TestStressIndexing): FAILED [junit] hit unexpected exception in search1 [junit] junit.framework.AssertionFailedError: hit unexpected exception in search1 [junit] at org .apache .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: 159) [junit] at org .apache .lucene .index .TestStressIndexing .testStressIndexAndSearching(TestStressIndexing.java:187) [junit] [junit] [junit] Test org.apache.lucene.index.TestStressIndexing FAILED Subsequent runs have, however passed. Has anyone else hit this on trunk? I am running using "ant clean test" I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure how to reproduce at this point, but strikes me as a threading issue. Oh joy! I'll try to investigate more tomorrow to see if I can dream up a test case. -Grant -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Occasional failure in TestStressIndexing.java
I opened https://issues.apache.org/jira/browse/LUCENE-1067 to track the issue. On Nov 27, 2007, at 6:10 AM, Michael McCandless wrote: OK I just ran the test 5 times, also on quad Mac Pro, and got the error to occur as well! Ugh. I will track it down. Mike "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: OK, I have seen this twice in the last two days: Testsuite: org.apache.lucene.index.TestStressIndexing [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 sec [junit] [junit] - Standard Output --- [junit] java.lang.NullPointerException [junit] at org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java: 67) [junit] at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) [junit] at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:544) [junit] at org .apache .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) [junit] at org.apache.lucene.index.IndexReader.open(IndexReader.java:192) [junit] at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) [junit] at org.apache.lucene.index.TestStressIndexing $SearcherThread.doWork(TestStressIndexing.java:111) [junit] at org.apache.lucene.index.TestStressIndexing $TimedThread.run(TestStressIndexing.java:55) [junit] - --- [junit] Testcase: testStressIndexAndSearching (org.apache.lucene.index.TestStressIndexing): FAILED [junit] hit unexpected exception in search1 [junit] junit.framework.AssertionFailedError: hit unexpected exception in search1 [junit] at org .apache .lucene .index.TestStressIndexing.runStressTest(TestStressIndexing.java: 159) [junit] at org .apache .lucene .index .TestStressIndexing .testStressIndexAndSearching(TestStressIndexing.java:187) [junit] [junit] [junit] Test org.apache.lucene.index.TestStressIndexing FAILED Subsequent runs have, however passed. Has anyone else hit this on trunk? I am running using "ant clean test" I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure how to reproduce at this point, but strikes me as a threading issue. Oh joy! I'll try to investigate more tomorrow to see if I can dream up a test case. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Occasional failure in TestStressIndexing.java
OK I just ran the test 5 times, also on quad Mac Pro, and got the error to occur as well! Ugh. I will track it down. Mike "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > OK, I have seen this twice in the last two days: > Testsuite: org.apache.lucene.index.TestStressIndexing > [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 18.58 > sec > [junit] > [junit] - Standard Output --- > [junit] java.lang.NullPointerException > [junit] at > org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:67) > [junit] at > org.apache.lucene.store.IndexInput.readInt(IndexInput.java:66) > [junit] at org.apache.lucene.index.SegmentInfos > $FindSegmentsFile.run(SegmentInfos.java:544) > [junit] at > org > .apache > .lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:209) > [junit] at > org.apache.lucene.index.IndexReader.open(IndexReader.java:192) > [junit] at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:56) > [junit] at org.apache.lucene.index.TestStressIndexing > $SearcherThread.doWork(TestStressIndexing.java:111) > [junit] at org.apache.lucene.index.TestStressIndexing > $TimedThread.run(TestStressIndexing.java:55) > [junit] - --- > [junit] Testcase: > testStressIndexAndSearching > (org.apache.lucene.index.TestStressIndexing): FAILED > [junit] hit unexpected exception in search1 > [junit] junit.framework.AssertionFailedError: hit unexpected > exception in search1 > [junit] at > org > .apache > .lucene.index.TestStressIndexing.runStressTest(TestStressIndexing.java: > 159) > [junit] at > org > .apache > .lucene > .index > .TestStressIndexing > .testStressIndexAndSearching(TestStressIndexing.java:187) > [junit] > [junit] > [junit] Test org.apache.lucene.index.TestStressIndexing FAILED > > Subsequent runs have, however passed. Has anyone else hit this on > trunk? > > I am running using "ant clean test" > > I'm on a Mac Pro 4 core, 4GB machine, if that helps at all. Not sure > how to reproduce at this point, but strikes me as a threading issue. > Oh joy! > > I'll try to investigate more tomorrow to see if I can dream up a test > case. > > -Grant > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
"Doug Cutting (JIRA)" <[EMAIL PROTECTED]> wrote on 26/11/2007 20:14:43: > > I found out however that delaying the syncs (but intending tosync) also > means keeping the file handles open [...] > > Not necessarily. You could just queue the file names for sync, > close them, and then have the background thread open, sync and > close them. The close could trigger the OS to sync things > faster in the background. Then the open/sync/close could > mostly be a no-op. Might be worth a try. Good point. Actually even with a background thread we must use file-names, because otherwise there's no control over the number of open file handles. In addition, my tests on XP indicated that this way many syncs were no-ops - i.e. close() and later open+sync+close was faster than flush() and later sync+close. On both XP and Linux, a background thread was faster than a sync-at-end. Some numbers (no-sync, immediate-sync, at-end, background): 100 files of 10K, Linux: 5.7, 5.8, 6.4, 5.9 XP: 6.6, 11.1, 7.7, 6.8 1,000 files of 1K Linux: 5.8, 13.8, 11.2, 6.0 XP: 8.1, 44.5, 19.2, 15.0 10,000 files of 100 chars Linux: 7.0, 89.9, 68.0, 60.3 So, as much as I am not happy about adding a thread, it seems to be faster, at least for this synthetic test.. I'm curious to see Mike's actual Lucene numbers. In any case we should not sync files saved during non-commit writes. Theses are most writes for large indexes with AutoCommit=false. Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]