Cannot Escape Special charectors Search with Lucene.Net 2.0
Hi, My Name is Abhilash, working as .Net developer/support. I have came acroos an issue with search option in our application which uses Lucene.Net 2.0 version. The scenario is if I try search a text TestTest (it is actually TestTest.doc, which is trying to search), it returns 0 hits. While debugging I could see that the line which wrote to Parse the query is giving the problem, Here is the error line code: Query q=null; q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); The variable query at above point contains as this: (title:(TestTest) shorttitle:(TestTest) content:(TestTest) keywords:(TestTest) description:(TestTest) ) and q will get as this: title:test test shorttitle:test test content:test test keywords:test test description:test test And hence the hit length will be 0 at IndexSearcher searcher = new IndexSearcher(indexPath); Hits hits = searcher.Search(q); I tried adding\ before , tried escape, tried enclosing the text in a but all result the same outcome. Could anyone please hlep me with any fix to it? If require I can post the full code here. Hope to hear from Lucene.Net. Many thanks Abhilash
Re: Cannot Escape Special charectors Search with Lucene.Net 2.0
Hi Abhilash, Try with: Test\\Test From Documentation (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html): == Lucene supports escaping special characters that are part of the query syntax. The current list special characters are + - || ! ( ) { } [ ] ^ ~ * ? : \ To escape these character use the \ before the character. For example to search for (1+1):2 use the query: \(1\+1\)\:2 == Regards, Vincent DARON ASK Le 17/12/10 12:29, abhilash ramachandran a écrit : Hi, My Name is Abhilash, working as .Net developer/support. I have came acroos an issue with search option in our application which uses Lucene.Net 2.0 version. The scenario is if I try search a text TestTest (it is actually TestTest.doc, which is trying to search), it returns 0 hits. While debugging I could see that the line which wrote to Parse the query is giving the problem, Here is the error line code: Query q=null; q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); The variable query at above point contains as this: (title:(TestTest) shorttitle:(TestTest) content:(TestTest) keywords:(TestTest) description:(TestTest) ) and q will get as this: title:test test shorttitle:test test content:test test keywords:test test description:test test And hence the hit length will be 0 at IndexSearcher searcher = new IndexSearcher(indexPath); Hits hits = searcher.Search(q); I tried adding\ before, tried escape, tried enclosing the text in a but all result the same outcome. Could anyone please hlep me with any fix to it? If require I can post the full code here. Hope to hear from Lucene.Net. Many thanks Abhilash -- Vincent DARON ASK
[jira] Created: (LUCENENET-385) Searching string with Special charactor not working
Searching string with Special charactor not working --- Key: LUCENENET-385 URL: https://issues.apache.org/jira/browse/LUCENENET-385 Project: Lucene.Net Issue Type: Task Environment: .NET Framework 2.0+, C#.NET, ASP.NET, Webservices Reporter: Abhilash C R I have came acroos an issue with search option in our application which uses Lucene.Net 2.0 version. The scenario is if I try search a text TestTest (it is actually TestTest.doc, which is trying to search), it returns 0 hits. While debugging I could see that the line which wrote to Parse the query is giving the problem, Here is the error line code: Query q=null; q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); The variable query at above point contains as this: (title:(TestTest) shorttitle:(TestTest) content:(TestTest) keywords:(TestTest) description:(TestTest) ) and q will get as this: title:test test shorttitle:test test content:test test keywords:test test description:test test And hence the hit length will be 0 at IndexSearcher searcher = new IndexSearcher(indexPath); Hits hits = searcher.Search(q); I tried adding\ before , tried escape, tried enclosing the text in a but all result the same outcome. Could anyone please hlep me with any fix to it? If require I can post the full code here. Hope to hear from Lucene.Net. Many thanks Abhilash -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENE-2816) MMapDirectory speedups
MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2816: Attachment: LUCENE-2816.patch Here's the most important benchmark: speeding up the MultiMMap's readByte(s) in general: MultiMMapIndexInput readByte(s) improvements [trunk, Standard codec] ||Query||QPS trunk||QPS patch||Pct diff |spanFirst(unit, 5)|12.72|12.85|{color:green}1.0%{color}| |+nebraska +state|137.47|139.33|{color:green}1.3%{color}| |spanNear([unit, state], 10, true)|2.90|2.94|{color:green}1.4%{color}| |unit state|5.88|5.99|{color:green}1.8%{color}| |unit~2.0|7.06|7.20|{color:green}2.0%{color}| |+unit +state|8.68|8.87|{color:green}2.2%{color}| |unit state|8.00|8.23|{color:green}2.9%{color}| |unit~1.0|7.19|7.41|{color:green}3.0%{color}| |unit*|22.66|23.41|{color:green}3.3%{color}| |uni*|12.54|13.12|{color:green}4.6%{color}| |united~1.0|10.61|11.12|{color:green}4.8%{color}| |united~2.0|2.52|2.65|{color:green}5.1%{color}| |state|28.72|30.23|{color:green}5.3%{color}| |un*d|44.84|48.06|{color:green}7.2%{color}| |u*d|13.17|14.51|{color:green}10.2%{color}| In the bulk postings branch, I've been experimenting with various techniques for FOR/PFOR and one thing i tried was simply decoding with readInt() from the DataInput. So I adapted For/PFOR to just take DataInput and work on it directly, instead of reading into a byte[], wrapping it with a ByteBuffer, and working on an IntBuffer view. But when I did this, i found that MMap was slow for readInt(), etc. So we implement these primitives with ByteBuffer.readInt(). This isn't very important since lucene doesn't much use these, and mostly theoretical but I still think things like readInt(), readShort(), readLong() should be fast... for example just earlier today someone posted an alternative PFOR implementation on LUCENE-1410 that uses DataInput.readInt(). MMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec] ||Query||QPS branch||QPS patch||Pct diff |spanFirst(unit, 5)|12.14|11.99|{color:red}-1.2%{color}| |united~1.0|11.32|11.33|{color:green}0.1%{color}| |united~2.0|2.51|2.56|{color:green}2.1%{color}| |unit~1.0|6.98|7.19|{color:green}3.0%{color}| |unit~2.0|6.88|7.11|{color:green}3.3%{color}| |spanNear([unit, state], 10, true)|2.81|2.96|{color:green}5.2%{color}| |unit state|8.04|8.59|{color:green}6.8%{color}| |+unit +state|10.97|12.12|{color:green}10.5%{color}| |unit*|26.67|29.80|{color:green}11.7%{color}| |unit state|5.59|6.27|{color:green}12.3%{color}| |uni*|15.10|17.51|{color:green}15.9%{color}| |state|33.20|38.72|{color:green}16.6%{color}| |+nebraska +state|59.17|71.45|{color:green}20.8%{color}| |un*d|35.98|47.14|{color:green}31.0%{color}| |u*d|9.48|12.46|{color:green}31.4%{color}| Here's the same benchmark of DataInput.readInt() but with the MultiMMapIndexInput MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec] ||Query||QPS branch||QPS patch||Pct diff |united~2.0|2.43|2.54|{color:green}4.3%{color}| |united~1.0|10.78|11.39|{color:green}5.7%{color}| |unit~1.0|6.81|7.21|{color:green}5.8%{color}| |unit~2.0|6.62|7.05|{color:green}6.5%{color}| |spanNear([unit, state], 10, true)|2.77|2.96|{color:green}6.6%{color}| |unit state|7.85|8.53|{color:green}8.7%{color}| |spanFirst(unit, 5)|10.50|11.71|{color:green}11.5%{color}| |+unit +state|10.26|11.94|{color:green}16.3%{color}| |unit state|5.39|6.31|{color:green}17.0%{color}| |state|31.95|39.17|{color:green}22.6%{color}| |unit*|24.39|31.02|{color:green}27.2%{color}| |+nebraska +state|54.68|71.98|{color:green}31.6%{color}| |u*d|9.53|12.62|{color:green}32.5%{color}| |uni*|13.72|18.23|{color:green}32.9%{color}| |un*d|35.87|48.19|{color:green}34.3%{color}| Just to be sure, I ran this last one on sparc64 (bigendian) also. MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec] ||Query||QPS branch||QPS patch||Pct diff |united~2.0|2.23|2.26|{color:green}1.5%{color}| |unit~2.0|6.37|6.47|{color:green}1.6%{color}| |united~1.0|11.33|11.59|{color:green}2.3%{color}| |unit~1.0|9.68|10.05|{color:green}3.7%{color}| |spanNear([unit, state], 10, true)|15.60|17.54|{color:green}12.5%{color}| |unit*|127.14|144.08|{color:green}13.3%{color}| |unit state|44.93|51.30|{color:green}14.2%{color}| |spanFirst(unit, 5)|58.42|68.37|{color:green}17.0%{color}| |uni*|56.66|67.53|{color:green}19.2%{color}| |+nebraska +state|215.62|262.99|{color:green}22.0%{color}| |+unit +state|63.18|77.86|{color:green}23.2%{color}| |unit state|32.24|40.05|{color:green}24.2%{color}| |u*d|29.13|36.69|{color:green}26.0%{color}| |state|145.99|188.33|{color:green}29.0%{color}| |un*d|65.27|87.20|{color:green}33.6%{color}| I think some of these benchmarks also show that MultiMMapIndexInput might now be essentially just as fast as MMapIndexInput... but lets not go there yet and keep them separate for now. MMapDirectory speedups
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972398#action_12972398 ] Simon Willnauer commented on LUCENE-2816: - Awesome results robert!! :) MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: QParserPlugin as jar?
That was how I expected it to work (and yes, I registered it in solrconfig.xml). Wonder why it did not work when I tested it, I had to apply the patch and recompile. Guess I'll have to give it another try. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. des. 2010, at 19.19, Erik Hatcher wrote: On Dec 15, 2010, at 12:47 , Jan Høydahl / Cominvent wrote: Hi, I tried to package the edismax QParser (SOLR-1553) as a .jar file for inclusion in an already installed solr1.4.1, and dropped my new jar in SOLR_HOME/lib. However it failed with an exception. It suspect because the patch modifies o.a.s.s.QParserPlugin, which is already existing on the classpath. Is there a way to dynamically initialize new plugins without statically updating the QParserPlugin class? Yes, you can simply register it in solrconfig.xml: queryParser name=lucene class=org.apache.solr.search.LuceneQParserPlugin/ The QParserPlugin statically registered qparsers are just convenience so those come built-in as registered (though can be overridden by registering a different class with the same name). Erik - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2817) SimpleText has a bulk enum buffer reuse bug
SimpleText has a bulk enum buffer reuse bug --- Key: LUCENE-2817 URL: https://issues.apache.org/jira/browse/LUCENE-2817 Project: Lucene - Java Issue Type: Bug Components: Codecs Affects Versions: Bulk Postings branch Reporter: Robert Muir testBulkPostingsBufferReuse fails with SimpleText codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2817) SimpleText has a bulk enum buffer reuse bug
[ https://issues.apache.org/jira/browse/LUCENE-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972410#action_12972410 ] Robert Muir commented on LUCENE-2817: - junit-sequential: [junit] Testsuite: org.apache.lucene.index.TestCodecs [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.492 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestCodecs -Dtestmethod=testBulkPostingsBufferReuse -Dtests.seed=8878914233730236058:-5578535381307601353 [junit] NOTE: test params are: codec=RandomCodecProvider: {field=SimpleText}, locale=mk, timezone=America/Indiana/Tell_City [junit] NOTE: all tests run in this JVM: [junit] [TestCodecs] [junit] - --- [junit] Testcase: testBulkPostingsBufferReuse(org.apache.lucene.index.TestCodecs): FAILED [junit] expected:org.apache.lucene.index.codecs.simpletext.simpletextfieldsreader$simpletextbulkpostingse...@1a42792 but was:org.apache.lucene.index.codecs.simpletext.simpletextfieldsreader$simpletextbulkpostingse...@2200d5 [junit] junit.framework.AssertionFailedError: expected:org.apache.lucene.index.codecs.simpletext.simpletextfieldsreader$simpletextbulkpostingse...@1a42792 but was:org.apache.lucene.index.codecs.simpletext.simpletextfieldsreader$simpletextbulkpostingse...@2200d5 [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1043) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:981) [junit] at org.apache.lucene.index.TestCodecs.testBulkPostingsBufferReuse(TestCodecs.java:671) [junit] [junit] [junit] Test org.apache.lucene.index.TestCodecs FAILED SimpleText has a bulk enum buffer reuse bug --- Key: LUCENE-2817 URL: https://issues.apache.org/jira/browse/LUCENE-2817 Project: Lucene - Java Issue Type: Bug Components: Codecs Affects Versions: Bulk Postings branch Reporter: Robert Muir testBulkPostingsBufferReuse fails with SimpleText codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972413#action_12972413 ] Uwe Schindler commented on LUCENE-2816: --- Awesome, Ro bert is changing to the MMap Performance Policeman! I like the idea to simply delegate the methods and catch exception to fallback to manual read with boundary transition! I just wanted to be sure that the position pointer in the buffer does not partly go forward when you read request fails at a buffer boundary, but that seems to be the case. MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972416#action_12972416 ] Uwe Schindler commented on LUCENE-2816: --- One thing to add: When using readFloat co, we should make sure that we set the endianness explicitely in the ctor. I just want to explicitely make sure that the endianness is correct and document it that it is big endian for Lucene. MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972416#action_12972416 ] Uwe Schindler edited comment on LUCENE-2816 at 12/17/10 4:43 AM: - -One thing to add: When using readFloat co, we should make sure that we set the endianness explicitely in the ctor. I just want to explicitely make sure that the endianness is correct and document it that it is big endian for Lucene.- We don't need that: The initial order of a byte buffer is always BIG_ENDIAN. was (Author: thetaphi): One thing to add: When using readFloat co, we should make sure that we set the endianness explicitely in the ctor. I just want to explicitely make sure that the endianness is correct and document it that it is big endian for Lucene. MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972417#action_12972417 ] Michael McCandless commented on LUCENE-2814: bq. Probably still a smaller change than flex indexing Yes, true! bq. But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. Well, not a mistake... this was unavoidable given that trunk was so far from what DWPT needs. But with per-seg deletes (LUCENE-2680) done, and no more doc stores (this issue), I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!). I can help merge too! I think coordinating on IRC #lucene is a good idea? It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class (which is already coordinating flush-due-to-deletes and flush-due-to-added-docs-of-one-DWPT). stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972418#action_12972418 ] Robert Muir commented on LUCENE-2816: - bq. I just wanted to be sure that the position pointer in the buffer does not partly go forward when you read request fails at a buffer boundary, but that seems to be the case. Yes, this is guaranteed in the APIs, and also tested well by TestMultiMMap, which uses random chunk sizes between 20 and 100 (including odd numbers etc) Though we should enhance this test, i think it just retrieves documents at the moment... probably better if it did some searches too. MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972420#action_12972420 ] Michael McCandless commented on LUCENE-2816: Good grief! What amazing gains, especially w/ PFor codec which of course makes super heavy use of .readInt(). Awesome Robert! This will mean w/ the cutover to FORPFOR codec for 4.0, MMapDir will likely have a huge edge over NIOFSDir? MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2817) SimpleText has a bulk enum buffer reuse bug
[ https://issues.apache.org/jira/browse/LUCENE-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972421#action_12972421 ] Michael McCandless commented on LUCENE-2817: Duh, silliness. I just added this test (to assert BulkPostingsEnum/buffer reuse) but SimpleText never re-uses. I'll add an Assume. SimpleText has a bulk enum buffer reuse bug --- Key: LUCENE-2817 URL: https://issues.apache.org/jira/browse/LUCENE-2817 Project: Lucene - Java Issue Type: Bug Components: Codecs Affects Versions: Bulk Postings branch Reporter: Robert Muir testBulkPostingsBufferReuse fails with SimpleText codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2816) MMapDirectory speedups
[ https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972422#action_12972422 ] Robert Muir commented on LUCENE-2816: - {quote} Good grief! What amazing gains, especially w/ PFor codec which of course makes super heavy use of .readInt(). Awesome Robert! This will mean w/ the cutover to FORPFOR codec for 4.0, MMapDir will likely have a huge edge over NIOFSDir? {quote} This isn't really a 'gain' for the bulkpostings branch? This is just making DataInput.readInt() faster. Currently the bulkpostings branch uses readByte(byte[]), then wraps into a ByteBuffer and processes an IntBuffer view of that. I switched to just using readInt() from DataInputDirectly [FrameOfRefDataInput] and found it to be much slower than this IntBuffer method. this whole benchmark is just benching DataInput.readInt()... So, we shouldn't change anything in bulkpostings, this isn't faster than the intbuffer method in my tests, at best its equivalent... but we should fix this slowdown in our APIs. MMapDirectory speedups -- Key: LUCENE-2816 URL: https://issues.apache.org/jira/browse/LUCENE-2816 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 3.1, 4.0 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2816.patch MMapDirectory has some performance problems: # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down. # the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2817) SimpleText has a bulk enum buffer reuse bug
[ https://issues.apache.org/jira/browse/LUCENE-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2817. Resolution: Fixed Fix Version/s: Bulk Postings branch Fixed. In fact SimpleText does try to reuse and in fact it was buggy! I fixed it. SimpleText has a bulk enum buffer reuse bug --- Key: LUCENE-2817 URL: https://issues.apache.org/jira/browse/LUCENE-2817 Project: Lucene - Java Issue Type: Bug Components: Codecs Affects Versions: Bulk Postings branch Reporter: Robert Muir Fix For: Bulk Postings branch testBulkPostingsBufferReuse fails with SimpleText codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2680. Resolution: Fixed Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-3.x - Build # 2623 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2623/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback Error Message: ConcurrentMergeScheduler hit unhandled exceptions Stack Trace: junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled exceptions at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:891) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:829) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:375) Build Log (for compile errors): [...truncated 4528 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 2651 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2651/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads Error Message: CheckIndex failed Stack Trace: java.lang.RuntimeException: CheckIndex failed at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:87) at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:73) at org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:131) at org.apache.lucene.index.TestIndexWriterOnJRECrash.checkIndexes(TestIndexWriterOnJRECrash.java:137) at org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads(TestIndexWriterOnJRECrash.java:61) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1048) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:986) Build Log (for compile errors): [...truncated 3301 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972505#action_12972505 ] Yonik Seeley commented on LUCENE-2815: -- Hmmm, this patch causes test failures because ConcurrentHashMap doesn't handle null values. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: LUCENE-2815.patch MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-2815: - Attachment: LUCENE-2815.patch Here's an updated patch that avoids containsKey() followed by get() (just an optimization) and avoids caching null Terms instances. This is the right thing to do anyway, since one could easily blow up the cache with fields that don't exist. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: LUCENE-2815.patch, LUCENE-2815.patch MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2723: Attachment: LUCENE-2723_wastedint.patch patch with more refactoring of For/Pfor decompression: * The decompressors take DataInput, but still use the IntBuffer technique for now. * I removed the wasted int-per-block in For. Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved LUCENE-2815. -- Resolution: Fixed committed. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: LUCENE-2815.patch, LUCENE-2815.patch MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972531#action_12972531 ] Jason Rutherglen commented on LUCENE-2814: -- bq. I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!) Indeed! bq. I think coordinating on IRC #lucene is a good idea? It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira. bq. It seems like LUCENE-2573 needs to be incorporated into IW's new FlushControl class Right, into the DWPT branch. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Cannot Escape Special charectors Search with Lucene.Net 2.0
Robert's correct the StandardAnalyzer will split the input text at the characters so your index will not contain them. As in this simple example: StandardAnalyzer aa = new StandardAnalyzer(); System.IO.StringReader srs = new System.IO.StringReader(aaa bbb testtest ccc ddd); Lucene.Net.Analysis.TokenStream ts = aa.TokenStream(srs); Lucene.Net.Analysis.Token tk; while( (tk = ts.Next()) != null ) { System.Console.WriteLine(String.Format(Token: \{0}\: S:{1}, E:{2}, tk.TermText(),tk.StartOffset(),tk.EndOffset())); } The output looks like this: Token: aaa: S:0, E:3 Token: bbb: S:4, E:7 Token: test: S:8, E:12 Token: test: S:14, E:18 Token: ccc: S:19, E:22 Token: ddd: S:23, E:26 You can see that the characters were identified as separators and two test tokens were emitted not the single testtest you expected. - Neal -Original Message- From: Robert Jordan [mailto:robe...@gmx.net] Sent: Friday, December 17, 2010 6:25 AM To: lucene-net-...@incubator.apache.org Subject: Re: Cannot Escape Special charectors Search with Lucene.Net 2.0 On 17.12.2010 12:29, abhilash ramachandran wrote: q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); I believe the issue has nothing to do with your query syntax. StandardAnalyzer is skipping chars like during the indexing process, so you can't search for them. Robert
WARNING: re-index all trunk indices!
If you are using Lucene's trunk (nightly build) release, read on... I just committed a change (for LUCENE-2811) that changes the index format on trunk, thus breaking (w/ likely strange exceptions on reading the segments_N file) any trunk indices created in the past week or so. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2811) SegmentInfo should explicitly track whether that segment wrote term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2811. Resolution: Fixed SegmentInfo should explicitly track whether that segment wrote term vectors --- Key: LUCENE-2811 URL: https://issues.apache.org/jira/browse/LUCENE-2811 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1, 4.0 Attachments: LUCENE-2811.patch Today SegmentInfo doesn't know if it has vectors, which means its files() method must check if the files exist. This leads to subtle bugs, because Si.files() caches the files but then we fail to invalidate that later when the term vectors files are created. It also leads to sloppy code, eg TermVectorsReader gracefully handles being opened when the files do not exist. I don't like that; it should only be opened if they exist. This also fixes these intermittent failures we've been seeing: {noformat} junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about file _1e.tvx at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917) at org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3633) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3699) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2407) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2478) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2460) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2444) at org.apache.lucene.index.TestIndexWriterExceptions.testRandomExceptionsThreads(TestIndexWriterExceptions.java:213) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972538#action_12972538 ] Steven Rowe commented on LUCENE-2814: - {quote} bq. I think coordinating on IRC #lucene is a good idea? It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira. {quote} #lucene-dev is logged. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: WARNING: re-index all trunk indices!
On Fri, Dec 17, 2010 at 11:18 AM, Michael McCandless luc...@mikemccandless.com wrote: If you are using Lucene's trunk (nightly build) release, read on... I just committed a change (for LUCENE-2811) that changes the index format on trunk, thus breaking (w/ likely strange exceptions on reading the segments_N file) any trunk indices created in the past week or so. For reference, the exception I got trying to start Solr with an older index on Windows is below. -Yonik http://www.lucidimagination.com SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.io.IOException: read past EOF at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:242) at org.apache.lucene.store.ChecksumIndexInput.readBytes(ChecksumIndexInput.java:48) at org.apache.lucene.store.DataInput.readString(DataInput.java:121) at org.apache.lucene.store.DataInput.readStringStringMap(DataInput.java:148) at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:192) at org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:57) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:220) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:90) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 31 more - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972543#action_12972543 ] Steven Rowe commented on LUCENE-2814: - bq. Steven, is that on a wiki page? bq. The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on Yeah, it's very rarely used. Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972543#action_12972543 ] Steven Rowe edited comment on LUCENE-2814 at 12/17/10 11:49 AM: bq. Steven, is that on a wiki page? I don't know, I never put it anywhere, just discussed on d...@l.a.o. Feel free to do so if you like. bq. The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on Yeah, it's very rarely used. Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want. was (Author: steve_rowe): bq. Steven, is that on a wiki page? bq. The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on Yeah, it's very rarely used. Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2289) The example documents have the same lat/lon for the store field for several stores. Space them out.
The example documents have the same lat/lon for the store field for several stores. Space them out. - Key: SOLR-2289 URL: https://issues.apache.org/jira/browse/SOLR-2289 Project: Solr Issue Type: Improvement Components: documentation Affects Versions: Next Environment: All Reporter: Erick Erickson Assignee: Erick Erickson Priority: Trivial Half-dozen or so of the documents in the exampledocs directory all have the same location, which makes it a bit confusing when playing with geospatial, at least I scratched my head wondering whether it was working. Note that this another reason to include the distance in the returned doc :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Cannot Escape Special charectors Search with Lucene.Net 2.0
N.G -- You can see that the characters were identified as separators and two test tokens were emitted not the single testtest you expected. A.R -- The scenario is if I try search a text TestTest But the query TestTest will also be parsed as test test by StandardAnalyzer. Since there are 2 sucessive tests in the index, there must be a hit. DIGY -Original Message- From: Granroth, Neal V. [mailto:neal.granr...@thermofisher.com] Sent: Friday, December 17, 2010 6:06 PM To: lucene-net-...@lucene.apache.org Subject: RE: Cannot Escape Special charectors Search with Lucene.Net 2.0 Robert's correct the StandardAnalyzer will split the input text at the characters so your index will not contain them. As in this simple example: StandardAnalyzer aa = new StandardAnalyzer(); System.IO.StringReader srs = new System.IO.StringReader(aaa bbb testtest ccc ddd); Lucene.Net.Analysis.TokenStream ts = aa.TokenStream(srs); Lucene.Net.Analysis.Token tk; while( (tk = ts.Next()) != null ) { System.Console.WriteLine(String.Format(Token: \{0}\: S:{1}, E:{2}, tk.TermText(),tk.StartOffset(),tk.EndOffset())); } The output looks like this: Token: aaa: S:0, E:3 Token: bbb: S:4, E:7 Token: test: S:8, E:12 Token: test: S:14, E:18 Token: ccc: S:19, E:22 Token: ddd: S:23, E:26 You can see that the characters were identified as separators and two test tokens were emitted not the single testtest you expected. - Neal -Original Message- From: Robert Jordan [mailto:robe...@gmx.net] Sent: Friday, December 17, 2010 6:25 AM To: lucene-net-...@incubator.apache.org Subject: Re: Cannot Escape Special charectors Search with Lucene.Net 2.0 On 17.12.2010 12:29, abhilash ramachandran wrote: q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); I believe the issue has nothing to do with your query syntax. StandardAnalyzer is skipping chars like during the indexing process, so you can't search for them. Robert
[jira] Updated: (SOLR-2289) The example documents have the same lat/lon for the store field for several stores. Space them out.
[ https://issues.apache.org/jira/browse/SOLR-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-2289: - Attachment: SOLR-2289.patch Patch attached, moves the stores that used to be identical NW along 55. Some are in farm fields, but what the heck Patch made with Tortoise SVN rather than the usual IntelliJ, but the format looks OK (tm). Anybody want to pick it up and commit? The example documents have the same lat/lon for the store field for several stores. Space them out. - Key: SOLR-2289 URL: https://issues.apache.org/jira/browse/SOLR-2289 Project: Solr Issue Type: Improvement Components: documentation Affects Versions: Next Environment: All Reporter: Erick Erickson Assignee: Erick Erickson Priority: Trivial Attachments: SOLR-2289.patch Original Estimate: 1h Remaining Estimate: 1h Half-dozen or so of the documents in the exampledocs directory all have the same location, which makes it a bit confusing when playing with geospatial, at least I scratched my head wondering whether it was working. Note that this another reason to include the distance in the returned doc :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Cannot Escape Special charectors Search with Lucene.Net 2.0
On 17.12.2010 17:59, Digy wrote: N.G -- You can see that the characters were identified as separators and two test tokens were emitted not the single testtest you expected. A.R -- The scenario is if I try search a text TestTest But the query TestTest will also be parsed as test test by StandardAnalyzer. Since there are 2 sucessive tests in the index, there must be a hit. Or he doesn't use the same analyzer for indexing and searching. Robert DIGY -Original Message- From: Granroth, Neal V. [mailto:neal.granr...@thermofisher.com] Sent: Friday, December 17, 2010 6:06 PM To: lucene-net-...@lucene.apache.org Subject: RE: Cannot Escape Special charectors Search with Lucene.Net 2.0 Robert's correct the StandardAnalyzer will split the input text at the characters so your index will not contain them. As in this simple example: StandardAnalyzer aa = new StandardAnalyzer(); System.IO.StringReader srs = new System.IO.StringReader(aaa bbb testtest ccc ddd); Lucene.Net.Analysis.TokenStream ts = aa.TokenStream(srs); Lucene.Net.Analysis.Token tk; while( (tk = ts.Next()) != null ) { System.Console.WriteLine(String.Format(Token: \{0}\: S:{1}, E:{2}, tk.TermText(),tk.StartOffset(),tk.EndOffset())); } The output looks like this: Token: aaa: S:0, E:3 Token: bbb: S:4, E:7 Token: test: S:8, E:12 Token: test: S:14, E:18 Token: ccc: S:19, E:22 Token: ddd: S:23, E:26 You can see that the characters were identified as separators and two test tokens were emitted not the single testtest you expected. - Neal -Original Message- From: Robert Jordan [mailto:robe...@gmx.net] Sent: Friday, December 17, 2010 6:25 AM To: lucene-net-...@incubator.apache.org Subject: Re: Cannot Escape Special charectors Search with Lucene.Net 2.0 On 17.12.2010 12:29, abhilash ramachandran wrote: q = new global::Lucene.Net.QueryParsers.QueryParser(content, new StandardAnalyzer()).Parse(query); I believe the issue has nothing to do with your query syntax. StandardAnalyzer is skipping chars like during the indexing process, so you can't search for them. Robert
RE: Is it possible to set the merge policy setMaxMergeMB from Solr
I'm a bit confused. There are some examples in the JIRA issue for Solr 1447, but I can't tell from reading it what the final allowed syntax is. I see !--mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy-- !--double name=maxMergeMB64.0/double-- !--/mergePolicy-- in the JIRA issue and in what I think is the test case config file: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/test/test-files/solr/conf/solrconfig-propinject.xml?view=log Lance's example is mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy maxMergeMB1024/maxMergeMB /mergePolicy Which one is correct? Tom -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Tuesday, December 07, 2010 10:48 AM To: dev@lucene.apache.org Subject: Re: Is it possible to set the merge policy setMaxMergeMB from Solr SOLR-1447 added this functionality. On Mon, Dec 6, 2010 at 2:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Lucene has this method to set the maximum size of a segment when merging: LogByteSizeMergePolicy.setMaxMergeMB (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29 ) I would like to be able to set this in my solrconfig.xml. Is this possible? If not should I open a JIRA issue or is there some gotcha I am unaware of? Tom Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass
[ https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972586#action_12972586 ] Simon Willnauer commented on LUCENE-2694: - FYI - there is a coding error in the latest patch that causes the TermState to be ignored - TermWeight uses the wrong reference to the PerReaderTermState. I will upload a new patch later this weekend simon MTQ rewrite + weight/scorer init should be single pass -- Key: LUCENE-2694 URL: https://issues.apache.org/jira/browse/LUCENE-2694 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch Spinoff of LUCENE-2690 (see the hacked patch on that issue)... Once we fix MTQ rewrite to be per-segment, we should take it further and make weight/scorer init also run in the same single pass as rewrite. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass
[ https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2694: Attachment: LUCENE-2694.patch here is a new patch. I removed the hacky TermWeight part to make only MTQ single pass for now. Other TermQueries will hit the TermCache for now. No nocommit left. Currently there is some duplication / unncessary classes in the TermState hierarchy - that needs cleanup. BTW. I see some highlighter test failing - will look into this later... simon MTQ rewrite + weight/scorer init should be single pass -- Key: LUCENE-2694 URL: https://issues.apache.org/jira/browse/LUCENE-2694 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch Spinoff of LUCENE-2690 (see the hacked patch on that issue)... Once we fix MTQ rewrite to be per-segment, we should take it further and make weight/scorer init also run in the same single pass as rewrite. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-3.x - Build # 214 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/214/ 4 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull Error Message: addIndexes(Directory[]) + optimize() hit IOException after disk space was freed up Stack Trace: junit.framework.AssertionFailedError: addIndexes(Directory[]) + optimize() hit IOException after disk space was freed up at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:891) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:829) at org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:323) REGRESSION: org.apache.lucene.index.TestIndexWriterOnDiskFull.testCorruptionAfterDiskFullDuringMerge Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:891) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:829) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:354) REGRESSION: org.apache.lucene.index.TestIndexWriterOnDiskFull.testImmediateDiskFull Error Message: ConcurrentMergeScheduler hit unhandled exceptions Stack Trace: junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled exceptions at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:891) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:829) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:375) REGRESSION: org.apache.lucene.index.TestIndexWriterOnJRECrash.testNRTThreads Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:891) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:829) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:354) Build Log (for compile errors): [...truncated 6950 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2818) abort() method for IndexOutput
abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2282: - Attachment: SOLR-2282.patch A patch attached. Currently, carrot.produceSummary doesn't work in distributed mode: {code:title=ClusteringComponent.finishStage()} // TODO: Currently, docIds is set to null in distributed environment. // This causes CarrotParams.PRODUCE_SUMMARY doesn't work. // To work CarrotParams.PRODUCE_SUMMARY under distributed mode, we can choose either one of: // (a) In each shard, ClusteringComponent produces summary and finishStage() // merges these summaries. // (b) Adding doHighlighting(SolrDocumentList, ...) method to SolrHighlighter and // making SolrHighlighter uses external text rather than stored values to produce snippets. {code} Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch New patch. Now with even more lines removed! DocStore-related index chain components used to track open/closed files through DocumentsWriter. Closed files list was unused, and is silently gone. Open files list was used to: * prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter. ** no shared docstores, no need + IFD no longer requires a reference to DW * delete already opened docstore files, when aborting. ** index chain now handles this on its own + has cleaner error handling code. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Is it possible to set the merge policy setMaxMergeMB from Solr
I worked on the patch and I can't keep it straight either. We need to update the wiki? I believe double name=maxMergeMB64.0/double is correct. On Fri, Dec 17, 2010 at 10:25 AM, Burton-West, Tom tburt...@umich.edu wrote: I'm a bit confused. There are some examples in the JIRA issue for Solr 1447, but I can't tell from reading it what the final allowed syntax is. I see !--mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy-- !--double name=maxMergeMB64.0/double-- !--/mergePolicy-- in the JIRA issue and in what I think is the test case config file: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/test/test-files/solr/conf/solrconfig-propinject.xml?view=log Lance's example is mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy maxMergeMB1024/maxMergeMB /mergePolicy Which one is correct? Tom -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Tuesday, December 07, 2010 10:48 AM To: dev@lucene.apache.org Subject: Re: Is it possible to set the merge policy setMaxMergeMB from Solr SOLR-1447 added this functionality. On Mon, Dec 6, 2010 at 2:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Lucene has this method to set the maximum size of a segment when merging: LogByteSizeMergePolicy.setMaxMergeMB (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29 ) I would like to be able to set this in my solrconfig.xml. Is this possible? If not should I open a JIRA issue or is there some gotcha I am unaware of? Tom Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972737#action_12972737 ] Shai Erera commented on LUCENE-2818: bq. but constitutes an API backcompat break Can abort() have a default impl in IndexOutput, such as close() followed by deleteFile() maybe? If so, then it won't break anything. Anyway, I think we can make an exception in this case - only those who impl Directory and provide their own IndexOutput extension will be affected, which I think is a relatively low number of applications? bq. What do you think? Would abort() on Directory fit better? E.g., it can abort all currently open and modified files, instead of the caller calling abort() on each IndexOutput? Are you thinking of a case where a write failed, and the caller would call abort() immediately, instead of some higher-level code? If so, would rollback() be a better name? I always thought of IndexOutput as a means for writing bytes, no special semantic logic coded and executed by it. The management code IMO should be maintained by higher-level code, such as Directory or even higher (today IndexWriter, but that's what you're trying to remove :)). So on one hand, I'd like to see IndexWriter's code simplified (this class has become a monster), but on the other, it doesn't feel right to me to add this logic in IndexOutput. Maybe I don't understand the use case for it well though. I do think though, that abort() on IndexOutput has a specific, clearer, meaning, where on Directory it can be perceived as kinda vague (what exactly is it aborting, reading / writing?). And maybe aborting a Directory is not good, if say you want to abort/rollback the changes done to a particular file. All in all, I'm +1 for simplifying IW, but am still +0 on transferring the logic to IndexOutput, unless I misunderstand the use case. abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Is it possible to set the merge policy setMaxMergeMB from Solr
Probably best to add something here as it currently has nothing regarding merge policies and has a long standing TODO on the indexDefaults. http://wiki.apache.org/solr/SolrConfigXml On Fri, Dec 17, 2010 at 7:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I worked on the patch and I can't keep it straight either. We need to update the wiki? I believe double name=maxMergeMB64.0/double is correct. On Fri, Dec 17, 2010 at 10:25 AM, Burton-West, Tom tburt...@umich.edu wrote: I'm a bit confused. There are some examples in the JIRA issue for Solr 1447, but I can't tell from reading it what the final allowed syntax is. I see !--mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy-- !--double name=maxMergeMB64.0/double-- !--/mergePolicy-- in the JIRA issue and in what I think is the test case config file: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/test/test-files/solr/conf/solrconfig-propinject.xml?view=log Lance's example is mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy maxMergeMB1024/maxMergeMB /mergePolicy Which one is correct? Tom -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Tuesday, December 07, 2010 10:48 AM To: dev@lucene.apache.org Subject: Re: Is it possible to set the merge policy setMaxMergeMB from Solr SOLR-1447 added this functionality. On Mon, Dec 6, 2010 at 2:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Lucene has this method to set the maximum size of a segment when merging: LogByteSizeMergePolicy.setMaxMergeMB (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29 ) I would like to be able to set this in my solrconfig.xml. Is this possible? If not should I open a JIRA issue or is there some gotcha I am unaware of? Tom Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Do we want 'nocommit' to fail the commit?
Hi Out of curiosity, I searched if we can have a nocommit comment in the code fail the commit. As far as I see, we try to avoid accidental commits (of say debug messages) by putting a nocommit comment, but I don't know if svn ci would fail in the presence of such comment - I guess not because we've seen some accidental nocommits checked in already in the past. So I Googled around and found that if we have control of the svn repo, we can add a pre-commit hook that will check and fail the commit. Here is a nice article that explains how to add pre-commit hooks in general ( http://wordaligned.org/articles/a-subversion-pre-commit-hook). I didn't try it yet (on our local svn instance), so I cannot say how well it works, but perhaps someone has experience with it ... So if this is interesting, and is doable for Lucene (say, open a JIRA issue for Infra?) I don't mind investigating it further and write the script (which can be as simple as 'grep the changed files and fail on the presence of nocommit string'). Shai
Lucene-trunk - Build # 1398 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1398/ All tests passed Build Log (for compile errors): [...truncated 18389 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2657: Attachment: LUCENE-2657.patch Updated to trunk, including addition of a POM for {{solr/contrib/analysis-extras/}} and upgraded dependencies. All tests pass. Replace Maven POM templates with full POMs, and change documentation accordingly Key: LUCENE-2657 URL: https://issues.apache.org/jira/browse/LUCENE-2657 Project: Lucene - Java Issue Type: Improvement Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Assignee: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. Full Maven POMs will include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-2657: Description: The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. The full Maven POMs in the attached patch include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. Several dependencies are not available through public maven repositories. A profile in the top-level POM can be activated to install these dependencies from the various {{lib/}} directories into your local repository. From the top-level directory: {code} mvn -N -Pbootstrap install {code} Once these non-Maven dependencies have been installed, to run all Lucene/Solr tests via Maven's surefire plugin, and populate your local repository with all artifacts, from the top level directory, run: {code} mvn install {code} When one Lucene/Solr module depends on another, the dependency is declared on the *artifact(s)* produced by the other module and deposited in your local repository, rather than on the other module's un-jarred compiler output in the {{build/}} directory, so you must run {{mvn install}} on the other module before its changes are visible to the module that depends on it. To create all the artifacts without running tests: {code} mvn -DskipTests install {code} I almost always include the {{clean}} phase when I do a build, e.g.: {code} mvn -DskipTests clean install {code} was: The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. Full Maven POMs will include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. Replace Maven POM templates with full POMs, and change documentation accordingly Key: LUCENE-2657 URL: https://issues.apache.org/jira/browse/LUCENE-2657 Project: Lucene - Java Issue Type: Improvement Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Assignee: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. The full Maven POMs in the attached patch include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. Several dependencies are not available through public maven repositories. A profile in the top-level POM can be activated to install these dependencies from the various {{lib/}} directories into your local repository. From the top-level directory: {code} mvn -N -Pbootstrap install {code} Once these non-Maven dependencies have been installed, to run all Lucene/Solr tests via Maven's surefire plugin, and populate your local repository with all artifacts, from the top level directory, run: {code} mvn install {code} When one Lucene/Solr module depends on another, the dependency is declared on the *artifact(s)* produced by the other module and deposited in your local repository, rather than on the other module's un-jarred compiler output in the {{build/}} directory, so you must run {{mvn install}} on the other module before its changes are visible to the module that depends on it. To create all the artifacts without running tests: {code} mvn -DskipTests install {code} I almost always include the {{clean}} phase when I do a build, e.g.: {code} mvn -DskipTests clean install {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr-3.x - Build # 200 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-3.x/200/ All tests passed Build Log (for compile errors): [...truncated 20521 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org