[jira] [Updated] (SOLR-3948) Calculate/display deleted documents in admin interface
[ https://issues.apache.org/jira/browse/SOLR-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Heisey updated SOLR-3948: --- Attachment: SOLR-3948.patch A patch against branch_4x that puts Deleted Docs into the admin interface. I may not have gotten everything that needs to be touched, this is my first look at the code that builds the gui. Calculate/display deleted documents in admin interface -- Key: SOLR-3948 URL: https://issues.apache.org/jira/browse/SOLR-3948 Project: Solr Issue Type: Improvement Components: web gui Affects Versions: 4.0 Reporter: Shawn Heisey Priority: Minor Fix For: 4.1 Attachments: SOLR-3948.patch The admin interface shows you two totals that let you infer how many deleted documents exist in the index by subtracting Num Docs from Max Doc. It would make things much easier for novice users and for automated statistics gathering if the number of deleted documents were calculated for you and displayed. Last Modified: 3 minutes ago Num Docs: 12924551 Max Doc: 13011778 Version: 862 Segment Count: 23 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3951) wt=json should set application/json as content-type
[ https://issues.apache.org/jira/browse/SOLR-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fredrik Rodland updated SOLR-3951: -- Description: the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} was: the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true 9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked] wt=json should set application/json as content-type --- Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: Darwin SCH-BP-2003.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3951) wt=json should set application/json as content-type
Fredrik Rodland created SOLR-3951: - Summary: wt=json should set application/json as content-type Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: Darwin SCH-BP-2003.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true 9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3951) wt=json should set application/json as content-type
[ https://issues.apache.org/jira/browse/SOLR-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fredrik Rodland updated SOLR-3951: -- Environment: max osx 10.7.5, SOLR 4.0.0 (was: Darwin SCH-BP-2003.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64, SOLR 4.0.0) wt=json should set application/json as content-type --- Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: max osx 10.7.5, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-3951) wt=json should set application/json as content-type
[ https://issues.apache.org/jira/browse/SOLR-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fredrik Rodland resolved SOLR-3951. --- Resolution: Not A Problem hm - reading a bit more - it seems that this is intended, and that you must manually specify that you want the content-type to be application/json when you use wt=json. Seems like an awkward decision. {code} queryResponseWriter name=json class=solr.JSONResponseWriter !-- For the purposes of the tutorial, JSON responses are written as plain text so that they are easy to read in *any* browser. If you expect a MIME type of application/json just remove this override. -- str name=content-typeapplication/json; charset=UTF-8/str /queryResponseWriter {code) resolving issue as not a problem wt=json should set application/json as content-type --- Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: max osx 10.7.5, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-3951) wt=json should set application/json as content-type
[ https://issues.apache.org/jira/browse/SOLR-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476833#comment-13476833 ] Fredrik Rodland edited comment on SOLR-3951 at 10/16/12 7:53 AM: - hm - reading a bit more - it seems that this is intended, and that you must manually specify that you want the content-type to be application/json when you use wt=json. Seems like an awkward decision. {code} from solrconfig.xml ... For the purposes of the tutorial, JSON responses are written as plain text so that they are easy to read in *any* browser. If you expect a MIME type of application/json just remove this override. ... {code) resolving issue as not a problem was (Author: fmr): hm - reading a bit more - it seems that this is intended, and that you must manually specify that you want the content-type to be application/json when you use wt=json. Seems like an awkward decision. {code} queryResponseWriter name=json class=solr.JSONResponseWriter !-- For the purposes of the tutorial, JSON responses are written as plain text so that they are easy to read in *any* browser. If you expect a MIME type of application/json just remove this override. -- str name=content-typeapplication/json; charset=UTF-8/str /queryResponseWriter {code) resolving issue as not a problem wt=json should set application/json as content-type --- Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: max osx 10.7.5, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-3951) wt=json should set application/json as content-type
[ https://issues.apache.org/jira/browse/SOLR-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476833#comment-13476833 ] Fredrik Rodland edited comment on SOLR-3951 at 10/16/12 7:53 AM: - hm - reading a bit more - it seems that this is intended, and that you must manually specify that you want the content-type to be application/json when you use wt=json. Seems like an awkward decision. {code} from solrconfig.xml ... For the purposes of the tutorial, JSON responses are written as plain text so that they are easy to read in *any* browser. If you expect a MIME type of application/json just remove this override. ... {code} resolving issue as not a problem was (Author: fmr): hm - reading a bit more - it seems that this is intended, and that you must manually specify that you want the content-type to be application/json when you use wt=json. Seems like an awkward decision. {code} from solrconfig.xml ... For the purposes of the tutorial, JSON responses are written as plain text so that they are easy to read in *any* browser. If you expect a MIME type of application/json just remove this override. ... {code) resolving issue as not a problem wt=json should set application/json as content-type --- Key: SOLR-3951 URL: https://issues.apache.org/jira/browse/SOLR-3951 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Environment: max osx 10.7.5, SOLR 4.0.0 Reporter: Fredrik Rodland the result with wt=json has content-type text/plain. Should be application/json. see SOLR-1123 (which seemed to be fixed for 4.0-ALPHA). reproduce: load all tutorial data. http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true info on request/response: {code}9:42:14.681[31ms][total 69ms] Status: 200[OK] GET http://localhost:8983/solr/collection1/select?q=*%3A*wt=jsonindent=true Content Size[-1] Mime Type[text/plain] Request Headers: Host[localhost:8983] User-Agent[Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Connection[keep-alive] Referer[http://localhost:8983/solr/] Cache-Control[max-age=0] Response Headers: Content-Type[text/plain;charset=UTF-8] Transfer-Encoding[chunked]{code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476854#comment-13476854 ] Mark Harwood commented on SOLR-3950: BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat and adds .blm files to the other files created by the choice of delegate. However your code has instantiated a BloomFilterPostingsFormat without passing a choice of delegate - presumably using the zero-arg constructor. The comments in the code for this zero-arg constructor state: // Used only by core Lucene at read-time via Service Provider instantiation - // do not use at Write-time in application code. Attempting postings=BloomFilter results in UnsupportedOperationException -- Key: SOLR-3950 URL: https://issues.apache.org/jira/browse/SOLR-3950 Project: Solr Issue Type: Bug Affects Versions: 4.1 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@bigindy5 ~]# java -version java version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) Reporter: Shawn Heisey Fix For: 4.1 Tested on branch_4x, checked out after BlockPostingsFormat was made the default by LUCENE-4446. I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and copied it into my sharedLib directory. When I subsequently tried postings=BloomFilter I got a the following exception in the log: {code} Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log SEVERE: java.lang.UnsupportedOperationException: Error - org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been constructed without a choice of PostingsFormat {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-NightlyTests-trunk - Build # 62 - Failure
I opened LUCENE-4484 for this. Mike McCandless http://blog.mikemccandless.com On Sun, Oct 14, 2012 at 2:21 AM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/62/ 1 tests failed. REGRESSION: org.apache.lucene.index.Test4GBStoredFields.test Error Message: Java heap space Stack Trace: java.lang.OutOfMemoryError: Java heap space at __randomizedtesting.SeedInfo.seed([2D89DD229CD304F5:A5DDE2F8322F690D]:0) at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:75) at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:48) at org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutputStream.java:139) at org.apache.lucene.store.RAMOutputStream.writeBytes(RAMOutputStream.java:125) at org.apache.lucene.store.MockIndexOutputWrapper.writeBytes(MockIndexOutputWrapper.java:123) at org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsWriter.writeField(Lucene40StoredFieldsWriter.java:180) at org.apache.lucene.index.StoredFieldsConsumer.finishDocument(StoredFieldsConsumer.java:120) at org.apache.lucene.index.DocFieldProcessor.finishDocument(DocFieldProcessor.java:339) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:263) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1443) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1122) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1103) at org.apache.lucene.index.Test4GBStoredFields.test(Test4GBStoredFields.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787) at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) at org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) Build Log: [...truncated 419 lines...] [junit4:junit4] Suite: org.apache.lucene.index.Test4GBStoredFields [junit4:junit4] 2 NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4:junit4] 2 NOTE: reproduce with: ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 [junit4:junit4] ERROR 3.95s J0 | Test4GBStoredFields.test [junit4:junit4] Throwable #1: java.lang.OutOfMemoryError: Java heap space [junit4:junit4]at __randomizedtesting.SeedInfo.seed([2D89DD229CD304F5:A5DDE2F8322F690D]:0) [junit4:junit4]at org.apache.lucene.store.RAMFile.newBuffer(RAMFile.java:75) [junit4:junit4]at org.apache.lucene.store.RAMFile.addBuffer(RAMFile.java:48) [junit4:junit4]at org.apache.lucene.store.RAMOutputStream.switchCurrentBuffer(RAMOutputStream.java:139) [junit4:junit4]at org.apache.lucene.store.RAMOutputStream.writeBytes(RAMOutputStream.java:125) [junit4:junit4]at
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476908#comment-13476908 ] Michael McCandless commented on LUCENE-4484: bq. Can uncache() be changed to return the still-open newly created IndexOutput? I think we'd have to wrap the RAMOutputStream .. then we could 1) know when too many bytes have been written, 2) close the wrapped RAMOutputStream and call uncache to move it to disk, 3) fix uncache to not close the IO (return it), 4) cutover the wrapper to the new on-disk IO. And all of this would have to be done inside a writeByte/s call (from the caller's standpoint) ... it seems hairy. We could also just leave it be, ie advertise this limitation. NRTCachingDir is already hairy enough... The purpose of this directory is to be used in an NRT setting where you have relatively frequent reopens compared to the indexing rate, and this naturally keeps files plenty small. It's also particularly unusual to index only stored fields in an NRT setting (what this test is doing). Yet another option would be to somehow have the indexer be able to flush based on size of stored fields / term vectors files ... today of course we completely disregard these from the RAM accounting since we write their bytes directly to disk. Maybe ... the app could pass the indexer an AtomicInt/Long recording bytes held elsewhere in RAM, and indexer would add that in its logic for when to trigger a flush... NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4472) Add setting that prevents merging on updateDocument
[ https://issues.apache.org/jira/browse/LUCENE-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476915#comment-13476915 ] Michael McCandless commented on LUCENE-4472: I like the new MergeCause enum! But, instead of folding all parameters into a MergeContext, and exposing a single MergePolicy.findMerges methods, can we keep the methods we have today and just add MergeCause as another parameter? This is a very expert API and I think it's fine to simply change it. I think this approach is more type-safe for the future, ie if we need to add something important such that a custom merge policy should pay attention to it ... apps will see compilation errors on upgrading and know they have to handle the new parameter. Add setting that prevents merging on updateDocument --- Key: LUCENE-4472 URL: https://issues.apache.org/jira/browse/LUCENE-4472 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.1, 5.0 Attachments: LUCENE-4472.patch, LUCENE-4472.patch Currently we always call maybeMerge if a segment was flushed after updateDocument. Some apps and in particular ElasticSearch uses some hacky workarounds to disable that ie for merge throttling. It should be easier to enable this kind of behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4472) Add setting that prevents merging on updateDocument
[ https://issues.apache.org/jira/browse/LUCENE-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476920#comment-13476920 ] Michael McCandless commented on LUCENE-4472: Actually I think we only need to add the MergeCause (maybe rename this to MergeTrigger?) param to findMerges? That method is invoked for natural merges, and knowing the trigger for the natural merge is useful... The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so. Add setting that prevents merging on updateDocument --- Key: LUCENE-4472 URL: https://issues.apache.org/jira/browse/LUCENE-4472 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.1, 5.0 Attachments: LUCENE-4472.patch, LUCENE-4472.patch Currently we always call maybeMerge if a segment was flushed after updateDocument. Some apps and in particular ElasticSearch uses some hacky workarounds to disable that ie for merge throttling. It should be easier to enable this kind of behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-trunk-Linux-Java6-64-test-only - Build # 9932 - Failure!
Hmmm I'll dig. Mike McCandless http://blog.mikemccandless.com On Mon, Oct 15, 2012 at 7:35 PM, buil...@flonkings.com wrote: Build: builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only/9932/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestNRTThreads.testNRTThreads Error Message: saw non-zero open-but-deleted count Stack Trace: java.lang.AssertionError: saw non-zero open-but-deleted count at __randomizedtesting.SeedInfo.seed([447148DE18F87BA8:DFA85CC559036DC3]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertFalse(Assert.java:68) at org.apache.lucene.index.TestNRTThreads.doSearching(TestNRTThreads.java:89) at org.apache.lucene.index.ThreadedIndexingAndSearchingTestCase.runTest(ThreadedIndexingAndSearchingTestCase.java:507) at org.apache.lucene.index.TestNRTThreads.testNRTThreads(TestNRTThreads.java:127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787) at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) at org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746) at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648) at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:682) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) at java.lang.Thread.run(Thread.java:662) Build Log: [...truncated 335 lines...] [junit4:junit4] Suite:
[jira] [Commented] (LUCENE-4472) Add setting that prevents merging on updateDocument
[ https://issues.apache.org/jira/browse/LUCENE-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476928#comment-13476928 ] Simon Willnauer commented on LUCENE-4472: - bq. The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so. I am not sure if we should really do that. I'd rather make those two methods protected and make it a impl detail of merge policy. I think the specialized methods are a poor man's approach to the MergeContext and the api is rather clumsy along those lines. I'd be happy to not break bw. compat but only add a more flexible API that is the authoritative source / single entry point for the IndexWriter. If you think this through finfForcedDeletesMerges and findForcedMerges are really and impl detail of the current IndexWriter and if we would modularize it would become even more obvious. Add setting that prevents merging on updateDocument --- Key: LUCENE-4472 URL: https://issues.apache.org/jira/browse/LUCENE-4472 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.1, 5.0 Attachments: LUCENE-4472.patch, LUCENE-4472.patch Currently we always call maybeMerge if a segment was flushed after updateDocument. Some apps and in particular ElasticSearch uses some hacky workarounds to disable that ie for merge throttling. It should be easier to enable this kind of behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3881) frequent OOM in LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476942#comment-13476942 ] Jan Høydahl commented on SOLR-3881: --- I'm sure it's possible to optimize memory footprint somehow. The reason why we concat all fl fields before detection was originally because Tika's detector gets better and better the longer input text you have. So while detection for individual short fields have a high risk of mis-detection, the resulting concatenated string has a better chance. A configurable max-cap in the concatenation may make sense, as the detection accuracy flattens out after some threshold. Perhaps we could also avoid the expandCapacity() and Ararys.copyOf() calls if we pre-allocate the StringBuffer with the theoretical max size, being the size of our SolrInputDoc. If StringBuffer is at 10kb and needs an extra 10b for an append, it will allocate a new buffer of (10kb+1)*2 capacity which is a waste. We should also switch to StringBuilder which is more performant. frequent OOM in LanguageIdentifierUpdateProcessor - Key: SOLR-3881 URL: https://issues.apache.org/jira/browse/SOLR-3881 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=) Reporter: Rob Tulloh We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens: {noformat} Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171) at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at
[jira] [Created] (LUCENE-4485) CheckIndex's term stats should not include deleted docs
Michael McCandless created LUCENE-4485: -- Summary: CheckIndex's term stats should not include deleted docs Key: LUCENE-4485 URL: https://issues.apache.org/jira/browse/LUCENE-4485 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless I was looking at the CheckIndex output on and index that has deletions, eg: {noformat} 4 of 30: name=_90 docCount=588408 codec=Lucene41 compound=false numFiles=14 size (MB)=265.318 diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation} has deletions [delGen=1] test: open reader.OK [39351 deleted docs] test: fields..OK [8 fields] test: field norms.OK [2 fields] test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens] test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens] test: stored fields...OK [1647171 total field count; avg 3 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] test: docvalues...OK [0 total doc count; 1 docvalues fields] {noformat} If you compare the {{test: terms, freq, prox}} (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much deadweight is in the index due to un-reclaimed deletions... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3952) TextResponseWriter/XMLWriter: Make escaping deactivatable
Sebastian Lutze created SOLR-3952: - Summary: TextResponseWriter/XMLWriter: Make escaping deactivatable Key: SOLR-3952 URL: https://issues.apache.org/jira/browse/SOLR-3952 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 1.4 Reporter: Sebastian Lutze Priority: Minor Fix For: 4.1 Attachments: disable_escape.patch Since we have full control over what is stored in our indexes, we want to retrieve highlighted terms or phrases in real XML-tags ... {code:xml} str emNapoleon/em /str {code} ... rather than in escaped sequences: {code:xml} str lt;emgt;Napoleonlt;/emgt; /str {code} Until now I haven't discovered any solution to solve this problem out-of-the-box. We patched together a very crude workaround involving Cocoon's ServletService, a XSLT-stylesheet and disableOutputEscaping=yes. Therefore this patch provides: - a field doEscape in TextResponseWriter and corresponding getters/setters - support for a request-parameter escape=off to disable escaping I'm not sure if I have chosen the optimal approach to address this issue or if the issue is even a issue. Maybe there is a better way with Formatters/Encoders or something else? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3952) TextResponseWriter/XMLWriter: Make escaping deactivatable
[ https://issues.apache.org/jira/browse/SOLR-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Lutze updated SOLR-3952: -- Attachment: disable_escape.patch TextResponseWriter/XMLWriter: Make escaping deactivatable - Key: SOLR-3952 URL: https://issues.apache.org/jira/browse/SOLR-3952 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 3.6 Reporter: Sebastian Lutze Priority: Minor Labels: escaping, response, xml Fix For: 4.1 Attachments: disable_escape.patch Since we have full control over what is stored in our indexes, we want to retrieve highlighted terms or phrases in real XML-tags ... {code:xml} str emNapoleon/em /str {code} ... rather than in escaped sequences: {code:xml} str lt;emgt;Napoleonlt;/emgt; /str {code} Until now I haven't discovered any solution to solve this problem out-of-the-box. We patched together a very crude workaround involving Cocoon's ServletService, a XSLT-stylesheet and disableOutputEscaping=yes. Therefore this patch provides: - a field doEscape in TextResponseWriter and corresponding getters/setters - support for a request-parameter escape=off to disable escaping I'm not sure if I have chosen the optimal approach to address this issue or if the issue is even a issue. Maybe there is a better way with Formatters/Encoders or something else? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3952) TextResponseWriter/XMLWriter: Make escaping deactivatable
[ https://issues.apache.org/jira/browse/SOLR-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Lutze updated SOLR-3952: -- Affects Version/s: (was: 1.4) 3.6 TextResponseWriter/XMLWriter: Make escaping deactivatable - Key: SOLR-3952 URL: https://issues.apache.org/jira/browse/SOLR-3952 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 3.6 Reporter: Sebastian Lutze Priority: Minor Labels: escaping, response, xml Fix For: 4.1 Attachments: disable_escape.patch Since we have full control over what is stored in our indexes, we want to retrieve highlighted terms or phrases in real XML-tags ... {code:xml} str emNapoleon/em /str {code} ... rather than in escaped sequences: {code:xml} str lt;emgt;Napoleonlt;/emgt; /str {code} Until now I haven't discovered any solution to solve this problem out-of-the-box. We patched together a very crude workaround involving Cocoon's ServletService, a XSLT-stylesheet and disableOutputEscaping=yes. Therefore this patch provides: - a field doEscape in TextResponseWriter and corresponding getters/setters - support for a request-parameter escape=off to disable escaping I'm not sure if I have chosen the optimal approach to address this issue or if the issue is even a issue. Maybe there is a better way with Formatters/Encoders or something else? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476964#comment-13476964 ] Robert Muir commented on LUCENE-4484: - {quote} ... it seems hairy. We could also just leave it be, ie advertise this limitation. NRTCachingDir is already hairy enough... The purpose of this directory is to be used in an NRT setting where you have relatively frequent reopens compared to the indexing rate, and this naturally keeps files plenty small. {quote} This seems fine to me. I think lets just do javadocs? Because in general there are lots of other combinations of stupid parameters that can cause OOM/Out of Open Files/etc and we can't prevent all of them. NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476969#comment-13476969 ] Mark Miller commented on LUCENE-4484: - Doesn't seem like a great answer to me - if you want to use NRTCachingDir, please make sure you are constantly indexing and reopening so that you don't run into problems...that sounds hairy as well... NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476973#comment-13476973 ] Robert Muir commented on LUCENE-4484: - The test in question is extreme in that it doesnt actually index anything, its just adding stored fields. NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476977#comment-13476977 ] Mark Miller commented on LUCENE-4484: - Yeah, I know - its a special case of stored fields and term vectors - but it would still be great if it was a special case you didn't have to worry about. It's not the end of the world - if someone has problems we can tell them to stop using NRTCachingDir - but it would also be great if it just worked well in that case too. (Solr defaults to NRTCachingDir) NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476981#comment-13476981 ] Robert Muir commented on LUCENE-4484: - I know it does: I think a much safer general solution to keep e.g. file counts low would be to just match the lucene defaults: FSDirectory.open and CFS enabled. I tend to agree with Mike on NRTCachingDirectory can really be especially for the NRT use case because otherwise I think its going to be ugly to make it work well for all use-cases... and even then not OOM'ing doesnt necessarily mean working well. If its always overflowing its cache and having to uncache files because its not really an NRT use case that doesn't seem great. But i don't disagree with trying to make it more general either, I do just think that this should be done in NRTCachingDir itself and not hacked into indexwriter (flushing when stored files get too large is illogical outside of hacking around this) NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476989#comment-13476989 ] Robert Muir commented on LUCENE-4484: - {quote} And all of this would have to be done inside a writeByte/s call (from the caller's standpoint) {quote} In trunk at least this could be done in switchBuffer or whatever instead. Not that it makes it cleaner, just less ugly. NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13476999#comment-13476999 ] Shawn Heisey commented on SOLR-3950: bq. However your code has instantiated a BloomFilterPostingsFormat without passing a choice of delegate - presumably using the zero-arg constructor. In this case, my code is Solr, source code unmodified. From my schema.xml: {code} fieldType name=bloomLong class=solr.TrieLongField precisionStep=0 omitNorms=true positionIncrementGap=0 postingsFormat=BloomFilter/ fieldType name=bloomLowercase class=solr.TextField sortMissingLast=true positionIncrementGap=0 omitNorms=true postingsFormat=BloomFilter . . snip . /fieldType {code} If there is some schema config that will tell Solr to do the right thing, please let me know. Attempting postings=BloomFilter results in UnsupportedOperationException -- Key: SOLR-3950 URL: https://issues.apache.org/jira/browse/SOLR-3950 Project: Solr Issue Type: Bug Affects Versions: 4.1 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@bigindy5 ~]# java -version java version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) Reporter: Shawn Heisey Fix For: 4.1 Tested on branch_4x, checked out after BlockPostingsFormat was made the default by LUCENE-4446. I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and copied it into my sharedLib directory. When I subsequently tried postings=BloomFilter I got a the following exception in the log: {code} Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log SEVERE: java.lang.UnsupportedOperationException: Error - org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been constructed without a choice of PostingsFormat {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3953) postingsFormat doesn't work on field, only on fieldType
Shawn Heisey created SOLR-3953: -- Summary: postingsFormat doesn't work on field, only on fieldType Key: SOLR-3953 URL: https://issues.apache.org/jira/browse/SOLR-3953 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.1 Reporter: Shawn Heisey Priority: Minor Fix For: 4.1 The following schema config (adding postingsFormat) produces no changes in Solr's behavior. If postingsFormat=BloomFilter is instead added to a new fieldType and that fieldType is used, then Solr's behavior changes. In my pre-deployment tests, it results in SOLR-3950. field name=did type=long indexed=true stored=true postingsFormat=BloomFilter/ Having to add a new fieldType for an alternate codec leads to configuration duplication and the potential for confusing problems. I would imagine that most people that are interested in alternate codecs will want to continue using an existing type. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3926) solrj should support better way of finding active sorts
[ https://issues.apache.org/jira/browse/SOLR-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eirik Lygre updated SOLR-3926: -- Affects Version/s: (was: 4.0-BETA) 4.0 solrj should support better way of finding active sorts --- Key: SOLR-3926 URL: https://issues.apache.org/jira/browse/SOLR-3926 Project: Solr Issue Type: Improvement Components: clients - java Affects Versions: 4.0 Reporter: Eirik Lygre Priority: Minor The Solrj api uses ortogonal concepts for setting/removing and getting sort information. Setting/removing uses a combination of (name,order), while getters return a String name order: {code} public SolrQuery setSortField(String field, ORDER order); public SolrQuery addSortField(String field, ORDER order); public SolrQuery removeSortField(String field, ORDER order); public String[] getSortFields(); public String getSortField(); {code} If you want to use the current sort information to present a list of active sorts, with the possibility to remove then, you need to manually parse the string(s) returned from getSortFields, to recreate the information required by removeSortField(). Not difficult, but not convenient either :-) Therefore this suggestion: Add a new method {{public MapString,ORDER getSortFieldMap();}} which returns an ordered map of active sort fields. An example implementation is shown below (here as a utility method living outside SolrQuery; the rewrite should be trivial) {code} public MapString, ORDER getSortFieldMap(SolrQuery query) { String[] actualSortFields = query.getSortFields(); if (actualSortFields == null || actualSortFields.length == 0) return Collections.emptyMap(); MapString, ORDER sortFieldMap = new LinkedHashMapString, ORDER(); for (String sortField : actualSortFields) { String[] fieldSpec = sortField.split( ); sortFieldMap.put(fieldSpec[0], ORDER.valueOf(fieldSpec[1])); } return sortFieldMap; } {code} For what it's worth, this is possible client code: {code} System.out.println(Active sorts); MapString, ORDER fieldMap = getSortFieldMap(query); for (String field : fieldMap.keySet()) { System.out.println(- + field + ; dir= + fieldMap.get(field)); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings=BloomFilter results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477036#comment-13477036 ] Mark Harwood commented on SOLR-3950: bq. If there is some schema config that will tell Solr to do the right thing, please let me know. Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as to what delegate it will use before you can use it at write-time. I think we have 3 options: 1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF e.g. Lucene40PF so you would have a zero-arg-constructor class named something like BloomLucene40PF or... 2) Solr extends config file format to provide a generic means of assembling wrapper PFs like Bloom in their config e.g: postingsFormat=BloomFilter delegatePostingsFormat=FooPF and Solr then does reflection magic to call constructors appropriately or.. 3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for BloomPF. Of these 1) feels like the right thing. Cheers Mark Attempting postings=BloomFilter results in UnsupportedOperationException -- Key: SOLR-3950 URL: https://issues.apache.org/jira/browse/SOLR-3950 Project: Solr Issue Type: Bug Affects Versions: 4.1 Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@bigindy5 ~]# java -version java version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) Reporter: Shawn Heisey Fix For: 4.1 Tested on branch_4x, checked out after BlockPostingsFormat was made the default by LUCENE-4446. I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and copied it into my sharedLib directory. When I subsequently tried postings=BloomFilter I got a the following exception in the log: {code} Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log SEVERE: java.lang.UnsupportedOperationException: Error - org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been constructed without a choice of PostingsFormat {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields
[ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477038#comment-13477038 ] Radim Kolar commented on LUCENE-4226: - is there example config provided? Efficient compression of small to medium stored fields -- Key: LUCENE-4226 URL: https://issues.apache.org/jira/browse/LUCENE-4226 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.1, 5.0 Attachments: CompressionBenchmark.java, CompressionBenchmark.java, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java I've been doing some experiments with stored fields lately. It is very common for an index with stored fields enabled to have most of its space used by the .fdt index file. To prevent this .fdt file from growing too much, one option is to compress stored fields. Although compression works rather well for large fields, this is not the case for small fields and the compression ratio can be very close to 100%, even with efficient compression algorithms. In order to improve the compression ratio for small fields, I've written a {{StoredFieldsFormat}} that compresses several documents in a single chunk of data. To see how it behaves in terms of document deserialization speed and compression ratio, I've run several tests with different index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text were indexed and stored): - no compression, - docs compressed with deflate (compression level = 1), - docs compressed with deflate (compression level = 9), - docs compressed with Snappy, - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 docs. For those who don't know Snappy, it is compression algorithm from Google which has very high compression ratios, but compresses and decompresses data very quickly. {noformat} Format Compression ratio IndexReader.document time uncompressed 100% 100% doc/deflate 1 59% 616% doc/deflate 9 58% 595% doc/snappy80% 129% index/deflate 1 49% 966% index/deflate 9 46% 938% index/snappy 65% 264% {noformat} (doc = doc-level compression, index = index-level compression) I find it interesting because it allows to trade speed for space (with deflate, the .fdt file shrinks by a factor of 2, much better than with doc-level compression). One other interesting thing is that {{index/snappy}} is almost as compact as {{doc/deflate}} while it is more than 2x faster at retrieving documents from disk. These tests have been done on a hot OS cache, which is the worst case for compressed fields (one can expect better results for formats that have a high compression ratio since they probably require fewer read/write operations from disk). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields
[ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477041#comment-13477041 ] Simon Willnauer commented on LUCENE-4226: - @adrien I deleted the jenkins job for this. Efficient compression of small to medium stored fields -- Key: LUCENE-4226 URL: https://issues.apache.org/jira/browse/LUCENE-4226 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.1, 5.0 Attachments: CompressionBenchmark.java, CompressionBenchmark.java, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java I've been doing some experiments with stored fields lately. It is very common for an index with stored fields enabled to have most of its space used by the .fdt index file. To prevent this .fdt file from growing too much, one option is to compress stored fields. Although compression works rather well for large fields, this is not the case for small fields and the compression ratio can be very close to 100%, even with efficient compression algorithms. In order to improve the compression ratio for small fields, I've written a {{StoredFieldsFormat}} that compresses several documents in a single chunk of data. To see how it behaves in terms of document deserialization speed and compression ratio, I've run several tests with different index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text were indexed and stored): - no compression, - docs compressed with deflate (compression level = 1), - docs compressed with deflate (compression level = 9), - docs compressed with Snappy, - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 docs. For those who don't know Snappy, it is compression algorithm from Google which has very high compression ratios, but compresses and decompresses data very quickly. {noformat} Format Compression ratio IndexReader.document time uncompressed 100% 100% doc/deflate 1 59% 616% doc/deflate 9 58% 595% doc/snappy80% 129% index/deflate 1 49% 966% index/deflate 9 46% 938% index/snappy 65% 264% {noformat} (doc = doc-level compression, index = index-level compression) I find it interesting because it allows to trade speed for space (with deflate, the .fdt file shrinks by a factor of 2, much better than with doc-level compression). One other interesting thing is that {{index/snappy}} is almost as compact as {{doc/deflate}} while it is more than 2x faster at retrieving documents from disk. These tests have been done on a hot OS cache, which is the worst case for compressed fields (one can expect better results for formats that have a high compression ratio since they probably require fewer read/write operations from disk). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-NightlyTests-trunk - Build # 63 - Still Failing
I think this is really https://issues.apache.org/jira/browse/LUCENE-4182 ? It seemed to be triggered several times before by NGramTokenizer with crazy params: e.g. large docs. So maybe this test is provoking it too for the same reason. I've never been able to reproduce these fails. On Sun, Oct 14, 2012 at 11:50 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/63/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestBagOfPositions.test Error Message: Captured an uncaught exception in thread: Thread[id=644, name=Thread-561, state=RUNNABLE, group=TGRP-TestBagOfPositions] Stack Trace: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=644, name=Thread-561, state=RUNNABLE, group=TGRP-TestBagOfPositions] Caused by: java.lang.AssertionError: ram was 33879456 expected: 33851840 flush mem: 18092896 activeMem: 15786560 pendingMem: 0 flushingMem: 3 blockedMem: 0 peakDeltaMem: 99136 at __randomizedtesting.SeedInfo.seed([11A534B74B63930E]:0) at org.apache.lucene.index.DocumentsWriterFlushControl.assertMemory(DocumentsWriterFlushControl.java:114) at org.apache.lucene.index.DocumentsWriterFlushControl.doAfterDocument(DocumentsWriterFlushControl.java:181) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:384) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1443) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1122) at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:201) at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:160) at org.apache.lucene.index.TestBagOfPositions$1.run(TestBagOfPositions.java:110) Build Log: [...truncated 420 lines...] [junit4:junit4] Suite: org.apache.lucene.index.TestBagOfPositions [junit4:junit4] 2 NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4:junit4] 2 NOTE: reproduce with: ant test -Dtestcase=TestBagOfPositions -Dtests.method=test -Dtests.seed=11A534B74B63930E -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=fi -Dtests.timezone=Africa/Conakry -Dtests.file.encoding=ISO-8859-1 [junit4:junit4] ERROR206s J0 | TestBagOfPositions.test [junit4:junit4] Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=644, name=Thread-561, state=RUNNABLE, group=TGRP-TestBagOfPositions] [junit4:junit4] Caused by: java.lang.AssertionError: ram was 33879456 expected: 33851840 flush mem: 18092896 activeMem: 15786560 pendingMem: 0 flushingMem: 3 blockedMem: 0 peakDeltaMem: 99136 [junit4:junit4]at __randomizedtesting.SeedInfo.seed([11A534B74B63930E]:0) [junit4:junit4]at org.apache.lucene.index.DocumentsWriterFlushControl.assertMemory(DocumentsWriterFlushControl.java:114) [junit4:junit4]at org.apache.lucene.index.DocumentsWriterFlushControl.doAfterDocument(DocumentsWriterFlushControl.java:181) [junit4:junit4]at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:384) [junit4:junit4]at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1443) [junit4:junit4]at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1122) [junit4:junit4]at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:201) [junit4:junit4]at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:160) [junit4:junit4]at org.apache.lucene.index.TestBagOfPositions$1.run(TestBagOfPositions.java:110) [junit4:junit4] Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=643, name=Thread-560, state=RUNNABLE, group=TGRP-TestBagOfPositions] [junit4:junit4] Caused by: java.lang.AssertionError: ram was 33879456 expected: 33851840 flush mem: 18092896 activeMem: 15786560 pendingMem: 0 flushingMem: 3 blockedMem: 0 peakDeltaMem: 99136 [junit4:junit4]at __randomizedtesting.SeedInfo.seed([11A534B74B63930E]:0) [junit4:junit4]at org.apache.lucene.index.DocumentsWriterFlushControl.assertMemory(DocumentsWriterFlushControl.java:114) [junit4:junit4]at org.apache.lucene.index.DocumentsWriterFlushControl.doAfterDocument(DocumentsWriterFlushControl.java:181) [junit4:junit4]at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:384) [junit4:junit4]at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1443) [junit4:junit4]
[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields
[ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477062#comment-13477062 ] Adrien Grand commented on LUCENE-4226: -- @radim you can have a look at CompressingCodec in lucene/test-framework @Simon ok, thanks! Efficient compression of small to medium stored fields -- Key: LUCENE-4226 URL: https://issues.apache.org/jira/browse/LUCENE-4226 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Fix For: 4.1, 5.0 Attachments: CompressionBenchmark.java, CompressionBenchmark.java, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java I've been doing some experiments with stored fields lately. It is very common for an index with stored fields enabled to have most of its space used by the .fdt index file. To prevent this .fdt file from growing too much, one option is to compress stored fields. Although compression works rather well for large fields, this is not the case for small fields and the compression ratio can be very close to 100%, even with efficient compression algorithms. In order to improve the compression ratio for small fields, I've written a {{StoredFieldsFormat}} that compresses several documents in a single chunk of data. To see how it behaves in terms of document deserialization speed and compression ratio, I've run several tests with different index compression strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text were indexed and stored): - no compression, - docs compressed with deflate (compression level = 1), - docs compressed with deflate (compression level = 9), - docs compressed with Snappy, - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and chunks of 6 docs, - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 docs. For those who don't know Snappy, it is compression algorithm from Google which has very high compression ratios, but compresses and decompresses data very quickly. {noformat} Format Compression ratio IndexReader.document time uncompressed 100% 100% doc/deflate 1 59% 616% doc/deflate 9 58% 595% doc/snappy80% 129% index/deflate 1 49% 966% index/deflate 9 46% 938% index/snappy 65% 264% {noformat} (doc = doc-level compression, index = index-level compression) I find it interesting because it allows to trade speed for space (with deflate, the .fdt file shrinks by a factor of 2, much better than with doc-level compression). One other interesting thing is that {{index/snappy}} is almost as compact as {{doc/deflate}} while it is more than 2x faster at retrieving documents from disk. These tests have been done on a hot OS cache, which is the worst case for compressed fields (one can expect better results for formats that have a high compression ratio since they probably require fewer read/write operations from disk). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3952) TextResponseWriter/XMLWriter: Make escaping deactivatable
[ https://issues.apache.org/jira/browse/SOLR-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Lutze updated SOLR-3952: -- Attachment: disable_escape.patch TextResponseWriter/XMLWriter: Make escaping deactivatable - Key: SOLR-3952 URL: https://issues.apache.org/jira/browse/SOLR-3952 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 3.6 Reporter: Sebastian Lutze Priority: Minor Labels: escaping, response, xml Fix For: 4.1 Attachments: disable_escape.patch, disable_escape.patch Since we have full control over what is stored in our indexes, we want to retrieve highlighted terms or phrases in real XML-tags ... {code:xml} str emNapoleon/em /str {code} ... rather than in escaped sequences: {code:xml} str lt;emgt;Napoleonlt;/emgt; /str {code} Until now I haven't discovered any solution to solve this problem out-of-the-box. We patched together a very crude workaround involving Cocoon's ServletService, a XSLT-stylesheet and disableOutputEscaping=yes. Therefore this patch provides: - a field doEscape in TextResponseWriter and corresponding getters/setters - support for a request-parameter escape=off to disable escaping I'm not sure if I have chosen the optimal approach to address this issue or if the issue is even a issue. Maybe there is a better way with Formatters/Encoders or something else? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-trunk-Linux-Java6-64-test-only - Build # 9932 - Failure!
I committed a fix. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 16, 2012 at 7:23 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmmm I'll dig. Mike McCandless http://blog.mikemccandless.com On Mon, Oct 15, 2012 at 7:35 PM, buil...@flonkings.com wrote: Build: builds.flonkings.com/job/Lucene-trunk-Linux-Java6-64-test-only/9932/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestNRTThreads.testNRTThreads Error Message: saw non-zero open-but-deleted count Stack Trace: java.lang.AssertionError: saw non-zero open-but-deleted count at __randomizedtesting.SeedInfo.seed([447148DE18F87BA8:DFA85CC559036DC3]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertFalse(Assert.java:68) at org.apache.lucene.index.TestNRTThreads.doSearching(TestNRTThreads.java:89) at org.apache.lucene.index.ThreadedIndexingAndSearchingTestCase.runTest(ThreadedIndexingAndSearchingTestCase.java:507) at org.apache.lucene.index.TestNRTThreads.testNRTThreads(TestNRTThreads.java:127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) at com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787) at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) at org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746) at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648) at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:682) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42) at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at
Re: [JENKINS] Lucene-trunk-Linux-Java7-64-test-only - Build # 9691 - Failure!
On Sat, Oct 13, 2012 at 12:05 PM, Robert Muir rcm...@gmail.com wrote: This one is now a nightly-only test! So maybe we can safely enable this for the hourly builds? +1 Seems like we just need something to prune them if disk is getting full? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4485) CheckIndex's term stats should not include deleted docs
[ https://issues.apache.org/jira/browse/LUCENE-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4485: --- Attachment: LUCENE-4485.patch Simple patch ... CheckIndex's term stats should not include deleted docs --- Key: LUCENE-4485 URL: https://issues.apache.org/jira/browse/LUCENE-4485 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4485.patch I was looking at the CheckIndex output on and index that has deletions, eg: {noformat} 4 of 30: name=_90 docCount=588408 codec=Lucene41 compound=false numFiles=14 size (MB)=265.318 diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation} has deletions [delGen=1] test: open reader.OK [39351 deleted docs] test: fields..OK [8 fields] test: field norms.OK [2 fields] test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens] test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens] test: stored fields...OK [1647171 total field count; avg 3 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] test: docvalues...OK [0 total doc count; 1 docvalues fields] {noformat} If you compare the {{test: terms, freq, prox}} (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much deadweight is in the index due to un-reclaimed deletions... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3881) frequent OOM in LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477104#comment-13477104 ] Hoss Man commented on SOLR-3881: bq. The reason why we concat all fl fields before detection was originally because Tika's detector gets better and better the longer input text you have. But is it possible to give Tika a String[] or ListString instead of concating everything into a single String? frequent OOM in LanguageIdentifierUpdateProcessor - Key: SOLR-3881 URL: https://issues.apache.org/jira/browse/SOLR-3881 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=) Reporter: Rob Tulloh We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens: {noformat} Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171) at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
Shawn Heisey created SOLR-3954: -- Summary: Option to have updateHandler and DIH skip updateLog Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477106#comment-13477106 ] Shawn Heisey commented on SOLR-3954: I was unsure what to put for the priority. Minor seems slightly too low and Major seems too high. Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4485) CheckIndex's term stats should not include deleted docs
[ https://issues.apache.org/jira/browse/LUCENE-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477112#comment-13477112 ] Robert Muir commented on LUCENE-4485: - +1 CheckIndex's term stats should not include deleted docs --- Key: LUCENE-4485 URL: https://issues.apache.org/jira/browse/LUCENE-4485 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-4485.patch I was looking at the CheckIndex output on and index that has deletions, eg: {noformat} 4 of 30: name=_90 docCount=588408 codec=Lucene41 compound=false numFiles=14 size (MB)=265.318 diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation} has deletions [delGen=1] test: open reader.OK [39351 deleted docs] test: fields..OK [8 fields] test: field norms.OK [2 fields] test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens] test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens] test: stored fields...OK [1647171 total field count; avg 3 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] test: docvalues...OK [0 total doc count; 1 docvalues fields] {noformat} If you compare the {{test: terms, freq, prox}} (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much deadweight is in the index due to un-reclaimed deletions... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3881) frequent OOM in LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477123#comment-13477123 ] Robert Muir commented on SOLR-3881: --- The langdetect implementation can append each piece at a time. It can also take reader: append(Reader), but that is really just syntactic sugar forwarding to append(String) and not exceeding the Detector.max_text_length. Seems like the concatenating stuff should be pushed out of the base class into the Tika impl. frequent OOM in LanguageIdentifierUpdateProcessor - Key: SOLR-3881 URL: https://issues.apache.org/jira/browse/SOLR-3881 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=) Reporter: Rob Tulloh We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens: {noformat} Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171) at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-3843) Add lucene-codecs to Solr libs?
[ https://issues.apache.org/jira/browse/SOLR-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley reopened SOLR-3843: Reopening. Core codecs and Solr should just work w/o requiring users to copy any jar files around. Add lucene-codecs to Solr libs? --- Key: SOLR-3843 URL: https://issues.apache.org/jira/browse/SOLR-3843 Project: Solr Issue Type: Wish Affects Versions: 4.0 Reporter: Adrien Grand Priority: Minor Fix For: 4.1 Solr gives the ability to its users to select the postings format to use on a per-field basis but only Lucene40PostingsFormat is available by default (unless users add lucene-codecs to the Solr lib directory). Maybe we should add lucene-codecs to Solr libs (I mean in the WAR file) so that people can try our non-default postings formats with minimum effort? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3843) Add lucene-codecs to Solr libs?
[ https://issues.apache.org/jira/browse/SOLR-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-3843: --- Priority: Critical (was: Minor) Affects Version/s: 4.0 Fix Version/s: 4.1 Add lucene-codecs to Solr libs? --- Key: SOLR-3843 URL: https://issues.apache.org/jira/browse/SOLR-3843 Project: Solr Issue Type: Wish Affects Versions: 4.0 Reporter: Adrien Grand Priority: Critical Fix For: 4.1 Solr gives the ability to its users to select the postings format to use on a per-field basis but only Lucene40PostingsFormat is available by default (unless users add lucene-codecs to the Solr lib directory). Maybe we should add lucene-codecs to Solr libs (I mean in the WAR file) so that people can try our non-default postings formats with minimum effort? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477196#comment-13477196 ] Mark Miller commented on SOLR-3954: --- What config are you using? The updatelog should not normally have this kind of performance penalty. In any case, I don't think we would add an option to skip the update log - you can remove it if the performance is unacceptable. Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3939) Solr Cloud recovery and leader election when unloading leader core
[ https://issues.apache.org/jira/browse/SOLR-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477222#comment-13477222 ] Joel Bernstein commented on SOLR-3939: -- It looks like after the leader is unloaded, the replica attempts to sync to the unloaded leader as part of the process to determine if it can be leader. When this fails, it thinks that there are better candidates to become leader. Then it goes into a recovery loop. Solr Cloud recovery and leader election when unloading leader core -- Key: SOLR-3939 URL: https://issues.apache.org/jira/browse/SOLR-3939 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0-BETA, 4.0 Reporter: Joel Bernstein Assignee: Mark Miller Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: cloud.log, SOLR-3939.patch When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down. This effects the ability to move a micro-shard from one Solr instance to another Solr instance. The problem does not occur 100% of the time but a large % of the time. To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3939) Solr Cloud recovery and leader election when unloading leader core
[ https://issues.apache.org/jira/browse/SOLR-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477247#comment-13477247 ] Mark Miller commented on SOLR-3939: --- That's what I see when I have an empty index. The leader sync fails because sync always fails with no local versions. The case with docs is perhaps a bit trickier since my simple test passes. I'll take a look at the logs. Solr Cloud recovery and leader election when unloading leader core -- Key: SOLR-3939 URL: https://issues.apache.org/jira/browse/SOLR-3939 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0-BETA, 4.0 Reporter: Joel Bernstein Assignee: Mark Miller Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: cloud.log, SOLR-3939.patch When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down. This effects the ability to move a micro-shard from one Solr instance to another Solr instance. The problem does not occur 100% of the time but a large % of the time. To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3939) Solr Cloud recovery and leader election when unloading leader core
[ https://issues.apache.org/jira/browse/SOLR-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477250#comment-13477250 ] Mark Miller commented on SOLR-3939: --- I think I see the issue. While we have talked about it, we don't currently try to populate the transaction log after a replication. So, the second core replica is replicating, it's got docs but no versions, then it tries to become the leader - but just like with the empty index, it cannot successfully sync with no versions as a frame of reference. Solr Cloud recovery and leader election when unloading leader core -- Key: SOLR-3939 URL: https://issues.apache.org/jira/browse/SOLR-3939 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0-BETA, 4.0 Reporter: Joel Bernstein Assignee: Mark Miller Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: cloud.log, SOLR-3939.patch When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down. This effects the ability to move a micro-shard from one Solr instance to another Solr instance. The problem does not occur 100% of the time but a large % of the time. To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3939) Solr Cloud recovery and leader election when unloading leader core
[ https://issues.apache.org/jira/browse/SOLR-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-3939: -- Priority: Critical (was: Major) Solr Cloud recovery and leader election when unloading leader core -- Key: SOLR-3939 URL: https://issues.apache.org/jira/browse/SOLR-3939 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0-BETA, 4.0 Reporter: Joel Bernstein Assignee: Mark Miller Priority: Critical Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: cloud.log, SOLR-3939.patch When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down. This effects the ability to move a micro-shard from one Solr instance to another Solr instance. The problem does not occur 100% of the time but a large % of the time. To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3939) Solr Cloud recovery and leader election when unloading leader core
[ https://issues.apache.org/jira/browse/SOLR-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477252#comment-13477252 ] Mark Miller commented on SOLR-3939: --- (My test was passing because I had the replica up initially, so it go the docs from the leader not through replication) Solr Cloud recovery and leader election when unloading leader core -- Key: SOLR-3939 URL: https://issues.apache.org/jira/browse/SOLR-3939 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.0-BETA, 4.0 Reporter: Joel Bernstein Assignee: Mark Miller Priority: Critical Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: cloud.log, SOLR-3939.patch When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down. This effects the ability to move a micro-shard from one Solr instance to another Solr instance. The problem does not occur 100% of the time but a large % of the time. To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3955) Return only matched multiValued field
Dotan Cohen created SOLR-3955: - Summary: Return only matched multiValued field Key: SOLR-3955 URL: https://issues.apache.org/jira/browse/SOLR-3955 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Dotan Cohen Assuming a multivalued, stored and indexed field named comment. When performing a search, it would be very helpful if there were a way to return only the values of comment which contain the match. For example: When searching for gold instead of getting this result: doc arr name=comment strTheres a lady whos sure/str strall that glitters is gold/str strand shes buying a stairway to heaven/str /arr /doc I would prefer to get this result: doc arr name=comment strall that glitters is gold/str /arr /doc (psuedo-XML from memory, may not be accurate but illustrates the point) Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3956) group.facet and facet.limit=-1 returns no facet counts
Mike Spencer created SOLR-3956: -- Summary: group.facet and facet.limit=-1 returns no facet counts Key: SOLR-3956 URL: https://issues.apache.org/jira/browse/SOLR-3956 Project: Solr Issue Type: Bug Components: search Affects Versions: 4.0 Reporter: Mike Spencer Attempting to use group.facet=true and facet.limit=-1 to return all facets from a grouped result ends up with the counts not being returned. Adjusting the facet.limit to any number greater than 0 returns the facet counts as expected. This does not appear limited to a specific field type, as I have tried on (both multivalued and not) text, string, boolean, and double types. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477283#comment-13477283 ] Shawn Heisey commented on SOLR-3954: Which specific configuration bits would you like to see? My solrconfig.xml file is heavily split into separate files and uses xinclude. I will go ahead and paste my best guesses now. {code} directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/ indexDefaults useCompoundFilefalse/useCompoundFile mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce35/int int name=segmentsPerTier35/int int name=maxMergeAtOnceExplicit105/int /mergePolicy mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount4/int int name=maxThreadCount4/int /mergeScheduler ramBufferSizeMB128/ramBufferSizeMB maxFieldLength32768/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypenative/lockType /indexDefaults updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs0/maxDocs maxTime0/maxTime /autoCommit !-- updateLog / -- /updateHandler {code} My schema has 47 fields defined. Not all fields in a typical document will be there, but at least half of them usually will be present. I use the ICU classes for lowercasing and most of the text fieldTypes are using WordDelimeterFilter. {code} fields field name=catchall type=genText indexed=true stored=false multiValued=true termVectors=true/ field name=doc_date type=tdate indexed=true stored=true/ field name=pd type=tdate indexed=true stored=true/ field name=ft_text type=ignored/ field name=mime_type type=mimeText indexed=true stored=true omitTermFreqAndPositions=true/ field name=ft_dname type=genText indexed=true stored=true/ field name=ft_subject type=genText indexed=true stored=true/ field name=action type=keyText indexed=true stored=true/ field name=attribute type=keyText indexed=true stored=true omitTermFreqAndPositions=true/ field name=category type=keyText indexed=true stored=true omitTermFreqAndPositions=true/ field name=caption_writer type=keyText indexed=true stored=true/ field name=doc_id type=keyText indexed=true stored=true/ field name=ft_owner type=keyText indexed=true stored=true/ field name=location type=keyText indexed=true stored=true/ field name=special type=keyText indexed=true stored=true/ field name=special_cats type=keyText indexed=true stored=true/ field name=selector type=keyText indexed=true stored=true omitTermFreqAndPositions=true/ field name=scode type=keyText indexed=true stored=true omitTermFreqAndPositions=true/ field name=byline type=sourceText indexed=true stored=true/ field name=credit type=sourceText indexed=true stored=false/ field name=keywords type=sourceText indexed=true stored=true/ field name=source type=sourceText indexed=true stored=true/ field name=sg type=lcsemi indexed=true stored=false omitTermFreqAndPositions=true/ field name=aimcode type=lowercase indexed=true stored=false omitTermFreqAndPositions=true/ field name=nc_lang type=lowercase indexed=true stored=false omitTermFreqAndPositions=true/ field name=tag_id type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=collection type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=feature type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=ip type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=longdim type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=webtable type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=set_name type=lowercase indexed=true stored=true omitTermFreqAndPositions=true/ field name=did type=long indexed=true stored=true postingsFormat=BloomFilter/ field name=doc_size type=long indexed=true stored=true/ field name=post_date type=tlong indexed=true stored=true/ field name=post_hour type=tlong indexed=true stored=true/ field name=set_count type=int indexed=false stored=true/ field name=set_lead type=boolean indexed=true stored=true default=true/ field name=format type=string indexed=false stored=true/ field name=ft_sfname type=string indexed=false stored=true/ field name=text_preview type=string indexed=false stored=true/ field name=_version_ type=long indexed=true stored=true/ field name=headline type=keyText indexed=true stored=true/ field name=mood type=keyText indexed=true stored=true/ field name=object type=keyText indexed=true stored=true/ field name=personality type=keyText indexed=true stored=true/ field name=poster type=keyText indexed=true stored=true/ /fields
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477289#comment-13477289 ] Shawn Heisey commented on SOLR-3954: You'll notice that one field has postingsFormat. This was for another bug that I filed. It's not causing any difference in the config. I will set up my import again so I can illustrate the performance impact from updateLog. Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477293#comment-13477293 ] Shawn Heisey commented on SOLR-3954: This is my most intense fieldType definition: {code} fieldType name=genText class=solr.TextField sortMissingLast=true positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$ replacement=$2 allowempty=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / filter class=solr.ICUFoldingFilterFactory/ filter class=solr.LengthFilterFactory min=1 max=512/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$ replacement=$2 allowempty=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 / filter class=solr.ICUFoldingFilterFactory/ filter class=solr.LengthFilterFactory min=1 max=512/ /analyzer /fieldType {code} Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477326#comment-13477326 ] Shawn Heisey commented on SOLR-3954: A completed import with updateLog turned off: {code} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=initArgs lst name=defaults str name=configdih-config.xml/str /lst /lst str name=statusidle/str str name=importResponse/ lst name=statusMessages str name=Total Requests made to DataSource1/str str name=Total Rows Fetched12947488/str str name=Total Documents Skipped0/str str name=Full Dump Started2012-10-16 07:46:01/str str name=Indexing completed. Added/Updated: 12947488 documents. Deleted 0 documents./str str name=Committed2012-10-16 11:17:48/str str name=Total Documents Processed12947488/str str name=Time taken3:31:47.508/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response {code} Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477333#comment-13477333 ] David Smiley commented on SOLR-3954: FWIW I've seen the updateLog grow to huge sizes for my bulk import. I commit at the end (of course) no soft commits or auto commits in-between. The updateLog is a hinderance during bulk imports. Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3957) Remove response WARNING of This response format is experimental
Erik Hatcher created SOLR-3957: -- Summary: Remove response WARNING of This response format is experimental Key: SOLR-3957 URL: https://issues.apache.org/jira/browse/SOLR-3957 Project: Solr Issue Type: Wish Affects Versions: 4.0 Reporter: Erik Hatcher Priority: Minor Fix For: 5.0 Remove all the useless (which I daresay is all of them) response WARNINGs stating This response format is experimental. At this point, all of these are more than just experimental, and even if so things are subject to change and in most cases can be done in a compatible manner anyway. Less noise. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4472) Add setting that prevents merging on updateDocument
[ https://issues.apache.org/jira/browse/LUCENE-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477418#comment-13477418 ] Michael McCandless commented on LUCENE-4472: I think forced merges or forcing reclaiming of deletions, both invoked by explicit app request, are very different use cases than the natural merging Lucene does during indexing (not directly invoked by the app, but as a side effect of other API calls). So I think it makes sense that the MP has separate methods to handle these very different use cases. I don't thing we should use single param / single method XXXContext approach to bypass back compat. We already tried this with ScorerContext but backed it out because of the loss of type safety... for expert APIs like this one I think it's actually good to require apps to revisit their impls on upgrading, if we've added parameters: it gives them a chance to improve their impls. Plus this API is already marked @experimental... Also, single method taking a single XXXContext obj means that method will have to have a switch or bunch of if statements to handle what are in fact very different use cases, which is rather awkward. Still, separately I would love to make forceMerge/Deletes un-public so you have to work harder to invoke them (eg maybe you invoke the merge policy directly and then call IW.maybeMerge ... or something). We can do that separately... Add setting that prevents merging on updateDocument --- Key: LUCENE-4472 URL: https://issues.apache.org/jira/browse/LUCENE-4472 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.1, 5.0 Attachments: LUCENE-4472.patch, LUCENE-4472.patch Currently we always call maybeMerge if a segment was flushed after updateDocument. Some apps and in particular ElasticSearch uses some hacky workarounds to disable that ie for merge throttling. It should be easier to enable this kind of behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3881) frequent OOM in LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477426#comment-13477426 ] Jan Høydahl commented on SOLR-3881: --- Probably built-in truncation is enough to avoid the OOMs, and we could refactor the multi string append if neccesary later. frequent OOM in LanguageIdentifierUpdateProcessor - Key: SOLR-3881 URL: https://issues.apache.org/jira/browse/SOLR-3881 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=) Reporter: Rob Tulloh We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens: {noformat} Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuffer.append(StringBuffer.java:224) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189) at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171) at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4485) CheckIndex's term stats should not include deleted docs
[ https://issues.apache.org/jira/browse/LUCENE-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-4485. Resolution: Fixed Fix Version/s: 5.0 4.1 CheckIndex's term stats should not include deleted docs --- Key: LUCENE-4485 URL: https://issues.apache.org/jira/browse/LUCENE-4485 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-4485.patch I was looking at the CheckIndex output on and index that has deletions, eg: {noformat} 4 of 30: name=_90 docCount=588408 codec=Lucene41 compound=false numFiles=14 size (MB)=265.318 diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation} has deletions [delGen=1] test: open reader.OK [39351 deleted docs] test: fields..OK [8 fields] test: field norms.OK [2 fields] test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens] test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens] test: stored fields...OK [1647171 total field count; avg 3 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] test: docvalues...OK [0 total doc count; 1 docvalues fields] {noformat} If you compare the {{test: terms, freq, prox}} (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much deadweight is in the index due to un-reclaimed deletions... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477434#comment-13477434 ] Michael McCandless commented on LUCENE-4484: bq. (Solr defaults to NRTCachingDir) Maybe it shouldn't? Or ... does it also default to NRT searching, like ElasticSearch (I think), i.e. frequently opening a new searcher? In which case it's a good default I think... NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3958) Solr should log a warning when old healthcheck method configured
Shawn Heisey created SOLR-3958: -- Summary: Solr should log a warning when old healthcheck method configured Key: SOLR-3958 URL: https://issues.apache.org/jira/browse/SOLR-3958 Project: Solr Issue Type: Improvement Affects Versions: 4.0 Reporter: Shawn Heisey Priority: Minor Fix For: 4.1 The old (3.x and earlier) way of handling a health check (with enable/disable functionality) has changed in Solr 4.0. If you are upgrading and still have the old method in the admin section, I believe that Solr should put a warning in the log. Currently it is just ignored. I do not believe it should keep Solr from starting, just log a warning. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477445#comment-13477445 ] Shawn Heisey commented on SOLR-3954: Here's a direct comparison on the same hardware. It might be important to know that when my import gets kicked off, there are actually four imports running. One of them is small -- during the second test (updateLog off), it imported 687765 rows in 10 minutes and 08 seconds. I did not check how long it took during the first test. The other three imports are all nearly 13 million records each. A du on the completed index directory with 12.9 million records shows 23520900 KB. I ran the first test and grabbed stats after an hour. Then I killed Solr, commented out updateLog, started it up again, kicked off the full-import, and again grabbed stats after an hour. Comparing the two shows that it is about twice as fast with updateLog turned off. With updateLog turned on: {code} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=initArgs lst name=defaults str name=configdih-config.xml/str /lst /lst str name=statusbusy/str str name=importResponseA command is still running.../str lst name=statusMessages str name=Time Elapsed1:0:1.762/str str name=Total Requests made to DataSource1/str str name=Total Rows Fetched2052096/str str name=Total Documents Processed2052095/str str name=Total Documents Skipped0/str str name=Full Dump Started2012-10-16 14:59:01/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response {code} With updateLog turned off: {code} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=initArgs lst name=defaults str name=configdih-config.xml/str /lst /lst str name=statusbusy/str str name=importResponseA command is still running.../str lst name=statusMessages str name=Time Elapsed1:0:0.434/str str name=Total Requests made to DataSource1/str str name=Total Rows Fetched4167525/str str name=Total Documents Processed4167524/str str name=Total Documents Skipped0/str str name=Full Dump Started2012-10-16 16:05:01/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response {code} Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
[ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477462#comment-13477462 ] Shawn Heisey commented on SOLR-3954: bq. In any case, I don't think we would add an option to skip the update log - you can remove it if the performance is unacceptable. When I revamp my SolrJ application, I plan to use soft commit on a very short interval (maybe 10 seconds) but only do a hard commit every five minutes, possibly even less often. If I understand the updateLog functionality right, and I don't claim that I do, it would mean that my SolrJ code would not need to keep separate track of which updates succeeded with soft commit and which ones succeeded with hard commit. If the server went down four minutes and 55 seconds after the last hard commit, I would have reasonable expectation that when it came back up, all those soft commits would get properly applied to my index. Assuming I have a proper understanding above, I want the updateLog for my incremental updates. It makes the bulk import take at least twice as long, and I do not need it there because if that fails, I will just start it over. If I am going to benefit from updateLog, I need to be able to turn it off for bulk indexing. Is there a way to create a second updateHandler that does not have updateLog enabled and tell DIH to use that handler? Option to have updateHandler and DIH skip updateLog --- Key: SOLR-3954 URL: https://issues.apache.org/jira/browse/SOLR-3954 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Shawn Heisey Fix For: 4.1 The updateLog feature makes updates take longer, likely because of the I/O time required to write the additional information to disk. It may take as much as three times as long for the indexing portion of the process. I'm not sure whether it affects the time to commit, but I would imagine that the difference there is small or zero. When doing incremental updates/deletes on an existing index, the time lag is probably very small and unimportant. When doing a full reindex (which may happen via DIH), especially if this is done in a build core that is then swapped with a live core, this performance hit is unacceptable. It seems to make the import take about three times as long. An option to have an update skip the updateLog would be very useful for these situations. It should have a method in SolrJ and be exposed in DIH as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3959) csv output is invalid csv if there is a currency field
Robert Muir created SOLR-3959: - Summary: csv output is invalid csv if there is a currency field Key: SOLR-3959 URL: https://issues.apache.org/jira/browse/SOLR-3959 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Like in the example. http://localhost:8983/solr/collection1/select?q=*%3A*fl=price_cwt=csv -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-trunk-Linux-Java7-64-test-only - Build # 10035 - Failure!
Build: builds.flonkings.com/job/Lucene-trunk-Linux-Java7-64-test-only/10035/ 1 tests failed. REGRESSION: org.apache.lucene.search.TestTimeLimitingCollector.testSearchMultiThreaded Error Message: Captured an uncaught exception in thread: Thread[id=255, name=Thread-198, state=RUNNABLE, group=TGRP-TestTimeLimitingCollector] Stack Trace: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=255, name=Thread-198, state=RUNNABLE, group=TGRP-TestTimeLimitingCollector] Caused by: java.lang.OutOfMemoryError: Java heap space at __randomizedtesting.SeedInfo.seed([41D53676D3187506]:0) at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.init(BlockTreeTermsReader.java:2266) at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.init(BlockTreeTermsReader.java:1275) at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.iterator(BlockTreeTermsReader.java:525) at org.apache.lucene.index.FilterAtomicReader$FilterTerms.iterator(FilterAtomicReader.java:86) at org.apache.lucene.index.AssertingAtomicReader$AssertingTerms.iterator(AssertingAtomicReader.java:99) at org.apache.lucene.index.MultiTerms.iterator(MultiTerms.java:103) at org.apache.lucene.index.TermContext.build(TermContext.java:94) at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:167) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:186) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:400) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:648) at org.apache.lucene.search.AssertingIndexSearcher.createNormalizedWeight(AssertingIndexSearcher.java:60) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:293) at org.apache.lucene.search.TestTimeLimitingCollector.search(TestTimeLimitingCollector.java:124) at org.apache.lucene.search.TestTimeLimitingCollector.doTestSearch(TestTimeLimitingCollector.java:139) at org.apache.lucene.search.TestTimeLimitingCollector.access$200(TestTimeLimitingCollector.java:42) at org.apache.lucene.search.TestTimeLimitingCollector$1.run(TestTimeLimitingCollector.java:292) Build Log: [...truncated 1072 lines...] [junit4:junit4] Suite: org.apache.lucene.search.TestTimeLimitingCollector [junit4:junit4] 2 oct 16, 2012 5:21:37 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4:junit4] 2 Advertencia: Uncaught exception in thread: Thread[Thread-198,5,TGRP-TestTimeLimitingCollector] [junit4:junit4] 2 java.lang.OutOfMemoryError: Java heap space [junit4:junit4] 2at __randomizedtesting.SeedInfo.seed([41D53676D3187506]:0) [junit4:junit4] 2at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.init(BlockTreeTermsReader.java:2266) [junit4:junit4] 2at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.init(BlockTreeTermsReader.java:1275) [junit4:junit4] 2at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.iterator(BlockTreeTermsReader.java:525) [junit4:junit4] 2at org.apache.lucene.index.FilterAtomicReader$FilterTerms.iterator(FilterAtomicReader.java:86) [junit4:junit4] 2at org.apache.lucene.index.AssertingAtomicReader$AssertingTerms.iterator(AssertingAtomicReader.java:99) [junit4:junit4] 2at org.apache.lucene.index.MultiTerms.iterator(MultiTerms.java:103) [junit4:junit4] 2at org.apache.lucene.index.TermContext.build(TermContext.java:94) [junit4:junit4] 2at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:167) [junit4:junit4] 2at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:186) [junit4:junit4] 2at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:400) [junit4:junit4] 2at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:648) [junit4:junit4] 2at org.apache.lucene.search.AssertingIndexSearcher.createNormalizedWeight(AssertingIndexSearcher.java:60) [junit4:junit4] 2at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:293) [junit4:junit4] 2at org.apache.lucene.search.TestTimeLimitingCollector.search(TestTimeLimitingCollector.java:124) [junit4:junit4] 2at org.apache.lucene.search.TestTimeLimitingCollector.doTestSearch(TestTimeLimitingCollector.java:139) [junit4:junit4] 2at org.apache.lucene.search.TestTimeLimitingCollector.access$200(TestTimeLimitingCollector.java:42) [junit4:junit4] 2at org.apache.lucene.search.TestTimeLimitingCollector$1.run(TestTimeLimitingCollector.java:292) [junit4:junit4] 2 [junit4:junit4] 2 oct 16, 2012 5:21:43 P.M.
[jira] [Commented] (LUCENE-4484) NRTCachingDir can't handle large files
[ https://issues.apache.org/jira/browse/LUCENE-4484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477536#comment-13477536 ] Mark Miller commented on LUCENE-4484: - Right, we have changed the defaults to favor NRT. You can always say to switch that if someone runs into a problem, but of course it would be nicer if NRTCachingDir was more versatile and could deal well with term vectors / stored fields. I agree it's more of a niche situation (it's not likely a common problem), but it would be my preference. NRTCachingDir can't handle large files -- Key: LUCENE-4484 URL: https://issues.apache.org/jira/browse/LUCENE-4484 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless I dug into this OOME, which easily repros for me on rev 1398268: {noformat} ant test -Dtestcase=Test4GBStoredFields -Dtests.method=test -Dtests.seed=2D89DD229CD304F5 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=ru -Dtests.timezone=Asia/Vladivostok -Dtests.file.encoding=UTF-8 -Dtests.verbose=true {noformat} The problem is the test got NRTCachingDir ... which cannot handle large files because it decides up front (when createOutput is called) whether the file will be in RAMDir vs wrapped dir ... so if that file turns out to be immense (which this test does since stored fields files can grow arbitrarily huge w/o any flush happening) then it takes unbounded RAM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Welcome Alan Woodward as Lucene/Solr committer
I'm pleased to announce that the Lucene PMC has voted Alan as a Lucene/Solr committer. Alan has been contributing patches on various tricky stuff: positions iterators, span queries, highlighters, codecs, and so on. Alan: its tradition that you introduce yourself with your background. I think your account is fully working and you should be able to add yourself to the who we are page on the website as well. Congratulations! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4479) TokenSources.getTokenStream() doesn't return correctly for termvectors with positions but no offsets
[ https://issues.apache.org/jira/browse/LUCENE-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-4479: --- Assignee: Alan Woodward TokenSources.getTokenStream() doesn't return correctly for termvectors with positions but no offsets Key: LUCENE-4479 URL: https://issues.apache.org/jira/browse/LUCENE-4479 Project: Lucene - Core Issue Type: Bug Components: modules/highlighter Affects Versions: 4.0 Reporter: Alan Woodward Assignee: Alan Woodward Priority: Minor Attachments: LUCENE-4479.patch, LUCENE-4479.patch The javadocs for TokenSources.getTokenStream(Terms, boolean) state: Low level api. Returns a token stream or null if no offset info available in index. This can be used to feed the highlighter with a pre-parsed token stream However, if the Terms instance passed in has positions but no offsets stored, a TokenStream is incorrectly returned, rather than null. This has the effect of incorrectly highlighting fields with term vectors and positions, but no offsets. All highlighting markup is prepended to the beginning of the field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org