Re: NullPointerException in PeerSync.handleUpdates
Right, if there's no "fixed version" mentioned and if the resolution is "unresolved", it's not in the code base at all. But that JIRA is not apparently reproducible, especially on more recent versions that 6.2. Is it possible to test a more recent version (6.6.2 would be my recommendation). Erick On Tue, Nov 21, 2017 at 9:58 PM, S Gwrote: > My bad. I found it at https://issues.apache.org/jira/browse/SOLR-9453 > But I could not find it in changes.txt perhaps because its yet not resolved. > > On Tue, Nov 21, 2017 at 9:15 AM, Erick Erickson > wrote: > >> Did you check the JIRA list? Or CHANGES.txt in more recent versions? >> >> On Tue, Nov 21, 2017 at 1:13 AM, S G wrote: >> > Hi, >> > >> > We are running 6.2 version of Solr and hitting this error frequently. >> > >> > Error while trying to recover. core=my_core:java.lang. >> NullPointerException >> > at org.apache.solr.update.PeerSync.handleUpdates( >> PeerSync.java:605) >> > at org.apache.solr.update.PeerSync.handleResponse( >> PeerSync.java:344) >> > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257) >> > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( >> RecoveryStrategy.java:376) >> > at org.apache.solr.cloud.RecoveryStrategy.run( >> RecoveryStrategy.java:221) >> > at java.util.concurrent.Executors$RunnableAdapter. >> call(Executors.java:511) >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> > at org.apache.solr.common.util.ExecutorUtil$ >> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) >> > at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1142) >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:617) >> > at java.lang.Thread.run(Thread.java:745) >> > >> > >> > >> > Is this a known issue and fixed in some newer version? >> > >> > >> > Thanks >> > SG >>
Re: Recovery Issue - Solr 6.6.1 and HDFS
Well, you can always manually change the ZK nodes, but whether just setting a node's state to "leader" in ZK then starting the Solr instance hosting that node would work... I don't know. Do consider running CheckIndex on one of the replicas in question first though. Best, Erick On Tue, Nov 21, 2017 at 3:06 PM, Joe Obernbergerwrote: > One other data point I just saw on one of the nodes. It has the following > error: > 2017-11-21 22:59:48.886 ERROR > (coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS s:shard14 > r:core_node175 x:UNCLASS_shard14_replica3] > o.a.s.c.ShardLeaderElectionContext There was a problem trying to register as > the leader:org.apache.solr.common.SolrException: Leader Initiated Recovery > prevented leadership > at > org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521) > at > org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424) > at > org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) > at > org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) > at > org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684) > at > org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454) > at > org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) > at > org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) > at > org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684) > at > org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454) > at > org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) > at > org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) > at > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) > > This stack trace repeats for a long while; looks like a recursive call. > > > -Joe > > > On 11/21/2017 3:24 PM, Hendrik Haddorp wrote: >> >> We sometimes also have replicas not recovering. If one replica is left >> active the easiest is to then to delete the replica and create a new one. >> When all replicas are down it helps most of the time to restart one of the >> nodes that contains a replica in down state. If that also doesn't get the >> replica to recover I would check the logs of the node and also that of the >> overseer node. I have seen the same issue on Solr using local storage. The >> main HDFS related issues we had so far was those lock files and if you >> delete and recreate collections/cores and it sometimes happens that the data >> was not cleaned up in HDFS and then causes a conflict. >> >> Hendrik >> >> On 21.11.2017 21:07, Joe Obernberger wrote: >>> >>> We've never run an index this size in anything but HDFS, so I have no >>> comparison. What we've been doing is keeping two main collections - all >>> data, and the last 30 days of data. Then we handle queries based on date >>> range. The 30 day index is significantly faster. >>> >>> My main concern right now is that 6 of the 100 shards are not coming back >>> because of no leader. I've never seen this error before. Any ideas? >>> ClusterStatus shows all three replicas with state 'down'. >>> >>> Thanks! >>> >>> -joe >>> >>> >>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That
Re: NullPointerException in PeerSync.handleUpdates
My bad. I found it at https://issues.apache.org/jira/browse/SOLR-9453 But I could not find it in changes.txt perhaps because its yet not resolved. On Tue, Nov 21, 2017 at 9:15 AM, Erick Ericksonwrote: > Did you check the JIRA list? Or CHANGES.txt in more recent versions? > > On Tue, Nov 21, 2017 at 1:13 AM, S G wrote: > > Hi, > > > > We are running 6.2 version of Solr and hitting this error frequently. > > > > Error while trying to recover. core=my_core:java.lang. > NullPointerException > > at org.apache.solr.update.PeerSync.handleUpdates( > PeerSync.java:605) > > at org.apache.solr.update.PeerSync.handleResponse( > PeerSync.java:344) > > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257) > > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:376) > > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:221) > > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at org.apache.solr.common.util.ExecutorUtil$ > MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > Is this a known issue and fixed in some newer version? > > > > > > Thanks > > SG >
Re: tokenstream reusable
Hello, Roxana. You probably looking for TeeSinkTokenFilter, but I believe the idea is cumbersome to implement in Solr. Also there is a preanalyzed field which can keep tokenstream in external form.
Re: Merging of index in Solr
Hi, I have encountered this error during the merging of the 3.5TB of index. What could be the cause that lead to this? Exception in thread "main" Exception in thread "Lucene Merge Thread #8" java.io. IOException: background merge hit exception: _6f(6.5.1):C7256757 _6e(6.5.1):C646 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431 _6a(6.5.1):C105 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84 _26(6.4.1):c44960/8149:de lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31 _68(6.5.1) :c66 into _6g [maxNumSegments=1] at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931) at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871) at org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57) Caused by: java.io.IOException: The requested operation could not be completed d ue to a file system limitation at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.write(Unknown Source) at sun.nio.ch.FileChannelImpl.write(Unknown Source) at java.nio.channels.Channels.writeFullyImpl(Unknown Source) at java.nio.channels.Channels.writeFully(Unknown Source) at java.nio.channels.Channels.access$000(Unknown Source) at java.nio.channels.Channels$1.write(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory .java:419) at java.util.zip.CheckedOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre amIndexOutput.java:53) at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited IndexOutput.java:73) at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52) at org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175 ) at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition( Lucene50PostingsWriter.java:286) at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting sWriterBase.java:156) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w rite(BlockTreeTermsWriter.java:866) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr eeTermsWriter.java:344) at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105 ) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter .merge(PerFieldPostingsFormat.java:164) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2 16) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353 ) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3928) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe rgeScheduler.java:624) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc urrentMergeScheduler.java:661) org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: The req uested operation could not be completed due to a file system limitation at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException (ConcurrentMergeScheduler.java:703) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc urrentMergeScheduler.java:683) Caused by: java.io.IOException: The requested operation could not be completed d ue to a file system limitation at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.write(Unknown Source) at sun.nio.ch.FileChannelImpl.write(Unknown Source) at java.nio.channels.Channels.writeFullyImpl(Unknown Source) at java.nio.channels.Channels.writeFully(Unknown Source) at java.nio.channels.Channels.access$000(Unknown Source) at java.nio.channels.Channels$1.write(Unknown Source) Regards, Edwin On 22 November 2017 at 00:10, Zheng Lin Edwin Yeowrote: > I am using the IndexMergeTool from Solr, from the command below: > > java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > org.apache.lucene.misc.IndexMergeTool > > The heap size is 32GB. There are more than 20 million documents in the two > cores. > > Regards, > Edwin > > > > On 21 November 2017 at 21:54, Shawn Heisey wrote: > >> On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: >> >>> Does anyone knows how long
FORCELEADER not working - solr 6.6.1
Hi All - sorry for the repeat, but I'm at a complete loss on this. I have a collection with 100 shards and 3 replicas each. 6 of the shard will not elect a leader. I've tried the FORCELEADER command, but nothing changes. The log shows 'Force leader attempt 1. Waiting 5 secs for an active leader' It tries 9 times, and then stops. The error that I get for a shard in question is: org.apache.solr.common.SolrException: Error getting leader from zk for shard shard21 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996) at org.apache.solr.cloud.ZkController.register(ZkController.java:902) at org.apache.solr.cloud.ZkController.register(ZkController.java:846) at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963) ... 7 more Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021) ... 9 more Please help. Thank you! -Joe
Re: Solr 7.x: Issues with unique()/hll() function on a string field nested in a range facet
I opened https://issues.apache.org/jira/browse/SOLR-11664 to track this. I should be able to look into this shortly if no one else does. -Yonik On Tue, Nov 21, 2017 at 6:02 PM, Yonik Seeleywrote: > Thanks for the complete info that allowed me to easily reproduce this! > The bug seems to extend beyond hll/unique... I tried min(string_s) and > got wonky results as well. > > -Yonik > > > On Tue, Nov 21, 2017 at 7:47 AM, Volodymyr Rudniev > wrote: >> Hello, >> >> I've encountered 2 issues while trying to apply unique()/hll() function to a >> string field inside a range facet: >> >> Results are incorrect for a single-valued string field. >> I’m getting ArrayIndexOutOfBoundsException for a multi-valued string field. >> >> >> How to reproduce: >> >> Create a core based on the default configSet. >> Add several simple documents to the core, like these: >> >> [ >> { >> "id": "14790", >> "int_i": 2010, >> "date_dt": "2010-01-01T00:00:00Z", >> "string_s": "a", >> "string_ss": ["a", "b"] >> }, >> { >> "id": "12254", >> "int_i": 2014, >> "date_dt": "2014-01-01T00:00:00Z", >> "string_s": "e", >> "string_ss": ["b", "c"] >> }, >> { >> "id": "12937", >> "int_i": 2008, >> "date_dt": "2008-01-01T00:00:00Z", >> "string_s": "c", >> "string_ss": ["c", "d"] >> }, >> { >> "id": "10575", >> "int_i": 2008, >> "date_dt": "2008-01-01T00:00:00Z", >> "string_s": "b", >> "string_ss": ["d", "e"] >> }, >> { >> "id": "13644", >> "int_i": 2014, >> "date_dt": "2014-01-01T00:00:00Z", >> "string_s": "e", >> "string_ss": ["e", "a"] >> }, >> { >> "id": "8405", >> "int_i": 2014, >> "date_dt": "2014-01-01T00:00:00Z", >> "string_s": "d", >> "string_ss": ["a", "b"] >> }, >> { >> "id": "6128", >> "int_i": 2008, >> "date_dt": "2008-01-01T00:00:00Z", >> "string_s": "a", >> "string_ss": ["b", "c"] >> }, >> { >> "id": "5220", >> "int_i": 2015, >> "date_dt": "2015-01-01T00:00:00Z", >> "string_s": "d", >> "string_ss": ["c", "d"] >> }, >> { >> "id": "6850", >> "int_i": 2012, >> "date_dt": "2012-01-01T00:00:00Z", >> "string_s": "b", >> "string_ss": ["d", "e"] >> }, >> { >> "id": "5748", >> "int_i": 2014, >> "date_dt": "2014-01-01T00:00:00Z", >> "string_s": "e", >> "string_ss": ["e", "a"] >> } >> ] >> >> 3. Try queries like the following for a single-valued string field: >> >> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)" >> >> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)" >> >> Distinct counts returned are incorrect in general. For example, for the set >> of documents above, the response will contain: >> >> { >> "val": 2010, >> "count": 1, >> "distinct_count": 0 >> } >> >> and >> >> "between": { >> "count": 10, >> "distinct_count": 1 >> } >> >> (there should be 5 distinct values). >> >> Note, the result depends on the order in which the documents are added. >> >> 4. Try queries like the following for a multi-valued string field: >> >> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)" >> >> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)" >> >> I’m getting ArrayIndexOutOfBoundsException for such queries. >> >> Note, everything looks Ok for other field types (I tried single- and >> multi-valued ints, doubles and dates) or when the enclosing facet is a terms >> facet or there is no enclosing facet at all. >> >> I can reproduce these issues both for Solr 7.0.1 and 7.1.0. Solr 6.x and >> 5.x, as it seems, do not have such issues. >> >> Is it a bug? Or, may be, I’ve missed something? >> >> Thanks, >> >> Volodymyr >>
Re: Recovery Issue - Solr 6.6.1 and HDFS
One other data point I just saw on one of the nodes. It has the following error: 2017-11-21 22:59:48.886 ERROR (coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS s:shard14 r:core_node175 x:UNCLASS_shard14_replica3] o.a.s.c.ShardLeaderElectionContext There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Leader Initiated Recovery prevented leadership at org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) at org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) at org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216) This stack trace repeats for a long while; looks like a recursive call. -Joe On 11/21/2017 3:24 PM, Hendrik Haddorp wrote: We sometimes also have replicas not recovering. If one replica is left active the easiest is to then to delete the replica and create a new one. When all replicas are down it helps most of the time to restart one of the nodes that contains a replica in down state. If that also doesn't get the replica to recover I would check the logs of the node and also that of the overseer node. I have seen the same issue on Solr using local storage. The main HDFS related issues we had so far was those lock files and if you delete and recreate collections/cores and it sometimes happens that the data was not cleaned up in HDFS and then causes a conflict. Hendrik On 21.11.2017 21:07, Joe Obernberger wrote: We've never run an index this size in anything but HDFS, so I have no comparison. What we've been doing is keeping two main collections - all data, and the last 30 days of data. Then we handle queries based on date range. The 30 day index is significantly faster. My main concern right now is that 6 of the 100 shards are not coming back because of no leader. I've never seen this error before. Any ideas? ClusterStatus shows all three replicas with state 'down'. Thanks! -joe On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that
Re: Solr 7.x: Issues with unique()/hll() function on a string field nested in a range facet
Thanks for the complete info that allowed me to easily reproduce this! The bug seems to extend beyond hll/unique... I tried min(string_s) and got wonky results as well. -Yonik On Tue, Nov 21, 2017 at 7:47 AM, Volodymyr Rudnievwrote: > Hello, > > I've encountered 2 issues while trying to apply unique()/hll() function to a > string field inside a range facet: > > Results are incorrect for a single-valued string field. > I’m getting ArrayIndexOutOfBoundsException for a multi-valued string field. > > > How to reproduce: > > Create a core based on the default configSet. > Add several simple documents to the core, like these: > > [ > { > "id": "14790", > "int_i": 2010, > "date_dt": "2010-01-01T00:00:00Z", > "string_s": "a", > "string_ss": ["a", "b"] > }, > { > "id": "12254", > "int_i": 2014, > "date_dt": "2014-01-01T00:00:00Z", > "string_s": "e", > "string_ss": ["b", "c"] > }, > { > "id": "12937", > "int_i": 2008, > "date_dt": "2008-01-01T00:00:00Z", > "string_s": "c", > "string_ss": ["c", "d"] > }, > { > "id": "10575", > "int_i": 2008, > "date_dt": "2008-01-01T00:00:00Z", > "string_s": "b", > "string_ss": ["d", "e"] > }, > { > "id": "13644", > "int_i": 2014, > "date_dt": "2014-01-01T00:00:00Z", > "string_s": "e", > "string_ss": ["e", "a"] > }, > { > "id": "8405", > "int_i": 2014, > "date_dt": "2014-01-01T00:00:00Z", > "string_s": "d", > "string_ss": ["a", "b"] > }, > { > "id": "6128", > "int_i": 2008, > "date_dt": "2008-01-01T00:00:00Z", > "string_s": "a", > "string_ss": ["b", "c"] > }, > { > "id": "5220", > "int_i": 2015, > "date_dt": "2015-01-01T00:00:00Z", > "string_s": "d", > "string_ss": ["c", "d"] > }, > { > "id": "6850", > "int_i": 2012, > "date_dt": "2012-01-01T00:00:00Z", > "string_s": "b", > "string_ss": ["d", "e"] > }, > { > "id": "5748", > "int_i": 2014, > "date_dt": "2014-01-01T00:00:00Z", > "string_s": "e", > "string_ss": ["e", "a"] > } > ] > > 3. Try queries like the following for a single-valued string field: > > q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)" > > q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)" > > Distinct counts returned are incorrect in general. For example, for the set > of documents above, the response will contain: > > { > "val": 2010, > "count": 1, > "distinct_count": 0 > } > > and > > "between": { > "count": 10, > "distinct_count": 1 > } > > (there should be 5 distinct values). > > Note, the result depends on the order in which the documents are added. > > 4. Try queries like the following for a multi-valued string field: > > q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)" > > q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)" > > I’m getting ArrayIndexOutOfBoundsException for such queries. > > Note, everything looks Ok for other field types (I tried single- and > multi-valued ints, doubles and dates) or when the enclosing facet is a terms > facet or there is no enclosing facet at all. > > I can reproduce these issues both for Solr 7.0.1 and 7.1.0. Solr 6.x and > 5.x, as it seems, do not have such issues. > > Is it a bug? Or, may be, I’ve missed something? > > Thanks, > > Volodymyr >
Re: Data inconsistencies and updates in solrcloud
Thanks Erick! As I said, user error! ;) Tom On 21/11/17 22:41, Erick Erickson wrote: I think you're confusing shards with replicas. numShards is 2, each with one replica. Therefore half of your docs will wind up on one replica and half on the other. If you're adding a single doc, by definition it'll be placed on only one of the two shards. If your shards had multiple replicas, all of the replicas associated with that shard would change. Best, Erick On Tue, Nov 21, 2017 at 12:56 PM, Tom Barberwrote: Hi folks I can't find an answer to this, and its clearly user error, we have a collection in solrcloud that is started numShards=2 replicationFactor=1 solr seems happy the collection seems happy. Yet when we post and update to it and then look at the record again, it seems to only affect one core and not the second. What are we likely to be doing wrong in our config or update to prevent the replication? Thanks Tom -- Spicule Limited is registered in England & Wales. Company Number: 09954122. Registered office: First Floor, Telecom House, 125-135 Preston Road, Brighton, England, BN1 6AF. VAT No. 251478891. All engagements are subject to Spicule Terms and Conditions of Business. This email and its contents are intended solely for the individual to whom it is addressed and may contain information that is confidential, privileged or otherwise protected from disclosure, distributing or copying. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Spicule Limited. The company accepts no liability for any damage caused by any virus transmitted by this email. If you have received this message in error, please notify us immediately by reply email before deleting it from your system. Service of legal notice cannot be effected on Spicule Limited by email.
Re: Recovery Issue - Solr 6.6.1 and HDFS
Hi Hendrick - the shards in question have three replicas. I tried restarting each one (one by one) - no luck. No leader is found. I deleted one of the replicas and added a new one, and the new one also shows as 'down'. I also tried the FORCELEADER call, but that had no effect. I checked the OVERSEERSTATUS, but there is nothing unusual there. I don't see anything useful in the logs except the error: org.apache.solr.common.SolrException: Error getting leader from zk for shard shard21 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996) at org.apache.solr.cloud.ZkController.register(ZkController.java:902) at org.apache.solr.cloud.ZkController.register(ZkController.java:846) at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963) ... 7 more Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021) ... 9 more Can I modify zookeeper to force a leader? Is there any other way to recover from this? Thanks very much! -Joe On 11/21/2017 3:24 PM, Hendrik Haddorp wrote: We sometimes also have replicas not recovering. If one replica is left active the easiest is to then to delete the replica and create a new one. When all replicas are down it helps most of the time to restart one of the nodes that contains a replica in down state. If that also doesn't get the replica to recover I would check the logs of the node and also that of the overseer node. I have seen the same issue on Solr using local storage. The main HDFS related issues we had so far was those lock files and if you delete and recreate collections/cores and it sometimes happens that the data was not cleaned up in HDFS and then causes a conflict. Hendrik On 21.11.2017 21:07, Joe Obernberger wrote: We've never run an index this size in anything but HDFS, so I have no comparison. What we've been doing is keeping two main collections - all data, and the last 30 days of data. Then we handle queries based on date range. The 30 day index is significantly faster. My main concern right now is that 6 of the 100 shards are not coming back because of no leader. I've never seen this error before. Any ideas? ClusterStatus shows all three replicas with state 'down'. Thanks! -joe On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused
Re: Data inconsistencies and updates in solrcloud
I think you're confusing shards with replicas. numShards is 2, each with one replica. Therefore half of your docs will wind up on one replica and half on the other. If you're adding a single doc, by definition it'll be placed on only one of the two shards. If your shards had multiple replicas, all of the replicas associated with that shard would change. Best, Erick On Tue, Nov 21, 2017 at 12:56 PM, Tom Barberwrote: > Hi folks > > I can't find an answer to this, and its clearly user error, we have a > collection in solrcloud that is started numShards=2 replicationFactor=1 solr > seems happy the collection seems happy. Yet when we post and update to it and > then look at the record again, it seems to only affect one core and not the > second. > > What are we likely to be doing wrong in our config or update to prevent the > replication? > > Thanks > > Tom
Re: Possible to disable SynonymQuery and get legacy behavior?
I have submitted a patch to make the query generated for overlapping query terms somewhat configurable (w/ default being SynonymQuery), based on practices I've seen in the field. I'd love to hear feedback https://issues.apache.org/jira/browse/SOLR-11662 On Tue, Nov 21, 2017 at 12:37 PM Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > We help clients that perform index-time semantic expansion to hypernyms at > index time. For example, they will have a synonyms file that does the > following > > wing_tips => wing_tips, dress_shoes, shoes > dress_shoes => dress_shoes, shoes > oxfords => oxfords, dress_shoes, shoes > > Then at query time, we rely on differing IDF of these terms in the same > position to bring up the rare, specific terms matches, followed by > increasingly semantically broad matches. For example, Previously, a search > for wing_tips would get turned into "wing_tips OR dress_shoes OR shoes". > Shoes being very common would get scored lowest. Wing tips being very > specific would get scored very highly > > ( I have a blog post about this (which uses Elasticsearch) > > http://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/ > ) > > As our clients upgrade to Solr 6 and above, we're noticing our technique > no longer works due to SynonymQuery, which blends the doc freq at query > time of synonyms at query time. SynonymQuery seems to be the right > direction for most people :) Still I would like to figure out how/if > there's a setting anywhere to return to the legacy behavior (a boolean > query of term queries) so I don't have to go back to the drawing board for > clients that rely on this technique. > > I've been going through QueryBuilder and I don't see where we could go > back to the legacy behavior. It seems to be based on position overlap. > > Thanks! > -Doug > > > > -- > Consultant, OpenSource Connections. Contact info at > http://o19s.com/about-us/doug-turnbull/; Free/Busy ( > http://bit.ly/dougs_cal) > -- Consultant, OpenSource Connections. Contact info at http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)
Re: Recovery Issue - Solr 6.6.1 and HDFS
Thank you Erick. I've set the RamBufferSize to 1G; perhaps higher would be beneficial. One more data point is that if I restart a node, more often than not, it goes into recovery, beats up the network for a while, and then goes green. This happens even if I do no indexing between restarts. Is that expected? Sometimes this can take longer than 20 minutes. No new data was added to the index between the restarts. -Joe On 11/21/2017 3:43 PM, Erick Erickson wrote: bq: We are doing lots of soft commits for NRT search... It's not surprising that this is slower than local storage, especially if you have any autowarming going on. Opening new searchers will need to read data from disk for the new segments, and HDFS may be slower here. As far as the commit interval, an under-appreciated event is that when RAMBufferSizeMB is exceeded (default 100M last I knew) new segments are written _anyway_, they're just a little invisible. That is, the segments_n file isn't updated even though they're closed IIUC at least. So that very long interval isn't helping with that problem I don't think Evidence to the contrary trumps my understanding of course. About starting all these collections up at once and the Overseer queue. I've seen this in similar situations. There are a _lot_ of messages flying back and forth for each replica on startup, and the Overseer processing was very inefficient historically so that queue could get in the 100s of K, I've seen some pathological situations where it's over 1M. SOLR-10524 made this a lot better. There are still a lot of messages written in a case like yours, but at least the Overseer has a much better chance to keep up Solr 6.6... At that point bringing up Solr took a very long time. Erick On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorpwrote: We sometimes also have replicas not recovering. If one replica is left active the easiest is to then to delete the replica and create a new one. When all replicas are down it helps most of the time to restart one of the nodes that contains a replica in down state. If that also doesn't get the replica to recover I would check the logs of the node and also that of the overseer node. I have seen the same issue on Solr using local storage. The main HDFS related issues we had so far was those lock files and if you delete and recreate collections/cores and it sometimes happens that the data was not cleaned up in HDFS and then causes a conflict. Hendrik On 21.11.2017 21:07, Joe Obernberger wrote: We've never run an index this size in anything but HDFS, so I have no comparison. What we've been doing is keeping two main collections - all data, and the last 30 days of data. Then we handle queries based on date range. The 30 day index is significantly faster. My main concern right now is that 6 of the 100 shards are not coming back because of no leader. I've never seen this error before. Any ideas? ClusterStatus shows all three replicas with state 'down'. Thanks! -joe On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time.
Data inconsistencies and updates in solrcloud
Hi folks I can't find an answer to this, and its clearly user error, we have a collection in solrcloud that is started numShards=2 replicationFactor=1 solr seems happy the collection seems happy. Yet when we post and update to it and then look at the record again, it seems to only affect one core and not the second. What are we likely to be doing wrong in our config or update to prevent the replication? Thanks Tom
Re: Recovery Issue - Solr 6.6.1 and HDFS
bq: We are doing lots of soft commits for NRT search... It's not surprising that this is slower than local storage, especially if you have any autowarming going on. Opening new searchers will need to read data from disk for the new segments, and HDFS may be slower here. As far as the commit interval, an under-appreciated event is that when RAMBufferSizeMB is exceeded (default 100M last I knew) new segments are written _anyway_, they're just a little invisible. That is, the segments_n file isn't updated even though they're closed IIUC at least. So that very long interval isn't helping with that problem I don't think Evidence to the contrary trumps my understanding of course. About starting all these collections up at once and the Overseer queue. I've seen this in similar situations. There are a _lot_ of messages flying back and forth for each replica on startup, and the Overseer processing was very inefficient historically so that queue could get in the 100s of K, I've seen some pathological situations where it's over 1M. SOLR-10524 made this a lot better. There are still a lot of messages written in a case like yours, but at least the Overseer has a much better chance to keep up Solr 6.6... At that point bringing up Solr took a very long time. Erick On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorpwrote: > We sometimes also have replicas not recovering. If one replica is left > active the easiest is to then to delete the replica and create a new one. > When all replicas are down it helps most of the time to restart one of the > nodes that contains a replica in down state. If that also doesn't get the > replica to recover I would check the logs of the node and also that of the > overseer node. I have seen the same issue on Solr using local storage. The > main HDFS related issues we had so far was those lock files and if you > delete and recreate collections/cores and it sometimes happens that the data > was not cleaned up in HDFS and then causes a conflict. > > Hendrik > > > On 21.11.2017 21:07, Joe Obernberger wrote: >> >> We've never run an index this size in anything but HDFS, so I have no >> comparison. What we've been doing is keeping two main collections - all >> data, and the last 30 days of data. Then we handle queries based on date >> range. The 30 day index is significantly faster. >> >> My main concern right now is that 6 of the 100 shards are not coming back >> because of no leader. I've never seen this error before. Any ideas? >> ClusterStatus shows all three replicas with state 'down'. >> >> Thanks! >> >> -joe >> >> >> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: >>> >>> We actually also have some performance issue with HDFS at the moment. We >>> are doing lots of soft commits for NRT search. Those seem to be slower then >>> with local storage. The investigation is however not really far yet. >>> >>> We have a setup with 2000 collections, with one shard each and a >>> replication factor of 2 or 3. When we restart nodes too fast that causes >>> problems with the overseer queue, which can lead to the queue getting out of >>> control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some >>> improvements and should handle these actions faster. I would check what you >>> see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The >>> critical part is the "overseer_queue_size" value. If this goes up to about >>> 1 it is pretty much game over on our setup. In that case it seems to be >>> best to stop all nodes, clear the queue in ZK and then restart the nodes one >>> by one with a gap of like 5min. That normally recovers pretty well. >>> >>> regards, >>> Hendrik >>> >>> On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time. Thanks again Erick! -Joe On 11/21/2017 2:00 PM, Erick Erickson wrote: > > Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... > > I need to back up a bit. Once nodes are in this state it's not
Re: Recovery Issue - Solr 6.6.1 and HDFS
We sometimes also have replicas not recovering. If one replica is left active the easiest is to then to delete the replica and create a new one. When all replicas are down it helps most of the time to restart one of the nodes that contains a replica in down state. If that also doesn't get the replica to recover I would check the logs of the node and also that of the overseer node. I have seen the same issue on Solr using local storage. The main HDFS related issues we had so far was those lock files and if you delete and recreate collections/cores and it sometimes happens that the data was not cleaned up in HDFS and then causes a conflict. Hendrik On 21.11.2017 21:07, Joe Obernberger wrote: We've never run an index this size in anything but HDFS, so I have no comparison. What we've been doing is keeping two main collections - all data, and the last 30 days of data. Then we handle queries based on date range. The 30 day index is significantly faster. My main concern right now is that 6 of the 100 shards are not coming back because of no leader. I've never seen this error before. Any ideas? ClusterStatus shows all three replicas with state 'down'. Thanks! -joe On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time. Thanks again Erick! -Joe On 11/21/2017 2:00 PM, Erick Erickson wrote: Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... I need to back up a bit. Once nodes are in this state it's not surprising that they need to be forcefully killed. I was more thinking about how they got in this situation in the first place. _Before_ you get into the nasty state how are the Solr nodes shut down? Forcefully? Your hard commit is far longer than it needs to be, resulting in much larger tlog files etc. I usually set this at 15-60 seconds with local disks, not quite sure whether longer intervals are helpful on HDFS. What this means is that you can spend up to 30 minutes when you restart solr _replaying the tlogs_! If Solr is killed, it may not have had a chance to fsync the segments and may have to replay on startup. If you have openSearcher set to false, the hard commit operation is not horribly expensive, it just fsync's the current segments and opens new ones. It won't be a total cure, but I bet reducing this interval would help a lot. Also, if you stop indexing there's no need to wait 30 minutes if you issue a manual commit, something like .../collection/update?commit=true. Just reducing the hard commit interval will make the wait between stopping indexing and restarting shorter all by itself if you don't want to issue the manual commit. Best, Erick On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorpwrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the
Re: Recovery Issue - Solr 6.6.1 and HDFS
We've never run an index this size in anything but HDFS, so I have no comparison. What we've been doing is keeping two main collections - all data, and the last 30 days of data. Then we handle queries based on date range. The 30 day index is significantly faster. My main concern right now is that 6 of the 100 shards are not coming back because of no leader. I've never seen this error before. Any ideas? ClusterStatus shows all three replicas with state 'down'. Thanks! -joe On 11/21/2017 2:35 PM, Hendrik Haddorp wrote: We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time. Thanks again Erick! -Joe On 11/21/2017 2:00 PM, Erick Erickson wrote: Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... I need to back up a bit. Once nodes are in this state it's not surprising that they need to be forcefully killed. I was more thinking about how they got in this situation in the first place. _Before_ you get into the nasty state how are the Solr nodes shut down? Forcefully? Your hard commit is far longer than it needs to be, resulting in much larger tlog files etc. I usually set this at 15-60 seconds with local disks, not quite sure whether longer intervals are helpful on HDFS. What this means is that you can spend up to 30 minutes when you restart solr _replaying the tlogs_! If Solr is killed, it may not have had a chance to fsync the segments and may have to replay on startup. If you have openSearcher set to false, the hard commit operation is not horribly expensive, it just fsync's the current segments and opens new ones. It won't be a total cure, but I bet reducing this interval would help a lot. Also, if you stop indexing there's no need to wait 30 minutes if you issue a manual commit, something like .../collection/update?commit=true. Just reducing the hard commit interval will make the wait between stopping indexing and restarting shorter all by itself if you don't want to issue the manual commit. Best, Erick On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorpwrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running
Re: Recovery Issue - Solr 6.6.1 and HDFS
We actually also have some performance issue with HDFS at the moment. We are doing lots of soft commits for NRT search. Those seem to be slower then with local storage. The investigation is however not really far yet. We have a setup with 2000 collections, with one shard each and a replication factor of 2 or 3. When we restart nodes too fast that causes problems with the overseer queue, which can lead to the queue getting out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some improvements and should handle these actions faster. I would check what you see for "/solr/admin/collections?action=OVERSEERSTATUS=json". The critical part is the "overseer_queue_size" value. If this goes up to about 1 it is pretty much game over on our setup. In that case it seems to be best to stop all nodes, clear the queue in ZK and then restart the nodes one by one with a gap of like 5min. That normally recovers pretty well. regards, Hendrik On 21.11.2017 20:12, Joe Obernberger wrote: We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time. Thanks again Erick! -Joe On 11/21/2017 2:00 PM, Erick Erickson wrote: Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... I need to back up a bit. Once nodes are in this state it's not surprising that they need to be forcefully killed. I was more thinking about how they got in this situation in the first place. _Before_ you get into the nasty state how are the Solr nodes shut down? Forcefully? Your hard commit is far longer than it needs to be, resulting in much larger tlog files etc. I usually set this at 15-60 seconds with local disks, not quite sure whether longer intervals are helpful on HDFS. What this means is that you can spend up to 30 minutes when you restart solr _replaying the tlogs_! If Solr is killed, it may not have had a chance to fsync the segments and may have to replay on startup. If you have openSearcher set to false, the hard commit operation is not horribly expensive, it just fsync's the current segments and opens new ones. It won't be a total cure, but I bet reducing this interval would help a lot. Also, if you stop indexing there's no need to wait 30 minutes if you issue a manual commit, something like .../collection/update?commit=true. Just reducing the hard commit interval will make the wait between stopping indexing and restarting shorter all by itself if you don't want to issue the manual commit. Best, Erick On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorpwrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to
Re: Recovery Issue - Solr 6.6.1 and HDFS
Unfortunately I can not upload my cleanup code but the steps I'm doing are quite easy. I wrote it in Java using the HDFS API and Curator for ZooKeeper. Steps are: - read out the children of /collections in ZK so you know all the collection names - read /collections//state.json to get the state - find the replicas in the state and filter those out that have a "node_name" matching your locale node (the node name is basically a combination of your host name and the solr port) - if the replica data has "dataDir" set then you basically only need to add "index/write.lock" to it and you have the lock location - if "dataDir" is not set (not really sure why) then you need to construct it yourself: //name>/data/index/write.lock - if the lock file exist delete it I believe there is a small race condition in case you use replica auto fail over. So I try to keep the time between checking the state in ZooKeeper and deleting the lock file as short, like not first determine all lock file locations and only then delete them but do that while checking the state. regards, Hendrik On 21.11.2017 19:53, Joe Obernberger wrote: A clever idea. Normally what we do when we need to do a restart, is to halt indexing, and then wait about 30 minutes. If we do not wait, and stop the cluster, the default scripts 180 second timeout is not enough and we'll have lock files to clean up. We use puppet to start and stop the nodes, but at this point that is not working well since we need to start one node at a time. With each one taking hours, this is a lengthy process! I'd love to see your script! This new error is now coming up - see screen shot. For some reason some of the shards have no leader assigned: http://lovehorsepower.com/SolrClusterErrors.jpg -Joe On 11/21/2017 1:34 PM, Hendrik Haddorp wrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. The smaller indexes recover much faster (less than 5 minutes). Should we have not used so many replicas with HDFS? Is there a better way we should have built the solr6 cluster? Thank you for any insight! -Joe --- This email has been checked for viruses by AVG. http://www.avg.com
Re: Recovery Issue - Solr 6.6.1 and HDFS
We set the hard commit time long because we were having performance issues with HDFS, and thought that since the block size is 128M, having a longer hard commit made sense. That was our hypothesis anyway. Happy to switch it back and see what happens. I don't know what caused the cluster to go into recovery in the first place. We had a server die over the weekend, but it's just one out of ~50. Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies). It was at this point that we noticed lots of network activity, and most of the shards in this recovery, fail, retry loop. That is when we decided to shut it down resulting in zombie lock files. I tried using the FORCELEADER call, which completed, but doesn't seem to have any effect on the shards that have no leader. Kinda out of ideas for that problem. If I can get the cluster back up, I'll try a lower hard commit time. Thanks again Erick! -Joe On 11/21/2017 2:00 PM, Erick Erickson wrote: Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... I need to back up a bit. Once nodes are in this state it's not surprising that they need to be forcefully killed. I was more thinking about how they got in this situation in the first place. _Before_ you get into the nasty state how are the Solr nodes shut down? Forcefully? Your hard commit is far longer than it needs to be, resulting in much larger tlog files etc. I usually set this at 15-60 seconds with local disks, not quite sure whether longer intervals are helpful on HDFS. What this means is that you can spend up to 30 minutes when you restart solr _replaying the tlogs_! If Solr is killed, it may not have had a chance to fsync the segments and may have to replay on startup. If you have openSearcher set to false, the hard commit operation is not horribly expensive, it just fsync's the current segments and opens new ones. It won't be a total cure, but I bet reducing this interval would help a lot. Also, if you stop indexing there's no need to wait 30 minutes if you issue a manual commit, something like .../collection/update?commit=true. Just reducing the hard commit interval will make the wait between stopping indexing and restarting shorter all by itself if you don't want to issue the manual commit. Best, Erick On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorpwrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. The smaller indexes recover much faster (less than 5 minutes). Should we have not used so many replicas with HDFS? Is there a better way we should have built the solr6 cluster? Thank you for any insight! -Joe --- This email has been checked for viruses by AVG. http://www.avg.com
NPE in modifyCollection
Hi, I'm trying to set a replica placement rule on an existing collection and getting a NPE. It looks like the update code is assuming there's a current value. Collection: highspot_test operation: modifycollection failed:java.lang.NullPointerException at org.apache.solr.cloud.OverseerCollectionMessageHandler.modifyCollection(OverseerCollectionMessageHandler.java:677) if (!updateKey.equals(ZkStateReader.COLLECTION_PROP) && !updateKey.equals(Overseer.QUEUE_OPERATION) >&& !collection.get(updateKey).equals(updateEntry.getValue())){ areChangesVisible = false; break; } I'm on 6.5.1, but the code looks the same in head. I didn't see anything related in Jira; is this a new ticket? Thanks, Nate -- Nate Dire Software Engineer Highspot
tokenstream reusable
Hello all, I would like to reuse the tokenstream generated in one field, to create a new tokenstream for another field without executing again the whole analysis. The particulate application is: - I have field *tokens* with an analyzer that generate the tokens (and maintains the token type attributes) - I would like to have two new fields: *verbs* and *adjectives*. This should reuse the token stream generated for the field *tokens* and filter the verbs and adjectives to add them to the respective fields. Is this feasible? How should it be implemented? Many thanks. Roxana
Re: Recovery Issue - Solr 6.6.1 and HDFS
Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)... I need to back up a bit. Once nodes are in this state it's not surprising that they need to be forcefully killed. I was more thinking about how they got in this situation in the first place. _Before_ you get into the nasty state how are the Solr nodes shut down? Forcefully? Your hard commit is far longer than it needs to be, resulting in much larger tlog files etc. I usually set this at 15-60 seconds with local disks, not quite sure whether longer intervals are helpful on HDFS. What this means is that you can spend up to 30 minutes when you restart solr _replaying the tlogs_! If Solr is killed, it may not have had a chance to fsync the segments and may have to replay on startup. If you have openSearcher set to false, the hard commit operation is not horribly expensive, it just fsync's the current segments and opens new ones. It won't be a total cure, but I bet reducing this interval would help a lot. Also, if you stop indexing there's no need to wait 30 minutes if you issue a manual commit, something like .../collection/update?commit=true. Just reducing the hard commit interval will make the wait between stopping indexing and restarting shorter all by itself if you don't want to issue the manual commit. Best, Erick On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorpwrote: > Hi, > > the write.lock issue I see as well when Solr is not been stopped gracefully. > The write.lock files are then left in the HDFS as they do not get removed > automatically when the client disconnects like a ephemeral node in > ZooKeeper. Unfortunately Solr does also not realize that it should be owning > the lock as it is marked in the state stored in ZooKeeper as the owner and > is also not willing to retry, which is why you need to restart the whole > Solr instance after the cleanup. I added some logic to my Solr start up > script which scans the log files in HDFS and compares that with the state in > ZooKeeper and then delete all lock files that belong to the node that I'm > starting. > > regards, > Hendrik > > > On 21.11.2017 14:07, Joe Obernberger wrote: >> >> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using >> HDFS as the index. The current index size is about 31TBytes. With 3x >> replication that takes up 93TBytes of disk. Our main collection is split >> across 100 shards with 3 replicas each. The issue that we're running into >> is when restarting the solr6 cluster. The shards go into recovery and start >> to utilize nearly all of their network interfaces. If we start too many of >> the nodes at once, the shards will go into a recovery, fail, and retry loop >> and never come up. The errors are related to HDFS not responding fast >> enough and warnings from the DFSClient. If we stop a node when this is >> happening, the script will force a stop (180 second timeout) and upon >> restart, we have lock files (write.lock) inside of HDFS. >> >> The process at this point is to start one node, find out the lock files, >> wait for it to come up completely (hours), stop it, delete the write.lock >> files, and restart. Usually this second restart is faster, but it still can >> take 20-60 minutes. >> >> The smaller indexes recover much faster (less than 5 minutes). Should we >> have not used so many replicas with HDFS? Is there a better way we should >> have built the solr6 cluster? >> >> Thank you for any insight! >> >> -Joe >> >
Re: Recovery Issue - Solr 6.6.1 and HDFS
A clever idea. Normally what we do when we need to do a restart, is to halt indexing, and then wait about 30 minutes. If we do not wait, and stop the cluster, the default scripts 180 second timeout is not enough and we'll have lock files to clean up. We use puppet to start and stop the nodes, but at this point that is not working well since we need to start one node at a time. With each one taking hours, this is a lengthy process! I'd love to see your script! This new error is now coming up - see screen shot. For some reason some of the shards have no leader assigned: http://lovehorsepower.com/SolrClusterErrors.jpg -Joe On 11/21/2017 1:34 PM, Hendrik Haddorp wrote: Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. The smaller indexes recover much faster (less than 5 minutes). Should we have not used so many replicas with HDFS? Is there a better way we should have built the solr6 cluster? Thank you for any insight! -Joe --- This email has been checked for viruses by AVG. http://www.avg.com
Re: Recovery Issue - Solr 6.6.1 and HDFS
Hi, the write.lock issue I see as well when Solr is not been stopped gracefully. The write.lock files are then left in the HDFS as they do not get removed automatically when the client disconnects like a ephemeral node in ZooKeeper. Unfortunately Solr does also not realize that it should be owning the lock as it is marked in the state stored in ZooKeeper as the owner and is also not willing to retry, which is why you need to restart the whole Solr instance after the cleanup. I added some logic to my Solr start up script which scans the log files in HDFS and compares that with the state in ZooKeeper and then delete all lock files that belong to the node that I'm starting. regards, Hendrik On 21.11.2017 14:07, Joe Obernberger wrote: Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. The smaller indexes recover much faster (less than 5 minutes). Should we have not used so many replicas with HDFS? Is there a better way we should have built the solr6 cluster? Thank you for any insight! -Joe
Re: Recovery Issue - Solr 6.6.1 and HDFS
Erick - thank you very much for the reply. I'm still working through restarting the nodes one by one. I'm stopping the nodes with the script, but yes - they are being killed forcefully because they are in this recovery, failed, retry loop. I could increase the timeout, but they never seem to recover. The largest tlog file that I see currently is 222MBytes. Autocommit is set to 180 and autoSoftCommit is set to 12. Errors when they are in the long recovery are things like: 2017-11-20 21:41:29.755 ERROR (recoveryExecutor-3-thread-4-processing-n:frodo:9100_solr x:UNCLASS_shard37_replica1 s:shard37 c:UNCLASS r:core_node196) [c:UNCLASS s:shard37 r:core_node196 x:UNCLASS_shard37_replica1] o.a.s.h.IndexFetcher Error closing file: _8dmn.cfs org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /solr6.6.0/UNCLASS/core_node196/data/index.20171120195705961/_8dmn.cfs could only be replicated to 0 nodes instead of minReplication (=1). There are 39 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1716) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3385) Complete log is here for one of the shards that was forcefully stopped. http://lovehorsepower.com/solr.log As to what is in the logs when it is recovering for several hours, it's many WARN messages from the DFSClient such as: Abandoning BP-1714598269-10.2.100.220-1341346046854:blk_4366207808_1103082741732 and Excluding datanode DatanodeInfoWithStorage[172.16.100.229:50010,DS-5985e40d-830a-44e7-a2ea-fc60bebabc30,DISK] or from the IndexFetcher: File _a96y.cfe did not match. expected checksum is 3502268220 and actual is checksum 2563579651. expected length is 405 and actual length is 405 Unfortunately, I'm not getting errors from some of the nodes (still working through restarting them) about zookeeper: org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963) at org.apache.solr.cloud.ZkController.register(ZkController.java:902) at org.apache.solr.cloud.ZkController.register(ZkController.java:846) at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021) Any idea what those could be? Those shards are not coming back up. Sorry so many questions! -Joe On 11/21/2017 12:11 PM, Erick Erickson wrote: How are you stopping Solr? Nodes should not go into recovery on startup unless Solr was killed un-gracefully (i.e. kill -9 or the like). If you use the bin/solr script to stop Solr and see a message about "killing XXX forcefully" then you can lengthen out the time the script waits for shutdown (there's a sysvar you can set, look in the script). Actually I'll correct myself a bit. Shards _do_ go into recovery but it should be very short in the graceful shutdown case. Basically shards temporarily go into recovery as part of normal processing just long enough to see there's no recovery necessary, but that should be measured in a few seconds. What it sounds like from this "The shards go into recovery and start to utilize nearly all of their network" is that your nodes go into "full recovery" where the entire index is copied down because the replica thinks it's "too far" out of date. That indicates something weird about the state when the Solr nodes stopped. wild-shot-in-the-dark question. How big are your tlogs? If you don't hard commit very often, the tlogs can replay at startup for a very long time. This makes no sense to me, I'm surely missing something: The
Possible to disable SynonymQuery and get legacy behavior?
We help clients that perform index-time semantic expansion to hypernyms at index time. For example, they will have a synonyms file that does the following wing_tips => wing_tips, dress_shoes, shoes dress_shoes => dress_shoes, shoes oxfords => oxfords, dress_shoes, shoes Then at query time, we rely on differing IDF of these terms in the same position to bring up the rare, specific terms matches, followed by increasingly semantically broad matches. For example, Previously, a search for wing_tips would get turned into "wing_tips OR dress_shoes OR shoes". Shoes being very common would get scored lowest. Wing tips being very specific would get scored very highly ( I have a blog post about this (which uses Elasticsearch) http://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/ ) As our clients upgrade to Solr 6 and above, we're noticing our technique no longer works due to SynonymQuery, which blends the doc freq at query time of synonyms at query time. SynonymQuery seems to be the right direction for most people :) Still I would like to figure out how/if there's a setting anywhere to return to the legacy behavior (a boolean query of term queries) so I don't have to go back to the drawing board for clients that rely on this technique. I've been going through QueryBuilder and I don't see where we could go back to the legacy behavior. It seems to be based on position overlap. Thanks! -Doug -- Consultant, OpenSource Connections. Contact info at http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)
Re: OutOfMemoryError in 6.5.1
On 11/21/2017 9:17 AM, Walter Underwood wrote: > All our customizations are in solr.in.sh. We’re using the one we configured > for 6.3.0. I’ll check for any differences between that and the 6.5.1 script. The order looks correct to me -- the arguments for the OOM killer are listed *before* the "-jar start.jar" part of the command, so they should be taking effect. Take a look at /apps/solr6/bin/oom_solr.sh and make sure it's marked as executable for the user that Solr is running under, and that the shebang at the top of the script is correct and executable as well. > I don’t see any arguments at all in the dashboard. I do see them in a ps > listing, right at the end. This UI problem is documented/handled in SOLR-11645. Your argument list includes "-Dsolr.log.muteconsole" twice, which triggers the problem. https://issues.apache.org/jira/browse/SOLR-11645 The fix isn't available in a released version yet, but the patch can easily be applied to a downloaded/installed Solr without compiling source code. Your browser will be caching the old version, so you'll have to deal with that. > I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our > load benchmarks use prod logs. We added suggesters, but those use analyzing > infix, so they are search indexes, not in-memory. It can be very difficult to figure out what's causing OOM issues, especially if the config, index, and queries are identical between one version without the problem and another version with the problem. It sounds like you and Erick have some theories about it. What is the exact message on the OOME that you're getting? Thanks, Shawn
Re: NullPointerException in PeerSync.handleUpdates
Did you check the JIRA list? Or CHANGES.txt in more recent versions? On Tue, Nov 21, 2017 at 1:13 AM, S Gwrote: > Hi, > > We are running 6.2 version of Solr and hitting this error frequently. > > Error while trying to recover. core=my_core:java.lang.NullPointerException > at org.apache.solr.update.PeerSync.handleUpdates(PeerSync.java:605) > at org.apache.solr.update.PeerSync.handleResponse(PeerSync.java:344) > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > > Is this a known issue and fixed in some newer version? > > > Thanks > SG
Re: OutOfMemoryError in 6.5.1
Walter: Yeah, I've seen this on occasion. IIRC, the OOM exception will be specific to running out of stack space, or at least slightly different than the "standard" OOM error. That would be the "smoking gun" for too many threads Erick On Tue, Nov 21, 2017 at 9:00 AM, Walter Underwoodwrote: > I do have one theory about the OOM. The server is running out of memory > because there are too many threads. Instead of queueing up overload in the > load balancer, it is queue in new threads waiting to run. Setting > solr.jetty.threads.max to 10,000 guarantees this will happen under overload. > > New Relic shows this clearly. CPU hits 100% at 15:40, thread count and load > average start climbing. At 15:43, it reaches 3000 threads and starts throwing > OOM. After that, the server is in a stable congested state. > > I understand why the Jetty thread max was set so high, but I think the cure > is worse than the disease. We’ll run another load benchmark with thread max > at something realistic, like 200. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Nov 21, 2017, at 8:17 AM, Walter Underwood wrote: >> >> All our customizations are in solr.in.sh. We’re using the one we configured >> for 6.3.0. I’ll check for any differences between that and the 6.5.1 script. >> >> I don’t see any arguments at all in the dashboard. I do see them in a ps >> listing, right at the end. >> >> java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled >> -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages >> -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc >> -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps >> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution >> -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log >> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M >> -Dcom.sun.management.jmxremote >> -Dcom.sun.management.jmxremote.local.only=false >> -Dcom.sun.management.jmxremote.ssl=false >> -Dcom.sun.management.jmxremote.authenticate=false >> -Dcom.sun.management.jmxremote.port=18983 >> -Dcom.sun.management.jmxremote.rmi.port=18983 >> -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com >> -DzkClientTimeout=15000 >> -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud >> -Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 >> -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks >> -Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC >> -Djetty.home=/apps/solr6/server -Dsolr.solr.home=/apps/solr6/server/solr >> -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01 >> -Dgraphite.host=influx.test.cheggnet.com >> -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3 >> -Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole >> -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh 8983 /solr/logs -jar >> start.jar --module=http >> >> I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our >> load benchmarks use prod logs. We added suggesters, but those use analyzing >> infix, so they are search indexes, not in-memory. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Nov 21, 2017, at 5:46 AM, Shawn Heisey wrote: >>> >>> On 11/20/2017 6:17 PM, Walter Underwood wrote: When I ran load benchmarks with 6.3.0, an overloaded cluster would get super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting OOMs. That is really bad, because it means we need to reboot every node in the cluster. Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an 8G heap. >>> This is not good behavior in prod. The process goes to the bad place, then we need to wait until someone is paged and kills it manually. Luckily, it usually drops out of the live nodes for each collection and doesn’t take user traffic. >>> >>> There was a bug, fixed long before 6.3.0, where the OOM killer script >>> wasn't working because the arguments enabling it were in the wrong place. >>> It was fixed in 5.5.1 and 6.0. >>> >>> https://issues.apache.org/jira/browse/SOLR-8145 >>> >>> If the scripts that you are using to get Solr started originated with a >>> much older version of Solr than you are currently running, maybe you've got >>> the arguments in the wrong order. >>> >>> Do you see the commandline arguments for the OOM killer (only available on >>> *NIX systems, not Windows) on the admin UI dashboard? If they are properly >>> placed, you will see them on the dashboard, but if they aren't properly >>> placed, then you won't
Re: Recovery Issue - Solr 6.6.1 and HDFS
How are you stopping Solr? Nodes should not go into recovery on startup unless Solr was killed un-gracefully (i.e. kill -9 or the like). If you use the bin/solr script to stop Solr and see a message about "killing XXX forcefully" then you can lengthen out the time the script waits for shutdown (there's a sysvar you can set, look in the script). Actually I'll correct myself a bit. Shards _do_ go into recovery but it should be very short in the graceful shutdown case. Basically shards temporarily go into recovery as part of normal processing just long enough to see there's no recovery necessary, but that should be measured in a few seconds. What it sounds like from this "The shards go into recovery and start to utilize nearly all of their network" is that your nodes go into "full recovery" where the entire index is copied down because the replica thinks it's "too far" out of date. That indicates something weird about the state when the Solr nodes stopped. wild-shot-in-the-dark question. How big are your tlogs? If you don't hard commit very often, the tlogs can replay at startup for a very long time. This makes no sense to me, I'm surely missing something: The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. When you start one node it may take a few minutes for leader electing to kick in (the default is 180 seconds) but absent replication it should just be there. Taking hours totally violates my expectations. What does Solr _think_ it's doing? What's in the logs at that point? And if you stop solr gracefully, there shouldn't be a problem with write.lock. You could also try increasing the timeouts, and the HDFS directory factory has some parameters to tweak that are a mystery to me... All in all, this is behavior that I find mystifying. Best, Erick On Tue, Nov 21, 2017 at 5:07 AM, Joe Obernbergerwrote: > Hi All - we have a system with 45 physical boxes running solr 6.6.1 using > HDFS as the index. The current index size is about 31TBytes. With 3x > replication that takes up 93TBytes of disk. Our main collection is split > across 100 shards with 3 replicas each. The issue that we're running into > is when restarting the solr6 cluster. The shards go into recovery and start > to utilize nearly all of their network interfaces. If we start too many of > the nodes at once, the shards will go into a recovery, fail, and retry loop > and never come up. The errors are related to HDFS not responding fast > enough and warnings from the DFSClient. If we stop a node when this is > happening, the script will force a stop (180 second timeout) and upon > restart, we have lock files (write.lock) inside of HDFS. > > The process at this point is to start one node, find out the lock files, > wait for it to come up completely (hours), stop it, delete the write.lock > files, and restart. Usually this second restart is faster, but it still can > take 20-60 minutes. > > The smaller indexes recover much faster (less than 5 minutes). Should we > have not used so many replicas with HDFS? Is there a better way we should > have built the solr6 cluster? > > Thank you for any insight! > > -Joe >
Re: Custom analyzer & frequency
One thing you might do is use the termfreq function to see that it looks like in the index. Also the schema/analysis page will put terms in "buckets" by power-of-2 so that might help too. Best, Erick On Tue, Nov 21, 2017 at 7:55 AM, Barbet Alainwrote: > You rock, thank you so much for this clear answer, I loose 2 days for > nothing as I've already the term freq but now I've an answer :-) > (And yes I check it's the doc freq, not the term freq). > > Thank you again ! > > 2017-11-21 16:34 GMT+01:00 Emir Arnautović : >> Hi Alain, >> As explained in prev mail that is doc frequency and each doc is counted >> once. I am not sure if Luke can provide you information about overall term >> frequency - sum of term frequency of all docs. >> >> Regards, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 21 Nov 2017, at 16:30, Barbet Alain wrote: >>> >>> $ cat add_test.sh >>> DATA=' >>> >>> >>>666 >>>toto titi tata toto tutu titi >>> >>> >>> ' >>> $ sh add_test.sh >>> >>> >>> 0>> name="QTime">484 >>> >>> >>> >>> $ curl >>> 'http://localhost:8983/solr/alian_test/terms?terms.fl=titi_txt_fr=index' >>> >>> >>> 0>> name="QTime">0>> name="titi_txt_fr">1>> name="titi">11>> name="tutu">1 >>> >>> >>> >>> So it's not only on Luke Side, it's come from Solr. Does it sound normal ? >>> >>> 2017-11-21 11:43 GMT+01:00 Barbet Alain : Hi, I build a custom analyzer & setup it in solr, but doesn't work as I expect. I always get 1 as frequency for each word even if it's present multiple time in the text. So I try with default analyzer & find same behavior: My schema >>> stored="true"/> alian@yoda:~/solr> cat add_test.sh DATA=' 666 toto titi tata toto tutu titi ' curl -X POST -H 'Content-Type: text/xml' 'http://localhost:8983/solr/alian_test/update?commit=true' --data-binary "$DATA" When I test in solr interface / analyze, I find the right behavior (find titi & toto 2 times). But when I look in solr index with Luke or solr interface / schema, the top term always get 1 as frequency. Can someone give me the thing I forget ? (solr 6.5) Thank you ! >>
Re: OutOfMemoryError in 6.5.1
I do have one theory about the OOM. The server is running out of memory because there are too many threads. Instead of queueing up overload in the load balancer, it is queue in new threads waiting to run. Setting solr.jetty.threads.max to 10,000 guarantees this will happen under overload. New Relic shows this clearly. CPU hits 100% at 15:40, thread count and load average start climbing. At 15:43, it reaches 3000 threads and starts throwing OOM. After that, the server is in a stable congested state. I understand why the Jetty thread max was set so high, but I think the cure is worse than the disease. We’ll run another load benchmark with thread max at something realistic, like 200. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 21, 2017, at 8:17 AM, Walter Underwoodwrote: > > All our customizations are in solr.in.sh. We’re using the one we configured > for 6.3.0. I’ll check for any differences between that and the 6.5.1 script. > > I don’t see any arguments at all in the dashboard. I do see them in a ps > listing, right at the end. > > java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled > -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages > -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc > -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.local.only=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.port=18983 > -Dcom.sun.management.jmxremote.rmi.port=18983 > -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com > -DzkClientTimeout=15000 > -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud > -Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 > -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks > -Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC > -Djetty.home=/apps/solr6/server -Dsolr.solr.home=/apps/solr6/server/solr > -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01 > -Dgraphite.host=influx.test.cheggnet.com > -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3 > -Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole > -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh 8983 /solr/logs -jar > start.jar --module=http > > I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our > load benchmarks use prod logs. We added suggesters, but those use analyzing > infix, so they are search indexes, not in-memory. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Nov 21, 2017, at 5:46 AM, Shawn Heisey wrote: >> >> On 11/20/2017 6:17 PM, Walter Underwood wrote: >>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get >>> super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start >>> getting OOMs. That is really bad, because it means we need to reboot every >>> node in the cluster. >>> Also, the JVM OOM hook isn’t running the process killer (JVM >>> 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an >>> 8G heap. >> >>> This is not good behavior in prod. The process goes to the bad place, then >>> we need to wait until someone is paged and kills it manually. Luckily, it >>> usually drops out of the live nodes for each collection and doesn’t take >>> user traffic. >> >> There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't >> working because the arguments enabling it were in the wrong place. It was >> fixed in 5.5.1 and 6.0. >> >> https://issues.apache.org/jira/browse/SOLR-8145 >> >> If the scripts that you are using to get Solr started originated with a much >> older version of Solr than you are currently running, maybe you've got the >> arguments in the wrong order. >> >> Do you see the commandline arguments for the OOM killer (only available on >> *NIX systems, not Windows) on the admin UI dashboard? If they are properly >> placed, you will see them on the dashboard, but if they aren't properly >> placed, then you won't see them. This is what the argument looks like for >> one of my Solr installs: >> >> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs >> >> Something which you probably already know: If you're hitting OOM, you need >> a larger heap, or you need to adjust the config so it uses less memory. >> There are no other ways to "fix" OOM problems. >> >> Thanks, >> Shawn >
Re: OutOfMemoryError in 6.5.1
bq: but those use analyzing infix, so they are search indexes, not in-memory Sure, but they still can consume heap. Most of the index is MMapped of course, but there are some control structures, indexes and the like still kept on the heap. I suppose not using the suggester would nail it though. I guess the second thing I'd be interested in is a heap dump of the two to get a sense of whether something really wonky crept in between those versions. Certainly nothing intentional that I know of. Erick On Tue, Nov 21, 2017 at 8:17 AM, Walter Underwoodwrote: > All our customizations are in solr.in.sh. We’re using the one we configured > for 6.3.0. I’ll check for any differences between that and the 6.5.1 script. > > I don’t see any arguments at all in the dashboard. I do see them in a ps > listing, right at the end. > > java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled > -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages > -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc > -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.local.only=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.port=18983 > -Dcom.sun.management.jmxremote.rmi.port=18983 > -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com > -DzkClientTimeout=15000 > -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud > -Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 > -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks > -Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC > -Djetty.home=/apps/solr6/server -Dsolr.solr.home=/apps/solr6/server/solr > -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01 > -Dgraphite.host=influx.test.cheggnet.com > -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3 > -Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole > -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh 8983 /solr/logs -jar > start.jar --module=http > > I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our > load benchmarks use prod logs. We added suggesters, but those use analyzing > infix, so they are search indexes, not in-memory. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Nov 21, 2017, at 5:46 AM, Shawn Heisey wrote: >> >> On 11/20/2017 6:17 PM, Walter Underwood wrote: >>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get >>> super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start >>> getting OOMs. That is really bad, because it means we need to reboot every >>> node in the cluster. >>> Also, the JVM OOM hook isn’t running the process killer (JVM >>> 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an >>> 8G heap. >> >>> This is not good behavior in prod. The process goes to the bad place, then >>> we need to wait until someone is paged and kills it manually. Luckily, it >>> usually drops out of the live nodes for each collection and doesn’t take >>> user traffic. >> >> There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't >> working because the arguments enabling it were in the wrong place. It was >> fixed in 5.5.1 and 6.0. >> >> https://issues.apache.org/jira/browse/SOLR-8145 >> >> If the scripts that you are using to get Solr started originated with a much >> older version of Solr than you are currently running, maybe you've got the >> arguments in the wrong order. >> >> Do you see the commandline arguments for the OOM killer (only available on >> *NIX systems, not Windows) on the admin UI dashboard? If they are properly >> placed, you will see them on the dashboard, but if they aren't properly >> placed, then you won't see them. This is what the argument looks like for >> one of my Solr installs: >> >> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs >> >> Something which you probably already know: If you're hitting OOM, you need >> a larger heap, or you need to adjust the config so it uses less memory. >> There are no other ways to "fix" OOM problems. >> >> Thanks, >> Shawn >
Re: Solr cloud in kubernetes
We hopefully will switch to Kubernetes/Rancher 2.0 from Rancher 1.x/Docker, soon. Here are some utilities that we've used as run-once containers to start everything up: https://github.com/odoko-devops/solr-utils Using a single image, run with many different configurations, we have been able to stand up an entire Solr stack, from scratch, including ZooKeeper, Solr, solr.xml, config upload, collection creation, replica creation, content indexing, etc. It is a delight to see when it works. Upayavira On Mon, 20 Nov 2017, at 09:30 AM, Björn Häuser wrote: > Hi Raja, > > we are using solrcloud as a statefulset and every pod has its own storage > attached to it. > > Thanks > Björn > > > On 20. Nov 2017, at 05:59, rajasaurwrote: > > > > Hi Bjorn, > > > > Im trying a similar approach now (to get solrcloud working on kubernetes). I > > have run Zookeeper as a statefulset, but not running SolrCloud, which is > > causing an issue when my pods get destroyed and restarted. > > I will try running the -h option so that the SOLR_HOST is used when > > connecting to itself (and to zookeeper). > > > > On another note, how do you store the indexes ? I had an issue with my GCE > > node (Node NotReady), which had its kubelet to be restarted, but with that, > > since solrcloud pods were restarted, all the data got wiped out. Just > > wondering how you have setup your indexes with the solrcloud kubernetes > > setup. > > > > Thanks > > Raja > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: OutOfMemoryError in 6.5.1
All our customizations are in solr.in.sh. We’re using the one we configured for 6.3.0. I’ll check for any differences between that and the 6.5.1 script. I don’t see any arguments at all in the dashboard. I do see them in a ps listing, right at the end. java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=18983 -Dcom.sun.management.jmxremote.rmi.port=18983 -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com -DzkClientTimeout=15000 -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud -Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC -Djetty.home=/apps/solr6/server -Dsolr.solr.home=/apps/solr6/server/solr -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01 -Dgraphite.host=influx.test.cheggnet.com -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3 -Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh 8983 /solr/logs -jar start.jar --module=http I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our load benchmarks use prod logs. We added suggesters, but those use analyzing infix, so they are search indexes, not in-memory. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 21, 2017, at 5:46 AM, Shawn Heiseywrote: > > On 11/20/2017 6:17 PM, Walter Underwood wrote: >> When I ran load benchmarks with 6.3.0, an overloaded cluster would get super >> slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting >> OOMs. That is really bad, because it means we need to reboot every node in >> the cluster. >> Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). >> Using the G1 collector with the Shawn Heisey settings in an 8G heap. > >> This is not good behavior in prod. The process goes to the bad place, then >> we need to wait until someone is paged and kills it manually. Luckily, it >> usually drops out of the live nodes for each collection and doesn’t take >> user traffic. > > There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't > working because the arguments enabling it were in the wrong place. It was > fixed in 5.5.1 and 6.0. > > https://issues.apache.org/jira/browse/SOLR-8145 > > If the scripts that you are using to get Solr started originated with a much > older version of Solr than you are currently running, maybe you've got the > arguments in the wrong order. > > Do you see the commandline arguments for the OOM killer (only available on > *NIX systems, not Windows) on the admin UI dashboard? If they are properly > placed, you will see them on the dashboard, but if they aren't properly > placed, then you won't see them. This is what the argument looks like for > one of my Solr installs: > > -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs > > Something which you probably already know: If you're hitting OOM, you need a > larger heap, or you need to adjust the config so it uses less memory. There > are no other ways to "fix" OOM problems. > > Thanks, > Shawn
Re: Merging of index in Solr
I am using the IndexMergeTool from Solr, from the command below: java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar org.apache.lucene.misc.IndexMergeTool The heap size is 32GB. There are more than 20 million documents in the two cores. Regards, Edwin On 21 November 2017 at 21:54, Shawn Heiseywrote: > On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: > >> Does anyone knows how long usually the merging in Solr will take? >> >> I am currently merging about 3.5TB of data, and it has been running for >> more than 28 hours and it is not completed yet. The merging is running on >> SSD disk. >> > > The following will apply if you mean Solr's "optimize" feature when you > say "merging". > > In my experience, merging proceeds at about 20 to 30 megabytes per second > -- even if the disks are capable of far faster data transfer. Merging is > not just copying the data. Lucene is completely rebuilding very large data > structures, and *not* including data from deleted documents as it does so. > It takes a lot of CPU power and time. > > If we average the data rates I've seen to 25, then that would indicate > that an optimize on a 3.5TB is going to take about 39 hours, and might take > as long as 48 hours. And if you're running SolrCloud with multiple > replicas, multiply that by the number of copies of the 3.5TB index. An > optimize on a SolrCloud collection handles one shard replica at a time and > works its way through the entire collection. > > If you are merging different indexes *together*, which a later message > seems to state, then the actual Lucene operation is probably nearly > identical, but I'm not really familiar with it, so I cannot say for sure. > > Thanks, > Shawn > >
Re: Custom analyzer & frequency
You rock, thank you so much for this clear answer, I loose 2 days for nothing as I've already the term freq but now I've an answer :-) (And yes I check it's the doc freq, not the term freq). Thank you again ! 2017-11-21 16:34 GMT+01:00 Emir Arnautović: > Hi Alain, > As explained in prev mail that is doc frequency and each doc is counted once. > I am not sure if Luke can provide you information about overall term > frequency - sum of term frequency of all docs. > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> On 21 Nov 2017, at 16:30, Barbet Alain wrote: >> >> $ cat add_test.sh >> DATA=' >> >> >>666 >>toto titi tata toto tutu titi >> >> >> ' >> $ sh add_test.sh >> >> >> 0> name="QTime">484 >> >> >> >> $ curl >> 'http://localhost:8983/solr/alian_test/terms?terms.fl=titi_txt_fr=index' >> >> >> 0> name="QTime">0> name="titi_txt_fr">1> name="titi">11> name="tutu">1 >> >> >> >> So it's not only on Luke Side, it's come from Solr. Does it sound normal ? >> >> 2017-11-21 11:43 GMT+01:00 Barbet Alain : >>> Hi, >>> >>> I build a custom analyzer & setup it in solr, but doesn't work as I expect. >>> I always get 1 as frequency for each word even if it's present >>> multiple time in the text. >>> >>> So I try with default analyzer & find same behavior: >>> My schema >>> >>> >>> >>> >>> >> stored="true"/> >>> >>> >>> alian@yoda:~/solr> cat add_test.sh >>> DATA=' >>> >>> >>>666 >>>toto titi tata toto tutu titi >>> >>> >>> ' >>> curl -X POST -H 'Content-Type: text/xml' >>> 'http://localhost:8983/solr/alian_test/update?commit=true' >>> --data-binary "$DATA" >>> >>> When I test in solr interface / analyze, I find the right behavior >>> (find titi & toto 2 times). >>> But when I look in solr index with Luke or solr interface / schema, >>> the top term always get 1 as frequency. Can someone give me the thing >>> I forget ? >>> >>> (solr 6.5) >>> >>> Thank you ! >
Re: Custom analyzer & frequency
Hi Alain, As explained in prev mail that is doc frequency and each doc is counted once. I am not sure if Luke can provide you information about overall term frequency - sum of term frequency of all docs. Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 16:30, Barbet Alainwrote: > > $ cat add_test.sh > DATA=' > > >666 >toto titi tata toto tutu titi > > > ' > $ sh add_test.sh > > > 0 name="QTime">484 > > > > $ curl > 'http://localhost:8983/solr/alian_test/terms?terms.fl=titi_txt_fr=index' > > > 0 name="QTime">0 name="titi_txt_fr">1 name="titi">11 name="tutu">1 > > > > So it's not only on Luke Side, it's come from Solr. Does it sound normal ? > > 2017-11-21 11:43 GMT+01:00 Barbet Alain : >> Hi, >> >> I build a custom analyzer & setup it in solr, but doesn't work as I expect. >> I always get 1 as frequency for each word even if it's present >> multiple time in the text. >> >> So I try with default analyzer & find same behavior: >> My schema >> >> >> >> >> > stored="true"/> >> >> >> alian@yoda:~/solr> cat add_test.sh >> DATA=' >> >> >>666 >>toto titi tata toto tutu titi >> >> >> ' >> curl -X POST -H 'Content-Type: text/xml' >> 'http://localhost:8983/solr/alian_test/update?commit=true' >> --data-binary "$DATA" >> >> When I test in solr interface / analyze, I find the right behavior >> (find titi & toto 2 times). >> But when I look in solr index with Luke or solr interface / schema, >> the top term always get 1 as frequency. Can someone give me the thing >> I forget ? >> >> (solr 6.5) >> >> Thank you !
Re: Custom analyzer & frequency
Hi Alain, I haven’t been using Luke UI in a while, but if you are talking about top terms for some field, that might be doc freq, not term freq and every doc is counted once - that is equivalent to “Load Term Info” in “Schema” in Solr Admin console. HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 16:21, Barbet Alainwrote: > > Thank you very much for your answer. > > It was an error on copy / paste on my mail sorry about that ! > So it was already a text field, so omitTermFrequenciesAndPosition was > already on “false” > > So I forget my custom analyzer and try to test with an already defined > field_type (text_fr) and see same behaviour in luke ! > So I look better. > On Luke when I took term one by one on "Document" tab, I see my > frequency set to 2. > But in first panel of Luke "overview", in "show top terms" Freq is > still at 1 for all values. > > I use Solr 6.5 & Luke 7.1. It didn't see this behavior if I open a > Lucene base I build outside Solr, I see top terms freq same on 2 > panels. > Do you know a reason for that ? > Does this have an impact on Solr search ? Does bad freq in "top terms" > come from Luke or Solr ? > > > 2017-11-21 12:08 GMT+01:00 Emir Arnautović : >> Hi Alain, >> You did not provided definition of used field type - you use “nametext” type >> and pasted “text_ami” field type. It is possible that you have >> omitTermFrequenciesAndPosition=“true” on nametext field type. The default >> value for text fields should be false. >> >> HTH, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 21 Nov 2017, at 11:43, Barbet Alain wrote: >>> >>> Hi, >>> >>> I build a custom analyzer & setup it in solr, but doesn't work as I expect. >>> I always get 1 as frequency for each word even if it's present >>> multiple time in the text. >>> >>> So I try with default analyzer & find same behavior: >>> My schema >>> >>> >>> >>> >>> >> stored="true"/> >>> >>> >>> alian@yoda:~/solr> cat add_test.sh >>> DATA=' >>> >>> >>> 666 >>> toto titi tata toto tutu titi >>> >>> >>> ' >>> curl -X POST -H 'Content-Type: text/xml' >>> 'http://localhost:8983/solr/alian_test/update?commit=true' >>> --data-binary "$DATA" >>> >>> When I test in solr interface / analyze, I find the right behavior >>> (find titi & toto 2 times). >>> But when I look in solr index with Luke or solr interface / schema, >>> the top term always get 1 as frequency. Can someone give me the thing >>> I forget ? >>> >>> (solr 6.5) >>> >>> Thank you ! >>
Re: Custom analyzer & frequency
$ cat add_test.sh DATA=' 666 toto titi tata toto tutu titi ' $ sh add_test.sh 0484 $ curl 'http://localhost:8983/solr/alian_test/terms?terms.fl=titi_txt_fr=index' 00 So it's not only on Luke Side, it's come from Solr. Does it sound normal ? 2017-11-21 11:43 GMT+01:00 Barbet Alain: > Hi, > > I build a custom analyzer & setup it in solr, but doesn't work as I expect. > I always get 1 as frequency for each word even if it's present > multiple time in the text. > > So I try with default analyzer & find same behavior: > My schema > > > > >stored="true"/> > > > alian@yoda:~/solr> cat add_test.sh > DATA=' > > > 666 > toto titi tata toto tutu titi > > > ' > curl -X POST -H 'Content-Type: text/xml' > 'http://localhost:8983/solr/alian_test/update?commit=true' > --data-binary "$DATA" > > When I test in solr interface / analyze, I find the right behavior > (find titi & toto 2 times). > But when I look in solr index with Luke or solr interface / schema, > the top term always get 1 as frequency. Can someone give me the thing > I forget ? > > (solr 6.5) > > Thank you !
Re: Custom analyzer & frequency
Thank you very much for your answer. It was an error on copy / paste on my mail sorry about that ! So it was already a text field, so omitTermFrequenciesAndPosition was already on “false” So I forget my custom analyzer and try to test with an already defined field_type (text_fr) and see same behaviour in luke ! So I look better. On Luke when I took term one by one on "Document" tab, I see my frequency set to 2. But in first panel of Luke "overview", in "show top terms" Freq is still at 1 for all values. I use Solr 6.5 & Luke 7.1. It didn't see this behavior if I open a Lucene base I build outside Solr, I see top terms freq same on 2 panels. Do you know a reason for that ? Does this have an impact on Solr search ? Does bad freq in "top terms" come from Luke or Solr ? 2017-11-21 12:08 GMT+01:00 Emir Arnautović: > Hi Alain, > You did not provided definition of used field type - you use “nametext” type > and pasted “text_ami” field type. It is possible that you have > omitTermFrequenciesAndPosition=“true” on nametext field type. The default > value for text fields should be false. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> On 21 Nov 2017, at 11:43, Barbet Alain wrote: >> >> Hi, >> >> I build a custom analyzer & setup it in solr, but doesn't work as I expect. >> I always get 1 as frequency for each word even if it's present >> multiple time in the text. >> >> So I try with default analyzer & find same behavior: >> My schema >> >> >> >> >> > stored="true"/> >> >> >> alian@yoda:~/solr> cat add_test.sh >> DATA=' >> >> >>666 >>toto titi tata toto tutu titi >> >> >> ' >> curl -X POST -H 'Content-Type: text/xml' >> 'http://localhost:8983/solr/alian_test/update?commit=true' >> --data-binary "$DATA" >> >> When I test in solr interface / analyze, I find the right behavior >> (find titi & toto 2 times). >> But when I look in solr index with Luke or solr interface / schema, >> the top term always get 1 as frequency. Can someone give me the thing >> I forget ? >> >> (solr 6.5) >> >> Thank you ! >
Re: Merging of index in Solr
On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: Does anyone knows how long usually the merging in Solr will take? I am currently merging about 3.5TB of data, and it has been running for more than 28 hours and it is not completed yet. The merging is running on SSD disk. The following will apply if you mean Solr's "optimize" feature when you say "merging". In my experience, merging proceeds at about 20 to 30 megabytes per second -- even if the disks are capable of far faster data transfer. Merging is not just copying the data. Lucene is completely rebuilding very large data structures, and *not* including data from deleted documents as it does so. It takes a lot of CPU power and time. If we average the data rates I've seen to 25, then that would indicate that an optimize on a 3.5TB is going to take about 39 hours, and might take as long as 48 hours. And if you're running SolrCloud with multiple replicas, multiply that by the number of copies of the 3.5TB index. An optimize on a SolrCloud collection handles one shard replica at a time and works its way through the entire collection. If you are merging different indexes *together*, which a later message seems to state, then the actual Lucene operation is probably nearly identical, but I'm not really familiar with it, so I cannot say for sure. Thanks, Shawn
Re: Merging of index in Solr
Hi Edwin, I’ll let somebody with more knowledge about merge to comment merge aspects. What do you use to merge those cores - merge tool or you run it using Solr’s core API? What is the heap size? How many documents are in those two cores? Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 14:20, Zheng Lin Edwin Yeowrote: > > Hi Emir, > > Thanks for your reply. > > There are only 1 host, 1 nodes and 1 shard for these 3.5TB. > The merging has already written the additional 3.5TB to another segment. > However, it is still not a single segment, and the size of the folder where > the merged index is supposed to be is now 4.6TB, This excludes the original > 3.5TB, meaning it is already using up 8.1TB of space, but the merging is > still going on. > > The index are currently updates free. We have only index the data in 2 > different collections, and we now need to merge them into a single > collection. > > Regards, > Edwin > > On 21 November 2017 at 16:52, Emir Arnautović > wrote: > >> Hi Edwin, >> How many host/nodes/shard are those 3.5TB? I am not familiar with merge >> code, but trying to think what it might include, so don’t take any of >> following as ground truth. >> Merging for sure will include segments rewrite, so you better have >> additional 3.5TB if you are merging it to a single segment. But that should >> not last days on SSD. My guess is that you are running on the edge of your >> heap and doing a lot GCs and maybe you will OOM at some point. I would >> guess that merging is memory intensive operation and even if not holding >> large structures in memory, it will probably create a lot of garbage. >> Merging requires a lot of comparison so it is also a possibility that you >> are exhausting CPU resources. >> Bottom line - without more details and some monitoring tool, it is hard to >> tell why it is taking that much. >> And there is also a question if merging is good choice in you case - is >> index static/updates free? >> >> Regards, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo >> wrote: >>> >>> Hi, >>> >>> Does anyone knows how long usually the merging in Solr will take? >>> >>> I am currently merging about 3.5TB of data, and it has been running for >>> more than 28 hours and it is not completed yet. The merging is running on >>> SSD disk. >>> >>> I am using Solr 6.5.1. >>> >>> Regards, >>> Edwin >> >>
Re: OutOfMemoryError in 6.5.1
On 11/20/2017 6:17 PM, Walter Underwood wrote: When I ran load benchmarks with 6.3.0, an overloaded cluster would get super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting OOMs. That is really bad, because it means we need to reboot every node in the cluster. Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an 8G heap. This is not good behavior in prod. The process goes to the bad place, then we need to wait until someone is paged and kills it manually. Luckily, it usually drops out of the live nodes for each collection and doesn’t take user traffic. There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't working because the arguments enabling it were in the wrong place. It was fixed in 5.5.1 and 6.0. https://issues.apache.org/jira/browse/SOLR-8145 If the scripts that you are using to get Solr started originated with a much older version of Solr than you are currently running, maybe you've got the arguments in the wrong order. Do you see the commandline arguments for the OOM killer (only available on *NIX systems, not Windows) on the admin UI dashboard? If they are properly placed, you will see them on the dashboard, but if they aren't properly placed, then you won't see them. This is what the argument looks like for one of my Solr installs: -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs Something which you probably already know: If you're hitting OOM, you need a larger heap, or you need to adjust the config so it uses less memory. There are no other ways to "fix" OOM problems. Thanks, Shawn
Re: Merging of index in Solr
Hi Emir, Thanks for your reply. There are only 1 host, 1 nodes and 1 shard for these 3.5TB. The merging has already written the additional 3.5TB to another segment. However, it is still not a single segment, and the size of the folder where the merged index is supposed to be is now 4.6TB, This excludes the original 3.5TB, meaning it is already using up 8.1TB of space, but the merging is still going on. The index are currently updates free. We have only index the data in 2 different collections, and we now need to merge them into a single collection. Regards, Edwin On 21 November 2017 at 16:52, Emir Arnautovićwrote: > Hi Edwin, > How many host/nodes/shard are those 3.5TB? I am not familiar with merge > code, but trying to think what it might include, so don’t take any of > following as ground truth. > Merging for sure will include segments rewrite, so you better have > additional 3.5TB if you are merging it to a single segment. But that should > not last days on SSD. My guess is that you are running on the edge of your > heap and doing a lot GCs and maybe you will OOM at some point. I would > guess that merging is memory intensive operation and even if not holding > large structures in memory, it will probably create a lot of garbage. > Merging requires a lot of comparison so it is also a possibility that you > are exhausting CPU resources. > Bottom line - without more details and some monitoring tool, it is hard to > tell why it is taking that much. > And there is also a question if merging is good choice in you case - is > index static/updates free? > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo > wrote: > > > > Hi, > > > > Does anyone knows how long usually the merging in Solr will take? > > > > I am currently merging about 3.5TB of data, and it has been running for > > more than 28 hours and it is not completed yet. The merging is running on > > SSD disk. > > > > I am using Solr 6.5.1. > > > > Regards, > > Edwin > >
Recovery Issue - Solr 6.6.1 and HDFS
Hi All - we have a system with 45 physical boxes running solr 6.6.1 using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collection is split across 100 shards with 3 replicas each. The issue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into a recovery, fail, and retry loop and never come up. The errors are related to HDFS not responding fast enough and warnings from the DFSClient. If we stop a node when this is happening, the script will force a stop (180 second timeout) and upon restart, we have lock files (write.lock) inside of HDFS. The process at this point is to start one node, find out the lock files, wait for it to come up completely (hours), stop it, delete the write.lock files, and restart. Usually this second restart is faster, but it still can take 20-60 minutes. The smaller indexes recover much faster (less than 5 minutes). Should we have not used so many replicas with HDFS? Is there a better way we should have built the solr6 cluster? Thank you for any insight! -Joe
Solr 7.x: Issues with unique()/hll() function on a string field nested in a range facet
Hello, I've encountered 2 issues while trying to apply unique()/hll() function to a string field inside a range facet: 1. Results are incorrect for a single-valued string field. 2. I’m getting ArrayIndexOutOfBoundsException for a multi-valued string field. How to reproduce: 1. Create a core based on the default configSet. 2. Add several simple documents to the core, like these: [ { "id": "14790", "int_i": 2010, "date_dt": "2010-01-01T00:00:00Z", "string_s": "a", "string_ss": ["a", "b"] }, { "id": "12254", "int_i": 2014, "date_dt": "2014-01-01T00:00:00Z", "string_s": "e", "string_ss": ["b", "c"] }, { "id": "12937", "int_i": 2008, "date_dt": "2008-01-01T00:00:00Z", "string_s": "c", "string_ss": ["c", "d"] }, { "id": "10575", "int_i": 2008, "date_dt": "2008-01-01T00:00:00Z", "string_s": "b", "string_ss": ["d", "e"] }, { "id": "13644", "int_i": 2014, "date_dt": "2014-01-01T00:00:00Z", "string_s": "e", "string_ss": ["e", "a"] }, { "id": "8405", "int_i": 2014, "date_dt": "2014-01-01T00:00:00Z", "string_s": "d", "string_ss": ["a", "b"] }, { "id": "6128", "int_i": 2008, "date_dt": "2008-01-01T00:00:00Z", "string_s": "a", "string_ss": ["b", "c"] }, { "id": "5220", "int_i": 2015, "date_dt": "2015-01-01T00:00:00Z", "string_s": "d", "string_ss": ["c", "d"] }, { "id": "6850", "int_i": 2012, "date_dt": "2012-01-01T00:00:00Z", "string_s": "b", "string_ss": ["d", "e"] }, { "id": "5748", "int_i": 2014, "date_dt": "2014-01-01T00:00:00Z", "string_s": "e", "string_ss": ["e", "a"] } ] 3. Try queries like the following for a single-valued string field: q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)" q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)" Distinct counts returned are incorrect in general. For example, for the set of documents above, the response will contain: { "val": 2010, "count": 1, "distinct_count": 0 } and "between": { "count": 10, "distinct_count": 1 } (there should be 5 distinct values). Note, the result depends on the order in which the documents are added. 4. Try queries like the following for a multi-valued string field: q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)" q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)" I’m getting ArrayIndexOutOfBoundsException for such queries. Note, everything looks Ok for other field types (I tried single- and multi-valued ints, doubles and dates) or when the enclosing facet is a terms facet or there is no enclosing facet at all. I can reproduce these issues both for Solr 7.0.1 and 7.1.0. Solr 6.x and 5.x, as it seems, do not have such issues. Is it a bug? Or, may be, I’ve missed something? Thanks, Volodymyr q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)" docs_1-10.json Description: application/json q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)"q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)"q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)"
Re: Issue facing with spell text field containing hyphen
I was about to suggest the same , Analysis Panel is the savior in such cases of doubts. -Atita On Tue, Nov 21, 2017 at 7:26 AM, Rick Leirwrote: > Chirag > Look in Sor Admin, the Analysis panel. Put spider-man in the left and > right text inputs, and see how it gets analysed. Cheers -- Rick > > On November 20, 2017 10:00:49 PM EST, Chirag garg > wrote: > >Hi Rick, > > > >Actually my spell field also contains text with hyphen i.e. it contains > >"spider-man" even then also i am not able to search it. > > > >Regards, > >Chirag > > > > > > > >-- > >Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com >
Re: Issue facing with spell text field containing hyphen
Chirag Look in Sor Admin, the Analysis panel. Put spider-man in the left and right text inputs, and see how it gets analysed. Cheers -- Rick On November 20, 2017 10:00:49 PM EST, Chirag gargwrote: >Hi Rick, > >Actually my spell field also contains text with hyphen i.e. it contains >"spider-man" even then also i am not able to search it. > >Regards, >Chirag > > > >-- >Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Please help me with solr plugin
Zara, If you're looking for custom search components, request handlers or update processors, you can check out my github repo with examples here: https://github.com/bdalal/SolrPluginsExamples/ On Tue, Nov 21, 2017 at 3:58 PM Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Zara, > What sort of plugins are you trying to build? What sort os issues did you > run into? Maybe you are not too far from having running custom plugin. I > would recommend you try running some of existing plugins as your own - just > to make sure that you are able to build and configure custom plugin. After > that you can concentrate on custom logic. > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 21 Nov 2017, at 11:22, Zara Parstwrote: > > > > Hi, > > > > I have spent too much time learning plugin for Solr. I am about give up. > If > > some one has experience writing it. Please contact me. I am open to all > > options. I want to learn it at any cost. > > > > Thanks > > Zara > > -- Regards, Binoy Dalal
Re: Custom analyzer & frequency
Hi Alain, You did not provided definition of used field type - you use “nametext” type and pasted “text_ami” field type. It is possible that you have omitTermFrequenciesAndPosition=“true” on nametext field type. The default value for text fields should be false. HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 11:43, Barbet Alainwrote: > > Hi, > > I build a custom analyzer & setup it in solr, but doesn't work as I expect. > I always get 1 as frequency for each word even if it's present > multiple time in the text. > > So I try with default analyzer & find same behavior: > My schema > > > > > stored="true"/> > > > alian@yoda:~/solr> cat add_test.sh > DATA=' > > >666 >toto titi tata toto tutu titi > > > ' > curl -X POST -H 'Content-Type: text/xml' > 'http://localhost:8983/solr/alian_test/update?commit=true' > --data-binary "$DATA" > > When I test in solr interface / analyze, I find the right behavior > (find titi & toto 2 times). > But when I look in solr index with Luke or solr interface / schema, > the top term always get 1 as frequency. Can someone give me the thing > I forget ? > > (solr 6.5) > > Thank you !
Custom analyzer & frequency
Hi, I build a custom analyzer & setup it in solr, but doesn't work as I expect. I always get 1 as frequency for each word even if it's present multiple time in the text. So I try with default analyzer & find same behavior: My schema alian@yoda:~/solr> cat add_test.sh DATA=' 666 toto titi tata toto tutu titi ' curl -X POST -H 'Content-Type: text/xml' 'http://localhost:8983/solr/alian_test/update?commit=true' --data-binary "$DATA" When I test in solr interface / analyze, I find the right behavior (find titi & toto 2 times). But when I look in solr index with Luke or solr interface / schema, the top term always get 1 as frequency. Can someone give me the thing I forget ? (solr 6.5) Thank you !
Re: Please help me with solr plugin
Hi Zara, What sort of plugins are you trying to build? What sort os issues did you run into? Maybe you are not too far from having running custom plugin. I would recommend you try running some of existing plugins as your own - just to make sure that you are able to build and configure custom plugin. After that you can concentrate on custom logic. Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 11:22, Zara Parstwrote: > > Hi, > > I have spent too much time learning plugin for Solr. I am about give up. If > some one has experience writing it. Please contact me. I am open to all > options. I want to learn it at any cost. > > Thanks > Zara
Please help me with solr plugin
Hi, I have spent too much time learning plugin for Solr. I am about give up. If some one has experience writing it. Please contact me. I am open to all options. I want to learn it at any cost. Thanks Zara
NullPointerException in PeerSync.handleUpdates
Hi, We are running 6.2 version of Solr and hitting this error frequently. Error while trying to recover. core=my_core:java.lang.NullPointerException at org.apache.solr.update.PeerSync.handleUpdates(PeerSync.java:605) at org.apache.solr.update.PeerSync.handleResponse(PeerSync.java:344) at org.apache.solr.update.PeerSync.sync(PeerSync.java:257) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Is this a known issue and fixed in some newer version? Thanks SG
Re: Merging of index in Solr
Hi Edwin, How many host/nodes/shard are those 3.5TB? I am not familiar with merge code, but trying to think what it might include, so don’t take any of following as ground truth. Merging for sure will include segments rewrite, so you better have additional 3.5TB if you are merging it to a single segment. But that should not last days on SSD. My guess is that you are running on the edge of your heap and doing a lot GCs and maybe you will OOM at some point. I would guess that merging is memory intensive operation and even if not holding large structures in memory, it will probably create a lot of garbage. Merging requires a lot of comparison so it is also a possibility that you are exhausting CPU resources. Bottom line - without more details and some monitoring tool, it is hard to tell why it is taking that much. And there is also a question if merging is good choice in you case - is index static/updates free? Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeowrote: > > Hi, > > Does anyone knows how long usually the merging in Solr will take? > > I am currently merging about 3.5TB of data, and it has been running for > more than 28 hours and it is not completed yet. The merging is running on > SSD disk. > > I am using Solr 6.5.1. > > Regards, > Edwin