Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
This happened when the second time I'm performing restart.  But after that,
every time this collection is stuck at here.  If I restart the leader node
as well, the core can get out of the recovering state

On Mon, May 16, 2016 at 5:00 PM, Li Ding <li.d...@bloomreach.com> wrote:

> Hi Anshum,
>
> This is for restart solr with 1000 collections.  I created an environment
> with 1023 collections today All collections are empty.  During repeated
> restart test, one of the cores are marked as "recovering" and stuck there
> for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
> is the relevant logs:
>
> ---This is the logs for the core stuck at "recovering"
>
> INFO  - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=down
>
> INFO  - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2]  CLOSING SolrCore
> org.apache.solr.core.SolrCore@1e48619
>
> INFO  - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2] Closing main searcher on request.
>
> INFO  - 2016-05-16 22:47:06.001;
> org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt
> /solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index
> [CachedDir<

Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
Hi Anshum,

This is for restart solr with 1000 collections.  I created an environment
with 1023 collections today All collections are empty.  During repeated
restart test, one of the cores are marked as "recovering" and stuck there
for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
is the relevant logs:

---This is the logs for the core stuck at "recovering"

INFO  - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController;
publishing core=test_collection_112_shard1_replica2 state=down

INFO  - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore;
[test_collection_112_shard1_replica2]  CLOSING SolrCore
org.apache.solr.core.SolrCore@1e48619

INFO  - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore;
[test_collection_112_shard1_replica2] Closing main searcher on request.

INFO  - 2016-05-16 22:47:06.001;
org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt
/solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index

Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
Hi all,

We have an unique scenario where we don't need leaders in every collection
to recover from failures.  The indexing never changes.  But we have faced
problems where either zk marked a core as down while the core is fine in
non-distributed query or during restart, the core never comes up.  My
question is that is there any simple way to disable those leaders and
leaders election in SolrCloud,  We do use multi-shard and distributed
queries.  But with our unique situation, we don't need leaders to maintain
the correct status of the index.  So if we can get rid of that part, our
solr restart will be more robust.

Any suggestions will be appreciated.

Thanks,

Li


Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-27 Thread Li Ding
Hi Erick,

I don't have the GC log.  But after the GC finished.  Isn't zk ping
succeeds and the core should be back to normal state?  From the log I
posted.  The sequence is:

1) Solr Detects itself can't connect to ZK and reconnect to ZK
2) Solr marked all cores are down
3) Solr recovery each cores, some succeeds, some failed.
4) After 30 minutes, the cores that are failed still marked as down.

So my questions is, during the 30 minutes interval, if GC takes too long,
all cores should failed.  And GC doesn't take longer than a minute since
all serving requests to other calls succeeds and the next zk ping should
bring the core back to normal? right?  We have an active monitor running at
the same time querying every core in distrib=false mode and every query
succeeds.

Thanks,

Li

On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> One of the reasons this happens is if you have very
> long GC cycles, longer than the Zookeeper "keep alive"
> timeout. During a full GC pause, Solr is unresponsive and
> if the ZK ping times out, ZK assumes the machine is
> gone and you get into this recovery state.
>
> So I'd collect GC logs and see if you have any
> stop-the-world GC pauses that take longer than the ZK
> timeout.
>
> see Mark Millers primer on GC here:
> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>
> Best,
> Erick
>
> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li.d...@bloomreach.com> wrote:
> > Thank you all for your help!
> >
> > The zookeeper log rolled over, thisis from Solr.log:
> >
> > Looks like the solr and zk connection is gone for some reason
> >
> > INFO  - 2016-04-21 12:37:57.536;
> > org.apache.solr.common.cloud.ConnectionManager; Watcher
> > org.apache.solr.common.cloud.ConnectionManager@19789a96
> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
> > state:Disconnected type:None path:null path:null type:None
> >
> > INFO  - 2016-04-21 12:37:57.536;
> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
> >
> > INFO  - 2016-04-21 12:38:24.248;
> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
> expired
> > - starting a new one...
> >
> > INFO  - 2016-04-21 12:38:24.262;
> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> > connect to ZooKeeper
> >
> > INFO  - 2016-04-21 12:38:24.269;
> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
> >
> >
> > Then it publishes all cores on the hosts are down.  I just list three
> cores
> > here:
> >
> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
> > publishing core=product1_shard1_replica1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
> > publishing core=collection1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
> > numShards not found on descriptor - reading it from system property
> >
> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
> > publishing core=product2_shard5_replica1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
> > publishing core=product2_shard13_replica1 state=down
> >
> >
> > product1 has only one shard one replica and it's able to be active
> > successfully:
> >
> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
> > Register replica - core:product1_shard1_replica1 address:http://
> > {internalIp}:8983/solr collection:product1 shard:shard1
> >
> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
> > cancelElection did not find election node to remove
> >
> > INFO  - 2016-04-21 12:38:26.393;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> > process for shard shard1
> >
> > INFO  - 2016-04-21 12:38:26.399;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
> to
> > continue.
> >
> > INFO  - 2016-04-21 12:38:26.399;
> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
> leader -
> > try and sync
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> > Success - now sync replicas to me
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
> > http://{internalIp}:898

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-26 Thread Li Ding
;
org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
shard5

INFO  - 2016-04-21 12:38:26.632; org.apache.solr.common.cloud.SolrZkClient;
makePath: /collections/product2_shard5_replica1/leaders/shard5

INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We are
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ and
leader is http://{internalIp}:8983/solr
product2_shard5_replica1_shard5_replica1/

INFO  - 2016-04-21 12:38:26.646;
org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
ZooKeeper...


Before I restarted this server, a bunch of queries failed for this
collection product2.  But I don't think it will affect the core status.


Do you guys have any idea about why this particular core is not published
as active since from the log, most steps are done except the very last one
to publish info to ZK.


Thanks,


Li
On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <rajeshhaz...@gmail.com>
wrote:

> Hi Li,
>
> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
> if its the case, this may be related to
> https://issues.apache.org/jira/browse/SOLR-7940,
> and i would say either use the patch file or upgrade.
>
>
> *Thanks,*
> *Rajesh,*
> *8328789519,*
> *If I don't answer your call please leave a voicemail with your contact
> info, *
> *will return your call ASAP.*
>
> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <yypvsxf19870...@gmail.com>
> wrote:
>
> > Hi
> >We have used Solr4.6 for 2 years,If you post more logs ,maybe we can
> > fixed it.
> >
> > 2016-04-21 6:50 GMT+08:00 Li Ding <li.d...@bloomreach.com>:
> >
> > > Hi All,
> > >
> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
> cores
> > > on the nodes are marked as down in Zookeeper.  If the cores are parts
> of
> > a
> > > multi-sharded collection with one replica,  the queries to that
> > collection
> > > will fail.  However, when this happened, if we issue queries to the
> core
> > > directly, it returns 200 and correct info.  But once Solr got into the
> > > state, the core will be marked down forever unless we do a restart on
> > Solr.
> > >
> > > Has anyone seen this behavior before?  Is there any to get out of the
> > state
> > > on its own?
> > >
> > > Thanks,
> > >
> > > Li
> > >
> >
>


Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-20 Thread Li Ding
Hi All,

We are using SolrCloud 4.6.1.  We have observed following behaviors
recently.  A Solr node in a Solrcloud cluster is up but some of the cores
on the nodes are marked as down in Zookeeper.  If the cores are parts of a
multi-sharded collection with one replica,  the queries to that collection
will fail.  However, when this happened, if we issue queries to the core
directly, it returns 200 and correct info.  But once Solr got into the
state, the core will be marked down forever unless we do a restart on Solr.

Has anyone seen this behavior before?  Is there any to get out of the state
on its own?

Thanks,

Li