Why does CLUSTERSTATUS return different information than the web cloud view?

2014-08-23 Thread Nathan Neulinger

In particular, a shard being 'active' vs. 'gone'.

The web ui is clearly showing the given replicas as being in Gone state when I shut down a server, yet the 
CLUSTERSTATUS says that each replica has state: active


Is there any way to ask it for status that will reflect that the replica is 
gone?

This is with 4.8.0.

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Why does CLUSTERSTATUS return different information than the web cloud view?

2014-08-23 Thread Nathan Neulinger
Is there a way to query the 'live node' state without sending a query to every node myself? i.e. to get the same data 
that is used for that cloud status screen?


-- Nathan

On 08/23/2014 06:39 PM, Mark Miller wrote:

The state is actually a combo of the state in clusterstate and the live nodes. 
If the live node is not there, it's gone regardless of the last state it 
published.

- Mark


On Aug 23, 2014, at 6:00 PM, Nathan Neulinger nn...@neulinger.org wrote:

In particular, a shard being 'active' vs. 'gone'.

The web ui is clearly showing the given replicas as being in Gone state when I shut 
down a server, yet the CLUSTERSTATUS says that each replica has state: active

Is there any way to ask it for status that will reflect that the replica is 
gone?

This is with 4.8.0.

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: /solr/admin/ping causing exceptions in log?

2014-07-28 Thread Nathan Neulinger

Thing is - I wouldn't expect any of the default options mentioned to change the 
behavior intermittently.

i.e. it's working for 95% of the health check requests, it's just the intermittent ones that seem to be cut off... I'm 
inquiring with haproxy devs since it appears that at least one other person on #haproxy is seeing the same behavior. 
Doesn't appear to be specific to solr.


-- Nathan

On 07/27/2014 10:44 PM, Shawn Heisey wrote:

On 7/27/2014 7:23 PM, Nathan Neulinger wrote:

Unfortunately, doesn't look like this clears the symptom.

The ping is responding almost instantly every time. I've tried setting a
15 second timeout on the check, with no change in occurences of the error.

Looking at a packet capture on the server side, there is a clear
distinction between working and failing/error-triggering connections.

It looks like in a working case, I see two packets immediately back to
back (one with header, and next a continuation with content) with no ack
in between, followed by ack, rst+ack, rst.

In the failing request, I see the GET request, acked, then the http/1.1
200 Ok response from Solr, a single ack, and then an almost
instantaneous reset sent by the client.


I'm only seeing this on traffic to/from haproxy checks. If I do a simple:

 while [ true ]; do curl -s http://host:8983/solr/admin/ping; done

from the same box, that flood runs with generally 10-20ms request times
and zero errors.


I won't claim to understand what's going on here, but it might be a
matter of the haproxy options.  Here are the options I'm using in the
defaults section of the config:

defaults
 log global
 modehttp
 option  httplog
 option  dontlognull
 option  redispatch
 option  abortonclose
 option  http-server-close
 option  http-pretend-keepalive
 retries 1
 maxconn 1024
 timeout connect 1s
 timeout client  5s
 timeout server  30s

One bit of information I came across when I first started setting
haproxy up for Solr is that servlet containers like Jetty and Tomcat
require the http-pretend-keepalive option to work properly.  Are you
using this option?

Thanks,
Shawn



--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: /solr/admin/ping causing exceptions in log?

2014-07-27 Thread Nathan Neulinger
Cool. That's likely exactly it, since I don't have one set, it's using the check interval, and occasionally must just be 
too short.


Thank you!

-- Nathan



I assume that this is the httpchk config to make sure that the server is
operational.  If so, you need to increase the timeout check value,
because it is too small.  The ping request is taking longer to run than
you have allowed in the timeout.  Here's part of my haproxy config:



--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: /solr/admin/ping causing exceptions in log?

2014-07-27 Thread Nathan Neulinger

Unfortunately, doesn't look like this clears the symptom.

The ping is responding almost instantly every time. I've tried setting a 15 second timeout on the check, with no change 
in occurences of the error.


Looking at a packet capture on the server side, there is a clear distinction between working and 
failing/error-triggering connections.


It looks like in a working case, I see two packets immediately back to back (one with header, and next a continuation 
with content) with no ack in between, followed by ack, rst+ack, rst.


In the failing request, I see the GET request, acked, then the http/1.1 200 Ok response from Solr, a single ack, and 
then an almost instantaneous reset sent by the client.



I'm only seeing this on traffic to/from haproxy checks. If I do a simple:

while [ true ]; do curl -s http://host:8983/solr/admin/ping; done

from the same box, that flood runs with generally 10-20ms request times and 
zero errors.

-- Nathan

On 07/27/2014 07:12 PM, Nathan Neulinger wrote:

Cool. That's likely exactly it, since I don't have one set, it's using the 
check interval, and occasionally must just be
too short.

Thank you!

-- Nathan



I assume that this is the httpchk config to make sure that the server is
operational.  If so, you need to increase the timeout check value,
because it is too small.  The ping request is taking longer to run than
you have allowed in the timeout.  Here's part of my haproxy config:





--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


solr-working.cap
Description: application/vnd.tcpdump.pcap


solr-cutoff2.cap
Description: application/vnd.tcpdump.pcap


Re: /solr/admin/ping causing exceptions in log?

2014-07-27 Thread Nathan Neulinger

Either way, looks like this is not a SOLR issue, but rather haproxy.

Thanks.

-- Nathan

On 07/27/2014 08:23 PM, Nathan Neulinger wrote:

Unfortunately, doesn't look like this clears the symptom.

The ping is responding almost instantly every time. I've tried setting a 15 
second timeout on the check, with no change
in occurences of the error.

Looking at a packet capture on the server side, there is a clear distinction 
between working and
failing/error-triggering connections.

It looks like in a working case, I see two packets immediately back to back 
(one with header, and next a continuation
with content) with no ack in between, followed by ack, rst+ack, rst.

In the failing request, I see the GET request, acked, then the http/1.1 200 Ok 
response from Solr, a single ack, and
then an almost instantaneous reset sent by the client.


I'm only seeing this on traffic to/from haproxy checks. If I do a simple:

 while [ true ]; do curl -s http://host:8983/solr/admin/ping; done

from the same box, that flood runs with generally 10-20ms request times and 
zero errors.

-- Nathan

On 07/27/2014 07:12 PM, Nathan Neulinger wrote:

Cool. That's likely exactly it, since I don't have one set, it's using the 
check interval, and occasionally must just be
too short.

Thank you!

-- Nathan



I assume that this is the httpchk config to make sure that the server is
operational.  If so, you need to increase the timeout check value,
because it is too small.  The ping request is taking longer to run than
you have allowed in the timeout.  Here's part of my haproxy config:







--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


/solr/admin/ping causing exceptions in log?

2014-07-26 Thread Nathan Neulinger
Recently deployed haproxy in front of my solr instances, and seeing a large number of exceptions in the logs now... 
Example below. I can pound the server with requests against /solr/admin/ping via curl, with no obvious issue, but the 
haproxy checks appear to be aggravating something.


Solr 4.8.0 w/ solr cloud, 2 nodes, 3 zk, linux x86_64

It seems like when the issue occurs, I get a set of the errors all in a burst 
(below), never just one.

Suggestions?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412



2014-07-26 23:04:36,506 ERROR qtp1532385072-4864 [g.apache.solr.servlet.SolrDispatchFilter]  - 
null:org.eclipse.jetty.io.EofException

at 
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
at 
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:443)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:100)
at 
org.eclipse.jetty.server.AbstractHttpConnection$Output.flush(AbstractHttpConnection.java:1094)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.apache.solr.util.FastWriter.flush(FastWriter.java:137)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:763)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:431)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:339)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at 
org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375)
at 
org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164)
at 
org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:194)
at 
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)
... 36 more

2014-07-26 23:04:36,513 ERROR qtp1532385072-4864 [g.apache.solr.servlet.SolrDispatchFilter]  - 
null:org.eclipse.jetty.io.EofException

at 
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
at 
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:443

Re: /solr/admin/ping causing exceptions in log?

2014-07-26 Thread Nathan Neulinger
Tried changing to use /solr/admin/cores instead as a test - still see the
same issue, though much less frequent.


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412




On Sat, Jul 26, 2014 at 6:15 PM, Nathan Neulinger nn...@neulinger.org
wrote:

 Recently deployed haproxy in front of my solr instances, and seeing a
 large number of exceptions in the logs now... Example below. I can pound
 the server with requests against /solr/admin/ping via curl, with no obvious
 issue, but the haproxy checks appear to be aggravating something.

 Solr 4.8.0 w/ solr cloud, 2 nodes, 3 zk, linux x86_64

 It seems like when the issue occurs, I get a set of the errors all in a
 burst (below), never just one.

 Suggestions?

 -- Nathan

 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412



 2014-07-26 23:04:36,506 ERROR qtp1532385072-4864 
 [g.apache.solr.servlet.SolrDispatchFilter]
  - null:org.eclipse.jetty.io.EofException
 at org.eclipse.jetty.http.HttpGenerator.flushBuffer(
 HttpGenerator.java:914)
 at org.eclipse.jetty.http.AbstractGenerator.flush(
 AbstractGenerator.java:443)
 at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:100)
 at org.eclipse.jetty.server.AbstractHttpConnection$Output.
 flush(AbstractHttpConnection.java:1094)
 at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
 at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
 at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
 at org.apache.solr.util.FastWriter.flush(FastWriter.java:137)
 at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(
 SolrDispatchFilter.java:763)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:431)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:339)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:207)
 at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
 doFilter(ServletHandler.java:1419)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(
 ServletHandler.java:455)
 at org.eclipse.jetty.server.handler.ScopedHandler.handle(
 ScopedHandler.java:137)
 at org.eclipse.jetty.security.SecurityHandler.handle(
 SecurityHandler.java:557)
 at org.eclipse.jetty.server.session.SessionHandler.
 doHandle(SessionHandler.java:231)
 at org.eclipse.jetty.server.handler.ContextHandler.
 doHandle(ContextHandler.java:1075)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(
 ServletHandler.java:384)
 at org.eclipse.jetty.server.session.SessionHandler.
 doScope(SessionHandler.java:193)
 at org.eclipse.jetty.server.handler.ContextHandler.
 doScope(ContextHandler.java:1009)
 at org.eclipse.jetty.server.handler.ScopedHandler.handle(
 ScopedHandler.java:135)
 at org.eclipse.jetty.server.handler.ContextHandlerCollection.
 handle(ContextHandlerCollection.java:255)
 at org.eclipse.jetty.server.handler.HandlerCollection.
 handle(HandlerCollection.java:154)
 at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
 HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(
 AbstractHttpConnection.java:489)
 at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(
 BlockingHttpConnection.java:53)
 at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(
 AbstractHttpConnection.java:942)
 at org.eclipse.jetty.server.AbstractHttpConnection$
 RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(
 HttpParser.java:640)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(
 HttpParser.java:235)
 at org.eclipse.jetty.server.BlockingHttpConnection.handle(
 BlockingHttpConnection.java:72)
 at org.eclipse.jetty.server.bio.SocketConnector$
 ConnectorEndPoint.run(SocketConnector.java:264)
 at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
 QueuedThreadPool.java:608)
 at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
 QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.SocketException: Connection reset
 at java.net.SocketOutputStream.socketWrite(
 SocketOutputStream.java:118)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
 at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(
 ByteArrayBuffer.java:375)
 at org.eclipse.jetty.io.bio.StreamEndPoint.flush(
 StreamEndPoint.java:164

Re: problem with replication/solrcloud - getting 'missing required field' during update intermittently (SOLR-6251)

2014-07-16 Thread Nathan Neulinger
FYI. We finally tracked down the problem at least 99.9% sure at this point, and it was staring me in the face the 
whole time - just never noticed:


[{id:4b2c4d09-31e2-4fe2-b767-3868efbdcda1,channel: {add: preet},channel: 
{add: adam}}]

Look at the JSON... It's trying to add two channel array elements... Should 
have been:

[{id:4b2c4d09-31e2-4fe2-b767-3868efbdcda1,channel: {add: preet}},
 {id:4b2c4d09-31e2-4fe2-b767-3868efbdcda1,channel: {add: adam}}]

I half wonder how it chose to interpret that particular chunk of json, but either way, I think the origin of our issue 
is resolved.



From what I'm reading on JSON - this isn't valid syntax at all. I'm guessing that SOLR doesn't actually validate the 
JSON, and it's parser is just creating something weird in that situation like a new request for a whole new document.


-- Nathan


On 07/15/2014 07:19 PM, Nathan Neulinger wrote:

Issue was closed in Jira requesting it be discussed here first. Looking for any 
diagnostic assistance on this issue with
4.8.0 since it is intermittent and occurs without warning.

Setup is two nodes, with external zk ensemble. Nodes are accessed round-robin 
on EC2 behind an ELB.

Schema has:

schema name=hive version=1.5
...
field name=timestamp type=long indexed=false stored=true required=true 
multiValued=false
omitNorms=true /
...


Most of the updates are working without issue, but randomly we'll get the above 
failure, even though searches before and
after the update clearly indicate that the document had the timestamp field in 
it. The error occurs when the second node
does it's distrib operation against the first node.

Diagnostic details are all in the jira issue. Can provide more as needed, but 
would appreciate any suggestions on what
to try or to help diagnose this other than just trying to throw thousands of 
requests at it in round-robin between the
two instances to see if it's possible to reproduce the issue.

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


problem with replication/solrcloud - getting 'missing required field' during update intermittently (SOLR-6251)

2014-07-15 Thread Nathan Neulinger
Issue was closed in Jira requesting it be discussed here first. Looking for any diagnostic assistance on this issue with 
4.8.0 since it is intermittent and occurs without warning.


Setup is two nodes, with external zk ensemble. Nodes are accessed round-robin 
on EC2 behind an ELB.

Schema has:

schema name=hive version=1.5
...
   field name=timestamp type=long indexed=false stored=true required=true multiValued=false 
omitNorms=true /

...


Most of the updates are working without issue, but randomly we'll get the above failure, even though searches before and 
after the update clearly indicate that the document had the timestamp field in it. The error occurs when the second node 
does it's distrib operation against the first node.


Diagnostic details are all in the jira issue. Can provide more as needed, but would appreciate any suggestions on what 
to try or to help diagnose this other than just trying to throw thousands of requests at it in round-robin between the 
two instances to see if it's possible to reproduce the issue.


-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: What is the right way to bring a failed SolrCloud node back online?

2014-01-26 Thread Nathan Neulinger
Thanks, yeah, I did just that - and sent the script in on SOLR-5665 if anyone wants a copy. Script is trivial, but 
you're welcome to stick it (trivial) in contrib or something if it's at all useful to anyone.


-- Nathan

On 01/26/2014 08:28 AM, Mark Miller wrote:

We are working on a new mode (which should become the default) where ZooKeeper 
will be treated as the truth for a cluster.

This mode will be able to handle situations like this - if the cluster state 
says a core should exist on a node and it doesn’t, it will be created on 
startup.

The way things work currently is this kind of hybrid situation where the truth 
is partly in ZooKeeper partly on each node. This is not ideal at all.

I think this new mode is very important, and it will be coming shortly. Until 
then, I’d recommend writing this logic externally as you suggest (I’ve seen it 
done before).

- Mark

http://about.me/markrmiller

On Jan 24, 2014, at 12:01 PM, Nathan Neulinger nn...@neulinger.org wrote:


I have an environment where new collections are being added frequently 
(isolated per customer), and the backup is virtually guaranteed to be missing 
some of them.

As it stands, bringing up the restored/out-of-date instance results in thos 
collections being stuck in 'Recovering' state, because the cores don't exist on 
the resulting server. This can also be extended to the case of restoring a 
completely blank instance.

Is there any way to tell SolrCloud Try recreating any missing cores for this 
collection based on where you know they should be located.

Or do I need to actually determine a list of cores (..._shardX_replicaY) and 
trigger the core creates myself, at which point I gather that it will start 
recovery for each of them?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412




--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-25 Thread Nathan Neulinger

Ok, so our issue sounds like a combination of not having softCommits properly 
done, combined with SOLR-4260.

Thanks everyone!

On 01/24/2014 11:04 PM, Erick Erickson wrote:

Right. There updates are guaranteed to be on the replicas and in their
transaction logs. That doesn't mean they're searchable, however. For a
document to be found in a search there must be a commit, either soft,
or hard with openSearcher=true. Here's a post that outlines all this.



If you have discrepancies when after commits, that's a problem

Best,
Erick

On Fri, Jan 24, 2014 at 8:52 PM, Nathan Neulinger nn...@neulinger.org wrote:

How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always not current - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


What is the right way to bring a failed SolrCloud node back online?

2014-01-24 Thread Nathan Neulinger
I have an environment where new collections are being added frequently (isolated per customer), and the backup is 
virtually guaranteed to be missing some of them.


As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' 
state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a 
completely blank instance.


Is there any way to tell SolrCloud Try recreating any missing cores for this collection based on where you know they 
should be located.


Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which 
point I gather that it will start recovery for each of them?


-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

How can we issue an update request and be certain that all of the replicas in 
the SolrCloud cluster are up to date?

I found this post:

http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation 
succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were 
almost always not current - i.e. the replicas were missing documents/etc.


Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud 
deployment are supposed to function?


If it's a problem with our cloud setup, do you have any suggestions on 
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

Wow, the detail in that jira issue makes my brain hurt... Great to see it's got 
a quick answer/fix!

Thank you!

-- Nathan

On 01/24/2014 09:43 PM, Joel Bernstein wrote:

If you're on Solr 4.6 then this is likely the issue:
https://issues.apache.org/jira/browse/SOLR-4260.

The issue is resolved for Solr 4.6.1 which should be out next week.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:


How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always not current - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412





--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

It's 4.6.0. Pair of servers with an external 3-node zk ensemble.

SOLR-4260 looks like a very promising answer. Will check it out as soon as 
4.6.1 is released.

May also check out the nightly builds since this is still just 
development/prototype usage.

-- Nathan

On 01/24/2014 09:45 PM, Anshum Gupta wrote:

Hi Nathan,

It'd be great to have more information about your setup, Solr Version?
Depending upon your version, you might want to also look at:
https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed).


On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:


How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always not current - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412







--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412