Shawn,

Apologies, I should have explained more properly.

To clarify: manually deleting the 'version-2' directory is never something
that happened when I first observed this behavior.
The reason I did it in this example is that it's the fastest and simplest
way to demonstrate the behavior.

What I have experienced is attempting to add a zookeeper to the ensemble,
and despite the container starting, not being added to the ensemble.
Probably iptables/firewall causing that. So this zookeeper is "empty"
because it could not sync with the others.
However, the deploy doesn't stop because it sees the container running so
it thinks everything is OK. Then the deploy continues and restarts the solr
container(because of the new ZK_HOST configuration having to be provided.
Far as I know this is not possible dynamically).
At this time, for some reason, solr connects to that specific zookeeper
first, gets an empty configuration and deletes the folders. This I have
seen happen.

This is the "split brain thing" I referred to in my first email.

I have also seen a move from native to container-based zookeepers where the
zookeeper 'version-2' data folder was not properly mounted. The native
zookeepers had a different location than what was provided.
The automated deploy process checks that the folder exists, creates an
empty one if it doesn't and just continues to the next step.
This also resulted in solr connecting to an "empty" zookeeper and deleting
all the folders.


So what I needed to simulate was a Solr that had cores, connecting to an
empty zookeeper, and then losing those cores.
Manually doing the zookeeper delete is a far simpler and a more easily
reproducible way of demonstrating this.
I could have started a second zookeeper with a different data mount, then
updated the ZK_HOST to point only to that one, I suppose.
It's just an example of how to arrive at these events.

That being explained, am I right in understanding that currently there is
no way of configuring Solr so that it won't delete the folders, in this
event?

I'm in the process of writing a script that basically does "docker exec
<zookeeper container> bin/zkCli.sh ls <path>" to every single known
zookeeper container and if they don't all return what I expect, the deploy
stops right before starting the solr container. That should be a safeguard,
for now, I suppose?

Side note in general for anyone reading this later in the archives: the
instructions tar.gz in my previous message contains the output of an audit
rule that was put on the data folder.
This output shows the process that is performing the delete is, in fact,
the solr(java) process, performing syscall "unlink" and "rmdir" on the
specific files and directories.

Regards,
Koen






On Thu, Apr 11, 2019 at 7:00 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/11/2019 3:17 AM, Koen De Groote wrote:
> > The basic steps are: set up zookeeper, set up solr root, set up solr.
> > Create dummy collection with example data. Stop the containers. Delete
> > the zookeeper 'version-2' folder. Recreate zookeeper container. Redo the
> > mkroot, recreate solr container. At this point, solr will start
> > complaining about the cores after a bit and then the data folders will
> > be deleted.
>
> By deleting the "version-2" folder, you deleted the entire ZooKeeper
> database.  All of the information that makes up your entire SolrCloud
> cluster is *gone*.
>
> We are trying as hard as we can to move to a "ZooKeeper As Truth" model.
>   Right now, the truth of a SolrCloud cluster is a combination of what's
> in ZooKeeper and what actually exists on disk, rather than just what's
> in ZK.
>
> It surprises me greatly that Solr is deleting data.  I would expect it
> to simply ignore cores during startup if there is no corresponding data
> in ZooKeeper.  In the past, I have seen evidence of it doing exactly that.
>
> So although it sounds like we do have a bug that needs fixing (SolrCloud
> should never delete data unless it has been explicitly asked to do so),
> you created this problem yourself by deleting all of your ZooKeeper data
> -- in essence, deleting the entire cluster.
>
> Thanks,
> Shawn
>

Reply via email to