To illustrate ONE problem we have (another problem is that the data returned is sometimes garbage):
john@app-001:~$ curl -I http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1 HTTP/1.1 200 OK Vary: Accept-Encoding Transfer-Encoding: chunked Server: MochiWeb/1.1 WebMachine/1.9.0 (participate in the frantic) Last-Modified: Mon, 20 Dec 2010 16:23:11 GMT Date: Wed, 09 Nov 2011 09:28:00 GMT Content-Type: application/postscript Connection: close ok good it exists according to riak john@app-001:~$ curl -O http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0 nothing saved to disk john@app-001:~$ curl -i http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1 HTTP/1.1 200 OK Vary: Accept-Encoding Transfer-Encoding: chunked Server: MochiWeb/1.1 WebMachine/1.9.0 (participate in the frantic) Last-Modified: Mon, 20 Dec 2010 16:23:11 GMT Date: Wed, 09 Nov 2011 09:34:37 GMT Content-Type: application/postscript Connection: close john@app-001:~$ just an empty response - seriously how does this happen? Doing this several times yields the same result so there doesn't seem to be any read-repair going on. Is there nothing we can do to get riak in a consistent state again? ( Other than going through all the 40 000 files and trying to determine which ones aren't there anymore or are just garbageā¦). John 8 nov 2011 kl. 11:35 skrev John Axel Eriksson: > Thanks for the emails detailing this issue - private and to the list. I've > got a question for the list on our situation: > > As stated we did an upgrade from 0.14.2 to 1.0.1 and after that we added a > new node to our cluster. This > really messed things up and nodes started crashing. In the end I opted to > remove the added node and after > quite a short while things settled down. The cluster is responding again. > What we see now are corrupted files. > > We've tried to determine how many of them there are but it's been a bit > difficult. What we know is that there ARE > corrupted files(or at least returned in an inconsistent state). I was > wondering if there is anything we can do to get > the cluster in a proper state again without having to manually delete > everything that's corrupted? Is it possible that > the data is actually there but not returned in a proper state by riak? I > think it's only the larger files stored in luwak > that have this problem. > > John > > > 29 okt 2011 kl. 01:03 skrev John Axel Eriksson: > >> I've got the utmost respect for developers such as yourselves(Basho) and >> we've had great success using Riak - we have been using it >> in production since 0.11. We've had our share of problems with it during >> this whole time but none as big as this. I can't understand why >> this wasn't posted somewhere using the blink tag and big red bold text. I >> mean if I try to fsck a mounted disk in use in Linux I get: >> >> "WARNING!!! The filesystem is mounted. If you continue you ***WILL*** >> cause ***SEVERE*** filesystem damage." >> >> I understand why I don't get a warning like that when trying to run >> "riak-admin join [email protected]" on Riak 1.0.1 but something similar to >> it happens. >> >> It goes against the whole idea of Riak being an ops-dream, distributed, >> fault-tolerant system having a bug such as this without disclosing it >> more openly than an entry in a bug tracking system. I don't want to be >> afraid of adding nodes to my cluster but that is the result of this bug and >> the lack of communication of same bug. The 1.0.1 release should have been >> pulled in my opinion. >> >> To sum it up, this was a nightmare for us, I didn't get much sleep last >> night and I woke up in hell. All that, corrupted data, downtime and lost >> customer >> confidence could have been avoided by better communication. >> >> I don't want to be too hard on you fine people of Basho and you provide a >> really great system in Riak and I understand what you're aiming for, but if >> anything as bad as this ever happens in the future you might want to >> communicate it better and consider pulling the release. >> >> Thanks, >> John >> >> >> 28 okt 2011 kl. 17:51 skrev Kelly McLaughlin: >> >>> John, >>> >>> It appears you've run into a race condition with adding and leaving nodes >>> that's present in 1.0.1. The problem happens during handoff and can cause >>> bitcask directories to be unexpectedly deleted. We have identified the >>> issue and we are in the process of correcting it, testing, and generating a >>> new point release containing the fix. In the meantime, we apologize for the >>> inconvenience and irritation this has caused. >>> >>> Kelly >>> >>> >>> On Oct 28, 2011, at 9:14 AM, John Axel Eriksson wrote: >>> >>>> Last night we did two things. First we upgraded our entire cluster from >>>> riak-search 0.14.2 to 1.0.1. This process went >>>> pretty well and the cluster was responding correctly after this was >>>> completed. >>>> >>>> In our cluster we have around 40 000 files stored in Luwak (we also have >>>> about the same amount of keys, or more, in riak which is mostly >>>> the metadata for the files in Luwak). The files are in sizes ranging from >>>> around 50K to around 400MB, most of the files are pretty small though. I >>>> think we're up to a total of around 30GB now. >>>> >>>> Anyway, upon adding a new node to the now 1.0.1 cluster I saw the beam.smp >>>> processes on all the servers, including the new one, taking >>>> up almost all available cpu. It stayed in this state for around an hour >>>> and the cluster was slow to respond and occasionally timed out. During the >>>> process Riak crashed on random nodes from time to time and I had to >>>> restart it. After about an hour things settled down. I added this >>>> new node to our load-balancer so it too could serve requests. When testing >>>> our apps against the cluster we still got lots of timeouts and something >>>> seemed very very wrong. >>>> >>>> After a while I did a "riak-admin leave" on the node that was added (kind >>>> of a panic move I guess). Around 20 minutes after I did this, the cluster >>>> started >>>> responding correctly again. All was not well though - files seemed to be >>>> corrupted(not sure what percentage but could be 1 % or more). I have no >>>> idea how >>>> that could happen but files that we had accessed before now contained >>>> garbage. I haven't thoroughly researched exactly WHAT garbage they contain >>>> but >>>> they're not in a usable state anymore. Is this something that could happen >>>> under any circumstances in Riak? >>>> >>>> I'm afraid of adding a node at all now since it resulted in downtime and >>>> corruption when I tried it. I checked and rechecked the configuration >>>> files and really - they're >>>> the same on all the nodes (except for vm.args where they have different >>>> names of course). Has anyone ever seen anything like this? Could it >>>> somehow be related to >>>> the fact that I did an upgrade from 0.14.2 to 1.0.1 and maybe an hour >>>> later added a new 1.0.1 node? >>>> >>>> Thanks for any input! >>>> >>>> John >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >> > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
