Thanks for the feedback on this Paul, this is outside of my area of expertise, so nice to know that this isn't symptomatic of using delayed_commits = true.

I also agree that this appears to be a hardware issue, and the only way to confirm would be to mirror the setup on some separate hardware and see if the issue persists.

Wendall

On 04/19/2013 12:40 PM, Paul Davis wrote:
Doubtful that delayed commits would cause this. This isn't a matter of
reordered writes or some writes not making it to disk. The binary
would've been pushed towards disk in a single write request and the
corruption appears to be in the middle of valid data which is a bit
weird.

My guess is this was either corrupted in RAM somehow before it hit
disk or somehow the disk is returning bad reads. I've seen similar
things before that end up preceding disk death but I'm also running a
comparatively older code base (most importantly, no snappy).

On Fri, Apr 19, 2013 at 11:42 AM, Wendall Cada <[email protected]> wrote:
If using the defaults isn't this set to delayed_commits = true still? Can't
this lead to just this type of data corruption? I'd like to see
delayed_commits = false and see if this is still happening.

I'd also be keen on seeing this data replicated to a different piece of
hardware with the same compaction schedule and see if the issue persists.
I'm inclined to point the finger at a hard disk issue, but would like to see
some confirmation that this can be reproduced with the same exact code on
different hardware.

I've run this same version heavily in production on several different
systems doing essentially the same thing and have never seen a data
corruption. The main difference is I always use delayed_commits = false

Wendall


On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
On 19 April 2013 00:41, Victor Nicollet <[email protected]> wrote:
I searched the logs for any signs of error. The operations performed on
the
prod-folder database in the two hours before the first crash were :

https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1

The compact at 10:54:08 finished without a hitch.
The compact at 11:54:07 finished with :

https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32

Hi Victor,

thanks for that information.

Can we get a working copy of the database, so we can compare the
corrupt compressed documents with the working ones and see if there's
any pattern?

I recommend you assume there's some storage system issue and:

- check dmesg / syslog for disk related errors
- fsck the filesystem where the couches are
- if this is a managed / hosted server you might want to get the
supplier to check if there are any disk / storage issues
- if it's not virtualised hardware, see if smartmontools tells you
anything useful

If you wish, you can encrypt files using my public key,
http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.

A+
Dave


Reply via email to