> However, I've been seeing advice that we shouldn't use this because its not
> ECC memory.  That "memory errors are a lot bigger than people think", and
> that this can cause data to rot over time.
> 

Why, super redundant servers themselves really negate the n=whatever ideology.  
The reason why I got into wanting to use RIAK a long time ago was the fact that 
it was about using whatever was laying around the server room (decent but not 
best hardware).  


> So, my question is this:  When you issue a read request to Riak, and the
> data is stored on 3 nodes, does any kind of a error check code ever get
> generated and compared?
> 
> Suppose I had an address record on three nodes, but at the moment the record
> was being written to one of the nodes a cosmic ray flipped a bit and instead
> of it being 123 Main street, the address read 223 main street.
> 

The write verification should take care of this.  IIRC that the has is 
generated in response to the data being written to the disk.  Those would 
differ.  But this may have changed.

> When I read that record, and all three nodes respond, will I simply get the
> result of whichever node is festest?
> 
> If, when I read, I say that R=2 and so 2 nodes have to respond, is the
> result from the two nodes compared?
> 
> I know the vector clock will be compared to make sure to return the latest
> record, but in this situation the vector clocks would be the same even
> though the data isn't.
> 
> Assuming there's no hash generated from the data that would catch or correct
> his type of error, I'm interested in hearing from people with largish
> clusters and knowing whether you use ECC RAM on them or not.
> 
> Basically looking for some advice from people with more experience, as the
> ones advocating ECC are pretty fervent but the cost difference is
> significant.
> 

Personally, I think that the servers that you are looking at are more than 
fine.  Others may disagree and cite things like "production data", etc.  Those 
that believe that ECC is the save all also forget to tell you that even data 
states on disk can in theory change over time.  Complex scenarios could say 
that what if 2 cosmic rays happened to hit 2 memory chips (primary and the ECC 
specific chip) and flip both bits.  This would not generate an error since the 
checksums would still be the same.

I'm not knocking good hardware, I have several high end 64GB and 128GB servers 
with ECC, I'm just saying the purpose of RIAK was to work around the problem of 
having to worry about these cases.

Let me also qualify that I'm not longer using RIAK in production.  I only 
tinker with it for small projects and testing.  I was looking for a local 
replacement for S3, this looked promising (for the reasons of cheap hardware, 
etc) but in the end the large file issue became the issue.  Otherwise it's a 
pretty good product.

Gary Smith




_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to