Hi All,

Praveen sent me the necessary config files and logs. Together with our
support team (thanks Jimmy) we were able to identify a probable cause of
this issue. Firstly, we do not recommend doing any serious testing on a
single machine with an n_val=1 environment (default replica count or n_val
in Riak is 3). That said, we do take missing data very seriously. Riak is
specifically designed to remain available and recover from failure of any
one or more physical machines in a cluster. Running on a single machine
with a replica count of one obviates all that awesomeness.

One thing immediately stood out in the config file: a ring size of 512.
Ring size is the number of virtual nodes or vnodes[1] in the cluster.
Vnodes are the default level of abstraction in Riak. When planning a
cluster[2], the general rule of thumb is that you have between 10 and 50
vnodes per physical machine. There are a number of reasons for this but
imho, it mostly has to do with finding the sweet spot performance wise
between optimally loading your hardware and omg wt&*%*# overloading your
hardware. 512 vnodes on a single machine is categorically the latter. One
reason you want to limit the max number of vnodes per machine is simply
file descriptor limitations handled in linux through ulimit. 512 vnodes
turns into a ton of fd's. So let's take a look at the logs.

In the error log we see this:

=====
===== LOGGING STARTED Thu Jun  4 08:29:17 GMT 2015
=====
Node 'riak@vps1' not responding to pings.
config is OK
!!!!
!!!! WARNING: ulimit -n is 1024; 4096 is the recommended minimum.
!!!!
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot
/usr/lib/riak/releases/1.4.12/riak              -config
/etc/riak/app.config            -pa
/usr/lib/riak/lib/basho-patches            -args_file /etc/riak/vm.args --
console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64]
[kernel-poll:true]


It is not the simplest thing to figure out what the ulimit is for a given
process. Thankfully, Erlang is easy to work with in this case. You simply
attach to the running erlang process[3] and run the following:

os:cmd('ulimit -n').

This will confirm the specific ulimit settings available to erlang. Not
enough fd's may well have been the issue. Recall, the specific concern was
that out of 8 thousand keys, 2 or 3 keys were not available three weeks
after uploading them. Assuming no key deletes via Riak, what can happen to
disk over time? Disk errors at the os/hardware level or back end compaction
related issues within Riak. Without adequate file descriptors the erlang
process can crash and if the erlang process crashes in the middle of a
compaction operation there is a chance you could lose data. Having multiple
copies of your data on multiple machines or even on one machine (in
different back end bitcask or leveldb files) allows Riak to recover from
these failures via read repair[4] and active anti entropy[5]. AAE was
turned off here (of course, it too consumes fd's).

At this point having seen Riak crashes in the error log and the ulimit
warning, assuming you are still working with a single instance with an
n_val of 1, my advice is to simply return the ring size to its default, 64,
and ensure that Riak is running with a high number of fd's via increasing
the ulimit[6].

Give that a shot.
Best,
Alexander


[1] http://docs.basho.com/riak/latest/theory/concepts/vnodes/
[2]
http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions
[3] http://docs.basho.com/riak/latest/ops/running/tools/riak/#attach
[4]
http://docs.basho.com/riak/latest/theory/concepts/Replication/#Read-Repair
[5] http://docs.basho.com/riak/latest/theory/concepts/aae/
[6] http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/

On Tue, Jun 9, 2015 at 12:38 PM, Praveen Baratam <praveen.bara...@gmail.com>
wrote:

> Hello everyone,
>
> I have setup a Riak test node in a VPS with n = 1, r = 1 and w =1, Bitcask
> engine and AAE turned off.. loaded some 8k blobs into it and everything was
> fine...
>
> Today, after three weeks, I noticed that a few of those 8K blobs are
> missing - not found...
>
> I also see a lot of invalid HintFile error in the console.log
>
> Can anybody explain why this is happening?
>
> Thanks in advance.
>
> Best,
>
> Praveen Baratam
>
> about.me <http://about.me/praveen.baratam>
> ᐧ
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to