Hi All, Praveen sent me the necessary config files and logs. Together with our support team (thanks Jimmy) we were able to identify a probable cause of this issue. Firstly, we do not recommend doing any serious testing on a single machine with an n_val=1 environment (default replica count or n_val in Riak is 3). That said, we do take missing data very seriously. Riak is specifically designed to remain available and recover from failure of any one or more physical machines in a cluster. Running on a single machine with a replica count of one obviates all that awesomeness.
One thing immediately stood out in the config file: a ring size of 512. Ring size is the number of virtual nodes or vnodes[1] in the cluster. Vnodes are the default level of abstraction in Riak. When planning a cluster[2], the general rule of thumb is that you have between 10 and 50 vnodes per physical machine. There are a number of reasons for this but imho, it mostly has to do with finding the sweet spot performance wise between optimally loading your hardware and omg wt&*%*# overloading your hardware. 512 vnodes on a single machine is categorically the latter. One reason you want to limit the max number of vnodes per machine is simply file descriptor limitations handled in linux through ulimit. 512 vnodes turns into a ton of fd's. So let's take a look at the logs. In the error log we see this: ===== ===== LOGGING STARTED Thu Jun 4 08:29:17 GMT 2015 ===== Node 'riak@vps1' not responding to pings. config is OK !!!! !!!! WARNING: ulimit -n is 1024; 4096 is the recommended minimum. !!!! Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.4.12/riak -config /etc/riak/app.config -pa /usr/lib/riak/lib/basho-patches -args_file /etc/riak/vm.args -- console Root: /usr/lib/riak Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true] It is not the simplest thing to figure out what the ulimit is for a given process. Thankfully, Erlang is easy to work with in this case. You simply attach to the running erlang process[3] and run the following: os:cmd('ulimit -n'). This will confirm the specific ulimit settings available to erlang. Not enough fd's may well have been the issue. Recall, the specific concern was that out of 8 thousand keys, 2 or 3 keys were not available three weeks after uploading them. Assuming no key deletes via Riak, what can happen to disk over time? Disk errors at the os/hardware level or back end compaction related issues within Riak. Without adequate file descriptors the erlang process can crash and if the erlang process crashes in the middle of a compaction operation there is a chance you could lose data. Having multiple copies of your data on multiple machines or even on one machine (in different back end bitcask or leveldb files) allows Riak to recover from these failures via read repair[4] and active anti entropy[5]. AAE was turned off here (of course, it too consumes fd's). At this point having seen Riak crashes in the error log and the ulimit warning, assuming you are still working with a single instance with an n_val of 1, my advice is to simply return the ring size to its default, 64, and ensure that Riak is running with a high number of fd's via increasing the ulimit[6]. Give that a shot. Best, Alexander [1] http://docs.basho.com/riak/latest/theory/concepts/vnodes/ [2] http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions [3] http://docs.basho.com/riak/latest/ops/running/tools/riak/#attach [4] http://docs.basho.com/riak/latest/theory/concepts/Replication/#Read-Repair [5] http://docs.basho.com/riak/latest/theory/concepts/aae/ [6] http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/ On Tue, Jun 9, 2015 at 12:38 PM, Praveen Baratam <praveen.bara...@gmail.com> wrote: > Hello everyone, > > I have setup a Riak test node in a VPS with n = 1, r = 1 and w =1, Bitcask > engine and AAE turned off.. loaded some 8k blobs into it and everything was > fine... > > Today, after three weeks, I noticed that a few of those 8K blobs are > missing - not found... > > I also see a lot of invalid HintFile error in the console.log > > Can anybody explain why this is happening? > > Thanks in advance. > > Best, > > Praveen Baratam > > about.me <http://about.me/praveen.baratam> > ᐧ > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com