Re: Slow performance using linkwalk, help wanted

Jan Buchholdt Tue, 09 Nov 2010 07:36:20 -0800

 Kevin -

The test client is part of a bigger system and would be a bit too muchtop send to you. The method that is calling Riak looks like this:


    import com.basho.riak.client.*;
    .
    .
    public List<Document> lookupDocuments(String personId, String url) {
        RiakClient riak = new RiakClient(url);

WalkResponse walkResponse = riak.walk("person", personId,"document,_,_");

        if (walkResponse.isSuccess()) {
            List<Document> out = new ArrayList<Document>();

List<? extends List<RiakObject>> steps =walkResponse.getSteps();

            if (steps.size() != 1) {

throw new RuntimeException("Expected to walk one link.Walked " + steps.size());

            }
            List<RiakObject> step = steps.get(0);
            for (RiakObject o : step) {
                try {
                    String chars = o.getValue();
                    Builder builder = Protos.Document.newBuilder();
                    JsonFormat2.merge(chars, builder);
                    out.add(((Document) builder.build()).getDocument());
                } catch (ParseException e) {

throw new DocumentServiceException("Error parsingdocument", e);

                }
            }
            return out;
        } else {

throw new RuntimeException("Walk error: " +walkResponse.getHttpHeaders());

        }
    }

It could be interesting to repeat your test on our cluster, to see if weget the same numbers as you do. Is it possible for you to send the codebehind your test?


--
Jan Buchholdt
Software Pilot
Trifork A/S
Cell +45 50761121



On 2010-11-09 15:47, Karsten Thygesen wrote:

On Nov 9, 2010, at 14:58 , Kevin Smith wrote:

On Nov 9, 2010, at 5:01 AM, Karsten Thygesen wrote:

Hi

OK, we will use a larger ringsize next time and will consider a data reload.

Regarding the metrics: the servers are dedicated to Riak use and it not used 
for anything else. They are new HP servers with 8 cores each and 4x146GB 10K 
RPM SAS disks in a contatenated mirror setup. We use Solaris with ZFS as 
filesystem and I have turned off atime update in the data partition.

The pool is built as such:

pool: pool01
state: ONLINE
scrub: scrub completed after 0h0m with 0 errors on Tue Oct 26 21:25:05 2010
config:

       NAME          STATE     READ WRITE CKSUM
       pool01        ONLINE       0     0     0
         mirror-0    ONLINE       0     0     0
           c0t0d0s7  ONLINE       0     0     0
           c0t1d0s7  ONLINE       0     0     0
         mirror-1    ONLINE       0     0     0
           c0t2d0    ONLINE       0     0     0
           c0t3d0    ONLINE       0     0     0

errors: No known data errors

so it is as fast as possible.

However - we use the ZFS default blocksize, which is 128Kb - is that optimal 
with bitcask as backend? It is rather large, but what is optimal with bitcask?

I don't have much experience tuning Solaris or ZFS for Riak. This is a question 
best asked of Ryan and I will make sure he sees this.

Thanks!

The cluster is 4 servers with gigabit connection located in the same datacenter 
on the same switch. The loadbalancer is a Zeus ZTM, which does quote a few http 
optimizations including extended reuse of http connections and we usually see 
far better response times using the loadbalancer than using a node directly.

Hmmm. Can you share what the performance times are like for direct cluster 
access?

In this case, there is no measurable difference whenever we ask a cluster node 
directly or we go through the loadbalancer. The largest difference is when we 
hit it with a lot of small requests, but that is not the case here.

When we run the test, each riak node is only about 100% cpu loaded (which on 
solaris means, that it only uses one of the 8 cores). We have seen spikes in 
the 160% area, but everything below 800% is not cpu bound. So all-in-all, the 
cpuload is between 5 and 10%.

Can you send me the code you're using for the performance test? I'd like to run 
the exact code on my test hardware and see if that reveals anything.

Jan, can you please provide the test client?

Also, low CPU usage might indicate you are IO bound. Do you know if Riak 
processes are spending much time waiting for IO to complete?

It does not seem so. The servers are not IO bound, there is plenty of network 
capacity and the disks is only around 10% loaded.

My largest suspicion is on the datamodel - when having a 4-node cluster and 
doing a linkwalk, which need to combine around 5-600 documents, it will take 
quite some time, but we still feel, that the numbers is very high.

Perhaps we should consider a datamodel, where we collect, say, 100 documents in 
a basket and the only have to linkwalk 4-5 baskets to return an answer? 
Tempting, performancewise, but it makes it a lot harder to maintain the data 
afterwards as we can not just use map-reduce and similar technologies to handle 
data...

Karsten

--Kevin


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Slow performance using linkwalk, help wanted

Reply via email to