Hello, I took into account the uniqueness of the hostnames/actual traffic in the test. That is to say, I did a test using both unique and non filtered hostnames. The results you see in the spread sheet are from the case where I did not filter the names. We also made sure to capture the logs from 3 different perimeter proxies each servicing a different dns record type (*. tumblr.com, A-records ex: ablog.com, cnames ex: willy.tarreau.com).
On blogs that get extremely popular, we do use an entirely different backend of varnish nodes with consistent hashing over the whole uri. We have a model that is used to determine these blogs and traffic to these blogs is redirected to the "special" backend by use of ACL in haproxy config. At the time I grabbed these logs there were no blogs being directed to the heavy traffic backend. Even in the scenario, that a blog were getting more traffic that others, I would expect that the algorithm with more standard deviation when using unique hostnames would have worse outcomes when testing with non unique hostnames. I will let you know once I run these through a haproxy and capture from the logs the backend was selected, probably later today. Code is at [1] Thanks -Bhaskar [1] https://gist.github.com/maddalab/7136792 On Thu, Oct 24, 2013 at 2:01 AM, Willy Tarreau <w...@1wt.eu> wrote: > Hello Bhaskar, > > On Wed, Oct 23, 2013 at 06:15:31PM -0400, Bhaskar Maddala wrote: > > Hello, > > > > Apologies for the delay in responding. The trouble is largely to do > with > > being able to reproduce the results with the test harness. I made a > couple > > of changes to the test harness you provided, for standard deviation and > > variance, the results are at [1]. The sheet titles hopefully make sense, > > lmk in case they don't > > Great, could you please post the updated source somewhere, it would > help others contribute and test with their logs. > > > It took a while to find the data that we based our decision on, the last > > sheet contains this data. The std dev for SDBM was 47 and for DJB2 was > 30, > > this difference correlates to connections from varnish to our application > > backends as [2]. I realize this is a second order metric however it makes > > sense in that the smaller standard deviation relates to a smoother > > distribution of loads to varnish from haproxy which in turn relates to > > connection counts to application webs converging across the varnish pool. > > > > I spent some time looking for how this data was obtained, and just found > > out that it was done by performing http requests. I used this data > against > > the test harness [3] and immediately noticed that the difference in std > dev > > results. > > > > I will as a next step use the same methodology used previously > (performing > > GET requests) to determine the efficacy of of the algorithms, however are > > there substantial differences between the test harness and the code in > > haproxy that would in anyway explain this difference? Any other thoughts? > > Yes there is a simple reason which is that some host names probably cause > many more requests than other ones. So if you did like me (sort -u on the > host names), you're considering that they're all used at the exact same > frequency which is not true. That said, I still think that this should not > be considered when designing a hash function for the job. If a hash > algorithm is perfect, you will still see differences caused by the > variations > between the smallest and largest sites you're hosting. > > In my opinion, the best solution should probably be to hash the full > request > and not just the host. But with many blogs I suspect that if one of them > becomes very popular, it's still possible to see differences in cache > loads. > > One solution when you know that *some* blogs are very popular is to > dedicate > them a backend section in which you use another algorithm, probably round > robin. Indeed, if a few host names constitute more than a few percent of > your traffic, it makes sense to spread them over all caches and allow their > contents to be replicated over the caches. > > I have not yet tried to apply murmurhash3 nor siphash, both of which seem > very promising. Maybe you would like to experiment with them ? > > Best regards, > Willy > >