Re: page allocation failures on osd nodes
On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote: On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote: Sorry, I have written too less yesterday because of being sleepy. That`s obviously a cache pressure since dropping caches resulted in disappearance of this errors for a long period. I`m not very familiar with kernel memory mechanisms, but shouldn`t kernel try to allocate memory on the second node if this not prohibited by process` cpuset first and only then report allocation failure(as can be seen only node 0 involved in the failures)? I really have no idea where numa-awareness may be count in case of osd daemons. Hi Andrey, You said that the allocation failure doesn't occur if you flush caches, but the kernel should evict pages from the cache as needed so that the osd can allocate more memory (unless their dirty, but it doesn't look like you have many dirty pages in this case). It looks like you have plenty of reclaimable pages as well. Does the osd remain running after that error occurs? Yes, it keeps running flawlessly without even a bit in an osdmap, but unfortunately logging wasn`t turned on for this moment. As soon as I`ll end massive test for ``suicide timeout'' bug I`ll check you idea with dd and also rerun test as below with ``debug osd = 20''. My thought is that kernel has ready-to-be-free memory on node1 and for strange reason osd process trying to reserve pages from node0 (where it is obviously allocated memory on start, since node1` memory starting only from high numbers over 32G), then kernel refuses to free cache on the specific node(it`s a quite misty, at least for me, why kernel just does not invalidate some buffers, even they are more preferably to stay in RAM than tail of LRU` ones?). Allocation looks like following on the most nodes: MemTotal: 66081396 kB MemFree: 278216 kB Buffers: 15040 kB Cached: 62422368 kB SwapCached:0 kB Active: 2063908 kB Inactive: 60876892 kB Active(anon): 509784 kB Inactive(anon): 56 kB Active(file):1554124 kB Inactive(file): 60876836 kB OSD-node free memory, with two osd processes on each node, libvirt prints ``Free'' field there: 0: 207500 KiB 1: 72332 KiB Total: 279832 KiB 0: 208528 KiB 1: 80692 KiB Total: 289220 KiB Since it is known that kernel reserve more memory on the node with higher memory pressure, seems very legit - osd processes works mostly with node 0` memory, so there is a bigger gap than on node 1 where exists almost only fs cache. I wonder if you see the same error if you do a long write intensive workload on the local disk for the osd in question, maybe dd if=/dev/zero of=/data/osd.0/foo -sam On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Those traces happens only constant high constant writes and seems to be very rarely. OSD processes do not consume more memory after this event and peaks are not distinguishable by monitoring. I have able to catch it having four-hour constant writes on the cluster. http://xdel.ru/downloads/ceph-log/allocation-failure/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RadosGW performance and disk space usage
Dear Sam, Dan and Marcus, Thank you a lot for the replies. I'll do more tests today. The length of each object used in my test is just 20 bytes. I'm glad you got 400 objects/s! Iif I get that with a length of 8 KB using a 2-node cluster, then ceph with rados will be already faster than my current solution. And then I will be able to present it to my boss. :-) I'll try rest-bench later. Thanks for the help! Best regards Mello On Sat, Jan 26, 2013 at 3:43 AM, Marcus Sorensen shadow...@gmail.com wrote: Have you tried rest-bench on localhost at the rados gateway? I was playing with the rados gateway in a VM the other day, and was getting up to 400/s on 4k objects. Above that I was getting connection failures, but I think it was just due to a default max connections setting somewhere or something. My VM is on SSD though. I was just thinking it may help isolate the issue. On Fri, Jan 25, 2013 at 4:14 PM, Sam Lang sam.l...@inktank.com wrote: On Thu, Jan 24, 2013 at 9:27 AM, Cesar Mello cme...@gmail.com wrote: Hi! I have successfully prototyped read/write access to ceph from Windows using the S3 API, thanks so much for the help. Now I would like to do some prototypes targeting performance evaluation. My scenario typically requires parallel storage of data from tens of thousands of loggers, but scalability to hundreds of thousands is the main reason for investigating ceph. My tests using a single laptop running ceph with 2 local OSDs and local radosgw allows writing in average 2.5 small objects per second (100 objects in 40 seconds). Is this the expected performance? It seems to be I/O bound because the HDD led keeps on during the PutObject requests. Any suggestion or documentation pointers for profiling are very appreciated. Hi Mello, 2.5 objects/sec seems terribly slow, even on your laptop. How small are these objects? You might try to benchmark without the disk as a potential bottleneck, by putting your osd data and journals in /tmp (for benchmarking only of course) or create/mount a tmpfs and point your osd backends there. I am afraid the S3 API is not good for my scenario, because there is no way to append data to existing objects (so I won't be able to model a single object for each data collector). If this is the case, then I would need to store billions of small objects. I would like to know how much disk space each object instance requires other than the object content length. If the S3 API is not well suited to my scenario, then my effort should be better directed to porting or writing a native ceph client for Windows. I just need an API to read and write/append blocks to files. Any comments are really appreciated. Hopefully someone with more windows experience will give you better info/advice than I can. You could try to port the rados API to windows. Its purely userspace, but does rely on pthreads and other libc/gcc specifics. With something like cygwin a port might not be too hard though. If you decide to go that route, let us know how you progress! -sam Thank you a lot for the attention! Best regards Mello -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] rbd: keep reference to lingering object requests
This series applies on top of the new rbd request code. When an osd request is marked to linger the osd client will keep a copy of the request, and will resubmit it if necessary. If it gets resubmitted, it will also call the completion routine again, and because of that we need to make sure the associated object request structure remains valid. The last patch in this series ensures that by taking an extra reference for an object request set to linger. -Alex [PATCH 1/4] rbd: unregister linger in watch sync routine [PATCH 2/4] rbd: track object rather than osd request for watch [PATCH 3/4] rbd: decrement obj request count when deleting [PATCH 4/4] rbd: don't drop watch requests on completion -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] rbd: unregister linger in watch sync routine
Move the code that unregisters an rbd device's lingering header object watch request into rbd_dev_header_watch_sync(), so it occurs in the same function that originally sets up that request. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 47e5798..363a813 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1730,6 +1730,10 @@ static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, int start) if (start) { rbd_dev-watch_request = obj_request-osd_req; ceph_osdc_set_request_linger(osdc, rbd_dev-watch_request); + } else { + ceph_osdc_unregister_linger_request(osdc, + rbd_dev-watch_request); + rbd_dev-watch_request = NULL; } ret = rbd_obj_request_submit(osdc, obj_request); if (ret) @@ -4040,12 +4044,6 @@ static void rbd_dev_release(struct device *dev) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); - if (rbd_dev-watch_request) { - struct ceph_client *client = rbd_dev-rbd_client-client; - - ceph_osdc_unregister_linger_request(client-osdc, - rbd_dev-watch_request); - } if (rbd_dev-watch_event) rbd_dev_header_watch_sync(rbd_dev, 0); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] rbd: don't drop watch requests on completion
The new request code arranges to get a callback for every osd request we submit (this was not the case previously). We register a lingering object watch request for the header object for each mapped rbd image. If a connection problem occurs, the osd client will re-submit lingering requests. And each time such a request is re-submitted, its callback function will get called again. We therefore need to ensure the object request associated with the lingering osd request stays valid, and the way to do that is to have an extra reference to the lingering osd request. So when a request to initiate a watch has completed, do not drop a reference as one normally would. Instead, hold off dropping that reference until the request to tear down that watch request is done. Also, only set the rbd device's watch_request pointer after the watch request has been completed successfully, and clear the pointer once it's been torn down. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 340773f..177ba0c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1716,6 +1716,7 @@ static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, int start) rbd_dev-watch_event); if (ret 0) return ret; + rbd_assert(rbd_dev-watch_event != NULL); } ret = -ENOMEM; @@ -1735,32 +1736,44 @@ static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, int start) if (!obj_request-osd_req) goto out_cancel; - if (start) { + if (start) ceph_osdc_set_request_linger(osdc, obj_request-osd_req); - rbd_dev-watch_request = obj_request; - } else { + else ceph_osdc_unregister_linger_request(osdc, rbd_dev-watch_request-osd_req); - rbd_dev-watch_request = NULL; - } ret = rbd_obj_request_submit(osdc, obj_request); if (ret) goto out_cancel; ret = rbd_obj_request_wait(obj_request); if (ret) goto out_cancel; - ret = obj_request-result; if (ret) goto out_cancel; - if (start) - goto done; /* Done if setting up the watch request */ + /* +* Since a watch request is set to linger the osd client +* will hang onto it in case it needs to be re-sent in the +* event of connection loss. If we're initiating the watch +* we therefore do *not* want to drop our reference to the +* object request now; we'll effectively transfer ownership +* of it to the osd client instead. Instead, we'll drop +* that reference when the watch request gets torn down. +*/ + if (start) { + rbd_dev-watch_request = obj_request; + + return 0; + } + + /* We have successfully torn down the watch request */ + + rbd_obj_request_put(rbd_dev-watch_request); + rbd_dev-watch_request = NULL; out_cancel: /* Cancel the event if we're tearing down, or on error */ ceph_osdc_cancel_event(rbd_dev-watch_event); rbd_dev-watch_event = NULL; -done: if (obj_request) rbd_obj_request_put(obj_request); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html