Re: page allocation failures on osd nodes

2013-01-26 Thread Andrey Korolyov
On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang sam.l...@inktank.com wrote:
 On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov and...@xdel.ru wrote:
 Sorry, I have written too less yesterday because of being sleepy.
 That`s obviously a cache pressure since dropping caches resulted in
 disappearance of this errors for a long period. I`m not very familiar
 with kernel memory mechanisms, but shouldn`t kernel try to allocate
 memory on the second node if this not prohibited by process` cpuset
 first and only then report allocation failure(as can be seen only node
 0 involved in the failures)? I really have no idea where
 numa-awareness may be count in case of osd daemons.

 Hi Andrey,

 You said that the allocation failure doesn't occur if you flush
 caches, but the kernel should evict pages from the cache as needed so
 that the osd can allocate more memory (unless their dirty, but it
 doesn't look like you have many dirty pages in this case).  It looks
 like you have plenty of reclaimable pages as well.  Does the osd
 remain running after that error occurs?

Yes, it keeps running flawlessly without even a bit in an osdmap, but
unfortunately logging wasn`t turned on for this moment. As soon as
I`ll end massive test for ``suicide timeout'' bug I`ll check you idea
with dd and also rerun test as below with ``debug osd = 20''.

My thought is that kernel has ready-to-be-free memory on node1 and for
strange reason osd process trying to reserve pages from node0 (where
it is obviously allocated memory on start, since node1` memory
starting only from high numbers over 32G), then kernel refuses to free
cache on the specific node(it`s a quite misty, at least for me, why
kernel just does not invalidate some buffers, even they are more
preferably to stay in RAM than tail of LRU` ones?).

Allocation looks like following on the most nodes:
MemTotal:   66081396 kB
MemFree:  278216 kB
Buffers:   15040 kB
Cached: 62422368 kB
SwapCached:0 kB
Active:  2063908 kB
Inactive:   60876892 kB
Active(anon): 509784 kB
Inactive(anon):   56 kB
Active(file):1554124 kB
Inactive(file): 60876836 kB

OSD-node free memory, with two osd processes on each node, libvirt
prints ``Free'' field there:


0: 207500 KiB
1:  72332 KiB

Total: 279832 KiB

0: 208528 KiB
1:  80692 KiB

Total: 289220 KiB

Since it is known that kernel reserve more memory on the node with
higher memory pressure, seems very legit - osd processes works mostly
with node 0` memory, so there is a bigger gap than on node 1 where
exists almost only fs cache.



 I wonder if you see the same error if you do a long write intensive
 workload on the local disk for the osd in question, maybe dd
 if=/dev/zero of=/data/osd.0/foo

 -sam



 On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,

 Those traces happens only constant high constant writes and seems to
 be very rarely. OSD processes do not consume more memory after this
 event and peaks are not distinguishable by monitoring. I have able to
 catch it having four-hour constant writes on the cluster.

 http://xdel.ru/downloads/ceph-log/allocation-failure/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RadosGW performance and disk space usage

2013-01-26 Thread Cesar Mello
Dear Sam, Dan and Marcus,

Thank you a lot for the replies. I'll do more tests today.

The length of each object used in my test is just 20 bytes. I'm glad
you got 400 objects/s! Iif I get that with a length of 8 KB using a
2-node cluster, then ceph with rados will be already faster than my
current solution. And then I will be able to present it to my boss.
:-)

I'll try rest-bench later. Thanks for the help!

Best regards
Mello

On Sat, Jan 26, 2013 at 3:43 AM, Marcus Sorensen shadow...@gmail.com wrote:
 Have you tried rest-bench on localhost at the rados gateway? I was playing
 with the rados gateway in a VM the other day, and was getting up to 400/s on
 4k objects. Above that I was getting connection failures, but I think it was
 just due to a default max connections setting somewhere or something. My VM
 is on SSD though. I was just thinking it may help isolate the issue.


 On Fri, Jan 25, 2013 at 4:14 PM, Sam Lang sam.l...@inktank.com wrote:

 On Thu, Jan 24, 2013 at 9:27 AM, Cesar Mello cme...@gmail.com wrote:
  Hi!
 
  I have successfully prototyped read/write access to ceph from Windows
  using the S3 API, thanks so much for the help.
 
  Now I would like to do some prototypes targeting performance
  evaluation. My scenario typically requires parallel storage of data
  from tens of thousands of loggers, but scalability to hundreds of
  thousands is the main reason for investigating ceph.
 
  My tests using a single laptop running ceph with 2 local OSDs and
  local radosgw allows writing in average 2.5 small objects per second
  (100 objects in 40 seconds). Is this the expected performance? It
  seems to be I/O bound because the HDD led keeps on during the
  PutObject requests. Any suggestion or documentation pointers for
  profiling are very appreciated.

 Hi Mello,

 2.5 objects/sec seems terribly slow, even on your laptop.  How small
 are these objects?  You might try to benchmark without the disk as a
 potential bottleneck, by putting your osd data and journals in /tmp
 (for benchmarking only of course) or create/mount a tmpfs and point
 your osd backends there.

 
  I am afraid the S3 API is not good for my scenario, because there is
  no way to append data to existing objects (so I won't be able to model
  a single object for each data collector). If this is the case, then I
  would need to store billions of small objects. I would like to know
  how much disk space each object instance requires other than the
  object content length.
 
  If the S3 API is not well suited to my scenario, then my effort should
  be better directed to porting or writing a native ceph client for
  Windows. I just need an API to read and write/append blocks to files.
  Any comments are really appreciated.

 Hopefully someone with more windows experience will give you better
 info/advice than I can.

 You could try to port the rados API to windows.  Its purely userspace,
 but does rely on pthreads and other libc/gcc specifics.  With
 something like cygwin a port might not be too hard though.  If you
 decide to go that route, let us know how you progress!

 -sam


 
  Thank you a lot for the attention!
 
  Best regards
  Mello
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] rbd: keep reference to lingering object requests

2013-01-26 Thread Alex Elder
This series applies on top of the new rbd request code.

When an osd request is marked to linger the osd client will keep
a copy of the request, and will resubmit it if necessary.  If it
gets resubmitted, it will also call the completion routine again,
and because of that we need to make sure the associated object
request structure remains valid.  The last patch in this series
ensures that by taking an extra reference for an object request
set to linger.

-Alex

[PATCH 1/4] rbd: unregister linger in watch sync routine
[PATCH 2/4] rbd: track object rather than osd request for watch
[PATCH 3/4] rbd: decrement obj request count when deleting
[PATCH 4/4] rbd: don't drop watch requests on completion
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] rbd: unregister linger in watch sync routine

2013-01-26 Thread Alex Elder
Move the code that unregisters an rbd device's lingering header
object watch request into rbd_dev_header_watch_sync(), so it
occurs in the same function that originally sets up that request.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 47e5798..363a813 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1730,6 +1730,10 @@ static int rbd_dev_header_watch_sync(struct
rbd_device *rbd_dev, int start)
if (start) {
rbd_dev-watch_request = obj_request-osd_req;
ceph_osdc_set_request_linger(osdc, rbd_dev-watch_request);
+   } else {
+   ceph_osdc_unregister_linger_request(osdc,
+   rbd_dev-watch_request);
+   rbd_dev-watch_request = NULL;
}
ret = rbd_obj_request_submit(osdc, obj_request);
if (ret)
@@ -4040,12 +4044,6 @@ static void rbd_dev_release(struct device *dev)
 {
struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);

-   if (rbd_dev-watch_request) {
-   struct ceph_client *client = rbd_dev-rbd_client-client;
-
-   ceph_osdc_unregister_linger_request(client-osdc,
-   rbd_dev-watch_request);
-   }
if (rbd_dev-watch_event)
rbd_dev_header_watch_sync(rbd_dev, 0);

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] rbd: don't drop watch requests on completion

2013-01-26 Thread Alex Elder
The new request code arranges to get a callback for every osd
request we submit (this was not the case previously).

We register a lingering object watch request for the header object
for each mapped rbd image.

If a connection problem occurs, the osd client will re-submit
lingering requests.  And each time such a request is re-submitted,
its callback function will get called again.

We therefore need to ensure the object request associated with the
lingering osd request stays valid, and the way to do that is to have
an extra reference to the lingering osd request.

So when a request to initiate a watch has completed, do not drop a
reference as one normally would.  Instead, hold off dropping that
reference until the request to tear down that watch request is done.

Also, only set the rbd device's watch_request pointer after the
watch request has been completed successfully, and clear the
pointer once it's been torn down.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 340773f..177ba0c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1716,6 +1716,7 @@ static int rbd_dev_header_watch_sync(struct
rbd_device *rbd_dev, int start)
rbd_dev-watch_event);
if (ret  0)
return ret;
+   rbd_assert(rbd_dev-watch_event != NULL);
}

ret = -ENOMEM;
@@ -1735,32 +1736,44 @@ static int rbd_dev_header_watch_sync(struct
rbd_device *rbd_dev, int start)
if (!obj_request-osd_req)
goto out_cancel;

-   if (start) {
+   if (start)
ceph_osdc_set_request_linger(osdc, obj_request-osd_req);
-   rbd_dev-watch_request = obj_request;
-   } else {
+   else
ceph_osdc_unregister_linger_request(osdc,
rbd_dev-watch_request-osd_req);
-   rbd_dev-watch_request = NULL;
-   }
ret = rbd_obj_request_submit(osdc, obj_request);
if (ret)
goto out_cancel;
ret = rbd_obj_request_wait(obj_request);
if (ret)
goto out_cancel;
-
ret = obj_request-result;
if (ret)
goto out_cancel;

-   if (start)
-   goto done;  /* Done if setting up the watch request */
+   /*
+* Since a watch request is set to linger the osd client
+* will hang onto it in case it needs to be re-sent in the
+* event of connection loss.  If we're initiating the watch
+* we therefore do *not* want to drop our reference to the
+* object request now; we'll effectively transfer ownership
+* of it to the osd client instead.  Instead, we'll drop
+* that reference when the watch request gets torn down.
+*/
+   if (start) {
+   rbd_dev-watch_request = obj_request;
+
+   return 0;
+   }
+
+   /* We have successfully torn down the watch request */
+
+   rbd_obj_request_put(rbd_dev-watch_request);
+   rbd_dev-watch_request = NULL;
 out_cancel:
/* Cancel the event if we're tearing down, or on error */
ceph_osdc_cancel_event(rbd_dev-watch_event);
rbd_dev-watch_event = NULL;
-done:
if (obj_request)
rbd_obj_request_put(obj_request);

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html