Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Priebe - Profihost AG
Not sure about off_t. What is min and max size?

Stefan

Am 21.11.2012 um 18:03 schrieb Stefan Weil :

> Am 20.11.2012 13:44, schrieb Stefan Priebe:
>> rbd / rados tends to return pretty often length of writes
>> or discarded blocks. These values might be bigger than int.
>> 
>> Signed-off-by: Stefan Priebe 
>> ---
>>  block/rbd.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/rbd.c b/block/rbd.c
>> index f57d0c6..6bf9c2e 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -69,7 +69,7 @@ typedef enum {
>>  typedef struct RBDAIOCB {
>>  BlockDriverAIOCB common;
>>  QEMUBH *bh;
>> -int ret;
>> +int64_t ret;
>>  QEMUIOVector *qiov;
>>  char *bounce;
>>  RBDAIOCmd cmd;
>> @@ -87,7 +87,7 @@ typedef struct RADOSCB {
>>  int done;
>>  int64_t size;
>>  char *buf;
>> -int ret;
>> +int64_t ret;
>>  } RADOSCB;
>>#define RBD_FD_READ 0
> 
> 
> Why do you use int64_t instead of off_t?
> If the value is related to file sizes, off_t would be a good choice.
> 
> Stefan W.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread Dan Mick
Still not certain I'm understanding *just* what you mean, but I'll point 
out that you can set up a cluster with rbd images, mount them from a 
separate non-virtualized host with kernel rbd, and expand those images 
and take advantage of the newly-available space on the separate host, 
just as though you were expanding a RAID device.  Maybe that fits your 
use case, Ruslan?


On 11/21/2012 12:05 PM, ruslan usifov wrote:

Yes i mean exactly this. it's a great pity :-( Maybe present some ceph
equivalent that solve my problem?

2012/11/21 Gregory Farnum :

On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov  wrote:

So, not possible use ceph as scalable block device without visualization?


I'm not sure I understand, but if you're trying to take a bunch of
compute nodes and glue their disks together, no, that's not a
supported use case at this time. There are a number of deadlock issues
caused by this sort of loopback; it's the same reason you shouldn't
mount NFS on the server host.
We may in the future manage to release an rbd-fuse client that you can
use to do this with a little less pain, but it's not ready at this
point.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread ruslan usifov
Yes i mean exactly this. it's a great pity :-( Maybe present some ceph
equivalent that solve my problem?

2012/11/21 Gregory Farnum :
> On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov  
> wrote:
>> So, not possible use ceph as scalable block device without visualization?
>
> I'm not sure I understand, but if you're trying to take a bunch of
> compute nodes and glue their disks together, no, that's not a
> supported use case at this time. There are a number of deadlock issues
> caused by this sort of loopback; it's the same reason you shouldn't
> mount NFS on the server host.
> We may in the future manage to release an rbd-fuse client that you can
> use to do this with a little less pain, but it's not ready at this
> point.
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread Gregory Farnum
On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov  wrote:
> So, not possible use ceph as scalable block device without visualization?

I'm not sure I understand, but if you're trying to take a bunch of
compute nodes and glue their disks together, no, that's not a
supported use case at this time. There are a number of deadlock issues
caused by this sort of loopback; it's the same reason you shouldn't
mount NFS on the server host.
We may in the future manage to release an rbd-fuse client that you can
use to do this with a little less pain, but it's not ready at this
point.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Files lost after mds rebuild

2012-11-21 Thread Gregory Farnum
On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang  wrote:
> 2012/11/21 Gregory Farnum :
>> No, absolutely not. There is no relationship between different RADOS
>> pools. If you've been using the cephfs tool to place some filesystem
>> data in different pools then your configuration is a little more
>> complicated (have you done that?), but deleting one pool is never
>> going to remove data from the others.
>> -Greg
>>
> I think that should be a bug. Here's the story I did:
> I created one directory 'audit' in running ceph filesystem, and put
> some data into the directory (about 100GB) before these commands:
> ceph osd pool create audit
> ceph mds add_data_pool 4
> cephfs /mnt/temp/audit/ set_layout -p 4
>
> log3 ~ # ceph osd dump | grep audit
> pool 4 'audit' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
> 8 pgp_num 8 last_change 1558 owner 0
>
> at this time, all data in audit still usable, after 'ceph osd pool
> delete data', the disk space recycled (forgot to test if the data
> still usable), only 200MB used, from 'ceph -s'. So, here's what I'm
> thinking, the data stored before pool created won't follow the pool,
> it still follows the default pool 'data', is this a bug, or intended
> behavior?

Oh, I see. Data is not moved when you set directory layouts; it only
impacts files created after that point. This is intended behavior —
Ceph would need to copy the data around anyway in order to make it
follow the pool. There's no sense in hiding that from the user,
especially given the complexity involved in doing so safely —
especially when there are many use cases where you want the files in
different pools.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
With 8 successful installs already done, I'm reasonably confident that
it's patch #50.  I'm making another build which applies all patches
from the 3.5 backport branch, excluding that specific one.  I'll let
you know if that turns up any unexpected failures.

What will the potential fall out be for removing that specific patch?


On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  wrote:
> It's really looking like it's the
> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>  So far I have gone through 4 successful installs with no hang with
> only 1-49 applied.  I'm still leaving my test run to make sure it's
> not a fluke, but since previously it hangs within the first couple of
> builds, it really looks like this is where the problem originated.
>
> 1-libceph_eliminate_connection_state_DEAD.patch
> 2-libceph_kill_bad_proto_ceph_connection_op.patch
> 3-libceph_rename_socket_callbacks.patch
> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
> 6-libceph_start_separating_connection_flags_from_state.patch
> 7-libceph_start_tracking_connection_socket_state.patch
> 8-libceph_provide_osd_number_when_creating_osd.patch
> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
> 11-libceph_drop_connection_refcounting_for_mon_client.patch
> 12-libceph_init_monitor_connection_when_opening.patch
> 13-libceph_fully_initialize_connection_in_con_init.patch
> 14-libceph_tweak_ceph_alloc_msg.patch
> 15-libceph_have_messages_point_to_their_connection.patch
> 16-libceph_have_messages_take_a_connection_reference.patch
> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
> 19-libceph_fix_overflow_in___decode_pool_names.patch
> 20-libceph_fix_overflow_in_osdmap_decode.patch
> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
> 24-libceph_use_con_get_put_methods.patch
> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
> 26-libceph_encapsulate_out_message_data_setup.patch
> 27-libceph_encapsulate_advancing_msg_page.patch
> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
> 29-libceph_move_init_bio__functions_up.patch
> 30-libceph_move_init_of_bio_iter.patch
> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
> 33-libceph_don_t_change_socket_state_on_sock_event.patch
> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
> 37-libceph_clear_NEGOTIATING_when_done.patch
> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
> 39-libceph_separate_banner_and_connect_writes.patch
> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
> 41-libceph_small_changes_to_messenger.c.patch
> 42-libceph_add_some_fine_ASCII_art.patch
> 43-libceph_set_peer_name_on_con_open_not_init.patch
> 44-libceph_initialize_mon_client_con_only_once.patch
> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
> 46-libceph_initialize_msgpool_message_types.patch
> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
> 48-libceph_report_socket_read_write_error_message.patch
> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch
>
>
> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil  wrote:
>> Thanks for hunting this down.  I'm very curious what the culprit is...
>>
>> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Weil

Am 20.11.2012 13:44, schrieb Stefan Priebe:

rbd / rados tends to return pretty often length of writes
or discarded blocks. These values might be bigger than int.

Signed-off-by: Stefan Priebe 
---
  block/rbd.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f57d0c6..6bf9c2e 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
-int ret;
+int64_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
@@ -87,7 +87,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
-int ret;
+int64_t ret;
  } RADOSCB;
  
  #define RBD_FD_READ 0



Why do you use int64_t instead of off_t?
If the value is related to file sizes, off_t would be a good choice.

Stefan W.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
It's really looking like it's the
libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
 So far I have gone through 4 successful installs with no hang with
only 1-49 applied.  I'm still leaving my test run to make sure it's
not a fluke, but since previously it hangs within the first couple of
builds, it really looks like this is where the problem originated.

1-libceph_eliminate_connection_state_DEAD.patch
2-libceph_kill_bad_proto_ceph_connection_op.patch
3-libceph_rename_socket_callbacks.patch
4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
6-libceph_start_separating_connection_flags_from_state.patch
7-libceph_start_tracking_connection_socket_state.patch
8-libceph_provide_osd_number_when_creating_osd.patch
9-libceph_set_CLOSED_state_bit_in_con_init.patch
10-libceph_embed_ceph_connection_structure_in_mon_client.patch
11-libceph_drop_connection_refcounting_for_mon_client.patch
12-libceph_init_monitor_connection_when_opening.patch
13-libceph_fully_initialize_connection_in_con_init.patch
14-libceph_tweak_ceph_alloc_msg.patch
15-libceph_have_messages_point_to_their_connection.patch
16-libceph_have_messages_take_a_connection_reference.patch
17-libceph_make_ceph_con_revoke_a_msg_operation.patch
18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
19-libceph_fix_overflow_in___decode_pool_names.patch
20-libceph_fix_overflow_in_osdmap_decode.patch
21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
22-libceph_transition_socket_state_prior_to_actual_connect.patch
23-libceph_fix_NULL_dereference_in_reset_connection.patch
24-libceph_use_con_get_put_methods.patch
25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
26-libceph_encapsulate_out_message_data_setup.patch
27-libceph_encapsulate_advancing_msg_page.patch
28-libceph_don_t_mark_footer_complete_before_it_is.patch
29-libceph_move_init_bio__functions_up.patch
30-libceph_move_init_of_bio_iter.patch
31-libceph_don_t_use_bio_iter_as_a_flag.patch
32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
33-libceph_don_t_change_socket_state_on_sock_event.patch
34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
35-libceph_don_t_touch_con_state_in_con_close_socket.patch
36-libceph_clear_CONNECTING_in_ceph_con_close.patch
37-libceph_clear_NEGOTIATING_when_done.patch
38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
39-libceph_separate_banner_and_connect_writes.patch
40-libceph_distinguish_two_phases_of_connect_sequence.patch
41-libceph_small_changes_to_messenger.c.patch
42-libceph_add_some_fine_ASCII_art.patch
43-libceph_set_peer_name_on_con_open_not_init.patch
44-libceph_initialize_mon_client_con_only_once.patch
45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
46-libceph_initialize_msgpool_message_types.patch
47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
48-libceph_report_socket_read_write_error_message.patch
49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch


On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil  wrote:
> Thanks for hunting this down.  I'm very curious what the culprit is...
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Sage Weil
On Tue, 20 Nov 2012, Nick Bartos wrote:
> Since I now have a decent script which can reproduce this, I decided
> to re-test with the same 3.5.7 kernel, but just not applying the
> patches from the wip-3.5 branch.  With the patches, I can only go 2
> builds before I run into a hang.  Without the patches, I have gone 9
> consecutive builds (and still going) without seeing the hang.  So it
> seems like a reasonable assumption that the problem was introduced in
> one of those patches.
> 
> We started seeing the problem before applying all the 3.5 patches, so
> it seems like one of these is the culprit:
> 
> 1-libceph-encapsulate-out-message-data-setup.patch
> 2-libceph-dont-mark-footer-complete-before-it-is.patch
> 3-libceph-move-init-of-bio_iter.patch
> 4-libceph-dont-use-bio_iter-as-a-flag.patch
> 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
> 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
> 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
> 8-libceph-protect-ceph_con_open-with-mutex.patch
> 9-libceph-reset-connection-retry-on-successfully-negotiation.patch
> 10-rbd-only-reset-capacity-when-pointing-to-head.patch
> 11-rbd-set-image-size-when-header-is-updated.patch
> 12-libceph-fix-crypto-key-null-deref-memory-leak.patch
> 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
> 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
> 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
> 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
> 17-libceph-check-for-invalid-mapping.patch
> 18-ceph-propagate-layout-error-on-osd-request-creation.patch
> 19-rbd-BUG-on-invalid-layout.patch
> 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
> 21-ceph-avoid-32-bit-page-index-overflow.patch
> 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch
> 
> I'll start doing some other builds to try and narrow down the patch
> introducing the problem more specifically.

Thanks for hunting this down.  I'm very curious what the culprit is...

sage



> 
> 
> On Tue, Nov 20, 2012 at 1:53 PM, Nick Bartos  wrote:
> > I reproduced the problem and got several sysrq states captured.
> > During this run, the monitor running on the host complained a few
> > times about the clocks being off, but all messages were for under 0.55
> > seconds.
> >
> > Here are the kernel logs.  Note that there are several traces, I
> > thought multiple during the incident may help:
> > https://raw.github.com/gist/4121395/a6dda7552ed8a45725ee5d632fe3ba38703f8cfc/gistfile1.txt
> >
> >
> > On Mon, Nov 19, 2012 at 3:34 PM, Gregory Farnum  wrote:
> >> Hmm, yep ? that param is actually only used for the warning; I guess
> >> we forgot what it actually covers. :(
> >>
> >> Have your monitor clocks been off by more than 5 seconds at any point?
> >>
> >> On Mon, Nov 19, 2012 at 3:04 PM, Nick Bartos  wrote:
> >>> Making 'mon clock drift allowed' very small (0.1) does not
> >>> reliably reproduce the hang.  I started looking at the code for 0.48.2
> >>> and it looks like this is only used in Paxos::warn_on_future_time,
> >>> which only handles the warning, nothing else.
> >>>
> >>>
> >>> On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil  wrote:
>  On Fri, 16 Nov 2012, Nick Bartos wrote:
> > Should I be lowering the clock drift allowed, or the lease interval to
> > help reproduce it?
> 
>  clock drift allowed.
> 
> 
> 
> >
> > On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil  wrote:
> > > You can safely set the clock drift allowed as high as 500ms.  The real
> > > limitation is that it needs to be well under the lease interval, 
> > > which is
> > > currently 5 seconds by default.
> > >
> > > You might be able to reproduce more easily by lowering the 
> > > threshold...
> > >
> > > sage
> > >
> > >
> > > On Fri, 16 Nov 2012, Nick Bartos wrote:
> > >
> > >> How far off do the clocks need to be before there is a problem?  It
> > >> would seem to be hard to ensure a very large cluster has all of it's
> > >> nodes synchronized within 50ms (which seems to be the default for 
> > >> "mon
> > >> clock drift allowed").  Does the mon clock drift allowed parameter
> > >> change anything other than the log messages?  Are there any other
> > >> tuning options that may help, assuming that this is the issue and 
> > >> it's
> > >> not feasible to get the clocks more than 500ms in sync between all
> > >> nodes?
> > >>
> > >> I'm trying to get a good way of reproducing this and get a trace on
> > >> the ceph processes to see what they're waiting on.  I'll let you know
> > >> when I have more info.
> > >>
> > >>
> > >> On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil  wrote:
> > >> > I just realized I was mixing up this thread with the other deadlock
> > >> > thread.
> > >> >
> > >> > On Fri, 16 Nov 2012, Nick Bartos wrote:
> > >> >> Turns 

Re: Hadoop and Ceph client/mds view of modification time

2012-11-21 Thread Noah Watkins
(Sorry for the dupe message. vger rejected due to HTML).

Thanks, I'll try this patch this morning.

Client B should perform a single stat after a notification from Client
A. But, won't Sage's patch still be required, since Client A needs the
MDS time to pass to Client B?

On Tue, Nov 20, 2012 at 12:20 PM, Sam Lang  wrote:
> On 11/20/2012 01:44 PM, Noah Watkins wrote:
>>
>> This is a description of the clock synchronization issue we are facing
>> in Hadoop:
>>
>> Components of Hadoop use mtime as a versioning mechanism. Here is an
>> example where Client B tests the expected 'version' of a file created
>> by Client A:
>>
>>Client A: create file, write data into file.
>>Client A: expected_mtime <-- lstat(file)
>>Client A: broadcast expected_mtime to client B
>>...
>>Client B: mtime <-- lstat(file)
>>Client B: test expected_mtime == mtime
>
>
> Here's a patch that might work to push the setattr out to the mds every time
> (the same as Sage's patch for getattr).  This isn't quite writeback, as it
> waits for the setattr at the server to complete before returning, but I
> think that's actually what you want in this case.  It needs to be enabled by
> setting client setattr writethru = true in the config.  Also, I haven't
> tested that it sends the setattr, just a basic test of functionality.
>
> BTW, if its always client B's first stat of the file, you won't need Sage's
> patch.
>
> -sam
>
> diff --git a/src/client/Client.cc b/src/client/Client.cc
> index 8d4a5ac..a7dd8f7 100644
> --- a/src/client/Client.cc
> +++ b/src/client/Client.cc
> @@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, int
> gid)
>
>  int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, int
> gid)
>  {
> +  int orig_mask = mask;
>int issued = in->caps_issued();
>
>ldout(cct, 10) << "_setattr mask " << mask << " issued " <<
> ccap_string(issued) << dendl;
> @@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, int
> mask, int uid, int gid)
>mask &= ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
>  }
>}
> -  if (!mask)
> +  if (!cct->_conf->client_setattr_writethru && !mask)
>  return 0;
>
>MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
> @@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat *attr,
> int mask, int uid, int gid)
>req->set_filepath(path);
>req->inode = in;
>
> +  // reset mask back to original if we're meant to do writethru
> +  if (cct->_conf->client_setattr_writethru)
> +mask = orig_mask;
> +
>if (mask & CEPH_SETATTR_MODE) {
>  req->head.args.setattr.mode = attr->st_mode;
>  req->inode_drop |= CEPH_CAP_AUTH_SHARED;
> diff --git a/src/common/config_opts.h b/src/common/config_opts.h
> index cc05095..51a2769 100644
> --- a/src/common/config_opts.h
> +++ b/src/common/config_opts.h
> @@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100)
> // MB * n  (dirty OR tx.
>  OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
>  OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)  // max age in cache
> before writeback
>  OPTION(client_oc_max_objects, OPT_INT, 1000)  // max objects in cache
> +OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the attributes
> to the mds server
>  // note: the max amount of "in flight" dirty data is roughly (max - target)
>  OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ invalidate
> callback to keep page cache consistent
>  OPTION(fuse_big_writes, OPT_BOOL, true)
>
>
>>
>> Since mtime may be set in Ceph by both client and MDS, inconsistent
>> mtime view is possible when clocks are not adequately synchronized.
>>
>> Here is a test that reproduces the problem. In the following output,
>> issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
>> set to an hour earlier than the MDS node.
>>
>> nwatkins@issdm-22:~$ ssh issdm-18 date && ./test
>> Tue Nov 20 11:40:28 PST 2012   // MDS TIME
>> local time: Tue Nov 20 10:42:47 2012  // Client TIME
>> fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
>> creation (MDS time)
>> lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
>> (client time)
>>
>> Here is the code used to produce that output.
>>
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char **argv)
>> {
>>  struct stat st;
>>  struct ceph_mount_info *cmount;
>>  struct timeval tv;
>>
>>  /* setup */
>>  ceph_create(&cmount, "admin");
>>  ceph_conf_read_file(cmount,
>> "/users/nwatkins/Projects/ceph.conf");
>>  ceph_mount(cmount, "/");
>>
>>  /* print local time for reference */
>>  gettimeofday(&tv, NULL);
>>  printf("local time: %s", ctime(&tv.tv_sec));
>>
>>  /* create a file */
>>

Re: RBD fio Performance concerns

2012-11-21 Thread Mark Nelson

Responding to my own message. :)

Talked to Sage a bit offline about this.  I think there are two opposing 
forces:


On one hand, random IO may be spreading reads/writes out across more 
OSDs than sequential IO that presumably would be hitting a single OSD 
more regularly.


On the other hand, you'd expect that sequential writes would be getting 
coalesced either at the RBD layer or on the OSD, and that the 
drive/controller/filesystem underneath the OSD would be doing some kind 
of readahead or prefetching.


On the third hand, maybe coalescing/prefetching is in fact happening but 
we are IOP limited by some per-osd limitation.


It could be interesting to do the test with a single OSD and see what 
happens.


Mark

On 11/21/2012 09:52 AM, Mark Nelson wrote:

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are
getting higher performance with random reads/writes vs sequential!  It
would be interesting to see what kind of throughput smalliobench reports
(should be packaged in bobtail) and also see if this behavior happens
with cephfs.  It's still too early in the morning for me right now to
come up with a reasonable explanation for what's going on.  It might be
worth running blktrace and seekwatcher to see what the io patterns on
the underlying disk look like in each case.  Maybe something unexpected
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:

Which iodepth did you use for those benchs?



I really don't understand why I can't get more rand read iops with 4K
block ...


Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
 wrote:

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?


rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K
block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
15% on cluster during read bench)


- Mail original -

De: "Sébastien Han" 
À: "Mark Kampe" 
Cc: "Alexandre DERUMIER" , "ceph-devel"

Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:


If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).


The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.


We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.


I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe 
wrote:

Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.



That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
fio 1.59
Starting 4 processes

Re: RBD fio Performance concerns

2012-11-21 Thread Mark Nelson

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are 
getting higher performance with random reads/writes vs sequential!  It 
would be interesting to see what kind of throughput smalliobench reports 
(should be packaged in bobtail) and also see if this behavior happens 
with cephfs.  It's still too early in the morning for me right now to 
come up with a reasonable explanation for what's going on.  It might be 
worth running blktrace and seekwatcher to see what the io patterns on 
the underlying disk look like in each case.  Maybe something unexpected 
is going on.


Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:

Which iodepth did you use for those benchs?



I really don't understand why I can't get more rand read iops with 4K block ...


Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER  wrote:

@Alexandre: is it the same for you? or do you always get more IOPS with seq?


rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on 
cluster during read bench)


- Mail original -

De: "Sébastien Han" 
À: "Mark Kampe" 
Cc: "Alexandre DERUMIER" , "ceph-devel" 

Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:


If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).


The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.


We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.


I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe  wrote:

Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.



That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
stdev=6239.06
cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,

=64=100.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.1%

issued r/w/d: total=200473/0/0, short=0/0/0

lat (msec): 

Re: RBD Backup

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 09:56 PM, Stefan Priebe - Profihost AG wrote:

Hi Wido,

thanks for all your explanations.

This doesn't seem to work:

rbd export --snap BACKUP  


rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img
rbd: error setting snapshot context: (2) No such file or directory

Or should i still create and delete a snapshot named BACKUP before doing
this?



Yes, you should create the snapshot first before exporting it. Export 
does not create the snapshot for you.


Wido


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to create snapshots

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 10:07 PM, Stefan Priebe - Profihost AG wrote:

Hello list,

i tried to create a snapshot of my disk vm-113-disk-1:

[: ~]# rbd -p kvmpool1 ls
vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1
rbd: extraneous parameter vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP
rbd: extraneous parameter BACKUP

What's wrong here?


Use:

$ rbd -p kvmpool1 snap create --image vm-113-disk-1 BACKUP

"rbd -h" also tells:

,  are [pool/]name[@snap], or you may specify
individual pieces of names with -p/--pool, --image, and/or --snap.

Never tried it, but you might be able to use:

$ rbd -p kvmpool1 snap create vm-113-disk-1@BACKUP

I don't have access to a running Ceph cluster now to verify this.

Wido



Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to create snapshots

2012-11-21 Thread Stefan Priebe - Profihost AG

Hello list,

i tried to create a snapshot of my disk vm-113-disk-1:

[: ~]# rbd -p kvmpool1 ls
vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1
rbd: extraneous parameter vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP
rbd: extraneous parameter BACKUP

What's wrong here?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-21 Thread Stefan Priebe - Profihost AG

Hi Wido,

thanks for all your explanations.

This doesn't seem to work:

rbd export --snap BACKUP  


rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img 


rbd: error setting snapshot context: (2) No such file or directory

Or should i still create and delete a snapshot named BACKUP before doing 
this?


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 09:37 PM, Stefan Priebe - Profihost AG wrote:

Hello list,

is there a recommanded way to backup rbd images / disks?

Or is it just
rbd snap create BACKUP
rbd export BACKUP


You should use:

rbd export --snap BACKUP  


rbd snap rm BACKUP

Is the snap needed at all? Or is an export save? Is there a way to make
sure the image is consistent?



While reading rbd.cc it doesn't seem like running export on a running VM 
is safe, so you should snapshot before.


The snapshot isn't consistent since it has no way of telling the VM to 
flush it's buffers.


To make it consistent you have to run "sync" (In the VM) just prior to 
creating the snapshot.



Is it possible to use the BACKUP file as a loop device or something else
so that i'm able to mount the partitions from the backup file?



You can do something like:

rbd export --snap BACKUP image1 /mnt/backup/image1.img
losetup /mnt/backup/image1.img
kpartx -a /dev/loop0

Now you will have the partitions from the RBD image available in 
/dev/mapper/loop0pX


Wido


Thanks!

Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD Backup

2012-11-21 Thread Stefan Priebe - Profihost AG

Hello list,

is there a recommanded way to backup rbd images / disks?

Or is it just
rbd snap create BACKUP
rbd export BACKUP
rbd snap rm BACKUP

Is the snap needed at all? Or is an export save? Is there a way to make 
sure the image is consistent?


Is it possible to use the BACKUP file as a loop device or something else 
so that i'm able to mount the partitions from the backup file?


Thanks!

Greets Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


'zombie snapshot' problem

2012-11-21 Thread Andrey Korolyov
Hi,

Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:

$ rbd snap purge dev-rack0/vm2
Removing all snapshots: 100% complete...done.
$ rbd rm dev-rack0/vm2
2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
- not removing
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge'
before the image can be removed.
$ rbd snap ls dev-rack0/vm2
SNAPID NAME   SIZE
   188 vm2.snap-yxf 16384 MB
$ rbd info dev-rack0/vm2
rbd image 'vm2':
size 16384 MB in 4096 objects
order 22 (4096 KB objects)
block_name_prefix: rbd_data.1fa164c960874
format: 2
features: layering
$ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to remove snapshot: (2) No such file or directory
$ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to create snapshot: (17) File exists
$ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
Rolling back to snapshot: 100% complete...done.
$ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
$ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2


Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
Is there any reliable way to kill problematic snap?


log-crash.txt.gz
Description: GNU Zip compressed data


Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-11-21 Thread Danny Al-Gaaf
Hi,

no, I have it basically ready but I have to run some tests before.
You'll have it in the next days!

Danny

Am 21.11.2012 01:23, schrieb Sage Weil:
> If you haven't gotten to this yet, I'll go ahead and jump on it..
> let me know!
> 
> Thanks- sage
> 
> 
> On Thu, 9 Aug 2012, Danny Kukawka wrote:
> 
>> Remove btrfs specific keys and replace them by more generic keys
>> to be able to replace btrfs with e.g. xfs or ext4 easily.
>> 
>> Add new key to define the osd fs type: 'fstype', which can get 
>> defined in the [osd] section for all OSDs.
>> 
>> Replace: - 'btrfs devs' -> 'devs' - 'btrfs path' -> 'fs path' -
>> 'btrfs options' -> 'fs options' - mkcephfs: replace --mkbtrfs
>> with --mkfs - init-ceph: replace --btrfs with --fsmount,
>> --nobtrfs with --nofsmount, --btrfsumount with --fsumount
>> 
>> Update documentation, manpage and example config files.
>> 
>> Signed-off-by: Danny Kukawka  --- 
>> doc/man/8/mkcephfs.rst  |   17 +++- 
>> man/mkcephfs.8  |   15 +++ 
>> src/ceph.conf.twoosds   |7 ++-- 
>> src/init-ceph.in|   50
>> +- src/mkcephfs.in
>> |   60 +-- src/sample.ceph.conf
>> |   15 --- src/test/cli/osdmaptool/ceph.conf.withracks |3
>> +- 7 Dateien ge?ndert, 95 Zeilen hinzugef?gt(+), 72 Zeilen
>> entfernt(-)
>> 
>> diff --git a/doc/man/8/mkcephfs.rst b/doc/man/8/mkcephfs.rst 
>> index ddc378a..dd3fbd5 100644 --- a/doc/man/8/mkcephfs.rst +++
>> b/doc/man/8/mkcephfs.rst @@ -70,20 +70,15 @@ Options default is
>> ``/etc/ceph/keyring`` (or whatever is specified in the config
>> file).
>> 
>> -.. option:: --mkbtrfs +.. option:: --mkfs
>> 
>> -   Create and mount the any btrfs file systems specified in the 
>> -   ceph.conf for OSD data storage using mkfs.btrfs. The "btrfs
>> devs" -   and (if it differs from "osd data") "btrfs path"
>> options must be -   defined. +   Create and mount any file system
>> specified in the ceph.conf for +   OSD data storage using mkfs.
>> The "devs" and (if it differs from +   "osd data") "fs path"
>> options must be defined.
>> 
>> **NOTE** Btrfs is still considered experimental.  This option -
>> can ease some configuration pain, but is the use of btrfs is not 
>> -   required when ``osd data`` directories are mounted manually
>> by the -   adminstrator. - -   **NOTE** This option is deprecated
>> and will be removed in a future -   release. +   can ease some
>> configuration pain, but is not required when +   ``osd data``
>> directories are mounted manually by the adminstrator.
>> 
>> .. option:: --no-copy-conf
>> 
>> diff --git a/man/mkcephfs.8 b/man/mkcephfs.8 index
>> 8544a01..22a5335 100644 --- a/man/mkcephfs.8 +++
>> b/man/mkcephfs.8 @@ -32,7 +32,7 @@ level margin:
>> \\n[rst2man-indent\\n[rst2man-indent-level]] . .SH SYNOPSIS .nf 
>> -\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [ \-\-mkbtrfs ] [ \-a,
>> \-\-all\-hosts [ \-k +\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [
>> \-\-mkfs ] [ \-a, \-\-all\-hosts [ \-k 
>> \fI/path/to/admin.keyring\fP ] ] .fi .sp @@ -111,19 +111,16 @@
>> config file). .UNINDENT .INDENT 0.0 .TP -.B \-\-mkbtrfs -Create
>> and mount the any btrfs file systems specified in the -ceph.conf
>> for OSD data storage using mkfs.btrfs. The "btrfs devs" -and (if
>> it differs from "osd data") "btrfs path" options must be +.B
>> \-\-mkfs +Create and mount any file systems specified in the 
>> +ceph.conf for OSD data storage using mkfs.*. The "devs" +and (if
>> it differs from "osd data") "fs path" options must be defined. 
>> .sp \fBNOTE\fP Btrfs is still considered experimental.  This
>> option -can ease some configuration pain, but is the use of btrfs
>> is not +can ease some configuration pain, but the use of this
>> option is not required when \fBosd data\fP directories are
>> mounted manually by the adminstrator. -.sp -\fBNOTE\fP This
>> option is deprecated and will be removed in a future -release. 
>> .UNINDENT .INDENT 0.0 .TP diff --git a/src/ceph.conf.twoosds
>> b/src/ceph.conf.twoosds index c0cfc68..05ca754 100644 ---
>> a/src/ceph.conf.twoosds +++ b/src/ceph.conf.twoosds @@ -67,7
>> +67,8 @@ debug journal = 20 log dir = /data/cosd$id osd data =
>> /mnt/osd$id -btrfs options = "flushoncommit,usertrans" + fs
>> options = "flushoncommit,usertrans" +fstype = btrfs ;user =
>> root
>> 
>> ;osd journal = /mnt/osd$id/journal @@ -75,8 +76,8 @@ osd journal
>> = "/dev/disk/by-path/pci-:05:02.0-scsi-6:0:0:0" ;filestore
>> max sync interval = 1
>> 
>> -btrfs devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0" 
>> -;   btrfs devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0
>> \ +  devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0" +;
>> devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 \ ;
>> /dev/disk/by-path/pci-:05:01.0-scsi-3:0:0:0 \ ;
>> /dev/disk/by-path/pci-:05:01.0-scsi-4:0:0:0 \ ;
>> /dev/disk/by-path/pci-:05:0

Re: [Openstack] Ceph + Nova

2012-11-21 Thread Sébastien Han
Hi,

I don't think it's the best place to ask your question since it's not
directly related to OpenStack but more about Ceph. I just put in c/c
the ceph ML. Anyway, CephFS is not ready yet for production but I
heard that some people use it. People from Inktank (the company behind
Ceph) don't recommend it, AFAIR they expect something more production
ready for Q2 2013. You can use it (I did, for testing purpose) but
it's at your own risk.
Beside of this RBD and RADOS are robust and stable now, so you can go
with the Cinder and Glance integration without any problems.

Cheers!

On Wed, Nov 21, 2012 at 9:37 AM, JuanFra Rodríguez Cardoso
 wrote:
> Hi everyone:
>
> I'd like to know your opinion as nova experts:
>
> Would you recommend CephFS as shared storage in /var/lib/nova/instances?
> Another option it would be use GlusterFS or MooseFS for
> /var/lib/nova/instances directory and Ceph RBD for Glance and Nova volumes,
> don't you think?
>
> Thanks for your attention.
>
> Best regards,
> JuanFra
>
> ___
> Mailing list: https://launchpad.net/~openstack
> Post to : openst...@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 09:33:08AM +0100, Stefan Priebe - Profihost AG wrote:
> Am 21.11.2012 09:26, schrieb Stefan Hajnoczi:
> >On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
> >>Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
> >QEMU is currently in hard freeze and only critical patches should go in.
> >Providing steps to reproduce the bug helps me decide that this patch
> >should still be merged for QEMU 1.3-rc1.
> >
> >Anyway, the patch is straightforward, I have applied it to my block tree
> >and it will be in QEMU 1.3-rc1:
> >https://github.com/stefanha/qemu/commits/block
> 
> Thanks!
> 
> The steps to reproduce are:
> mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends
> a discard. Important is that you use scsi-hd and set
> discard_granularity=512. Otherwise rbd disabled discard support.

Excellent, thanks!  I will add it to the commit description.

> Might you have a look at my other rbd fix too? It fixes a race
> between task cancellation and writes. The same race was fixed in
> iscsi this summer.

Yes.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Priebe - Profihost AG

Am 21.11.2012 09:26, schrieb Stefan Hajnoczi:

On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:

Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:

We're going in circles here.  I know the types are wrong in the code and
your patch fixes it, that's why I said it looks good in my first reply.


Sorry not so familiar with processes like these.



QEMU is currently in hard freeze and only critical patches should go in.
Providing steps to reproduce the bug helps me decide that this patch
should still be merged for QEMU 1.3-rc1.

Anyway, the patch is straightforward, I have applied it to my block tree
and it will be in QEMU 1.3-rc1:
https://github.com/stefanha/qemu/commits/block


Thanks!

The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a 
discard. Important is that you use scsi-hd and set 
discard_granularity=512. Otherwise rbd disabled discard support.


Might you have a look at my other rbd fix too? It fixes a race between 
task cancellation and writes. The same race was fixed in iscsi this summer.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
> Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
> >On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe  wrote:
> >>Hi Stefan,
> >>
> >>Am 20.11.2012 17:29, schrieb Stefan Hajnoczi:
> >>
> >>>On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote:
> 
> rbd / rados tends to return pretty often length of writes
> or discarded blocks. These values might be bigger than int.
> 
> Signed-off-by: Stefan Priebe 
> ---
>    block/rbd.c |4 ++--
>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>>
> >>>Looks good but I want to check whether this fixes an bug you've hit?
> >>>Please indicate details of the bug and how to reproduce it in the commit
> >>>message.
> >>
> >>
> >>you get various I/O errors in client. As negative return values indicate I/O
> >>errors. When now a big positive value is returned by librbd block/rbd tries
> >>to store this one in acb->ret which is an int. Then it wraps around and is
> >>negative. After that block/rbd thinks this is an I/O error and report this
> >>to the guest.
> >
> >It's still not clear whether this is a bug that you can reproduce.
> >After all, the ret value would have to be >2^31 which is a 2+ GB
> >request!
> Yes and that is the fact.
> 
> Look here:
>if (acb->cmd == RBD_AIO_WRITE ||
> acb->cmd == RBD_AIO_DISCARD) {
> if (r < 0) {
> acb->ret = r;
> acb->error = 1;
> } else if (!acb->error) {
> acb->ret = rcb->size;
> }
> 
> It sets acb->ret to rcb->size. But the size from a DISCARD if you
> DISCARD a whole device might be 500GB or today even some TB.

We're going in circles here.  I know the types are wrong in the code and
your patch fixes it, that's why I said it looks good in my first reply.

QEMU is currently in hard freeze and only critical patches should go in.
Providing steps to reproduce the bug helps me decide that this patch
should still be merged for QEMU 1.3-rc1.

Anyway, the patch is straightforward, I have applied it to my block tree
and it will be in QEMU 1.3-rc1:
https://github.com/stefanha/qemu/commits/block

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html