Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
Not sure about off_t. What is min and max size? Stefan Am 21.11.2012 um 18:03 schrieb Stefan Weil : > Am 20.11.2012 13:44, schrieb Stefan Priebe: >> rbd / rados tends to return pretty often length of writes >> or discarded blocks. These values might be bigger than int. >> >> Signed-off-by: Stefan Priebe >> --- >> block/rbd.c |4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/block/rbd.c b/block/rbd.c >> index f57d0c6..6bf9c2e 100644 >> --- a/block/rbd.c >> +++ b/block/rbd.c >> @@ -69,7 +69,7 @@ typedef enum { >> typedef struct RBDAIOCB { >> BlockDriverAIOCB common; >> QEMUBH *bh; >> -int ret; >> +int64_t ret; >> QEMUIOVector *qiov; >> char *bounce; >> RBDAIOCmd cmd; >> @@ -87,7 +87,7 @@ typedef struct RADOSCB { >> int done; >> int64_t size; >> char *buf; >> -int ret; >> +int64_t ret; >> } RADOSCB; >>#define RBD_FD_READ 0 > > > Why do you use int64_t instead of off_t? > If the value is related to file sizes, off_t would be a good choice. > > Stefan W. > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does still not recommended place rbd device on nodes, where osd daemon located?
Still not certain I'm understanding *just* what you mean, but I'll point out that you can set up a cluster with rbd images, mount them from a separate non-virtualized host with kernel rbd, and expand those images and take advantage of the newly-available space on the separate host, just as though you were expanding a RAID device. Maybe that fits your use case, Ruslan? On 11/21/2012 12:05 PM, ruslan usifov wrote: Yes i mean exactly this. it's a great pity :-( Maybe present some ceph equivalent that solve my problem? 2012/11/21 Gregory Farnum : On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov wrote: So, not possible use ceph as scalable block device without visualization? I'm not sure I understand, but if you're trying to take a bunch of compute nodes and glue their disks together, no, that's not a supported use case at this time. There are a number of deadlock issues caused by this sort of loopback; it's the same reason you shouldn't mount NFS on the server host. We may in the future manage to release an rbd-fuse client that you can use to do this with a little less pain, but it's not ready at this point. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does still not recommended place rbd device on nodes, where osd daemon located?
Yes i mean exactly this. it's a great pity :-( Maybe present some ceph equivalent that solve my problem? 2012/11/21 Gregory Farnum : > On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov > wrote: >> So, not possible use ceph as scalable block device without visualization? > > I'm not sure I understand, but if you're trying to take a bunch of > compute nodes and glue their disks together, no, that's not a > supported use case at this time. There are a number of deadlock issues > caused by this sort of loopback; it's the same reason you shouldn't > mount NFS on the server host. > We may in the future manage to release an rbd-fuse client that you can > use to do this with a little less pain, but it's not ready at this > point. > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does still not recommended place rbd device on nodes, where osd daemon located?
On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov wrote: > So, not possible use ceph as scalable block device without visualization? I'm not sure I understand, but if you're trying to take a bunch of compute nodes and glue their disks together, no, that's not a supported use case at this time. There are a number of deadlock issues caused by this sort of loopback; it's the same reason you shouldn't mount NFS on the server host. We may in the future manage to release an rbd-fuse client that you can use to do this with a little less pain, but it's not ready at this point. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Files lost after mds rebuild
On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang wrote: > 2012/11/21 Gregory Farnum : >> No, absolutely not. There is no relationship between different RADOS >> pools. If you've been using the cephfs tool to place some filesystem >> data in different pools then your configuration is a little more >> complicated (have you done that?), but deleting one pool is never >> going to remove data from the others. >> -Greg >> > I think that should be a bug. Here's the story I did: > I created one directory 'audit' in running ceph filesystem, and put > some data into the directory (about 100GB) before these commands: > ceph osd pool create audit > ceph mds add_data_pool 4 > cephfs /mnt/temp/audit/ set_layout -p 4 > > log3 ~ # ceph osd dump | grep audit > pool 4 'audit' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num > 8 pgp_num 8 last_change 1558 owner 0 > > at this time, all data in audit still usable, after 'ceph osd pool > delete data', the disk space recycled (forgot to test if the data > still usable), only 200MB used, from 'ceph -s'. So, here's what I'm > thinking, the data stored before pool created won't follow the pool, > it still follows the default pool 'data', is this a bug, or intended > behavior? Oh, I see. Data is not moved when you set directory layouts; it only impacts files created after that point. This is intended behavior — Ceph would need to copy the data around anyway in order to make it follow the pool. There's no sense in hiding that from the user, especially given the complexity involved in doing so safely — especially when there are many use cases where you want the files in different pools. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for removing that specific patch? On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos wrote: > It's really looking like it's the > libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When > patches 1-50 (listed below) are applied to 3.5.7, the hang is present. > So far I have gone through 4 successful installs with no hang with > only 1-49 applied. I'm still leaving my test run to make sure it's > not a fluke, but since previously it hangs within the first couple of > builds, it really looks like this is where the problem originated. > > 1-libceph_eliminate_connection_state_DEAD.patch > 2-libceph_kill_bad_proto_ceph_connection_op.patch > 3-libceph_rename_socket_callbacks.patch > 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch > 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch > 6-libceph_start_separating_connection_flags_from_state.patch > 7-libceph_start_tracking_connection_socket_state.patch > 8-libceph_provide_osd_number_when_creating_osd.patch > 9-libceph_set_CLOSED_state_bit_in_con_init.patch > 10-libceph_embed_ceph_connection_structure_in_mon_client.patch > 11-libceph_drop_connection_refcounting_for_mon_client.patch > 12-libceph_init_monitor_connection_when_opening.patch > 13-libceph_fully_initialize_connection_in_con_init.patch > 14-libceph_tweak_ceph_alloc_msg.patch > 15-libceph_have_messages_point_to_their_connection.patch > 16-libceph_have_messages_take_a_connection_reference.patch > 17-libceph_make_ceph_con_revoke_a_msg_operation.patch > 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch > 19-libceph_fix_overflow_in___decode_pool_names.patch > 20-libceph_fix_overflow_in_osdmap_decode.patch > 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch > 22-libceph_transition_socket_state_prior_to_actual_connect.patch > 23-libceph_fix_NULL_dereference_in_reset_connection.patch > 24-libceph_use_con_get_put_methods.patch > 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch > 26-libceph_encapsulate_out_message_data_setup.patch > 27-libceph_encapsulate_advancing_msg_page.patch > 28-libceph_don_t_mark_footer_complete_before_it_is.patch > 29-libceph_move_init_bio__functions_up.patch > 30-libceph_move_init_of_bio_iter.patch > 31-libceph_don_t_use_bio_iter_as_a_flag.patch > 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch > 33-libceph_don_t_change_socket_state_on_sock_event.patch > 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch > 35-libceph_don_t_touch_con_state_in_con_close_socket.patch > 36-libceph_clear_CONNECTING_in_ceph_con_close.patch > 37-libceph_clear_NEGOTIATING_when_done.patch > 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch > 39-libceph_separate_banner_and_connect_writes.patch > 40-libceph_distinguish_two_phases_of_connect_sequence.patch > 41-libceph_small_changes_to_messenger.c.patch > 42-libceph_add_some_fine_ASCII_art.patch > 43-libceph_set_peer_name_on_con_open_not_init.patch > 44-libceph_initialize_mon_client_con_only_once.patch > 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch > 46-libceph_initialize_msgpool_message_types.patch > 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch > 48-libceph_report_socket_read_write_error_message.patch > 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch > 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch > > > On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil wrote: >> Thanks for hunting this down. I'm very curious what the culprit is... >> >> sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
Am 20.11.2012 13:44, schrieb Stefan Priebe: rbd / rados tends to return pretty often length of writes or discarded blocks. These values might be bigger than int. Signed-off-by: Stefan Priebe --- block/rbd.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index f57d0c6..6bf9c2e 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -69,7 +69,7 @@ typedef enum { typedef struct RBDAIOCB { BlockDriverAIOCB common; QEMUBH *bh; -int ret; +int64_t ret; QEMUIOVector *qiov; char *bounce; RBDAIOCmd cmd; @@ -87,7 +87,7 @@ typedef struct RADOSCB { int done; int64_t size; char *buf; -int ret; +int64_t ret; } RADOSCB; #define RBD_FD_READ 0 Why do you use int64_t instead of off_t? If the value is related to file sizes, off_t would be a good choice. Stefan W. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_socket_callbacks.patch 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch 6-libceph_start_separating_connection_flags_from_state.patch 7-libceph_start_tracking_connection_socket_state.patch 8-libceph_provide_osd_number_when_creating_osd.patch 9-libceph_set_CLOSED_state_bit_in_con_init.patch 10-libceph_embed_ceph_connection_structure_in_mon_client.patch 11-libceph_drop_connection_refcounting_for_mon_client.patch 12-libceph_init_monitor_connection_when_opening.patch 13-libceph_fully_initialize_connection_in_con_init.patch 14-libceph_tweak_ceph_alloc_msg.patch 15-libceph_have_messages_point_to_their_connection.patch 16-libceph_have_messages_take_a_connection_reference.patch 17-libceph_make_ceph_con_revoke_a_msg_operation.patch 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch 19-libceph_fix_overflow_in___decode_pool_names.patch 20-libceph_fix_overflow_in_osdmap_decode.patch 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch 22-libceph_transition_socket_state_prior_to_actual_connect.patch 23-libceph_fix_NULL_dereference_in_reset_connection.patch 24-libceph_use_con_get_put_methods.patch 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch 26-libceph_encapsulate_out_message_data_setup.patch 27-libceph_encapsulate_advancing_msg_page.patch 28-libceph_don_t_mark_footer_complete_before_it_is.patch 29-libceph_move_init_bio__functions_up.patch 30-libceph_move_init_of_bio_iter.patch 31-libceph_don_t_use_bio_iter_as_a_flag.patch 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch 33-libceph_don_t_change_socket_state_on_sock_event.patch 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch 35-libceph_don_t_touch_con_state_in_con_close_socket.patch 36-libceph_clear_CONNECTING_in_ceph_con_close.patch 37-libceph_clear_NEGOTIATING_when_done.patch 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch 39-libceph_separate_banner_and_connect_writes.patch 40-libceph_distinguish_two_phases_of_connect_sequence.patch 41-libceph_small_changes_to_messenger.c.patch 42-libceph_add_some_fine_ASCII_art.patch 43-libceph_set_peer_name_on_con_open_not_init.patch 44-libceph_initialize_mon_client_con_only_once.patch 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch 46-libceph_initialize_msgpool_message_types.patch 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch 48-libceph_report_socket_read_write_error_message.patch 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil wrote: > Thanks for hunting this down. I'm very curious what the culprit is... > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
On Tue, 20 Nov 2012, Nick Bartos wrote: > Since I now have a decent script which can reproduce this, I decided > to re-test with the same 3.5.7 kernel, but just not applying the > patches from the wip-3.5 branch. With the patches, I can only go 2 > builds before I run into a hang. Without the patches, I have gone 9 > consecutive builds (and still going) without seeing the hang. So it > seems like a reasonable assumption that the problem was introduced in > one of those patches. > > We started seeing the problem before applying all the 3.5 patches, so > it seems like one of these is the culprit: > > 1-libceph-encapsulate-out-message-data-setup.patch > 2-libceph-dont-mark-footer-complete-before-it-is.patch > 3-libceph-move-init-of-bio_iter.patch > 4-libceph-dont-use-bio_iter-as-a-flag.patch > 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch > 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch > 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch > 8-libceph-protect-ceph_con_open-with-mutex.patch > 9-libceph-reset-connection-retry-on-successfully-negotiation.patch > 10-rbd-only-reset-capacity-when-pointing-to-head.patch > 11-rbd-set-image-size-when-header-is-updated.patch > 12-libceph-fix-crypto-key-null-deref-memory-leak.patch > 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch > 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch > 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch > 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch > 17-libceph-check-for-invalid-mapping.patch > 18-ceph-propagate-layout-error-on-osd-request-creation.patch > 19-rbd-BUG-on-invalid-layout.patch > 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch > 21-ceph-avoid-32-bit-page-index-overflow.patch > 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch > > I'll start doing some other builds to try and narrow down the patch > introducing the problem more specifically. Thanks for hunting this down. I'm very curious what the culprit is... sage > > > On Tue, Nov 20, 2012 at 1:53 PM, Nick Bartos wrote: > > I reproduced the problem and got several sysrq states captured. > > During this run, the monitor running on the host complained a few > > times about the clocks being off, but all messages were for under 0.55 > > seconds. > > > > Here are the kernel logs. Note that there are several traces, I > > thought multiple during the incident may help: > > https://raw.github.com/gist/4121395/a6dda7552ed8a45725ee5d632fe3ba38703f8cfc/gistfile1.txt > > > > > > On Mon, Nov 19, 2012 at 3:34 PM, Gregory Farnum wrote: > >> Hmm, yep ? that param is actually only used for the warning; I guess > >> we forgot what it actually covers. :( > >> > >> Have your monitor clocks been off by more than 5 seconds at any point? > >> > >> On Mon, Nov 19, 2012 at 3:04 PM, Nick Bartos wrote: > >>> Making 'mon clock drift allowed' very small (0.1) does not > >>> reliably reproduce the hang. I started looking at the code for 0.48.2 > >>> and it looks like this is only used in Paxos::warn_on_future_time, > >>> which only handles the warning, nothing else. > >>> > >>> > >>> On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil wrote: > On Fri, 16 Nov 2012, Nick Bartos wrote: > > Should I be lowering the clock drift allowed, or the lease interval to > > help reproduce it? > > clock drift allowed. > > > > > > > On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil wrote: > > > You can safely set the clock drift allowed as high as 500ms. The real > > > limitation is that it needs to be well under the lease interval, > > > which is > > > currently 5 seconds by default. > > > > > > You might be able to reproduce more easily by lowering the > > > threshold... > > > > > > sage > > > > > > > > > On Fri, 16 Nov 2012, Nick Bartos wrote: > > > > > >> How far off do the clocks need to be before there is a problem? It > > >> would seem to be hard to ensure a very large cluster has all of it's > > >> nodes synchronized within 50ms (which seems to be the default for > > >> "mon > > >> clock drift allowed"). Does the mon clock drift allowed parameter > > >> change anything other than the log messages? Are there any other > > >> tuning options that may help, assuming that this is the issue and > > >> it's > > >> not feasible to get the clocks more than 500ms in sync between all > > >> nodes? > > >> > > >> I'm trying to get a good way of reproducing this and get a trace on > > >> the ceph processes to see what they're waiting on. I'll let you know > > >> when I have more info. > > >> > > >> > > >> On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil wrote: > > >> > I just realized I was mixing up this thread with the other deadlock > > >> > thread. > > >> > > > >> > On Fri, 16 Nov 2012, Nick Bartos wrote: > > >> >> Turns
Re: Hadoop and Ceph client/mds view of modification time
(Sorry for the dupe message. vger rejected due to HTML). Thanks, I'll try this patch this morning. Client B should perform a single stat after a notification from Client A. But, won't Sage's patch still be required, since Client A needs the MDS time to pass to Client B? On Tue, Nov 20, 2012 at 12:20 PM, Sam Lang wrote: > On 11/20/2012 01:44 PM, Noah Watkins wrote: >> >> This is a description of the clock synchronization issue we are facing >> in Hadoop: >> >> Components of Hadoop use mtime as a versioning mechanism. Here is an >> example where Client B tests the expected 'version' of a file created >> by Client A: >> >>Client A: create file, write data into file. >>Client A: expected_mtime <-- lstat(file) >>Client A: broadcast expected_mtime to client B >>... >>Client B: mtime <-- lstat(file) >>Client B: test expected_mtime == mtime > > > Here's a patch that might work to push the setattr out to the mds every time > (the same as Sage's patch for getattr). This isn't quite writeback, as it > waits for the setattr at the server to complete before returning, but I > think that's actually what you want in this case. It needs to be enabled by > setting client setattr writethru = true in the config. Also, I haven't > tested that it sends the setattr, just a basic test of functionality. > > BTW, if its always client B's first stat of the file, you won't need Sage's > patch. > > -sam > > diff --git a/src/client/Client.cc b/src/client/Client.cc > index 8d4a5ac..a7dd8f7 100644 > --- a/src/client/Client.cc > +++ b/src/client/Client.cc > @@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, int > gid) > > int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, int > gid) > { > + int orig_mask = mask; >int issued = in->caps_issued(); > >ldout(cct, 10) << "_setattr mask " << mask << " issued " << > ccap_string(issued) << dendl; > @@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, int > mask, int uid, int gid) >mask &= ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME); > } >} > - if (!mask) > + if (!cct->_conf->client_setattr_writethru && !mask) > return 0; > >MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR); > @@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat *attr, > int mask, int uid, int gid) >req->set_filepath(path); >req->inode = in; > > + // reset mask back to original if we're meant to do writethru > + if (cct->_conf->client_setattr_writethru) > +mask = orig_mask; > + >if (mask & CEPH_SETATTR_MODE) { > req->head.args.setattr.mode = attr->st_mode; > req->inode_drop |= CEPH_CAP_AUTH_SHARED; > diff --git a/src/common/config_opts.h b/src/common/config_opts.h > index cc05095..51a2769 100644 > --- a/src/common/config_opts.h > +++ b/src/common/config_opts.h > @@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100) > // MB * n (dirty OR tx. > OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep > this smallish) > OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0) // max age in cache > before writeback > OPTION(client_oc_max_objects, OPT_INT, 1000) // max objects in cache > +OPTION(client_setattr_writethru, OPT_BOOL, false) // send the attributes > to the mds server > // note: the max amount of "in flight" dirty data is roughly (max - target) > OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ invalidate > callback to keep page cache consistent > OPTION(fuse_big_writes, OPT_BOOL, true) > > >> >> Since mtime may be set in Ceph by both client and MDS, inconsistent >> mtime view is possible when clocks are not adequately synchronized. >> >> Here is a test that reproduces the problem. In the following output, >> issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time >> set to an hour earlier than the MDS node. >> >> nwatkins@issdm-22:~$ ssh issdm-18 date && ./test >> Tue Nov 20 11:40:28 PST 2012 // MDS TIME >> local time: Tue Nov 20 10:42:47 2012 // Client TIME >> fstat time: Tue Nov 20 11:40:28 2012 // mtime seen after file >> creation (MDS time) >> lstat time: Tue Nov 20 10:42:47 2012 // mtime seen after file write >> (client time) >> >> Here is the code used to produce that output. >> >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> >> int main(int argc, char **argv) >> { >> struct stat st; >> struct ceph_mount_info *cmount; >> struct timeval tv; >> >> /* setup */ >> ceph_create(&cmount, "admin"); >> ceph_conf_read_file(cmount, >> "/users/nwatkins/Projects/ceph.conf"); >> ceph_mount(cmount, "/"); >> >> /* print local time for reference */ >> gettimeofday(&tv, NULL); >> printf("local time: %s", ctime(&tv.tv_sec)); >> >> /* create a file */ >>
Re: RBD fio Performance concerns
Responding to my own message. :) Talked to Sage a bit offline about this. I think there are two opposing forces: On one hand, random IO may be spreading reads/writes out across more OSDs than sequential IO that presumably would be hitting a single OSD more regularly. On the other hand, you'd expect that sequential writes would be getting coalesced either at the RBD layer or on the OSD, and that the drive/controller/filesystem underneath the OSD would be doing some kind of readahead or prefetching. On the third hand, maybe coalescing/prefetching is in fact happening but we are IOP limited by some per-osd limitation. It could be interesting to do the test with a single OSD and see what happens. Mark On 11/21/2012 09:52 AM, Mark Nelson wrote: Hi Guys, I'm late to this thread but thought I'd chime in. Crazy that you are getting higher performance with random reads/writes vs sequential! It would be interesting to see what kind of throughput smalliobench reports (should be packaged in bobtail) and also see if this behavior happens with cephfs. It's still too early in the morning for me right now to come up with a reasonable explanation for what's going on. It might be worth running blktrace and seekwatcher to see what the io patterns on the underlying disk look like in each case. Maybe something unexpected is going on. Mark On 11/19/2012 02:57 PM, Sébastien Han wrote: Which iodepth did you use for those benchs? I really don't understand why I can't get more rand read iops with 4K block ... Me neither, hope to get some clarification from the Inktank guys. It doesn't make any sense to me... -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER wrote: @Alexandre: is it the same for you? or do you always get more IOPS with seq? rand read 4K : 6000 iops seq read 4K : 3500 iops seq read 4M : 31iops (1gigabit client bandwith limit) rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq write 4M : 31iops (1gigabit client bandwith limit) I really don't understand why I can't get more rand read iops with 4K block ... I try with high end cpu for client, it doesn't change nothing. But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench) - Mail original - De: "Sébastien Han" À: "Mark Kampe" Cc: "Alexandre DERUMIER" , "ceph-devel" Envoyé: Lundi 19 Novembre 2012 19:03:40 Objet: Re: RBD fio Performance concerns @Sage, thanks for the info :) @Mark: If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). The original benchmark has been performed with 4M block size. And as you can see I still get more IOPS with rand than seq... I just tried with 4M without direct I/O, still the same. I can print fio results if it's needed. We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. I know why I use direct I/O. It's synthetic benchmarks, it's far away from a real life scenario and how common applications works. I just try to see the maximum I/O throughput that I can get from my RBD. All my applications use buffered I/O. @Alexandre: is it the same for you? or do you always get more IOPS with seq? Thanks to all of you.. On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe wrote: Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes
Re: RBD fio Performance concerns
Hi Guys, I'm late to this thread but thought I'd chime in. Crazy that you are getting higher performance with random reads/writes vs sequential! It would be interesting to see what kind of throughput smalliobench reports (should be packaged in bobtail) and also see if this behavior happens with cephfs. It's still too early in the morning for me right now to come up with a reasonable explanation for what's going on. It might be worth running blktrace and seekwatcher to see what the io patterns on the underlying disk look like in each case. Maybe something unexpected is going on. Mark On 11/19/2012 02:57 PM, Sébastien Han wrote: Which iodepth did you use for those benchs? I really don't understand why I can't get more rand read iops with 4K block ... Me neither, hope to get some clarification from the Inktank guys. It doesn't make any sense to me... -- Bien cordialement. Sébastien HAN. On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER wrote: @Alexandre: is it the same for you? or do you always get more IOPS with seq? rand read 4K : 6000 iops seq read 4K : 3500 iops seq read 4M : 31iops (1gigabit client bandwith limit) rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq write 4M : 31iops (1gigabit client bandwith limit) I really don't understand why I can't get more rand read iops with 4K block ... I try with high end cpu for client, it doesn't change nothing. But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench) - Mail original - De: "Sébastien Han" À: "Mark Kampe" Cc: "Alexandre DERUMIER" , "ceph-devel" Envoyé: Lundi 19 Novembre 2012 19:03:40 Objet: Re: RBD fio Performance concerns @Sage, thanks for the info :) @Mark: If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). The original benchmark has been performed with 4M block size. And as you can see I still get more IOPS with rand than seq... I just tried with 4M without direct I/O, still the same. I can print fio results if it's needed. We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. I know why I use direct I/O. It's synthetic benchmarks, it's far away from a real life scenario and how common applications works. I just try to see the maximum I/O throughput that I can get from my RBD. All my applications use buffered I/O. @Alexandre: is it the same for you? or do you always get more IOPS with seq? Thanks to all of you.. On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe wrote: Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, =64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec):
Re: RBD Backup
Hi, On 11/21/2012 09:56 PM, Stefan Priebe - Profihost AG wrote: Hi Wido, thanks for all your explanations. This doesn't seem to work: rbd export --snap BACKUP rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img rbd: error setting snapshot context: (2) No such file or directory Or should i still create and delete a snapshot named BACKUP before doing this? Yes, you should create the snapshot first before exporting it. Export does not create the snapshot for you. Wido Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to create snapshots
Hi, On 11/21/2012 10:07 PM, Stefan Priebe - Profihost AG wrote: Hello list, i tried to create a snapshot of my disk vm-113-disk-1: [: ~]# rbd -p kvmpool1 ls vm-113-disk-1 [: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1 rbd: extraneous parameter vm-113-disk-1 [: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP rbd: extraneous parameter BACKUP What's wrong here? Use: $ rbd -p kvmpool1 snap create --image vm-113-disk-1 BACKUP "rbd -h" also tells: , are [pool/]name[@snap], or you may specify individual pieces of names with -p/--pool, --image, and/or --snap. Never tried it, but you might be able to use: $ rbd -p kvmpool1 snap create vm-113-disk-1@BACKUP I don't have access to a running Ceph cluster now to verify this. Wido Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to create snapshots
Hello list, i tried to create a snapshot of my disk vm-113-disk-1: [: ~]# rbd -p kvmpool1 ls vm-113-disk-1 [: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1 rbd: extraneous parameter vm-113-disk-1 [: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP rbd: extraneous parameter BACKUP What's wrong here? Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD Backup
Hi Wido, thanks for all your explanations. This doesn't seem to work: rbd export --snap BACKUP rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img rbd: error setting snapshot context: (2) No such file or directory Or should i still create and delete a snapshot named BACKUP before doing this? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD Backup
Hi, On 11/21/2012 09:37 PM, Stefan Priebe - Profihost AG wrote: Hello list, is there a recommanded way to backup rbd images / disks? Or is it just rbd snap create BACKUP rbd export BACKUP You should use: rbd export --snap BACKUP rbd snap rm BACKUP Is the snap needed at all? Or is an export save? Is there a way to make sure the image is consistent? While reading rbd.cc it doesn't seem like running export on a running VM is safe, so you should snapshot before. The snapshot isn't consistent since it has no way of telling the VM to flush it's buffers. To make it consistent you have to run "sync" (In the VM) just prior to creating the snapshot. Is it possible to use the BACKUP file as a loop device or something else so that i'm able to mount the partitions from the backup file? You can do something like: rbd export --snap BACKUP image1 /mnt/backup/image1.img losetup /mnt/backup/image1.img kpartx -a /dev/loop0 Now you will have the partitions from the RBD image available in /dev/mapper/loop0pX Wido Thanks! Greets Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RBD Backup
Hello list, is there a recommanded way to backup rbd images / disks? Or is it just rbd snap create BACKUP rbd export BACKUP rbd snap rm BACKUP Is the snap needed at all? Or is an export save? Is there a way to make sure the image is consistent? Is it possible to use the BACKUP file as a loop device or something else so that i'm able to mount the partitions from the backup file? Thanks! Greets Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
'zombie snapshot' problem
Hi, Somehow I have managed to produce unkillable snapshot, which does not allow to remove itself or parent image: $ rbd snap purge dev-rack0/vm2 Removing all snapshots: 100% complete...done. $ rbd rm dev-rack0/vm2 2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed. $ rbd snap ls dev-rack0/vm2 SNAPID NAME SIZE 188 vm2.snap-yxf 16384 MB $ rbd info dev-rack0/vm2 rbd image 'vm2': size 16384 MB in 4096 objects order 22 (4096 KB objects) block_name_prefix: rbd_data.1fa164c960874 format: 2 features: layering $ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to remove snapshot: (2) No such file or directory $ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2 rbd: failed to create snapshot: (17) File exists $ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2 Rolling back to snapshot: 100% complete...done. $ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2 $ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2 Meanwhile, ``rbd ls -l dev-rack0'' segfaulting with an attached log. Is there any reliable way to kill problematic snap? log-crash.txt.gz Description: GNU Zip compressed data
Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible
Hi, no, I have it basically ready but I have to run some tests before. You'll have it in the next days! Danny Am 21.11.2012 01:23, schrieb Sage Weil: > If you haven't gotten to this yet, I'll go ahead and jump on it.. > let me know! > > Thanks- sage > > > On Thu, 9 Aug 2012, Danny Kukawka wrote: > >> Remove btrfs specific keys and replace them by more generic keys >> to be able to replace btrfs with e.g. xfs or ext4 easily. >> >> Add new key to define the osd fs type: 'fstype', which can get >> defined in the [osd] section for all OSDs. >> >> Replace: - 'btrfs devs' -> 'devs' - 'btrfs path' -> 'fs path' - >> 'btrfs options' -> 'fs options' - mkcephfs: replace --mkbtrfs >> with --mkfs - init-ceph: replace --btrfs with --fsmount, >> --nobtrfs with --nofsmount, --btrfsumount with --fsumount >> >> Update documentation, manpage and example config files. >> >> Signed-off-by: Danny Kukawka --- >> doc/man/8/mkcephfs.rst | 17 +++- >> man/mkcephfs.8 | 15 +++ >> src/ceph.conf.twoosds |7 ++-- >> src/init-ceph.in| 50 >> +- src/mkcephfs.in >> | 60 +-- src/sample.ceph.conf >> | 15 --- src/test/cli/osdmaptool/ceph.conf.withracks |3 >> +- 7 Dateien ge?ndert, 95 Zeilen hinzugef?gt(+), 72 Zeilen >> entfernt(-) >> >> diff --git a/doc/man/8/mkcephfs.rst b/doc/man/8/mkcephfs.rst >> index ddc378a..dd3fbd5 100644 --- a/doc/man/8/mkcephfs.rst +++ >> b/doc/man/8/mkcephfs.rst @@ -70,20 +70,15 @@ Options default is >> ``/etc/ceph/keyring`` (or whatever is specified in the config >> file). >> >> -.. option:: --mkbtrfs +.. option:: --mkfs >> >> - Create and mount the any btrfs file systems specified in the >> - ceph.conf for OSD data storage using mkfs.btrfs. The "btrfs >> devs" - and (if it differs from "osd data") "btrfs path" >> options must be - defined. + Create and mount any file system >> specified in the ceph.conf for + OSD data storage using mkfs. >> The "devs" and (if it differs from + "osd data") "fs path" >> options must be defined. >> >> **NOTE** Btrfs is still considered experimental. This option - >> can ease some configuration pain, but is the use of btrfs is not >> - required when ``osd data`` directories are mounted manually >> by the - adminstrator. - - **NOTE** This option is deprecated >> and will be removed in a future - release. + can ease some >> configuration pain, but is not required when + ``osd data`` >> directories are mounted manually by the adminstrator. >> >> .. option:: --no-copy-conf >> >> diff --git a/man/mkcephfs.8 b/man/mkcephfs.8 index >> 8544a01..22a5335 100644 --- a/man/mkcephfs.8 +++ >> b/man/mkcephfs.8 @@ -32,7 +32,7 @@ level margin: >> \\n[rst2man-indent\\n[rst2man-indent-level]] . .SH SYNOPSIS .nf >> -\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [ \-\-mkbtrfs ] [ \-a, >> \-\-all\-hosts [ \-k +\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [ >> \-\-mkfs ] [ \-a, \-\-all\-hosts [ \-k >> \fI/path/to/admin.keyring\fP ] ] .fi .sp @@ -111,19 +111,16 @@ >> config file). .UNINDENT .INDENT 0.0 .TP -.B \-\-mkbtrfs -Create >> and mount the any btrfs file systems specified in the -ceph.conf >> for OSD data storage using mkfs.btrfs. The "btrfs devs" -and (if >> it differs from "osd data") "btrfs path" options must be +.B >> \-\-mkfs +Create and mount any file systems specified in the >> +ceph.conf for OSD data storage using mkfs.*. The "devs" +and (if >> it differs from "osd data") "fs path" options must be defined. >> .sp \fBNOTE\fP Btrfs is still considered experimental. This >> option -can ease some configuration pain, but is the use of btrfs >> is not +can ease some configuration pain, but the use of this >> option is not required when \fBosd data\fP directories are >> mounted manually by the adminstrator. -.sp -\fBNOTE\fP This >> option is deprecated and will be removed in a future -release. >> .UNINDENT .INDENT 0.0 .TP diff --git a/src/ceph.conf.twoosds >> b/src/ceph.conf.twoosds index c0cfc68..05ca754 100644 --- >> a/src/ceph.conf.twoosds +++ b/src/ceph.conf.twoosds @@ -67,7 >> +67,8 @@ debug journal = 20 log dir = /data/cosd$id osd data = >> /mnt/osd$id -btrfs options = "flushoncommit,usertrans" + fs >> options = "flushoncommit,usertrans" +fstype = btrfs ;user = >> root >> >> ;osd journal = /mnt/osd$id/journal @@ -75,8 +76,8 @@ osd journal >> = "/dev/disk/by-path/pci-:05:02.0-scsi-6:0:0:0" ;filestore >> max sync interval = 1 >> >> -btrfs devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0" >> -; btrfs devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 >> \ + devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0" +; >> devs = "/dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 \ ; >> /dev/disk/by-path/pci-:05:01.0-scsi-3:0:0:0 \ ; >> /dev/disk/by-path/pci-:05:01.0-scsi-4:0:0:0 \ ; >> /dev/disk/by-path/pci-:05:0
Re: [Openstack] Ceph + Nova
Hi, I don't think it's the best place to ask your question since it's not directly related to OpenStack but more about Ceph. I just put in c/c the ceph ML. Anyway, CephFS is not ready yet for production but I heard that some people use it. People from Inktank (the company behind Ceph) don't recommend it, AFAIR they expect something more production ready for Q2 2013. You can use it (I did, for testing purpose) but it's at your own risk. Beside of this RBD and RADOS are robust and stable now, so you can go with the Cinder and Glance integration without any problems. Cheers! On Wed, Nov 21, 2012 at 9:37 AM, JuanFra Rodríguez Cardoso wrote: > Hi everyone: > > I'd like to know your opinion as nova experts: > > Would you recommend CephFS as shared storage in /var/lib/nova/instances? > Another option it would be use GlusterFS or MooseFS for > /var/lib/nova/instances directory and Ceph RBD for Glance and Nova volumes, > don't you think? > > Thanks for your attention. > > Best regards, > JuanFra > > ___ > Mailing list: https://launchpad.net/~openstack > Post to : openst...@lists.launchpad.net > Unsubscribe : https://launchpad.net/~openstack > More help : https://help.launchpad.net/ListHelp > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Wed, Nov 21, 2012 at 09:33:08AM +0100, Stefan Priebe - Profihost AG wrote: > Am 21.11.2012 09:26, schrieb Stefan Hajnoczi: > >On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote: > >>Am 21.11.2012 07:41, schrieb Stefan Hajnoczi: > >QEMU is currently in hard freeze and only critical patches should go in. > >Providing steps to reproduce the bug helps me decide that this patch > >should still be merged for QEMU 1.3-rc1. > > > >Anyway, the patch is straightforward, I have applied it to my block tree > >and it will be in QEMU 1.3-rc1: > >https://github.com/stefanha/qemu/commits/block > > Thanks! > > The steps to reproduce are: > mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends > a discard. Important is that you use scsi-hd and set > discard_granularity=512. Otherwise rbd disabled discard support. Excellent, thanks! I will add it to the commit description. > Might you have a look at my other rbd fix too? It fixes a race > between task cancellation and writes. The same race was fixed in > iscsi this summer. Yes. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
Am 21.11.2012 09:26, schrieb Stefan Hajnoczi: On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote: Am 21.11.2012 07:41, schrieb Stefan Hajnoczi: We're going in circles here. I know the types are wrong in the code and your patch fixes it, that's why I said it looks good in my first reply. Sorry not so familiar with processes like these. QEMU is currently in hard freeze and only critical patches should go in. Providing steps to reproduce the bug helps me decide that this patch should still be merged for QEMU 1.3-rc1. Anyway, the patch is straightforward, I have applied it to my block tree and it will be in QEMU 1.3-rc1: https://github.com/stefanha/qemu/commits/block Thanks! The steps to reproduce are: mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a discard. Important is that you use scsi-hd and set discard_granularity=512. Otherwise rbd disabled discard support. Might you have a look at my other rbd fix too? It fixes a race between task cancellation and writes. The same race was fixed in iscsi this summer. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote: > Am 21.11.2012 07:41, schrieb Stefan Hajnoczi: > >On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe wrote: > >>Hi Stefan, > >> > >>Am 20.11.2012 17:29, schrieb Stefan Hajnoczi: > >> > >>>On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote: > > rbd / rados tends to return pretty often length of writes > or discarded blocks. These values might be bigger than int. > > Signed-off-by: Stefan Priebe > --- > block/rbd.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> > >>>Looks good but I want to check whether this fixes an bug you've hit? > >>>Please indicate details of the bug and how to reproduce it in the commit > >>>message. > >> > >> > >>you get various I/O errors in client. As negative return values indicate I/O > >>errors. When now a big positive value is returned by librbd block/rbd tries > >>to store this one in acb->ret which is an int. Then it wraps around and is > >>negative. After that block/rbd thinks this is an I/O error and report this > >>to the guest. > > > >It's still not clear whether this is a bug that you can reproduce. > >After all, the ret value would have to be >2^31 which is a 2+ GB > >request! > Yes and that is the fact. > > Look here: >if (acb->cmd == RBD_AIO_WRITE || > acb->cmd == RBD_AIO_DISCARD) { > if (r < 0) { > acb->ret = r; > acb->error = 1; > } else if (!acb->error) { > acb->ret = rcb->size; > } > > It sets acb->ret to rcb->size. But the size from a DISCARD if you > DISCARD a whole device might be 500GB or today even some TB. We're going in circles here. I know the types are wrong in the code and your patch fixes it, that's why I said it looks good in my first reply. QEMU is currently in hard freeze and only critical patches should go in. Providing steps to reproduce the bug helps me decide that this patch should still be merged for QEMU 1.3-rc1. Anyway, the patch is straightforward, I have applied it to my block tree and it will be in QEMU 1.3-rc1: https://github.com/stefanha/qemu/commits/block Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html