crash report: paging request errors in various krbd contexts

2014-05-16 Thread Hannes Landeholm
A production server just locked up and had to be hard rebooted. It had
these various rbd related crash signatures in it's system log within
the same 10 second interval:

kernel: BUG: unable to handle kernel paging request at 
kernel: IP: [812832dd] strnlen+0xd/0x40
kernel: PGD 180f067 PUD 1811067 PMD 0
kernel: Oops:  [#1] PREEMPT SMP
kernel: Modules linked in: xt_recent xt_limit veth ipt_MASQUERADE cbc
ipt_REJECT xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp
iptable_mangle ip_tables x_tables coretemp hwmon x86_pkg_temp_thermal
crct10dif_pclmul crct10di
f_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode evdev
pcspkr xen_netback xen_blkback xen_gntalloc xenfs xen_gntdev
xen_evtchn rbd libceph crc32c libcrc32c ext4 crc16 mbcache jbd2
ixgbevf xen_privcmd xen_pcifront xen_netfront xen_kbdfro
nt xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio
ipmi_poweroff ipmi_msghandler
kernel: CPU: 3 PID: 2887 Comm: mysqld Not tainted 3.14.1-1-js #1
kernel: task: 880342ce1d70 ti: 88033e17e000 task.ti: 88033e17e000
kernel: RIP: e030:[812832dd]  [812832dd] strnlen+0xd/0x40
kernel: RSP: e02b:88033e17fa78  EFLAGS: 00010286
kernel: RAX: 816feed6 RBX: 8800ffb60068 RCX: fffe
kernel: RDX:  RSI:  RDI: 
kernel: RBP: 88033e17fa78 R08:  R09: 
kernel: R10: 756d696e612d736a R11: 2f30622d637a762f R12: 
kernel: R13: 8800ffb600cd R14:  R15: 
kernel: FS:  7f98dec7f700() GS:8803b5d8()
knlGS:
kernel: CS:  e033 DS:  ES:  CR0: 8005003b
kernel: CR2:  CR3: 000335dcc000 CR4: 2660
kernel: Stack:
kernel:  88033e17fab0 8128562b 8800ffb60068 8800ffb600cd
kernel:  88033e17fb28 a0196869 a0196869 88033e17fb18
kernel:  81286ac1 880389525960 814de243 
kernel: Call Trace:
kernel:  [8128562b] string.isra.6+0x3b/0xf0
kernel:  [81286ac1] vsnprintf+0x1c1/0x610
kernel:  [814de243] ? _raw_spin_unlock_irq+0x13/0x30
kernel:  [81286fd9] snprintf+0x39/0x40
kernel:  [a01911e0] ? rbd_img_request_fill+0x100/0x6d0 [rbd]
kernel:  [a019122a] rbd_img_request_fill+0x14a/0x6d0 [rbd]
kernel:  [a018f4d5] ? rbd_img_request_create+0x155/0x220 [rbd]
kernel:  [8125cab9] ? blk_add_timer+0x19/0x20
kernel:  [a0194a1d] rbd_request_fn+0x1ed/0x330 [rbd]
kernel:  [81252f13] __blk_run_queue+0x33/0x40
kernel:  [8125411e] queue_unplugged+0x2e/0xd0
kernel:  [81256cf0] blk_flush_plug_list+0x1f0/0x230
kernel:  [812570a4] blk_finish_plug+0x14/0x40
kernel:  [a00b9d6e] ext4_writepages+0x48e/0xd50 [ext4]
kernel:  [81136aae] ? generic_file_aio_write+0x5e/0xe0
kernel:  [811417ae] do_writepages+0x1e/0x40
kernel:  [811363d9] __filemap_fdatawrite_range+0x59/0x60
kernel:  [811364da] filemap_write_and_wait_range+0x2a/0x70
kernel:  [a00b149a] ext4_sync_file+0xba/0x360 [ext4]
kernel:  [811d50ce] do_fsync+0x4e/0x80
kernel:  [811d5373] SyS_fdatasync+0x13/0x20
kernel:  [814e66e9] system_call_fastpath+0x16/0x1b
kernel: Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 66 66 66
66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74
2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83
kernel: RIP  [812832dd] strnlen+0xd/0x40
kernel:  RSP 88033e17fa78
kernel: CR2: 
kernel: ---[ end trace 83a2fd2a9969b20d ]---

kernel: BUG: unable to handle kernel paging request at 177b
kernel: IP: [814dd33c] down_read+0xc/0x20
kernel: PGD 39c86b067 PUD 39c86a067 PMD 0
kernel: Oops: 0002 [#2] PREEMPT SMP
kernel: Modules linked in: xt_recent xt_limit veth ipt_MASQUERADE cbc
ipt_REJECT xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp
iptable_mangle ip_tables x_tables coretemp hwmon x86_pkg_temp_thermal
crct10dif_pclmul crct10di
f_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode evdev
pcspkr xen_netback xen_blkback xen_gntalloc xenfs xen_gntdev
xen_evtchn rbd libceph crc32c libcrc32c ext4 crc16 mbcache jbd2
ixgbevf xen_privcmd xen_pcifront xen_netfront xen_kbdfro
nt xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio
ipmi_poweroff ipmi_msghandler
kernel: CPU: 3 PID: 16277 Comm: jbd2/rbd115-8 Tainted: G  D
3.14.1-1-js #1
kernel: task: 8800e6f13110 ti: 8800f9d52000 

Re: [ceph-users] Does CEPH rely on any multicasting?

2014-05-16 Thread David McBride
On 15/05/14 18:07, Dietmar Maurer wrote:

 Besides, it would be great if ceph could use existing cluster stacks like 
 corosync, ...
 Is there any plan to support that?

For clarity: To what end?

Recall that Ceph already incorporates its own cluster-management
framework, and the various Ceph daemons already operate in a clustered
manner.

(The documentation at
http://ceph.com/docs/firefly/architecture/#scalability-and-high-availability
may be helpful if you are not already familiar with this.)

Kind regards,
David
-- 
David McBride dw...@cam.ac.uk
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Does CEPH rely on any multicasting?

2014-05-16 Thread Dietmar Maurer
 Recall that Ceph already incorporates its own cluster-management framework,
 and the various Ceph daemons already operate in a clustered manner.

Sure. But it guess it could reduce 'ceph' code size if you use an existing 
framework.

We (Proxmox VE) run corosync by default on all nodes, so it would also make
configuration easier.



Re: Radosgw - bucket index

2014-05-16 Thread Sage Weil
Hi Guang,

[I think the problem is that your email is HTML formatted, and vger 
silently drops those.  Make sure your mailer is set to plain text mode.]

On Fri, 16 May 2014, Guang wrote:

   * *Key/value OSD backend* (experimental): An alternative storage
   backend
    for Ceph OSD processes that puts all data in a key/value
   database like
    leveldb.  This provides better performance for workloads
   dominated by
    key/value operations (like radosgw bucket indices).
 
 Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
 around with it, as Sage mentioned in the release note, I thought K/V store
 could be the solution for radosgw?s bucket indexing feature which currently
 has scaling problems [1], however, after playing around with K/V store and
 understanding the requirement for bucket indexing, I think at least for now
 there is still gap to fix the bucket indexing by leveraging the K/V store.
 
 In my opinion, one requirement (API) to implement bucket indexing is to
 support ordered scan (prefix filter), which is not part of the API of rados,
 and as K/V store does not extend the rados API (it is not supposed to) but
 only  change the underlying object store strategy. It is not likely to help
 for the bucket indexing, except that we use the original way using omap to
 store bucket indexing and each bucket corresponds to one object.

The rados omap API does allow a prefix filter, although it's somewhat 
implicit:

/**
 * omap_get_keys: keys from the object omap
 *
 * Get up to max_return keys beginning after start_after
 *
 * @param start_after [in] list keys starting after start_after
 * @parem max_return [in] list no more than max_return keys
 * @param out_keys [out] place returned values in out_keys on completion
 * @param prval [out] place error code in prval upon completion
 */
void omap_get_keys(const std::string start_after,
   uint64_t max_return,
   std::setstd::string *out_keys,
   int *prval);

Since all keys are sorted alphanumerically, you simply have to set 
start_after == your prefix, and start ignoring the results once you get a 
key that does not contain your prefix.  This could be improved by having 
an explicit prefix argument that does this server-side, but for now at you 
can get the right data (plus a bit a extra at the end).

Is that what you mean by prefix scan, or are you referring to the ability 
to scan for rados objects that begin with a prefix?  If it's the latter, 
you are right: objects are hashed across nodes and there is no sorted 
object name index to allow prefix filtering.  There is a list_objects 
filter option, but it is still O(objects in the pool).

 Did I miss anything obvious here?
 
 We are very interested in the effort to improve the scalability of bucket
 index [1] as the blueprint mentioned, here is my thoughts on top of this:
  1. It would be nice we can refactor the interface so that it is easy to
 switch to a different underlying storage system for bucket indexing, for
 example, DynamoDB seems like being used for S3?s implementation [2], and SWIFT
 uses sqllite [1] and has a flat namespace for listing purpose (with prefix
 and delimiter).

radosgw is using the omap key/value API for objects, which is more or less 
equivalent to what swift is doing with sqlite.  This data passes straight 
into leveldb on the backend (or whatever other backend you are using).  
Using something like rocksdb in its place is pretty simple and ther are
unmerged patches to do that; the user would just need to adjust their 
crush map so that the rgw index pool is mapped to a different set of OSDs 
with the better k/v backend.

  2. As mentioned in the blueprint, if we go with the approach to do sharding
 for the bucket index object, what is the design choice? Are we going to
 maintain a B- tree structure get all keys sorted and sharidng on demand,
 like having a background thread do the sharding when it reaches a certain
 threshold? 

I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I 
suspect something simpler than a B tree (like a single-level hash-based 
fan out) would be sufficient, although you'd pay a bit of a price for 
object enumeration.

sage



 
 [1] https://wiki.ceph.com/Planning/Sideboard/rgw%3A_bucket_index_scalabilit
 y
 [2] http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAn
 dScan.html
 [3] https://swiftstack.com/openstack-swift/architecture/
 
 Thanks,
 Guang
 
 On May 7, 2014, at 9:05 AM, Sage Weil s...@inktank.com wrote:
 
   We did it!  Firefly v0.80 is built and pushed out to the
   ceph.com
   repositories.
 
   This release will form the basis for our long-term supported
   release
   Firefly, v0.80.x.  The big new features are support for erasure
   coding
   and cache tiering, although a broad range of other features,
 

Re: crash report: paging request errors in various krbd contexts

2014-05-16 Thread Alex Elder
On 05/16/2014 01:00 AM, Hannes Landeholm wrote:
 A production server just locked up and had to be hard rebooted. It had
 these various rbd related crash signatures in it's system log within
 the same 10 second interval:

I'll try to provide a quick summary of what's likely
happened in each of these.

The bottom line is that it appears that some
memory used by rbd and/or libceph has become
corrupted, or there is something (or more than
one thing) that is being used after it's been
freed.  Either way this sort of thing will be
difficult to try to understand; it would be
great if it could be reproduced independently.

 kernel: BUG: unable to handle kernel paging request at 
 kernel: IP: [812832dd] strnlen+0xd/0x40
 kernel: PGD 180f067 PUD 1811067 PMD 0
 kernel: Oops:  [#1] PREEMPT SMP
 kernel: Modules linked in: xt_recent xt_limit veth ipt_MASQUERADE cbc
 ipt_REJECT xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4
 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp
 iptable_mangle ip_tables x_tables coretemp hwmon x86_pkg_temp_thermal
 crct10dif_pclmul crct10di
 f_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode evdev
 pcspkr xen_netback xen_blkback xen_gntalloc xenfs xen_gntdev
 xen_evtchn rbd libceph crc32c libcrc32c ext4 crc16 mbcache jbd2
 ixgbevf xen_privcmd xen_pcifront xen_netfront xen_kbdfro
 nt xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
 xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio
 ipmi_poweroff ipmi_msghandler
 kernel: CPU: 3 PID: 2887 Comm: mysqld Not tainted 3.14.1-1-js #1
 kernel: task: 880342ce1d70 ti: 88033e17e000 task.ti: 88033e17e000
 kernel: RIP: e030:[812832dd]  [812832dd] strnlen+0xd/0x40

We're calling strnlen() (ultimately) from snprintf().  The
format provided will be %s.%012llx (or similar).  The
string provided for the %s is rbd_dev-header.object_prefix,
which is a dynamically allocated string initialized once
for the rbd device, which will be NUL-terminated and
unchanging until the device gets mapped.

Either the rbd device got unmapped while still
in use, or the memory holding this rbd_dev structure
got corrupted somehow.

 kernel: RSP: e02b:88033e17fa78  EFLAGS: 00010286
 kernel: RAX: 816feed6 RBX: 8800ffb60068 RCX: fffe
 kernel: RDX:  RSI:  RDI: 
 kernel: RBP: 88033e17fa78 R08:  R09: 
 kernel: R10: 756d696e612d736a R11: 2f30622d637a762f R12: 
 kernel: R13: 8800ffb600cd R14:  R15: 
 kernel: FS:  7f98dec7f700() GS:8803b5d8()
 knlGS:
 kernel: CS:  e033 DS:  ES:  CR0: 8005003b
 kernel: CR2:  CR3: 000335dcc000 CR4: 2660
 kernel: Stack:
 kernel:  88033e17fab0 8128562b 8800ffb60068 8800ffb600cd
 kernel:  88033e17fb28 a0196869 a0196869 88033e17fb18
 kernel:  81286ac1 880389525960 814de243 
 kernel: Call Trace:
 kernel:  [8128562b] string.isra.6+0x3b/0xf0
 kernel:  [81286ac1] vsnprintf+0x1c1/0x610
 kernel:  [814de243] ? _raw_spin_unlock_irq+0x13/0x30
 kernel:  [81286fd9] snprintf+0x39/0x40
 kernel:  [a01911e0] ? rbd_img_request_fill+0x100/0x6d0 [rbd]
 kernel:  [a019122a] rbd_img_request_fill+0x14a/0x6d0 [rbd]
 kernel:  [a018f4d5] ? rbd_img_request_create+0x155/0x220 [rbd]
 kernel:  [8125cab9] ? blk_add_timer+0x19/0x20
 kernel:  [a0194a1d] rbd_request_fn+0x1ed/0x330 [rbd]
 kernel:  [81252f13] __blk_run_queue+0x33/0x40
 kernel:  [8125411e] queue_unplugged+0x2e/0xd0
 kernel:  [81256cf0] blk_flush_plug_list+0x1f0/0x230
 kernel:  [812570a4] blk_finish_plug+0x14/0x40
 kernel:  [a00b9d6e] ext4_writepages+0x48e/0xd50 [ext4]
 kernel:  [81136aae] ? generic_file_aio_write+0x5e/0xe0
 kernel:  [811417ae] do_writepages+0x1e/0x40
 kernel:  [811363d9] __filemap_fdatawrite_range+0x59/0x60
 kernel:  [811364da] filemap_write_and_wait_range+0x2a/0x70
 kernel:  [a00b149a] ext4_sync_file+0xba/0x360 [ext4]
 kernel:  [811d50ce] do_fsync+0x4e/0x80
 kernel:  [811d5373] SyS_fdatasync+0x13/0x20
 kernel:  [814e66e9] system_call_fastpath+0x16/0x1b
 kernel: Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 66 66 66
 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74
 2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83
 kernel: RIP  [812832dd] strnlen+0xd/0x40
 kernel:  RSP 88033e17fa78
 kernel: CR2: 
 kernel: ---[ end trace 83a2fd2a9969b20d ]---
 
 kernel: BUG: unable to handle kernel paging request at 177b
 kernel: IP: [814dd33c] down_read+0xc/0x20
 

Re: crash report: paging request errors in various krbd contexts

2014-05-16 Thread Hannes Landeholm
 The bottom line is that it appears that some
 memory used by rbd and/or libceph has become
 corrupted, or there is something (or more than
 one thing) that is being used after it's been
 freed.  Either way this sort of thing will be
 difficult to try to understand; it would be
 great if it could be reproduced independently.

 We're calling strnlen() (ultimately) from snprintf().  The
 format provided will be %s.%012llx (or similar).  The
 string provided for the %s is rbd_dev-header.object_prefix,
 which is a dynamically allocated string initialized once
 for the rbd device, which will be NUL-terminated and
 unchanging until the device gets mapped.

 Either the rbd device got unmapped while still
 in use, or the memory holding this rbd_dev structure
 got corrupted somehow.

Yes, with my limited knowledge of the kernel I would have guessed that
it was some form of memory allocation problem as well as it crashed in
wildly different contexts and it crashed right after a memory
allocation in the snprintf() case.

Is it possible to configure the kernel when building it so it sanity
checks memory allocations that are free'd and/or reserved? I have
implemented my own free list based VM in userspace and I find it very
useful to insert a header with a magic canary value that I set before
giving out memory and check when I get memory back. This allows me to
crash with the offending code in the backtrace instead of crashing in
a wildly different context.

 I don't know if you've supplied this before, but can
 you describe the way the rbd device(s) in question
 is configured?  How many devices, how big are they,
 and *especially*, are they using layering and if so
 what the relationships are between them.

It's something like ~100-200 mappings that are 10 gb each. They use
layering and generally share the same parent with varying distance to
the common ancestor snapshot, but it's unlikely to be more than ~20
layers at the moment. More than 75% probably share the same common
ancestor. We don't have rbd caching enabled.

Thank you for you time,
Hannes
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-05-16 Thread Yehuda Sadeh
On Fri, May 16, 2014 at 6:09 AM, Sage Weil s...@inktank.com wrote:
 Hi Guang,

 [I think the problem is that your email is HTML formatted, and vger
 silently drops those.  Make sure your mailer is set to plain text mode.]

 On Fri, 16 May 2014, Guang wrote:

   * *Key/value OSD backend* (experimental): An alternative storage
   backend
for Ceph OSD processes that puts all data in a key/value
   database like
leveldb.  This provides better performance for workloads
   dominated by
key/value operations (like radosgw bucket indices).

 Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
 around with it, as Sage mentioned in the release note, I thought K/V store
 could be the solution for radosgw?s bucket indexing feature which currently
 has scaling problems [1], however, after playing around with K/V store and
 understanding the requirement for bucket indexing, I think at least for now
 there is still gap to fix the bucket indexing by leveraging the K/V store.

 In my opinion, one requirement (API) to implement bucket indexing is to
 support ordered scan (prefix filter), which is not part of the API of rados,
 and as K/V store does not extend the rados API (it is not supposed to) but
 only  change the underlying object store strategy. It is not likely to help
 for the bucket indexing, except that we use the original way using omap to
 store bucket indexing and each bucket corresponds to one object.

 The rados omap API does allow a prefix filter, although it's somewhat
 implicit:

 /**
  * omap_get_keys: keys from the object omap
  *
  * Get up to max_return keys beginning after start_after
  *
  * @param start_after [in] list keys starting after start_after
  * @parem max_return [in] list no more than max_return keys
  * @param out_keys [out] place returned values in out_keys on completion
  * @param prval [out] place error code in prval upon completion
  */
 void omap_get_keys(const std::string start_after,
uint64_t max_return,
std::setstd::string *out_keys,
int *prval);

 Since all keys are sorted alphanumerically, you simply have to set
 start_after == your prefix, and start ignoring the results once you get a
 key that does not contain your prefix.  This could be improved by having
 an explicit prefix argument that does this server-side, but for now at you
 can get the right data (plus a bit a extra at the end).

 Is that what you mean by prefix scan, or are you referring to the ability
 to scan for rados objects that begin with a prefix?  If it's the latter,
 you are right: objects are hashed across nodes and there is no sorted
 object name index to allow prefix filtering.  There is a list_objects
 filter option, but it is still O(objects in the pool).

 Did I miss anything obvious here?

 We are very interested in the effort to improve the scalability of bucket
 index [1] as the blueprint mentioned, here is my thoughts on top of this:
  1. It would be nice we can refactor the interface so that it is easy to
 switch to a different underlying storage system for bucket indexing, for
 example, DynamoDB seems like being used for S3?s implementation [2], and 
 SWIFT
 uses sqllite [1] and has a flat namespace for listing purpose (with prefix
 and delimiter).

 radosgw is using the omap key/value API for objects, which is more or less
 equivalent to what swift is doing with sqlite.  This data passes straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of OSDs
 with the better k/v backend.

  2. As mentioned in the blueprint, if we go with the approach to do sharding
 for the bucket index object, what is the design choice? Are we going to
 maintain a B- tree structure get all keys sorted and sharidng on demand,
 like having a background thread do the sharding when it reaches a certain
 threshold?

 I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I
 suspect something simpler than a B tree (like a single-level hash-based
 fan out) would be sufficient, although you'd pay a bit of a price for
 object enumeration.


My more well-formed opinion is that we need to come up with a good
design. It needs to be flexible enough to be able to grow (and maybe
shrink), and I assume there would be some kind of background operation
that will enable that. I also believe that making it hash based is the
way to go. It looks like that the more complicated issue is here is
how to handle the transition in which we shard buckets.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: XioMessenger (RDMA) Status Update

2014-05-16 Thread Sage Weil
Hi Matt,

I've rebased this branch on top of master and pushed it to wip-xio in 
ceph.git, and then opened a pull request to capture review:

https://github.com/ceph/ceph/pull/1819

I would like to get some of the preliminary pieces into master sooner 
rather than later so we can start cutting down on the size of the branch.  
I've started by looking just first few patches that modify the Messenger 
and made a few clean ups:

 - use SimplePolicyMessage for SimpleMessenger, too, to avoid dup code
 - move PipeConnection into a separate header and tweak a few things to 
   remove it from the generic Messenger/Message/Connection interface
 - cleaned up a bit of cruft from the original patch
 - resolved a few merge conflicts between firefly and master

Before I get too far into this can you take a look?  I'd like to pull 
*all* of the non-xio stuff to the top of the branch and get it in good 
shape first.

I think the next step for me is to look at how you've instantiated the 
alternate xio messenger and clean up that interface.  This is probably 
also a good time for us to get the entity_addr_t and entity_name_t stuff 
sorted out.

Thanks!
sage


On Tue, 13 May 2014, Matt W. Benjamin wrote:

 Hi Ceph Devs,
 
 I've pushed two Ceph+Accelio branches, xio-firefly and xio-firefly-cmake to 
 our
 public ceph repository https://github.com/linuxbox2/linuxbox-ceph.git .
 
 These branches are pulled up to the HEAD of ceph/firefly, and also have 
 improvements
 to XioMessenger which allow:
 
 1. operation of (at least) a minimal Ceph Mon and OSD cluster over Xio 
 messaging alone
 2. initial support for Accelio in the MDS server and Client/libcephfs (not 
 formally tested)
 
 There are some new config options:
 
 (global addr) rdma local sets a local rdma interface address
 (global bool) cluster rdma selects Accelio for intra-cluster communication
 (global bool) client rdma selecs Accelio for libcephfs communications
 
 These changes haven't been strenuously tested, we'll expect to have 
 additional information
 and likely new, simple rados bench results against this code in the next 
 several days,
 at latest.
 
 Thanks,
 
 Matt
 
 -- 
 Matt Benjamin
 CohortFS, LLC.
 206 South Fifth Ave. Suite 150
 Ann Arbor, MI  48104
 
 http://cohortfs.com
 
 tel.  734-761-4689 
 fax.  734-769-8938 
 cel.  734-216-5309 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XioMessenger (RDMA) Status Update

2014-05-16 Thread Gregory Farnum
On Fri, May 16, 2014 at 11:21 AM, Sage Weil s...@inktank.com wrote:
 Hi Matt,

 I've rebased this branch on top of master and pushed it to wip-xio in
 ceph.git, and then opened a pull request to capture review:

 https://github.com/ceph/ceph/pull/1819

 I would like to get some of the preliminary pieces into master sooner
 rather than later so we can start cutting down on the size of the branch.
 I've started by looking just first few patches that modify the Messenger
 and made a few clean ups:

  - use SimplePolicyMessage for SimpleMessenger, too, to avoid dup code
  - move PipeConnection into a separate header and tweak a few things to
remove it from the generic Messenger/Message/Connection interface
  - cleaned up a bit of cruft from the original patch
  - resolved a few merge conflicts between firefly and master

 Before I get too far into this can you take a look?  I'd like to pull
 *all* of the non-xio stuff to the top of the branch and get it in good
 shape first.

 I think the next step for me is to look at how you've instantiated the
 alternate xio messenger and clean up that interface.  This is probably
 also a good time for us to get the entity_addr_t and entity_name_t stuff
 sorted out.

Unfortunately, I think we need to get farther along before pulling any
of this into master. In particular, unless it's been updated since I
looked at it last, the SimplePolicyMessenger doesn't do anything right
now and we're not sure of whether the interface is actually helpful.
We can perhaps do some of the multi-messenger support bits, but the
internal interface changes really need to wait until we've verified
that it's possible to implement a non-TCP messenger with them before
it's worth making the TCP stuff follow their rules.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XioMessenger (RDMA) Status Update

2014-05-16 Thread Sage Weil
On Fri, 16 May 2014, Gregory Farnum wrote:
 On Fri, May 16, 2014 at 11:21 AM, Sage Weil s...@inktank.com wrote:
  Hi Matt,
 
  I've rebased this branch on top of master and pushed it to wip-xio in
  ceph.git, and then opened a pull request to capture review:
 
  https://github.com/ceph/ceph/pull/1819
 
  I would like to get some of the preliminary pieces into master sooner
  rather than later so we can start cutting down on the size of the branch.
  I've started by looking just first few patches that modify the Messenger
  and made a few clean ups:
 
   - use SimplePolicyMessage for SimpleMessenger, too, to avoid dup code
   - move PipeConnection into a separate header and tweak a few things to
 remove it from the generic Messenger/Message/Connection interface
   - cleaned up a bit of cruft from the original patch
   - resolved a few merge conflicts between firefly and master
 
  Before I get too far into this can you take a look?  I'd like to pull
  *all* of the non-xio stuff to the top of the branch and get it in good
  shape first.
 
  I think the next step for me is to look at how you've instantiated the
  alternate xio messenger and clean up that interface.  This is probably
  also a good time for us to get the entity_addr_t and entity_name_t stuff
  sorted out.
 
 Unfortunately, I think we need to get farther along before pulling any
 of this into master.

I agree. I'm just trying to get this cleaned up and ordered so that it can 
be sanely reviewed.  :)

 In particular, unless it's been updated since I looked at it last, the 
 SimplePolicyMessenger doesn't do anything right now and we're not sure 
 of whether the interface is actually helpful. We can perhaps do some of 
 the multi-messenger support bits, but the internal interface changes 
 really need to wait until we've verified that it's possible to implement 
 a non-TCP messenger with them before it's worth making the TCP stuff 
 follow their rules.

The SimplePolicyMessenger piece just pull the get/set Policy stuff out of 
SimpleMessenger into a parent class; nothing more.  The multiplexer is 
something else (which I can't find in that branch at the moment!).

So far everything I've looked at has been a nice cleanup of the abstract 
interface.  The next thing I'm concerned about is something akin to a sane 
factory method so that (hopefully) the explicit SimpleMessenger references 
can be dropped from other parts of the code.  I'm not touching the xio 
bits yet, although it looks like that will quickly lead us to some 
other core pieces like buffer support for xio.

Don't worry--none of this will get merged until it is cleaned up, 
reviewed, and tested!

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph fixes for -rc6

2014-05-16 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The first patch fixes a problem when we have a page count of 0 for 
sendpage which is triggered by zfs.  The second fixes a bug in CRUSH that 
was resolved in the userland code a while back but fell through the cracks 
on the kernel side.

Thanks!
sage



Chunwei Chen (1):
  libceph: fix corruption when using page_count 0 page in rbd

Ilya Dryomov (1):
  crush: decode and initialize chooseleaf_vary_r

 net/ceph/messenger.c | 20 +++-
 net/ceph/osdmap.c|  5 +
 2 files changed, 24 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bloom filters

2014-05-16 Thread Loic Dachary
Hi Sahid,

Here are the files implementing the bloom filter we discussed tonight

  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.hpp
  https://github.com/ceph/ceph/blob/master/src/common/bloom_filter.cc

and the associated unit tests

  https://github.com/ceph/ceph/blob/master/src/test/common/test_bloom_filter.cc

Improving code coverage would be nice way for you to learn the code base ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Ceph low hanging fruit

2014-05-16 Thread Loic Dachary
Hi Florent,

Here is a low hanging fruit for you to walk through the Ceph contribution 
process:

  http://tracker.ceph.com/issues/7725

I'm announcing it on the development mailing list so that it is not taken from 
you ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre




signature.asc
Description: OpenPGP digital signature