Re: [ceph-users] OSD Weights

2013-02-14 Thread Sébastien Han
Hi,

As far as I know Ceph won't attempt to do any weight modifications. If
you use the default CRUSH map, every devices get a default weight of
1. However this value can be modified while the cluster runs. Simply
update the CRUSH map like so:

# ceph osd crush reweight {name} {weight}

If you need more input, have a look at the documentation ;-)

http://ceph.com/docs/master/rados/operations/crush-map/?highlight=crush#adjust-an-osd-s-crush-weight

Cheers,
--
Regards,
Sébastien Han.


On Wed, Feb 13, 2013 at 4:23 PM, sheng qiu herbert1984...@gmail.com wrote:
 Hi Gregory,

 once running ceph online, will ceph change the weight dynamically (if
 not set properly) or it can only be changed by the user through
 command line or it cannot be changed online?

 Thanks,
 Sheng

 On Mon, Feb 11, 2013 at 3:31 PM, Gregory Farnum g...@inktank.com wrote:
 On Mon, Feb 11, 2013 at 12:43 PM, Holcombe, Christopher
 cholc...@cscinfo.com wrote:
 Hi Everyone,

 I just wanted to confirm my thoughts on the ceph osd weightings.  My 
 understanding is they are a statistical distribution number.  My current 
 setup has 3TB hard drives and they all have the default weight of 1.  I was 
 thinking that if I mixed in 4TB hard drives in the future it would only put 
 3TB of data on them.  I thought if I changed the weight to 3 for the 3TB 
 hard drives and 4 for the 4TB hard drives it would correctly use the larger 
 storage disks.  Is that correct?

 Yep, looks good.
 -Greg
 PS: This is a good question for the new ceph-users list.
 (http://ceph.com/community/introducing-ceph-users/)
 :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Sheng Qiu
 Texas A  M University
 Room 332B Wisenbaker
 email: herbert1984...@gmail.com
 College Station, TX 77843-3259
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


urgent journal conf on ceph.conf

2013-02-14 Thread charles L

Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for 
STORAGE with 3tb hard drives each and two SSD's each.  I want to use OSD data 
on the hard drive and osd journal on SSD. 

I want to know how osd journal configuration is set to SSD. My SSD is mounted 
on /dev/sdb.

I have tried the osd data configurations  devs = /dev/sda and it worked just 
good. 

Is this line correct osd journal = /dev/osd$id/journal ?? and  osd journal = 
/dev/sdb  ???

[global]

                auth cluster required = cephx
                auth service required = cephx
                auth client required = cephx
                debug ms = 1

[osd]
                osd journal size = 1000
                osd journal = /dev/osd$id/journal
                filestore xattr use omap = true
                osd mkfs type = xfs
                osd mkfs options xfs = -f
                osd mount options xfs = rw,noatime,

[osd.0]
                host = server04
                devs = /dev/sda
                osd journal = /dev/sdb
[osd.1]
                host = server05
                devs = /dev/sda
                osd journal = /dev/sdb

THANKS.   --
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD dies after seconds

2013-02-14 Thread Jesus Cuenca
I upgraded to ceph 0.56-3 but the problem persist...

OSD starts but after a second it finishes:

2013-02-14 12:18:34.504391 7fae613ea760 10 journal _open journal is
not a block device, NOT checking disk write cache on
'/var/lib/ceph/osd/ceph-0/jour
nal'
2013-02-14 12:18:34.504400 7fae613ea760  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 17: 1048576 bytes, block size
4096 bytes, directio = 1
, aio = 0
2013-02-14 12:18:34.504458 7fae613ea760 10 journal journal_start
2013-02-14 12:18:34.504506 7fae5d3c6700 10 journal write_thread_entry start
2013-02-14 12:18:34.504515 7fae5d3c6700 20 journal write_thread_entry
going to sleep
2013-02-14 12:18:34.504706 7fae5cbc5700 10 journal
write_finish_thread_entry enter
2013-02-14 12:18:34.504716 7fae5cbc5700 20 journal
write_finish_thread_entry sleeping
2013-02-14 12:18:34.504893 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry start
2013-02-14 12:18:34.504903 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry sleeping
2013-02-14 12:18:34.505013 7fae613ea760  5
filestore(/var/lib/ceph/osd/ceph-0) umount /var/lib/ceph/osd/ceph-0
2013-02-14 12:18:34.505036 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry awoke
2013-02-14 12:18:34.505044 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry finish
2013-02-14 12:18:34.505113 7fae5dbc7700 20
filestore(/var/lib/ceph/osd/ceph-0) sync_entry force_sync set
2013-02-14 12:18:34.505129 7fae5dbc7700 10 journal commit_start
max_applied_seq 2, open_ops 0
2013-02-14 12:18:34.505136 7fae5dbc7700 10 journal commit_start
blocked, all open_ops have completed
2013-02-14 12:18:34.505138 7fae5dbc7700 10 journal commit_start nothing to do
2013-02-14 12:18:34.505141 7fae5dbc7700 10 journal commit_start
2013-02-14 12:18:34.505506 7fae613ea760 10 journal journal_stop
2013-02-14 12:18:34.505698 7fae613ea760  1 journal close
/var/lib/ceph/osd/ceph-0/journal
2013-02-14 12:18:34.505787 7fae5d3c6700 20 journal write_thread_entry woke up
2013-02-14 12:18:34.505796 7fae5d3c6700 10 journal write_thread_entry finish
2013-02-14 12:18:34.505845 7fae5cbc5700 10 journal
write_finish_thread_entry exit


On Wed, Feb 13, 2013 at 6:28 PM, Jesus Cuenca jcue...@cnb.csic.es wrote:
 thanks for the fast answer.

 no, it does not segfault:

 gdb --args /usr/local/bin/ceph-osd -i 0
 ...
 (gdb) run
 Starting program: /usr/local/bin/ceph-osd -i 0
 [Thread debugging using libthread_db enabled]
 [New Thread 0x75fce700 (LWP 8920)]
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal
 [Thread 0x75fce700 (LWP 8920) exited]

 Program exited normally.

 --



 On Wed, Feb 13, 2013 at 6:21 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 13 Feb 2013, Jesus Cuenca wrote:
 Hi,

 I'm setting up a small ceph 0.56.2 cluster on 3 64-bit Debian 6
 servers with kernel 3.7.2.

 This might be

 http://tracker.ceph.com/issues/3595

 which is problems with google perftools (which we use by default) and the
 version in squeeze, which is buggy.  This doesn't seem to affect all
 squeeze users.

 Does it seg fault?

 sage



 My problem is that OSD die. First I try to start them with the init script:

  /etc/init.d/ceph start osd.0
 ...
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal

  ps -ef | grep ceph
 (No ceph-osd process)

 I then run with debugging:

  ceph-osd -i 0 --debug_ms 20 --debug_osd 20 --debug_filestore 20 
  --debug_journal 20 -d
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal
 2013-02-13 18:04:40.351830 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351895 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351910 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6800/0
 2013-02-13 18:04:40.351919 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6800/0
 2013-02-13 18:04:40.351930 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6800/8438 need_addr=1
 2013-02-13 18:04:40.351935 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351938 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351943 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6801/0
 2013-02-13 18:04:40.351946 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6801/0
 2013-02-13 18:04:40.351952 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6801/8438 need_addr=1
 2013-02-13 18:04:40.351959 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351961 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351966 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6802/0
 2013-02-13 18:04:40.351969 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6802/0
 2013-02-13 18:04:40.351975 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6802/8438 need_addr=1
 2013-02-13 18:04:40.352636 7fe98cd8a760  5
 filestore(/var/lib/ceph/osd/ceph-0) basedir 

Re: urgent journal conf on ceph.conf

2013-02-14 Thread Wido den Hollander

On 02/14/2013 11:24 AM, charles L wrote:


Pls can someone help me with the ceph.conf for 0.56.2. I have two servers for 
STORAGE with 3tb hard drives each and two SSD's each.  I want to use OSD data 
on the hard drive and osd journal on SSD.

I want to know how osd journal configuration is set to SSD. My SSD is mounted 
on /dev/sdb.

I have tried the osd data configurations  devs = /dev/sda and it worked just 
good.

Is this line correct osd journal = /dev/osd$id/journal ?? and  osd journal = 
/dev/sdb  ???



In the osd specific (osd.0 and osd.1) sections you override the journal 
settings made in the [osd] section.


They are not needed, since you give the whole block device (/dev/sdb) to 
the OSD as a journal.


Are you sure /dev/sda is available for the OSD and it's not your boot 
device?


Wido


[global]

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 debug ms = 1

[osd]
 osd journal size = 1000
 osd journal = /dev/osd$id/journal
 filestore xattr use omap = true
 osd mkfs type = xfs
 osd mkfs options xfs = -f
 osd mount options xfs = rw,noatime,

[osd.0]
 host = server04
 devs = /dev/sda
 osd journal = /dev/sdb
[osd.1]
 host = server05
 devs = /dev/sda
 osd journal = /dev/sdb

THANKS.   --
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: urgent journal conf on ceph.conf

2013-02-14 Thread Joao Eduardo Luis

Including ceph-users, as it feels like this belongs there :-)


On 02/14/2013 01:47 PM, Wido den Hollander wrote:

On 02/14/2013 11:24 AM, charles L wrote:


Pls can someone help me with the ceph.conf for 0.56.2. I have two
servers for STORAGE with 3tb hard drives each and two SSD's each.  I
want to use OSD data on the hard drive and osd journal on SSD.

I want to know how osd journal configuration is set to SSD. My SSD is
mounted on /dev/sdb.

I have tried the osd data configurations  devs = /dev/sda and it
worked just good.

Is this line correct osd journal = /dev/osd$id/journal ?? and  osd
journal = /dev/sdb  ???



In the osd specific (osd.0 and osd.1) sections you override the journal
settings made in the [osd] section.

They are not needed, since you give the whole block device (/dev/sdb) to
the OSD as a journal.

Are you sure /dev/sda is available for the OSD and it's not your boot
device?

Wido


[global]

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx
 debug ms = 1

[osd]
 osd journal size = 1000
 osd journal = /dev/osd$id/journal
 filestore xattr use omap = true
 osd mkfs type = xfs
 osd mkfs options xfs = -f
 osd mount options xfs = rw,noatime,

[osd.0]
 host = server04
 devs = /dev/sda
 osd journal = /dev/sdb
[osd.1]
 host = server05
 devs = /dev/sda
 osd journal = /dev/sdb

THANKS.   --
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] urgent journal conf on ceph.conf

2013-02-14 Thread Sébastien Han
+1 for Wido

Moreover, if you want to store the journal on a block device, you
should partition your journal disk and assign one partition per OSD
like /dev/sdb1, 2 , 3

Again, osd journal = /dev/osd$id/journal is wrong, if you use this
directive, this must point to a filesystem because the journal will be
a file.

Anyway, as far I'm concerned I didn't notice that much performance
gain by using the journal on a block device. At the end, just put the
journal on a dedicated formatted partition since the filesystem
overhead is not that big. So just keep the osd journal =
/dev/osd$id/journal but change it for something like osd journal =
/srv/ceph/journals/osd$id/journal.

Cheers.
--
Regards,
Sébastien Han.


On Thu, Feb 14, 2013 at 2:52 PM, Joao Eduardo Luis
joao.l...@inktank.com wrote:
 Including ceph-users, as it feels like this belongs there :-)



 On 02/14/2013 01:47 PM, Wido den Hollander wrote:

 On 02/14/2013 11:24 AM, charles L wrote:


 Pls can someone help me with the ceph.conf for 0.56.2. I have two
 servers for STORAGE with 3tb hard drives each and two SSD's each.  I
 want to use OSD data on the hard drive and osd journal on SSD.

 I want to know how osd journal configuration is set to SSD. My SSD is
 mounted on /dev/sdb.

 I have tried the osd data configurations  devs = /dev/sda and it
 worked just good.

 Is this line correct osd journal = /dev/osd$id/journal ?? and  osd
 journal = /dev/sdb  ???


 In the osd specific (osd.0 and osd.1) sections you override the journal
 settings made in the [osd] section.

 They are not needed, since you give the whole block device (/dev/sdb) to
 the OSD as a journal.

 Are you sure /dev/sda is available for the OSD and it's not your boot
 device?

 Wido

 [global]

  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  debug ms = 1

 [osd]
  osd journal size = 1000
  osd journal = /dev/osd$id/journal
  filestore xattr use omap = true
  osd mkfs type = xfs
  osd mkfs options xfs = -f
  osd mount options xfs = rw,noatime,

 [osd.0]
  host = server04
  devs = /dev/sda
  osd journal = /dev/sdb
 [osd.1]
  host = server05
  devs = /dev/sda
  osd journal = /dev/sdb

 THANKS.   --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw: Update a key's meta data

2013-02-14 Thread Sylvain Munaut
Hi,

I was wondering how I could update a key's metadata like the Content-Type.

The solution on S3 seem to be to copy the key on itself and replacing
meta data. If I do that in ceph, will it work ? And more importantly,
will it be done intelligently (i.e. without copying the actual file
data around).

I tried reading the code, but although part of the code seem to hint
at support for this (in rgw_rest_s3.cc), some other part seem to not
look at all if the src == dst  (like rgw_op.cc).

Cheers,

 Sylvain Munaut
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)

2013-02-14 Thread Martin Mailand
Hi List,

I get reproducible this assertion, how can I help to debug it?


-martin

(Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit
installiert.)
Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch
.../linux-firmware_1.79.1_all.deb) ...
Ersatz für linux-firmware wird entpackt ...
osdc/ObjectCacher.cc: In function 'void
ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285
osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
 ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
 1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long,
unsigned long, int)+0xd68) [0x7f72d4050848]
 2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b]
 3: (Context::complete(int)+0xa) [0x7f72d400f9ba]
 4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145]
 5: (Context::complete(int)+0xa) [0x7f72d400f9ba]
 6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7]
 7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d]
 8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50]
 9: (()+0x7e9a) [0x7f72d03c7e9a]
 10: (clone()+0x6d) [0x7f72d00f4cbd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)

2013-02-14 Thread Sage Weil
Hi Martin-

On Thu, 14 Feb 2013, Martin Mailand wrote:
 Hi List,
 
 I get reproducible this assertion, how can I help to debug it?

Can you describe the workload?  Are the OSDs also running 0.56.2(+)?  Any 
other activity on the server side (data migration, OSD failure, etc.) that 
may have contributed?

We just reopened http://tracker.ceph.com/issues/2947 to track this.  I'm 
working on reproducing it now as well.

Thanks!
sage



 
 
 -martin
 
 (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit
 installiert.)
 Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch
 .../linux-firmware_1.79.1_all.deb) ...
 Ersatz f?r linux-firmware wird entpackt ...
 osdc/ObjectCacher.cc: In function 'void
 ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
 tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285
 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
  ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
  1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long,
 unsigned long, int)+0xd68) [0x7f72d4050848]
  2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b]
  3: (Context::complete(int)+0xa) [0x7f72d400f9ba]
  4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145]
  5: (Context::complete(int)+0xa) [0x7f72d400f9ba]
  6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7]
  7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d]
  8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50]
  9: (()+0x7e9a) [0x7f72d03c7e9a]
  10: (clone()+0x6d) [0x7f72d00f4cbd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 Aborted
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simple doc update pull request

2013-02-14 Thread Sage Weil
Merged. Thanks, Travis!

On Thu, 14 Feb 2013, Travis Rhoden wrote:

 Hey folks,
 
 I submitted a pull-request for some simple doc updates related to
 cephx and creating new keys/clients.  Please take a look when
 possible.
 
 https://github.com/ceph/ceph/pull/56
 
  - Travis
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions on some minor issues when upgrading from 0.48 to 0.56

2013-02-14 Thread Daniel Hoang


Thanks Wido for the clarifications. 

I guessed this means that I can update the OSD cluster to 0.56, and client that 
was compiled with the old librados2 0.48 should still be able to access the 
cluster without any change. Client that compiled with the new librados2 0.56 
(API level 0.48) has to fix that rados_pool_list before it can access the 
cluster.

DanielH


- Original Message -
From: Wido den Hollander w...@42on.com
To: Daniel Hoang daniel_m_ho...@yahoo.com
Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
Sent: Wednesday, February 13, 2013 11:18 PM
Subject: Re: Questions on some minor issues when upgrading from 0.48 to 0.56

Hi,

On 02/13/2013 08:26 PM, Daniel Hoang wrote:


 Hi All,

 Just in case these issues have not been reported yet, I am on ubuntu 12.04, 
 upgrade librados2/librados-dev from 0.48 to 0.56, and I notice the following 
 issues:

 1. librados2 / librados-dev still reports minor version as 48

 Should minor version changed to 56?

No. From what I understand the librados version is only bumped when the 
API actually changes.

This indicates that the API is still the same as 0.48


 2. In 0.48, rados_pool_list(cluster, NULL, 0) can be used like a buffer size 
 query, and it would return the buffer size required for the pool list buffer 
 string. In 0.56, this call now returns error -22 instead, and I have to pass 
 in a tmp_buf[32] and len = 32 in order for the call to return successfully. I 
 check the current rados_pool_list API, and it does not mention anything about 
 buffer and len should not be NULL, 0.

 May be this was a bug in 0.48, that we should not pass in NULL, 0 ?


Take a look at this commit: 
https://github.com/ceph/ceph/commit/a677f47926b9640c53fbd00c94d6eb7a590a94fc

I ran into this with phprados as well: 
https://github.com/ceph/phprados/commit/ee8b87fe93f87f92a7c3fa197a33b3d2de2fc4b6

Wido

 Thanks for your help,
 DanielH

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid tid)

2013-02-14 Thread Martin Mailand
Hi Sage,

everything is on 0.56.2 and the cluster is healthy.
I can reproduce it with an apt-get upgrade within the vm, the vm os is
12.04. Most of the time the assertion happened when the firmware .deb is
updated. See the log in my first email.
But I use a custom build qemu version (1.4-rc1), which was build against
0.56.2.


root@store1:~# ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=192.168.195.33:6789/0}, election epoch 1,
quorum 0 a
   osdmap e160: 20 osds: 20 up, 20 in
pgmap v28314: 3264 pgs: 3264 active+clean; 437 GB data, 1027 GB
used, 144 TB / 145 TB avail
   mdsmap e1: 0/0/1 up

root@store1:~# ceph --version
ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)


root@compute4:~# dpkg -l|grep 'rbd\|rados\|qemu'
ii  librados20.56.2-1precise
RADOS distributed object store client library
ii  librbd1  0.56.2-1precise
RADOS block device client library
ii  qemu-common  1.4.0-rc1-vdsp1.0
qemu common functionality (bios, documentation, etc)
ii  qemu-kvm 1.4.0-rc1-vdsp1.0
Full virtualization on i386 and amd64 hardware
ii  qemu-utils   1.4.0-rc1-vdsp1.0
qemu utilities


-martin

On 14.02.2013 18:18, Sage Weil wrote:
 Hi Martin-
 
 On Thu, 14 Feb 2013, Martin Mailand wrote:
 Hi List,

 I get reproducible this assertion, how can I help to debug it?
 
 Can you describe the workload?  Are the OSDs also running 0.56.2(+)?  Any 
 other activity on the server side (data migration, OSD failure, etc.) that 
 may have contributed?
 
 We just reopened http://tracker.ceph.com/issues/2947 to track this.  I'm 
 working on reproducing it now as well.
 
 Thanks!
 sage
 
 
 


 -martin

 (Lese Datenbank ... 52246 Dateien und Verzeichnisse sind derzeit
 installiert.)
 Vorbereitung zum Ersetzen von linux-firmware 1.79 (durch
 .../linux-firmware_1.79.1_all.deb) ...
 Ersatz f?r linux-firmware wird entpackt ...
 osdc/ObjectCacher.cc: In function 'void
 ObjectCacher::bh_write_commit(int64_t, sobject_t, loff_t, uint64_t,
 tid_t, int)' thread 7f72b7fff700 time 2013-02-14 16:04:48.867285
 osdc/ObjectCacher.cc: 834: FAILED assert(ob-last_commit_tid  tid)
  ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
  1: (ObjectCacher::bh_write_commit(long, sobject_t, long, unsigned long,
 unsigned long, int)+0xd68) [0x7f72d4050848]
  2: (ObjectCacher::C_WriteCommit::finish(int)+0x6b) [0x7f72d405742b]
  3: (Context::complete(int)+0xa) [0x7f72d400f9ba]
  4: (librbd::C_Request::finish(int)+0x85) [0x7f72d403f145]
  5: (Context::complete(int)+0xa) [0x7f72d400f9ba]
  6: (librbd::rados_req_cb(void*, void*)+0x47) [0x7f72d40241b7]
  7: (librados::C_AioSafe::finish(int)+0x1d) [0x7f72d33db16d]
  8: (Finisher::finisher_thread_entry()+0x1c0) [0x7f72d3444e50]
  9: (()+0x7e9a) [0x7f72d03c7e9a]
  10: (clone()+0x6d) [0x7f72d00f4cbd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 Aborted
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] Fix more performance issues found by cppcheck (#51)

2013-02-14 Thread Gregory Farnum
Hey Danny,
I've merged in most of these (commit
ffda2eab4695af79abdc9ed9bf001c3cd662a1f2) but had comments on a
couple:
d99764e8c72a24eaba0542944f497cc2d9e154b4 is a patch on gtest. We did
import that wholesale into our repository as that's what they
recommend,b but I'd prefer to get patches by re-importing rather than
by applying them to our tree. That patch should go upstream. :)
3fc14b3470748578840ed9374db53e9ef9926382 and
7ca6e5d8875d06aa61ce35b727ce7ee219838c69 are patches to remove the
useless definition of a declared variable in cases like:

bool success = false;
...
 // nothing that reads success
...
success = function();

where the proposed fix is just doing:
bool success;
...
 // nothing that reads success
...
success = function();

However, we'd prefer for defensive programming reasons that variables
be defined on declaration whenever possible. For these patches it's
appropriate enough to just move the declaration to the first real
definition (and I've done so), but for cases where that's not
acceptable we'd prefer to take whatever performance hit there is in
order to not have random garbage in the stack. ;)


I did opt to leave the wireshark patch alone, because...C. But if you
or Yehuda want to take another pass through that, I notice that it's
still somewhat inconsistent about variable assignments on
initialization.
-Greg


On Wed, Feb 13, 2013 at 12:47 PM, Danny Al-Gaaf
notificati...@github.com wrote:
 Here some more patches to fix performance issues found by cppcheck. This
 should now cover - together with wip-da-sca-cppcheck-performance - the
 following issues:

 use empty() instead of size() to check for emptiness
 dont't pass string::c_str() to string arguments
 prevent useless value assignment
 pass some objects by reference instead of by-value

 

 You can merge this Pull Request by running

   git pull https://github.com/dalgaaf/ceph wip-da-sca-cppcheck-performance-2

 Or view, comment on, or merge it at:

   https://github.com/ceph/ceph/pull/51

 Commit Summary

 CephxProtocol.h: pass CryptoKey by reference to decode_decrypt()
 CInode.h: use !old_inodes.empty() instead of size()
 AuthMonitor.cc: use !pending_auth.empty() instead of 'size()  0'
 OSDMonitor.h: use !reporters.empty() instead of size()
 MonCaps.cc: use !empty() instead of size()
 Monitor.cc: use empty() instead of size()
 OSDMonitor.cc: use !empty() instead of size()
 PGMonitor.cc: use !empty() instead of size() to check for emptiness
 monmaptool.cc: use empty() instead of size() to check for emptiness
 DBObjectMap.cc: use empty() instead of size() to check for emptiness
 FileStore.cc: use empty() instead of size() to check for emptiness
 HashIndex.cc: use empty() instead of size() to check for emptiness
 LFNIndex.cc: use !holes.empty() instead of 'size()  0'
 OSD.cc: use empty() instead of size() to check for emptiness
 PG.cc: use empty() instead of size() to check for emptiness
 ReplicatedPG.cc: use empty() instead of size() to check for emptiness
 ObjectCacher.cc: use empty() instead of !size() to check for emptiness
 Objecter.cc: use !empty() instead of size() to check for emptiness
 Objecter.cc: prevent useless value assignment
 osdmaptool.cc: : use empty() instead of 'size()  1'
 rados.cc: use omap.empty() instead of size() to check for emptiness
 rbd.cc: use empty() instead of size() to check for emptiness
 rgw/rgw_admin.cc: prevent useless value assignment
 rgw/rgw_admin.cc: use empty() instead of size() to check for emptiness
 rgw/rgw_gc.cc: use !empty() instead of size() to check for emptiness
 rgw/rgw_log.cc: don't pass c_str() result to std::string argument
 cls/rbd/cls_rbd.cc: use !empty() instead of 'size()  0'
 cls_refcount.cc: use empty() instead of !size() to check for emptiness
 common/WorkQueue.cc: use !empty() instead of size() to check for emptiness
 obj_bencher.cc: use empty() instead of 'size() == 0' to check for emptiness
 crush/CrushWrapper.cc: don't pass c_str() result to std::string argument
 crushtool.cc: use !empty() instead of 'size()  0' to check for emptiness
 use empty() instead of 'size() == 0' to check for emptiness
 cls_kvs.cc: use !empty() instead of 'size()  0' to check for emptiness
 kv_flat_btree_async.cc: use empty() instead of size() to check for emptiness
 librbd/internal.cc: use !empty() instead of size()
 mds/CDir.cc: use !empty() instead of size()
 mds/CInode.cc: use !empty() instead of size()
 mds/Locker.cc: use !empty() instead of size()
 mds/MDCache.cc: use empty() instead of size() to check for emptiness
 mds/MDS.cc: use !empty() instead of size() to check for emptiness
 mds/MDSMap.cc: use !empty() instead of size() to check for emptiness
 mds/SnapServer.cc: use !empty() instead of size() to check for emptiness
 mds/journal.cc: use !empty() instead of size() to check for emptiness
 rgw/rgw_main.cc: use empty() instead of size() to check for emptiness
 rgw/rgw_op.cc: use empty() instead of size() to check for emptiness
 rgw/rgw_rados.cc: : 

Re: [ceph-commit] [ceph/ceph] e330b7: mon: create fail_mds_gid() helper; make 'ceph mds ...

2013-02-14 Thread Gregory Farnum
On Thu, Feb 14, 2013 at 11:39 AM, GitHub nore...@github.com wrote:
   Branch: refs/heads/master
   Home:   https://github.com/ceph/ceph
   Commit: e330b7ec54f89ca799ada376d5615e3c1dfc54f0
   
 https://github.com/ceph/ceph/commit/e330b7ec54f89ca799ada376d5615e3c1dfc54f0
   Author: Sage Weil s...@inktank.com
   Date:   2013-01-17 (Thu, 17 Jan 2013)

   Changed paths:
 M src/mon/MDSMonitor.cc
 M src/mon/MDSMonitor.h

   Log Message:
   ---
   mon: create fail_mds_gid() helper; make 'ceph mds rm ...' more generic

 Take a gid or a rank or a name.  Use a nicer helper.

 Signed-off-by: Sage Weil s...@inktank.com


   Commit: 2e11297750a1b683c41f58c3fae05321fc49
   
 https://github.com/ceph/ceph/commit/2e11297750a1b683c41f58c3fae05321fc49
   Author: Sage Weil s...@inktank.com
   Date:   2013-01-17 (Thu, 17 Jan 2013)

   Changed paths:
 M src/common/config_opts.h
 M src/mds/MDSMap.h
 M src/mon/MDSMonitor.cc

   Log Message:
   ---
   mon: enforce unique name in mdsmap

 Add 'mds enforce unique name' option, defaulting to true.

 If set, when an MDS boots, it will kick any previous mds with the same
 name from the mdsmap.  This is possibly less confusing for users.  If
 an mds daemon restarts, it will immediately replace its previous
 instantiation.

 Two misconfigured daemons running with the same name will fight over
 the same role.

 Fixes: #3857
 Signed-off-by: Sage Weil s...@inktank.com


   Commit: dd7caf5f411696f8e7dc108270a8e85a34f3e80c
   
 https://github.com/ceph/ceph/commit/dd7caf5f411696f8e7dc108270a8e85a34f3e80c
   Author: Sage Weil s...@inktank.com
   Date:   2013-01-17 (Thu, 17 Jan 2013)

   Changed paths:
 M src/mds/MDS.cc

   Log Message:
   ---
   mds: gracefully exit if newer gid replaces us by name

 If 'mds enforce unique name' is set, and another MDS with the same name
 kicks us out of the MDSMap, gracefully exit instead of respawning and
 fighting over our position.

 Signed-off-by: Sage Weil s...@inktank.com


   Commit: 6f28faf9e6613bff403bcd958818d8dccd004f9d
   
 https://github.com/ceph/ceph/commit/6f28faf9e6613bff403bcd958818d8dccd004f9d
   Author: Sage Weil s...@inktank.com
   Date:   2013-01-17 (Thu, 17 Jan 2013)

   Changed paths:
 M src/mds/MDCache.cc

   Log Message:
   ---
   mds: open mydir after replay

 In certain cases, we may replay the journal and not end up with the
 dirfrag for mydir open.  This is fine--we just need to open it up and
 fetch it below.

 Signed-off-by: Sage Weil s...@inktank.com

In the tests I ran last night on this branch I saw some Valgrind
warnings in the OSDs and Monitors, but I couldn't figure out any way
for this series to have caused them so I assume they're latent and pop
up occasionally in master? In any case, please keep an eye out in case
I was wrong. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-commit] [ceph/ceph] e330b7: mon: create fail_mds_gid() helper; make 'ceph mds ...

2013-02-14 Thread Sage Weil
On Thu, 14 Feb 2013, Gregory Farnum wrote:
 In the tests I ran last night on this branch I saw some Valgrind
 warnings in the OSDs and Monitors, but I couldn't figure out any way
 for this series to have caused them so I assume they're latent and pop
 up occasionally in master? In any case, please keep an eye out in case
 I was wrong. :)

That is probably a sneaky MMonPaxos leak in the mon.  And the OSD isn't 
yet valgrind leak check clean.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Further thoughts on fsck for CephFS

2013-02-14 Thread Gregory Farnum
Sage sent out an early draft of what we were thinking about doing for
fsck on CephFS at the beginning of the week, but it was a bit
incomplete and still very much a work in progress. I spent a good
chunk of today thinking about it more so that we can start planning
ticket-level chunks of work. The following is similar to where Sage's
email ended up, but incorporates a bit more thought about memory
scaling and is hopefully a bit more organized. :)

First, we are breaking up development and running of fsck into two
distinct phases. The first phase will consist of a forward scrub,
which simply starts with the root directory inode and follows links
forward to check that it can find everything that's linked, and that
the forward- and backward-links are consistent. (Backward links are
under development right now; see http://tracker.ceph.com/issues/3540,
or the CephFS backlog at
http://tracker.ceph.com/rb/master_backlogs/cephfs, which is only
groomed for the first several items on the list but might be of
interest.) The intention for this phase is that it can be used both as
part of a requested full-system fsck, and separately can be used to do
background scrubbing during normal operation.
I've tried to think through this forward scrub phase enough to do real
development planning over the next couple of days, and have included
my description below. Please comment if you see issues or have
questions.

The second phase we're referring to as the backward scan. This mode
is currently intended to be used as part of the fsck you would run
after somehow losing data in RADOS, and is exclusively an offline
operation — no client access to the data is permitted, etc and it
involves scanning through every object in the CephFS metadata and data
storage pools. We haven't thought this one through in quite as much
detail, but I wanted to figure out a mechanism (that scales to large
directories and hierarchies) enough to see how it might impact the
design of our forward scrub. I've got the details I came up with
below, but this is a much more complicated problem and not one we need
to start work on right way so it doesn't go into nearly as much depth.
Again though, please comment if you see any issues, have questions, or
think there's something in the backward scan that impacts the forward
scrub in a way I haven't accounted for!
Thanks,
Greg


MDS Forward Scrub

We maintain a stack of inodes to scrub. When a new scrub is requested,
the inode in question goes into this stack at a position depending on
how it's inserted.

We have a separate scrubbing thread in every MDS. This thread begins
in the scrub_node(inode) function, passing in the inode on the top of
the scrub stack.
scrub_node() starts by setting a new scrub_start_stamp and
scrub_start_version on the inode (where the scrub_start_version is the
version of the *parent* of the inode). If the node is a file:
the thread optionally spins off an async check of the backtrace (and
in the future, optionally checks other metadata we might be able to
add or pick up), then sleeps until finish_scrub(inode) is called. (If
it doesn't do the backtrace check, it calls finish_scrub() directly).
If the node is a dirfrag:
put the dirfrag's first child on the top of the stack, and call
scrub_node(child). Note that this might involve reading the dirfrag
off disk, etc.

finish_scrub(inode) is pretty simple. If the inode is a dirfrag:
It verifies that the parent's data matches the aggregate data of the
children, then does the same stuff as to a file:
1) sets last_scrubbed_stamp to scrub_start_stamp, and
last_scrubbed_version to scrub_start_version.
2) Pops the inode off of the scrub queue, and checks if the next thing
up is the inode's parent.
3) If so, calls scrub_node() on the dentry following this one in the
parent dirfrag.
3b) if there are no remaining nodes in the parent dirfrag, it checks
that all the children were scrubbed following the parent's
scrub_start_version (or modified — we don't want to scrub hierarchies
that were renamed into the tree following a scrub start), then calls
finish_scrub() on the dirfrag.

If at any point the scrub thread finishes scrubbing a node which does
not start up another one immediately (implying that another scrub got
injected into the middle of one that was already running), it looks at
the node in question. If it's a file, it calls scrub_node() on it. If
it's a dirfrag, it finds the first dentry in the dirfrag with a
last_scrubbed_version less than the dirfrag's last_scrubbed_version,
puts that dentry on the scrub_stack, and calls scrub_node() on that
dentry.

This is simple enough in concept (although functionally it will need
to be broken up quite a bit more in order to do all the locking in a
reasonably efficient fashion). To expand this to a multi-MDS system,
modify it slightly according to the following rules:
1) Only the authoritative 

Re: slow requests, hunting for new mon

2013-02-14 Thread Chris Dunlop
On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote:
 Hi,

 What are likely causes for slow requests and monclient: hunting for new
 mon messages? E.g.:

 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
 ...
 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
 included below; oldest blocked for  30.383883 secs
 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
 seconds old, received at 2013-02-12 16:27:15.508374: 
 osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
 921600~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
 seconds old, received at 2013-02-12 16:27:15.508475: 
 osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
 987136~4096] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
 seconds old, received at 2013-02-12 16:27:15.508537: 
 osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
 seconds old, received at 2013-02-12 16:27:15.508573: 
 osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
 seconds old, received at 2013-02-12 16:27:15.563271: 
 osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

OK, for the sake of anyone who might come across this thread when
searching for similar issues...

http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd

...unfortunately the error message in the link above says old request
rather than slow request (old code?), so that page doesn't come up
when googling for the slow request message. The page needs
updating.

The underlying problem in our case seems to have been spikes in the
number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst
the disks were coping with steady state load, occasionally something (in
this case, activity in a vm running on rbd) would cause a spike in
activity and the disks couldn't cope. I'd initially looked at the amount
of data going to the disks and thought it was well with the disks'
capabilities, however I hadn't considered the IOPS.

The (partial?) solution was to move the journals onto a separate device,
halving the IOPS going to the data disk (write journal, write data) as
well as avoiding having the heads slamming back and forth between the
data and journal. We're continuing to watch the IOPS and will add more
OSDs to spread the load further if necessary.

I still don't know what the hunting messages actually indicate, but
they've also disappeared since fixing the slow request messages.

Incidentally, it strikes me that there is a significant amount of write
amplification going on when running vms with a file system such as xfs
or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs
(with journal). I.e. a single write from a vm can turn into up to 8
separate writes by the time it hits the underlying xfs filesystem. I
think this is why our ceph setup is struggling at far less load on the
same hardware compared to the drbd setup we're wanting to replace.

Cheers!

Chris.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow requests, hunting for new mon

2013-02-14 Thread Sage Weil
On Fri, 15 Feb 2013, Chris Dunlop wrote:
 On 2013-02-12, Chris Dunlop ch...@onthe.net.au wrote:
  Hi,
 
  What are likely causes for slow requests and monclient: hunting for new
  mon messages? E.g.:
 
  2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
  ...
  2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
  included below; oldest blocked for  30.383883 secs
  2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 
  30.383883 seconds old, received at 2013-02-12 16:27:15.508374: 
  osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 
  921600~4096] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 
  30.383782 seconds old, received at 2013-02-12 16:27:15.508475: 
  osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 
  987136~4096] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 
  30.383720 seconds old, received at 2013-02-12 16:27:15.508537: 
  osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
  1036288~8192] 2.981cf6bc) v4 currently no flag points reached
  2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 
  30.383684 seconds old, received at 2013-02-12 16:27:15.508573: 
  osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
  1454080~4096] 2.fff29a9a) v4 currently no flag points reached
  2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 
  30.328986 seconds old, received at 2013-02-12 16:27:15.563271: 
  osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
  1482752~4096] 2.fff29a9a) v4 currently no flag points reached
 
 OK, for the sake of anyone who might come across this thread when
 searching for similar issues...
 
 http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#slow-or-unresponsive-osd
 
 ...unfortunately the error message in the link above says old request
 rather than slow request (old code?), so that page doesn't come up
 when googling for the slow request message. The page needs
 updating.

Updated, thanks!

 The underlying problem in our case seems to have been spikes in the
 number of IOPS going to the disks (e.g. watch 'iostat -x' output). Whilst
 the disks were coping with steady state load, occasionally something (in
 this case, activity in a vm running on rbd) would cause a spike in
 activity and the disks couldn't cope. I'd initially looked at the amount
 of data going to the disks and thought it was well with the disks'
 capabilities, however I hadn't considered the IOPS.
 
 The (partial?) solution was to move the journals onto a separate device,
 halving the IOPS going to the data disk (write journal, write data) as
 well as avoiding having the heads slamming back and forth between the
 data and journal. We're continuing to watch the IOPS and will add more
 OSDs to spread the load further if necessary.
 
 I still don't know what the hunting messages actually indicate, but
 they've also disappeared since fixing the slow request messages.

This usually means the monitor was responding and we (the OSD or client) 
is trying to reconnect (to a random monitor).

 Incidentally, it strikes me that there is a significant amount of write
 amplification going on when running vms with a file system such as xfs
 or ext4 (with journal) on top of rbd/rados (with journal) on top of xfs
 (with journal). I.e. a single write from a vm can turn into up to 8
 separate writes by the time it hits the underlying xfs filesystem. I
 think this is why our ceph setup is struggling at far less load on the
 same hardware compared to the drbd setup we're wanting to replace.

Currently, yes.  There is always going to be some additional overhead 
because the object data is stored in a file system.  We were/are doing 
several other non-optimal things too, however, that is being improved in 
the current master branch (moving some metadata into leveldb which does a 
better job of managing the IO pattern).  Stay tuned!

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] snapshot, clone and mount a VM-Image

2013-02-14 Thread Josh Durgin

On 02/14/2013 12:53 PM, Sage Weil wrote:

Hi Jens-

On Thu, 14 Feb 2013, Jens Kristian S?gaard wrote:

Hi Sage,


block device level.  We plan to implement an incremental backup function for
the relative change between two snapshots (or a snapshot and the head).
It's O(n) the size of the device vs the number of files, but should be more
efficient for all but the most sparse of images.  The implementation should
be simple; the challenge is mostly around the incremental file format,
probably.
That doesn't help you now, but would be a relatively self-contained piece of
functionality for someone to contribute to RBD.  This isn't a top


I'm very interesting in having an incremental backup tool for Ceph, so if it
is possible for me to do, I would like to take a shot at implementing it. It
will be a spare time project, so I cannot say how fast it will progress
though.

If you have any details on how you would like to see the implementation work,
please let me know!


Great to hear you're interested in this!  There is a feature in the
tracker open:

http://tracker.ceph.com/issues/4084

(Not that there is much information there yet!)

I think this breaks down into a few different pieces:

1) Decide what output format to use.  We want to use something that is
resembles a portable, standard way of representing an incremental set of
changes to a block device (or large file).  I'm not sure what is out
there, but we should look carefully before making up our own format.

2) Expose changes objects between rados snapshots.  This is some generic
functionality we would bake into librbd that would probably work similarly
to how read_iterate() currently does (you specify a callback).  We
probably also want to provide this information directly to a user, so that
they can get a dump of (offsets, length) pairs for integration with their
own tool.  I expect this is just a core librbd method.


It'd be nice to implement it as more than one request at once (unlike
read_iterate()'s current implementation). The interface could still
be the same though.


3) Write a dumper based on #2 that outputs in format from #1.  The
callback would (instead of printing file offsets) write the data to the
output stream with appropriate metadata indicating which part of the image
it is.  Ideally the output part would be modular, too, so that we can come
back later and implement support for new formats easily.  The output data
stream should be able to be directed at stdout or a file.

4) Write an importer for #1.  It would take as input an existing image,
assumed to be in the state of the reference snapshot, and write all the
changed bits.  Take input from stdin or a file.


I think it'd be good to have some kind of safety check here by default. 
Storing a checksum of the original snapshot with the backup and

comparing to the image being restored onto would work, but would be
pretty slow. Any ideas for better ways to do this?


5) If necessary, extend the above so that image resize events are properly
handled.


Couldn't this be handled by storing the size of the original snapshot
in the diff, and resizing to the size of the diff when restoring? Is
there another issue you're thinking of?


Probably the trickiest bit here is #2, as it will probably involve adding
some low-level rados operations to efficiently query the snapshot state
from the client.  With this (and any of the rest), we can help figure out
how to integrate it cleanly.  My suggestion is to start with #1, though
(and make sure the rest of this all makes sense to everyone).

THanks!
sage



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mon losing touch with OSDs

2013-02-14 Thread Chris Dunlop
G'day,

In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
mons to lose touch with the osds?

I imagine a network glitch could cause it, but I can't see any issues in any
other system logs on any of the machines on the network.

Having (mostly?) resolved my previous slow requests issue
(http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
5 seconds later:

ceph-mon.b2.log:
2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or 
pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago.  
marking down
2013-02-14 20:26:20.780244 7fa48d4f8700  1 mon.b2@0(leader).osd e769 e769: 2 
osds: 1 up, 2 in
2013-02-14 20:26:20.837123 7fa48d4f8700  0 log [INF] : osdmap e769: 2 osds: 1 
up, 2 in
2013-02-14 20:26:20.947523 7fa48d4f8700  0 log [INF] : pgmap v2822098: 576 pgs: 
304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 
3724 GB avail
2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or 
pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago.  
marking down
2013-02-14 20:26:25.822773 7fa48d4f8700  1 mon.b2@0(leader).osd e770 e770: 2 
osds: 0 up, 2 in
2013-02-14 20:26:25.863493 7fa48d4f8700  0 log [INF] : osdmap e770: 2 osds: 0 
up, 2 in
2013-02-14 20:26:25.954799 7fa48d4f8700  0 log [INF] : pgmap v2822099: 576 pgs: 
576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:31:30.772360 7fa48dcf9700  0 log [INF] : osd.1 out (down for 
304.933403)
2013-02-14 20:31:30.893521 7fa48d4f8700  1 mon.b2@0(leader).osd e771 e771: 2 
osds: 0 up, 1 in
2013-02-14 20:31:30.933439 7fa48d4f8700  0 log [INF] : osdmap e771: 2 osds: 0 
up, 1 in
2013-02-14 20:31:31.055408 7fa48d4f8700  0 log [INF] : pgmap v2822100: 576 pgs: 
576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail
2013-02-14 20:35:05.831221 7fa48dcf9700  0 log [INF] : osd.0 out (down for 
525.033581)
2013-02-14 20:35:05.989724 7fa48d4f8700  1 mon.b2@0(leader).osd e772 e772: 2 
osds: 0 up, 0 in
2013-02-14 20:35:06.031409 7fa48d4f8700  0 log [INF] : osdmap e772: 2 osds: 0 
up, 0 in
2013-02-14 20:35:06.129046 7fa48d4f8700  0 log [INF] : pgmap v2822101: 576 pgs: 
576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail

The other 2 mons both have messages like this in their logs, starting at around 
20:12:

2013-02-14 20:12:26.534977 7f2092b86700  0 -- 10.200.63.133:6789/0  
10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:24.741092 7f2092d88700  0 -- 10.200.63.133:6789/0  
10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:56.551908 7f2090560700  0 -- 10.200.63.133:6789/0  
10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:14:24.752356 7f209035e700  0 -- 10.200.63.133:6789/0  
10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)

(10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1)

...although Greg Farnum indicates these messages are normal:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993

Osd.0 doesn't show any signs of distress at all:

ceph-osd.0.log:
2013-02-14 20:00:10.280601 7ffceb012700  0 log [INF] : 2.7e scrub ok
2013-02-14 20:14:19.923490 7ffceb012700  0 log [INF] : 2.5b scrub ok
2013-02-14 20:14:50.571980 7ffceb012700  0 log [INF] : 2.7b scrub ok
2013-02-14 20:17:48.475129 7ffceb012700  0 log [INF] : 2.7d scrub ok
2013-02-14 20:28:22.601594 7ffceb012700  0 log [INF] : 2.91 scrub ok
2013-02-14 20:28:32.839278 7ffceb012700  0 log [INF] : 2.92 scrub ok
2013-02-14 20:28:46.992226 7ffceb012700  0 log [INF] : 2.93 scrub ok
2013-02-14 20:29:12.330668 7ffceb012700  0 log [INF] : 2.95 scrub ok

...although osd.1 started seeing problems around this time:

ceph-osd.1.log:
2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for  30.750063 secs
2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 
seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 
xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map
2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 
seconds old, 

Re: Mon losing touch with OSDs

2013-02-14 Thread Sage Weil
Hi Chris,

On Fri, 15 Feb 2013, Chris Dunlop wrote:
 G'day,
 
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
 mons to lose touch with the osds?
 
 I imagine a network glitch could cause it, but I can't see any issues in any
 other system logs on any of the machines on the network.
 
 Having (mostly?) resolved my previous slow requests issue
 (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
 5 seconds later:
 
 ceph-mon.b2.log:
 2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 
 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
 2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 
 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd 
 or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds 
 ago.  marking down

There is a safety check that if the osd doesn't check in for a long period 
of time we assume it is dead.  But it seems as though that shouldn't 
happen, since osd.0 has some PGs assigned and is scrubbing away.

Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
hopes that this happens again?  It will give us more information to go on.

 ...although osd.1 started seeing problems around this time:
 
 ceph-osd.1.log:
 2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
 2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 
 included below; oldest blocked for  30.750063 secs
 2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 
 seconds old, received at 2013-02-14 20:26:20.851304: 
 osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait 
 for new map
 2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 
 seconds old, received at 2013-02-14 20:26:20.851420: 
 osd_op(client.10001.0:618473 yy.rbd [watch 1~0] 2.3854277a) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601440 7fd1e6f26700  0 log [WRN] : slow request 30.749938 
 seconds old, received at 2013-02-14 20:26:20.851429: 
 osd_op(client.9998.0:39716 zz.rbd [watch 1~0] 2.71731007) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601442 7fd1e6f26700  0 log [WRN] : slow request 30.749907 
 seconds old, received at 2013-02-14 20:26:20.851460: 
 osd_op(client.10007.0:59572 aa.rbd [watch 1~0] 2.320eebb8) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601445 7fd1e6f26700  0 log [WRN] : slow request 30.749630 
 seconds old, received at 2013-02-14 20:26:20.851737: 
 osd_op(client.9980.0:86883 bb.rbd [watch 1~0] 2.ab9b579f) v4 currently 
 wait for new map
 
 Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened 
 in
 any of the many previous slow requests intances, and the timing doesn't look
 quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
 the osd.0 log shows nothing problems at all, then the mon complains about not
 having heard from osd.1 since 20:11:21, whereas the first indication of 
 trouble
 on osd.1 was the request from 20:26:20 not being processed in a timely 
 fashion.

My guess is the above was a side-effect of osd.0 being marked out.   On 
0.56.2 there is some strange peering workqueue laggyness that could 
potentially contribute as well.  I recommend moving to 0.56.3.

 No knowing enough about how the various pieces of ceph talk to each other
 makes it difficult to distinguish cause and effect!
 
 Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
 restarting the osds ('service ceph restart osd' on each osd host).
 
 The immediate issue was resolved by restarting ceph completely on one of the
 mon/osd hosts (service ceph restart). Possibly a restart of just the mon would
 have been sufficient.

Did you notice that the osds you restarted didn't immediately mark 
themselves in?  Again, it could be explained by the peering wq issue, 
especially if there are pools in your cluster that are not getting any IO.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html