[ceph-users] add writeback to Bluestore thanks to lvm-writecache

2019-08-13 Thread Olivier Bonvalet
Hi,

we use OSDs with data on HDD and db/wal on NVMe.
But for now, BlueStore.DB and BlueStore.WAL only store medadata NOT
data. Right ?

So, when we migrated from :
A) Filestore + HDD with hardware writecache + journal on SSD
to :
B) Bluestore + HDD without hardware writecache + DB/WAL on NVMe

Performance on ours random-write workloads drops.

Since default OSD setup now use LVM, enabling LVM-writecache is easy.
But is it a good idea ? Do you try it ?

Thanks,

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :
> My understanding is that basically the kernel is either unable or 
> uninterested (maybe due to lack of memory pressure?) in reclaiming
> the 
> memory .  It's possible you might have better behavior if you set 
> /sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or 
> maybe disable transparent huge pages entirely.
> 
> 
> Some background:
> 
> https://github.com/gperftools/gperftools/issues/1073
> 
> https://blog.nelhage.com/post/transparent-hugepages/
> 
> https://www.kernel.org/doc/Documentation/vm/transhuge.txt
> 
> 
> Mark
> 
> 
> On 4/9/19 7:31 AM, Olivier Bonvalet wrote:
> > Well, Dan seems to be right :
> > 
> > _tune_cache_size
> >  target: 4294967296
> >heap: 6514409472
> >unmapped: 2267537408
> >  mapped: 4246872064
> > old cache_size: 2845396873
> > new cache size: 2845397085
> > 
> > 
> > So we have 6GB in heap, but "only" 4GB mapped.
> > 
> > But "ceph tell osd.* heap release" should had release that ?
> > 
> > 
> > Thanks,
> > 
> > Olivier
> > 
> > 
> > Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :
> > > One of the difficulties with the osd_memory_target work is that
> > > we
> > > can't
> > > tune based on the RSS memory usage of the process. Ultimately
> > > it's up
> > > to
> > > the kernel to decide to reclaim memory and especially with
> > > transparent
> > > huge pages it's tough to judge what the kernel is going to do
> > > even
> > > if
> > > memory has been unmapped by the process.  Instead the autotuner
> > > looks
> > > at
> > > how much memory has been mapped and tries to balance the caches
> > > based
> > > on
> > > that.
> > > 
> > > 
> > > In addition to Dan's advice, you might also want to enable debug
> > > bluestore at level 5 and look for lines containing "target:" and
> > > "cache_size:".  These will tell you the current target, the
> > > mapped
> > > memory, unmapped memory, heap size, previous aggregate cache
> > > size,
> > > and
> > > new aggregate cache size.  The other line will give you a break
> > > down
> > > of
> > > how much memory was assigned to each of the bluestore caches and
> > > how
> > > much each case is using.  If there is a memory leak, the
> > > autotuner
> > > can
> > > only do so much.  At some point it will reduce the caches to fit
> > > within
> > > cache_min and leave it there.
> > > 
> > > 
> > > Mark
> > > 
> > > 
> > > On 4/8/19 5:18 AM, Dan van der Ster wrote:
> > > > Which OS are you using?
> > > > With CentOS we find that the heap is not always automatically
> > > > released. (You can check the heap freelist with `ceph tell
> > > > osd.0
> > > > heap
> > > > stats`).
> > > > As a workaround we run this hourly:
> > > > 
> > > > ceph tell mon.* heap release
> > > > ceph tell osd.* heap release
> > > > ceph tell mds.* heap release
> > > > 
> > > > -- Dan
> > > > 
> > > > On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
> > > > ceph.l...@daevel.fr> wrote:
> > > > > Hi,
> > > > > 
> > > > > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
> > > > > the
> > > > > osd_memory_target :
> > > > > 
> > > > > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > > > > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > > > > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > > > > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
&

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Well, Dan seems to be right :

_tune_cache_size
target: 4294967296
  heap: 6514409472
  unmapped: 2267537408
mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :
> One of the difficulties with the osd_memory_target work is that we
> can't 
> tune based on the RSS memory usage of the process. Ultimately it's up
> to 
> the kernel to decide to reclaim memory and especially with
> transparent 
> huge pages it's tough to judge what the kernel is going to do even
> if 
> memory has been unmapped by the process.  Instead the autotuner looks
> at 
> how much memory has been mapped and tries to balance the caches based
> on 
> that.
> 
> 
> In addition to Dan's advice, you might also want to enable debug 
> bluestore at level 5 and look for lines containing "target:" and 
> "cache_size:".  These will tell you the current target, the mapped 
> memory, unmapped memory, heap size, previous aggregate cache size,
> and 
> new aggregate cache size.  The other line will give you a break down
> of 
> how much memory was assigned to each of the bluestore caches and how 
> much each case is using.  If there is a memory leak, the autotuner
> can 
> only do so much.  At some point it will reduce the caches to fit
> within 
> cache_min and leave it there.
> 
> 
> Mark
> 
> 
> On 4/8/19 5:18 AM, Dan van der Ster wrote:
> > Which OS are you using?
> > With CentOS we find that the heap is not always automatically
> > released. (You can check the heap freelist with `ceph tell osd.0
> > heap
> > stats`).
> > As a workaround we run this hourly:
> > 
> > ceph tell mon.* heap release
> > ceph tell osd.* heap release
> > ceph tell mds.* heap release
> > 
> > -- Dan
> > 
> > On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
> > ceph.l...@daevel.fr> wrote:
> > > Hi,
> > > 
> > > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
> > > osd_memory_target :
> > > 
> > > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser
> > > ceph --setgroup ceph
> > > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser
> > > ceph --setgroup ceph
> > > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
> > > 1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser
> > > ceph --setgroup ceph
> > > ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
> > > 2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser
> > > ceph --setgroup ceph
> > > ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
> > > 1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser
> > > ceph --setgroup ceph
> > > ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
> > > 1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser
> > > ceph --setgroup ceph
> > > ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
> > > 1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser
> > > ceph --setgroup ceph
> > > ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
> > > 1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser
> > > ceph --setgroup ceph
> > > 
> > > daevel-ob@ssdr712h:~$ free -m
> > >totalusedfree  shared  buff/ca
> > > che   available
> > > Mem:  47771   452101643  17 9
> > > 17   43556
> > > Swap: 0   0   0
> > > 
> > > # ceph daemon osd.147 config show | grep memory_target
> > >  "osd_memory_target": "4294967296",
> > > 
> > > 
> > > And there is no recovery / backfilling, the cluster is fine :
> > > 
> > > $ ceph status
> > >   cluster:
> > > id: de035250-323d-4cf6-8c4b-cf0faf6296b1
> > > health: HEALTH_OK
> > > 
> > >   services:
> > > mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
> > > mgr: tsyne(active), standbys: olkas, tolriq, 

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Thanks for the advice, we are using Debian 9 (stretch), with a custom
Linux kernel 4.14.

But "heap release" didn't help.


Le lundi 08 avril 2019 à 12:18 +0200, Dan van der Ster a écrit :
> Which OS are you using?
> With CentOS we find that the heap is not always automatically
> released. (You can check the heap freelist with `ceph tell osd.0 heap
> stats`).
> As a workaround we run this hourly:
> 
> ceph tell mon.* heap release
> ceph tell osd.* heap release
> ceph tell mds.* heap release
> 
> -- Dan
> 
> On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet 
> wrote:
> > Hi,
> > 
> > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
> > osd_memory_target :
> > 
> > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph
> > --setgroup ceph
> > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser ceph
> > --setgroup ceph
> > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
> > 1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser ceph
> > --setgroup ceph
> > ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
> > 2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser ceph
> > --setgroup ceph
> > ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
> > 1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser ceph
> > --setgroup ceph
> > ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
> > 1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser ceph
> > --setgroup ceph
> > ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
> > 1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser ceph
> > --setgroup ceph
> > ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
> > 1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser ceph
> > --setgroup ceph
> > 
> > daevel-ob@ssdr712h:~$ free -m
> >   totalusedfree  shared  buff/cache
> >available
> > Mem:  47771   452101643  17 917
> >43556
> > Swap: 0   0   0
> > 
> > # ceph daemon osd.147 config show | grep memory_target
> > "osd_memory_target": "4294967296",
> > 
> > 
> > And there is no recovery / backfilling, the cluster is fine :
> > 
> >$ ceph status
> >  cluster:
> >id: de035250-323d-4cf6-8c4b-cf0faf6296b1
> >health: HEALTH_OK
> > 
> >  services:
> >mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
> >mgr: tsyne(active), standbys: olkas, tolriq, lorunde, amphel
> >osd: 120 osds: 116 up, 116 in
> > 
> >  data:
> >pools:   20 pools, 12736 pgs
> >objects: 15.29M objects, 31.1TiB
> >usage:   101TiB used, 75.3TiB / 177TiB avail
> >pgs: 12732 active+clean
> > 4 active+clean+scrubbing+deep
> > 
> >  io:
> >client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd,
> > 1.29kop/s wr
> > 
> > 
> >On an other host, in the same pool, I see also high memory usage
> > :
> > 
> >daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
> >ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21
> > 1511:07 /usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser ceph
> > --setgroup ceph
> >ceph6759  7.3 11.2 6299140 5484412 ? Ssl  mars21
> > 1665:22 /usr/bin/ceph-osd -f --cluster ceph --id 132 --setuser ceph
> > --setgroup ceph
> >ceph7114  7.0 11.7 6576168 5756236 ? Ssl  mars21
> > 1612:09 /usr/bin/ceph-osd -f --cluster ceph --id 133 --setuser ceph
> > --setgroup ceph
> >ceph7467  7.4 11.1 6244668 5430512 ? Ssl  mars21
> > 1704:06 /usr/bin/ceph-osd -f --cluster ceph --id 134 --setuser ceph
> > --setgroup ceph
> >ceph7821  7.7 11.1 6309456 5469376 ? Ssl  mars21
> > 1754:35 /usr/bin/ceph-osd -f --cluster ceph --id 135 --setuser ceph
> > --setgroup ceph
> >ceph8174  6.9 11.6 6545224 5705412 ? Ssl  mars21
> > 1590:31 /usr/bin/ceph-osd -f --cluster ceph --id 136 --setuser ceph
> > --setgroup ceph
> >ceph8746  6.6 11.1 6290004 5477204 ? Ssl  mars21
> > 1511:11 /usr/bin/ceph-osd -f --cluster ceph --id 137 --setuser ceph
> > 

[ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-06 Thread Olivier Bonvalet
Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29 1903:42 
/usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29 1443:41 
/usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29 1889:41 
/usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29 2198:47 
/usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29 1866:05 
/usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29 1634:30 
/usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29 1882:42 
/usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29 1782:52 
/usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
  totalusedfree  shared  buff/cache   available
Mem:  47771   452101643  17 917   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
"osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

   $ ceph status
 cluster:
   id: de035250-323d-4cf6-8c4b-cf0faf6296b1
   health: HEALTH_OK

 services:
   mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
   mgr: tsyne(active), standbys: olkas, tolriq, lorunde, amphel
   osd: 120 osds: 116 up, 116 in

 data:
   pools:   20 pools, 12736 pgs
   objects: 15.29M objects, 31.1TiB
   usage:   101TiB used, 75.3TiB / 177TiB avail
   pgs: 12732 active+clean
4 active+clean+scrubbing+deep

 io:
   client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd, 1.29kop/s wr


   On an other host, in the same pool, I see also high memory usage :

   daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
   ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21 1511:07 
/usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser ceph --setgroup ceph
   ceph6759  7.3 11.2 6299140 5484412 ? Ssl  mars21 1665:22 
/usr/bin/ceph-osd -f --cluster ceph --id 132 --setuser ceph --setgroup ceph
   ceph7114  7.0 11.7 6576168 5756236 ? Ssl  mars21 1612:09 
/usr/bin/ceph-osd -f --cluster ceph --id 133 --setuser ceph --setgroup ceph
   ceph7467  7.4 11.1 6244668 5430512 ? Ssl  mars21 1704:06 
/usr/bin/ceph-osd -f --cluster ceph --id 134 --setuser ceph --setgroup ceph
   ceph7821  7.7 11.1 6309456 5469376 ? Ssl  mars21 1754:35 
/usr/bin/ceph-osd -f --cluster ceph --id 135 --setuser ceph --setgroup ceph
   ceph8174  6.9 11.6 6545224 5705412 ? Ssl  mars21 1590:31 
/usr/bin/ceph-osd -f --cluster ceph --id 136 --setuser ceph --setgroup ceph
   ceph8746  6.6 11.1 6290004 5477204 ? Ssl  mars21 1511:11 
/usr/bin/ceph-osd -f --cluster ceph --id 137 --setuser ceph --setgroup ceph
   ceph9100  7.7 11.6 6552080 5713560 ? Ssl  mars21 1757:22 
/usr/bin/ceph-osd -f --cluster ceph --id 138 --setuser ceph --setgroup ceph

   But ! On a similar host, in a different pool, the problem is less visible :

   daevel-ob@ssdr712i:~$ ps auxw | grep ceph-osd
   ceph3617  2.8  9.9 5660308 4847444 ? Ssl  mars29 313:05 
/usr/bin/ceph-osd -f --cluster ceph --id 151 --setuser ceph --setgroup ceph
   ceph3958  2.3  9.8 5661936 4834320 ? Ssl  mars29 256:55 
/usr/bin/ceph-osd -f --cluster ceph --id 152 --setuser ceph --setgroup ceph
   ceph4299  2.3  9.8 5620616 4807248 ? Ssl  mars29 266:26 
/usr/bin/ceph-osd -f --cluster ceph --id 153 --setuser ceph --setgroup ceph
   ceph4643  2.3  9.6 5527724 4713572 ? Ssl  mars29 262:50 
/usr/bin/ceph-osd -f --cluster ceph --id 154 --setuser ceph --setgroup ceph
   ceph5016  2.2  9.7 5597504 4783412 ? Ssl  mars29 248:37 
/usr/bin/ceph-osd -f --cluster ceph --id 155 --setuser ceph --setgroup ceph
   ceph5380  2.8  9.9 5700204 4886432 ? Ssl  mars29 321:05 
/usr/bin/ceph-osd -f --cluster ceph --id 156 --setuser ceph --setgroup ceph
   ceph5724  3.1 10.1 5767456 4953484 ? Ssl  mars29 352:55 
/usr/bin/ceph-osd -f --cluster ceph --id 157 --setuser ceph --setgroup ceph
   ceph6070  2.7  9.9 5683092 4868632 ? Ssl  mars29 309:10 
/usr/bin/ceph-osd -f --cluster ceph --id 158 --setuser ceph --setgroup ceph


   Is there some memory leak ? Or should I expect that osd_

[ceph-users] Bluestore & snapshots weight

2018-10-28 Thread Olivier Bonvalet
Hi,

with Filestore, to estimate the weight of snapshot we use a simple find
script on each OSD :

nice find "$OSDROOT/$OSDDIR/current/" \
-type f -not -name '*_head_*' -not -name '*_snapdir_*' \
-printf '%P\n'

Then we agregate by image prefix, and obtain an estimation of each
snapshot weight. We use this method because we never found this
information in Ceph tools.

Now with Bluestore we can't use this script anymore. Is there an other
way to obtain this information ?

I read that we can "mount" inactive OSD with "ceph-objectstore-tool",
but I can't shutdown OSDs for this.

Thanks for any help,

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet

Le vendredi 21 septembre 2018 à 19:45 +0200, Paul Emmerich a écrit :
> The cache tiering has nothing to do with the PG of the underlying
> pool
> being incomplete.
> You are just seeing these requests as stuck because it's the only
> thing trying to write to the underlying pool.

I agree, It was just to be sure that the problems on OSD 32, 68 and 69
are related to only one "real" problem.


> What you need to fix is the PG showing incomplete.  I assume you
> already tried reducing the min_size to 4 as suggested? Or did you by
> chance always run with min_size 4 on the ec pool, which is a common
> cause for problems like this.

Yes, it has always run with min_size 4.

We use Luminous 12.2.8 here, but some (~40%) OSD still run Luminous
12.2.7. I was hoping to "fix" this problem before to continue
upgrading.

pool details :

pool 37 'bkp-foo-raid6' erasure size 6 min_size 4 crush_rule 20
object_hash rjenkins pg_num 256 pgp_num 256 last_change 585715 lfor
585714/585714 flags hashpspool,backfillfull stripe_width 4096 fast_read
1 application rbd
removed_snaps [1~3]




> Can you share the output of "ceph osd pool ls detail"?
> Also, which version of Ceph are you running?
> Paul
> 
> Am Fr., 21. Sep. 2018 um 19:28 Uhr schrieb Olivier Bonvalet
> :
> > 
> > So I've totally disable cache-tiering and overlay. Now OSD 68 & 69
> > are
> > fine, no more blocked.
> > 
> > But OSD 32 is still blocked, and PG 37.9c still marked incomplete
> > with
> > :
> > 
> > "recovery_state": [
> > {
> > "name": "Started/Primary/Peering/Incomplete",
> > "enter_time": "2018-09-21 18:56:01.222970",
> > "comment": "not enough complete instances of this PG"
> > },
> > 
> > But I don't see blocked requests in OSD.32 logs, should I increase
> > one
> > of the "debug_xx" flag ?
> > 
> > 
> > Le vendredi 21 septembre 2018 à 16:51 +0200, Maks Kowalik a écrit :
> > > According to the query output you pasted shards 1 and 2 are
> > > broken.
> > > But, on the other hand EC profile (4+2) should make it possible
> > > to
> > > recover from 2 shards lost simultanously...
> > > 
> > > pt., 21 wrz 2018 o 16:29 Olivier Bonvalet 
> > > napisał(a):
> > > > Well on drive, I can find thoses parts :
> > > > 
> > > > - cs0 on OSD 29 and 30
> > > > - cs1 on OSD 18 and 19
> > > > - cs2 on OSD 13
> > > > - cs3 on OSD 66
> > > > - cs4 on OSD 0
> > > > - cs5 on OSD 75
> > > > 
> > > > And I can read thoses files too.
> > > > 
> > > > And all thoses OSD are UP and IN.
> > > > 
> > > > 
> > > > Le vendredi 21 septembre 2018 à 13:10 +, Eugen Block a
> > > > écrit :
> > > > > > > I tried to flush the cache with "rados -p cache-bkp-foo
> > > > 
> > > > cache-
> > > > > > > flush-
> > > > > > > evict-all", but it blocks on the object
> > > > > > > "rbd_data.f66c92ae8944a.000f2596".
> > > > > 
> > > > > This is the object that's stuck in the cache tier (according
> > > > > to
> > > > > your
> > > > > output in https://pastebin.com/zrwu5X0w). Can you verify if
> > > > > that
> > > > > block
> > > > > device is in use and healthy or is it corrupt?
> > > > > 
> > > > > 
> > > > > Zitat von Maks Kowalik :
> > > > > 
> > > > > > Could you, please paste the output of pg 37.9c query
> > > > > > 
> > > > > > pt., 21 wrz 2018 o 14:39 Olivier Bonvalet <
> > > > > > ceph.l...@daevel.fr>
> > > > > > napisał(a):
> > > > > > 
> > > > > > > In fact, one object (only one) seem to be blocked on the
> > > > 
> > > > cache
> > > > > > > tier
> > > > > > > (writeback).
> > > > > > > 
> > > > > > > I tried to flush the cache with "rados -p cache-bkp-foo
> > > > 
> > > > cache-
> > > > > > > flush-
> > > > > > > evict-all", but it blocks on the object
> > > > > > > "rbd_data.f66c92ae8944a.000f2

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
So I've totally disable cache-tiering and overlay. Now OSD 68 & 69 are
fine, no more blocked.

But OSD 32 is still blocked, and PG 37.9c still marked incomplete with
:

"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2018-09-21 18:56:01.222970",
"comment": "not enough complete instances of this PG"
},

But I don't see blocked requests in OSD.32 logs, should I increase one
of the "debug_xx" flag ?


Le vendredi 21 septembre 2018 à 16:51 +0200, Maks Kowalik a écrit :
> According to the query output you pasted shards 1 and 2 are broken.
> But, on the other hand EC profile (4+2) should make it possible to
> recover from 2 shards lost simultanously... 
> 
> pt., 21 wrz 2018 o 16:29 Olivier Bonvalet 
> napisał(a):
> > Well on drive, I can find thoses parts :
> > 
> > - cs0 on OSD 29 and 30
> > - cs1 on OSD 18 and 19
> > - cs2 on OSD 13
> > - cs3 on OSD 66
> > - cs4 on OSD 0
> > - cs5 on OSD 75
> > 
> > And I can read thoses files too.
> > 
> > And all thoses OSD are UP and IN.
> > 
> > 
> > Le vendredi 21 septembre 2018 à 13:10 +, Eugen Block a écrit :
> > > > > I tried to flush the cache with "rados -p cache-bkp-foo
> > cache-
> > > > > flush-
> > > > > evict-all", but it blocks on the object
> > > > > "rbd_data.f66c92ae8944a.000f2596".
> > > 
> > > This is the object that's stuck in the cache tier (according to
> > > your  
> > > output in https://pastebin.com/zrwu5X0w). Can you verify if that
> > > block  
> > > device is in use and healthy or is it corrupt?
> > > 
> > > 
> > > Zitat von Maks Kowalik :
> > > 
> > > > Could you, please paste the output of pg 37.9c query
> > > > 
> > > > pt., 21 wrz 2018 o 14:39 Olivier Bonvalet 
> > > > napisał(a):
> > > > 
> > > > > In fact, one object (only one) seem to be blocked on the
> > cache
> > > > > tier
> > > > > (writeback).
> > > > > 
> > > > > I tried to flush the cache with "rados -p cache-bkp-foo
> > cache-
> > > > > flush-
> > > > > evict-all", but it blocks on the object
> > > > > "rbd_data.f66c92ae8944a.000f2596".
> > > > > 
> > > > > So I reduced (a lot) the cache tier to 200MB, "rados -p
> > cache-
> > > > > bkp-foo
> > > > > ls" now show only 3 objects :
> > > > > 
> > > > > rbd_directory
> > > > > rbd_data.f66c92ae8944a.000f2596
> > > > > rbd_header.f66c92ae8944a
> > > > > 
> > > > > And "cache-flush-evict-all" still hangs.
> > > > > 
> > > > > I also switched the cache tier to "readproxy", to avoid using
> > > > > this
> > > > > cache. But, it's still blocked.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet
> > a
> > > > > écrit :
> > > > > > Hello,
> > > > > > 
> > > > > > on a Luminous cluster, I have a PG incomplete and I can't
> > find
> > > > > > how to
> > > > > > fix that.
> > > > > > 
> > > > > > It's an EC pool (4+2) :
> > > > > > 
> > > > > > pg 37.9c is incomplete, acting [32,50,59,1,0,75]
> > (reducing
> > > > > > pool
> > > > > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs
> > for
> > > > > > 'incomplete')
> > > > > > 
> > > > > > Of course, we can't reduce min_size from 4.
> > > > > > 
> > > > > > And the full state : https://pastebin.com/zrwu5X0w
> > > > > > 
> > > > > > So, IO are blocked, we can't access thoses damaged data.
> > > > > > OSD blocks too :
> > > > > > osds 32,68,69 have stuck requests > 4194.3 sec
> > > > > > 
> > > > > > OSD 32 is the primary of this PG.
> > > > > > And OSD 68 and 69 are for cache tiering.
> > > > > > 
> > > > > > Any idea how can I fix that ?
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Olivier
> > > > > > 
> > > > > > 
> > > > > > ___
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > 
> > > > > 
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > 
> > > 
> > > 
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Well on drive, I can find thoses parts :

- cs0 on OSD 29 and 30
- cs1 on OSD 18 and 19
- cs2 on OSD 13
- cs3 on OSD 66
- cs4 on OSD 0
- cs5 on OSD 75

And I can read thoses files too.

And all thoses OSD are UP and IN.


Le vendredi 21 septembre 2018 à 13:10 +, Eugen Block a écrit :
> > > I tried to flush the cache with "rados -p cache-bkp-foo cache-
> > > flush-
> > > evict-all", but it blocks on the object
> > > "rbd_data.f66c92ae8944a.000f2596".
> 
> This is the object that's stuck in the cache tier (according to
> your  
> output in https://pastebin.com/zrwu5X0w). Can you verify if that
> block  
> device is in use and healthy or is it corrupt?
> 
> 
> Zitat von Maks Kowalik :
> 
> > Could you, please paste the output of pg 37.9c query
> > 
> > pt., 21 wrz 2018 o 14:39 Olivier Bonvalet 
> > napisał(a):
> > 
> > > In fact, one object (only one) seem to be blocked on the cache
> > > tier
> > > (writeback).
> > > 
> > > I tried to flush the cache with "rados -p cache-bkp-foo cache-
> > > flush-
> > > evict-all", but it blocks on the object
> > > "rbd_data.f66c92ae8944a.000f2596".
> > > 
> > > So I reduced (a lot) the cache tier to 200MB, "rados -p cache-
> > > bkp-foo
> > > ls" now show only 3 objects :
> > > 
> > > rbd_directory
> > > rbd_data.f66c92ae8944a.000f2596
> > > rbd_header.f66c92ae8944a
> > > 
> > > And "cache-flush-evict-all" still hangs.
> > > 
> > > I also switched the cache tier to "readproxy", to avoid using
> > > this
> > > cache. But, it's still blocked.
> > > 
> > > 
> > > 
> > > 
> > > Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a
> > > écrit :
> > > > Hello,
> > > > 
> > > > on a Luminous cluster, I have a PG incomplete and I can't find
> > > > how to
> > > > fix that.
> > > > 
> > > > It's an EC pool (4+2) :
> > > > 
> > > > pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing
> > > > pool
> > > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> > > > 'incomplete')
> > > > 
> > > > Of course, we can't reduce min_size from 4.
> > > > 
> > > > And the full state : https://pastebin.com/zrwu5X0w
> > > > 
> > > > So, IO are blocked, we can't access thoses damaged data.
> > > > OSD blocks too :
> > > > osds 32,68,69 have stuck requests > 4194.3 sec
> > > > 
> > > > OSD 32 is the primary of this PG.
> > > > And OSD 68 and 69 are for cache tiering.
> > > > 
> > > > Any idea how can I fix that ?
> > > > 
> > > > Thanks,
> > > > 
> > > > Olivier
> > > > 
> > > > 
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Yep :

pool 38 'cache-bkp-foo' replicated size 3 min_size 2 crush_rule 26
object_hash rjenkins pg_num 128 pgp_num 128 last_change 585369 lfor
68255/68255 flags hashpspool,incomplete_clones tier_of 37 cache_mode
readproxy target_bytes 209715200 hit_set
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 300s
x2 decay_rate 0 search_last_n 0 min_read_recency_for_promote 10
min_write_recency_for_promote 2 stripe_width 0

I can't totally disable the cache tiering, because OSD are in filestore
(so without "overwrites" feature).

Le vendredi 21 septembre 2018 à 13:26 +, Eugen Block a écrit :
> > I also switched the cache tier to "readproxy", to avoid using this
> > cache. But, it's still blocked.
> 
> You could change the cache mode to "none" to disable it. Could you  
> paste the output of:
> 
> ceph osd pool ls detail | grep cache-bkp-foo
> 
> 
> 
> Zitat von Olivier Bonvalet :
> 
> > In fact, one object (only one) seem to be blocked on the cache tier
> > (writeback).
> > 
> > I tried to flush the cache with "rados -p cache-bkp-foo cache-
> > flush-
> > evict-all", but it blocks on the object
> > "rbd_data.f66c92ae8944a.000f2596".
> > 
> > So I reduced (a lot) the cache tier to 200MB, "rados -p cache-bkp-
> > foo
> > ls" now show only 3 objects :
> > 
> > rbd_directory
> > rbd_data.f66c92ae8944a.000f2596
> > rbd_header.f66c92ae8944a
> > 
> > And "cache-flush-evict-all" still hangs.
> > 
> > I also switched the cache tier to "readproxy", to avoid using this
> > cache. But, it's still blocked.
> > 
> > 
> > 
> > 
> > Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a
> > écrit :
> > > Hello,
> > > 
> > > on a Luminous cluster, I have a PG incomplete and I can't find
> > > how to
> > > fix that.
> > > 
> > > It's an EC pool (4+2) :
> > > 
> > > pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing
> > > pool
> > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> > > 'incomplete')
> > > 
> > > Of course, we can't reduce min_size from 4.
> > > 
> > > And the full state : https://pastebin.com/zrwu5X0w
> > > 
> > > So, IO are blocked, we can't access thoses damaged data.
> > > OSD blocks too :
> > > osds 32,68,69 have stuck requests > 4194.3 sec
> > > 
> > > OSD 32 is the primary of this PG.
> > > And OSD 68 and 69 are for cache tiering.
> > > 
> > > Any idea how can I fix that ?
> > > 
> > > Thanks,
> > > 
> > > Olivier
> > > 
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
In fact, one object (only one) seem to be blocked on the cache tier
(writeback).

I tried to flush the cache with "rados -p cache-bkp-foo cache-flush-
evict-all", but it blocks on the object
"rbd_data.f66c92ae8944a.000f2596".

So I reduced (a lot) the cache tier to 200MB, "rados -p cache-bkp-foo
ls" now show only 3 objects :

rbd_directory
rbd_data.f66c92ae8944a.000f2596
rbd_header.f66c92ae8944a

And "cache-flush-evict-all" still hangs.

I also switched the cache tier to "readproxy", to avoid using this
cache. But, it's still blocked.




Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a écrit :
> Hello,
> 
> on a Luminous cluster, I have a PG incomplete and I can't find how to
> fix that.
> 
> It's an EC pool (4+2) :
> 
> pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
> bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> 'incomplete')
> 
> Of course, we can't reduce min_size from 4.
> 
> And the full state : https://pastebin.com/zrwu5X0w
> 
> So, IO are blocked, we can't access thoses damaged data.
> OSD blocks too :
> osds 32,68,69 have stuck requests > 4194.3 sec
> 
> OSD 32 is the primary of this PG.
> And OSD 68 and 69 are for cache tiering.
> 
> Any idea how can I fix that ?
> 
> Thanks,
> 
> Olivier
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Ok, so it's a replica 3 pool, and OSD 68 & 69 are on the same host.

Le vendredi 21 septembre 2018 à 11:09 +, Eugen Block a écrit :
> > cache-tier on this pool have 26GB of data (for 5.7TB of data on the
> > EC
> > pool).
> > We tried to flush the cache tier, and restart OSD 68 & 69, without
> > any
> > success.
> 
> I meant the replication size of the pool
> 
> ceph osd pool ls detail | grep 
> 
> In the experimental state of our cluster we had a cache tier (for
> rbd  
> pool) with size 2, that can cause problems during recovery. Since
> only  
> OSDs 68 and 69 are mentioned I was wondering if your cache tier
> also  
> has size 2.
> 
> 
> Zitat von Olivier Bonvalet :
> 
> > Hi,
> > 
> > cache-tier on this pool have 26GB of data (for 5.7TB of data on the
> > EC
> > pool).
> > We tried to flush the cache tier, and restart OSD 68 & 69, without
> > any
> > success.
> > 
> > But I don't see any related data on cache-tier OSD (filestore) with
> > :
> > 
> > find /var/lib/ceph/osd/ -maxdepth 3 -name '*37.9c*'
> > 
> > 
> > I don't see any usefull information in logs. Maybe I should
> > increase
> > log level ?
> > 
> > Thanks,
> > 
> > Olivier
> > 
> > 
> > Le vendredi 21 septembre 2018 à 09:34 +, Eugen Block a écrit :
> > > Hi Olivier,
> > > 
> > > what size does the cache tier have? You could set cache-mode to
> > > forward and flush it, maybe restarting those OSDs (68, 69) helps,
> > > too.
> > > Or there could be an issue with the cache tier, what do those
> > > logs
> > > say?
> > > 
> > > Regards,
> > > Eugen
> > > 
> > > 
> > > Zitat von Olivier Bonvalet :
> > > 
> > > > Hello,
> > > > 
> > > > on a Luminous cluster, I have a PG incomplete and I can't find
> > > > how
> > > > to
> > > > fix that.
> > > > 
> > > > It's an EC pool (4+2) :
> > > > 
> > > > pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing
> > > > pool
> > > > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> > > > 'incomplete')
> > > > 
> > > > Of course, we can't reduce min_size from 4.
> > > > 
> > > > And the full state : https://pastebin.com/zrwu5X0w
> > > > 
> > > > So, IO are blocked, we can't access thoses damaged data.
> > > > OSD blocks too :
> > > > osds 32,68,69 have stuck requests > 4194.3 sec
> > > > 
> > > > OSD 32 is the primary of this PG.
> > > > And OSD 68 and 69 are for cache tiering.
> > > > 
> > > > Any idea how can I fix that ?
> > > > 
> > > > Thanks,
> > > > 
> > > > Olivier
> > > > 
> > > > 
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Hi,

cache-tier on this pool have 26GB of data (for 5.7TB of data on the EC
pool).
We tried to flush the cache tier, and restart OSD 68 & 69, without any
success.

But I don't see any related data on cache-tier OSD (filestore) with :

find /var/lib/ceph/osd/ -maxdepth 3 -name '*37.9c*'


I don't see any usefull information in logs. Maybe I should increase
log level ?

Thanks,

Olivier


Le vendredi 21 septembre 2018 à 09:34 +, Eugen Block a écrit :
> Hi Olivier,
> 
> what size does the cache tier have? You could set cache-mode to  
> forward and flush it, maybe restarting those OSDs (68, 69) helps,
> too.  
> Or there could be an issue with the cache tier, what do those logs
> say?
> 
> Regards,
> Eugen
> 
> 
> Zitat von Olivier Bonvalet :
> 
> > Hello,
> > 
> > on a Luminous cluster, I have a PG incomplete and I can't find how
> > to
> > fix that.
> > 
> > It's an EC pool (4+2) :
> > 
> > pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
> > bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> > 'incomplete')
> > 
> > Of course, we can't reduce min_size from 4.
> > 
> > And the full state : https://pastebin.com/zrwu5X0w
> > 
> > So, IO are blocked, we can't access thoses damaged data.
> > OSD blocks too :
> > osds 32,68,69 have stuck requests > 4194.3 sec
> > 
> > OSD 32 is the primary of this PG.
> > And OSD 68 and 69 are for cache tiering.
> > 
> > Any idea how can I fix that ?
> > 
> > Thanks,
> > 
> > Olivier
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck incomplete

2018-09-20 Thread Olivier Bonvalet
Hello,

on a Luminous cluster, I have a PG incomplete and I can't find how to
fix that.

It's an EC pool (4+2) :

pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
'incomplete')

Of course, we can't reduce min_size from 4.

And the full state : https://pastebin.com/zrwu5X0w

So, IO are blocked, we can't access thoses damaged data.
OSD blocks too :
osds 32,68,69 have stuck requests > 4194.3 sec

OSD 32 is the primary of this PG.
And OSD 68 and 69 are for cache tiering.

Any idea how can I fix that ?

Thanks,

Olivier


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optane 900P device class automatically set to SSD not NVME

2018-08-13 Thread Olivier Bonvalet
On a recent Luminous cluster, with nvme*n1 devices, the class is
automatically set as "nvme" on "Intel SSD DC P3520 Series" :

~# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF 
 -1   2.15996 root default  
 -9   0.71999 room room135  
 -3   0.71999 host ceph135a 
  0  nvme 0.35999 osd.0 up  1.0 1.0 
  1  nvme 0.35999 osd.1 up  1.0 1.0 
-11   0.71999 room room209  
 -5   0.71999 host ceph209a 
  2  nvme 0.35999 osd.2 up  1.0 1.0 
  3  nvme 0.35999 osd.3 up  1.0 1.0 
-12   0.71999 room room220  
 -7   0.71999 host ceph220a 
  4  nvme 0.35999 osd.4 up  1.0 1.0 
  5  nvme 0.35999 osd.5 up  1.0 1.0 


Le dimanche 12 août 2018 à 23:37 +0200, c...@elchaka.de a écrit :
> 
> Am 1. August 2018 10:33:26 MESZ schrieb Jake Grimmett <
> j...@mrc-lmb.cam.ac.uk>:
> > Dear All,
> 
> Hello Jake,
> 
> > 
> > Not sure if this is a bug, but when I add Intel Optane 900P drives,
> > their device class is automatically set to SSD rather than NVME.
> > 
> 
> AFAIK ceph actually difference only between hdd and ssd. Nvme would
> be handled as same like ssd.
> 
> Hth 
> - Mehmet 
>  
> > This happens under Mimic 13.2.0 and 13.2.1
> > 
> > [root@ceph2 ~]# ceph-volume lvm prepare --bluestore --data
> > /dev/nvme0n1
> > 
> > (SNIP see http://p.ip.fi/eopR for output)
> > 
> > Check...
> > [root@ceph2 ~]# ceph osd tree | grep "osd.1 "
> >  1   ssd0.25470 osd.1   up  1.0 1.0
> > 
> > Fix is easy
> > [root@ceph2 ~]# ceph osd crush rm-device-class osd.1
> > done removing class of osd(s): 1
> > 
> > [root@ceph2 ~]# ceph osd crush set-device-class nvme osd.1
> > set osd(s) 1 to class 'nvme'
> > 
> > Check...
> > [root@ceph2 ~]# ceph osd tree | grep "osd.1 "
> >  1  nvme0.25470 osd.1   up  1.0 1.0
> > 
> > 
> > Thanks,
> > 
> > Jake
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
Hi,

Good point ! Changing this value, *and* restarting ceph-mgr fix this
issue. Now we have to find a way to reduce PG account.

Thanks Paul !

Olivier

Le mardi 05 juin 2018 à 10:39 +0200, Paul Emmerich a écrit :
> Hi,
> 
> looks like you are running into the PG overdose protection of
> Luminous (you got > 200 PGs per OSD): try to increase
> mon_max_pg_per_osd on the monitors to 300 or so to temporarily
> resolve this.
> 
> Paul
> 
> 2018-06-05 9:40 GMT+02:00 Olivier Bonvalet :
> > Some more informations : the cluster was just upgraded from Jewel
> > to
> > Luminous.
> > 
> > # ceph pg dump | egrep '(stale|creating)'
> > dumped all
> > 15.32 10947  00 0   0 
> > 45870301184  3067 3067   
> > stale+active+clean 2018-06-04 09:20:42.594317   387644'251008   
> >  437722:754803[48,31,45] 48   
> > [48,31,45] 48   213014'224196 2018-04-22
> > 02:01:09.148152   200181'219150 2018-04-14 14:40:13.116285 
> >0 
> > 19.77  4131  00 0   0 
> > 17326669824  3076 3076   
> > stale+down 2018-06-05 07:28:33.968860394478'58307   
> >  438699:736881  [NONE,20,76] 20   
> >   [NONE,20,76] 20273736'49495 2018-05-17
> > 01:05:35.523735273736'49495 2018-05-17 01:05:35.523735 
> >0 
> > 13.76 10730  00 0   0 
> > 44127133696  3011 3011   
> > stale+down 2018-06-05 07:30:27.578512   397231'457143   
> > 438813:4600135  [NONE,21,76] 21   
> >   [NONE,21,76] 21   286462'438402 2018-05-20
> > 18:06:12.443141   286462'438402 2018-05-20 18:06:12.443141 
> >0 
> > 
> > 
> > 
> > 
> > Le mardi 05 juin 2018 à 09:25 +0200, Olivier Bonvalet a écrit :
> > > Hi,
> > > 
> > > I have a cluster in "stale" state : a lots of RBD are blocked
> > since
> > > ~10
> > > hours. In the status I see PG in stale or down state, but thoses
> > PG
> > > doesn't seem to exists anymore :
> > > 
> > > root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
> > > HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull
> > osd(s);
> > > 16 pool(s) nearfull; 4645278/103969515 objects misplaced
> > (4.468%);
> > > Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs
> > > peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515
> > > objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized;
> > 229
> > > slow requests are blocked > 32 sec; 4074 stuck requests are
> > blocked >
> > > 4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-
> > sbg,hyp02-
> > > sbg,hyp03-sbg are using a lot of disk space
> > > PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12
> > pgs
> > > down, 2 pgs peering, 3 pgs stale
> > > pg 31.8b is down, acting [2147483647,16,36]
> > > pg 31.8e is down, acting [2147483647,29,19]
> > > pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]
> > > 
> > > root! stor00-sbg:~# ceph pg 31.8b query
> > > Error ENOENT: i don't have pgid 31.8b
> > > 
> > > root! stor00-sbg:~# ceph pg 31.8e query
> > > Error ENOENT: i don't have pgid 31.8e
> > > 
> > > root! stor00-sbg:~# ceph pg 46.b8 query
> > > Error ENOENT: i don't have pgid 46.b8
> > > 
> > > 
> > > We just loose an HDD, and mark the corresponding OSD as "lost".
> > > 
> > > Any idea of what should I do ?
> > > 
> > > Thanks,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
Some more informations : the cluster was just upgraded from Jewel to
Luminous.

# ceph pg dump | egrep '(stale|creating)'
dumped all
15.32 10947  00 0   0  45870301184  
3067 3067stale+active+clean 2018-06-04 
09:20:42.594317   387644'251008 437722:754803[48,31,45] 
48[48,31,45] 48   213014'224196 
2018-04-22 02:01:09.148152   200181'219150 2018-04-14 14:40:13.116285   
  0 
19.77  4131  00 0   0  17326669824  
3076 3076stale+down 2018-06-05 
07:28:33.968860394478'58307 438699:736881  [NONE,20,76] 
20  [NONE,20,76] 20273736'49495 
2018-05-17 01:05:35.523735273736'49495 2018-05-17 01:05:35.523735   
  0 
13.76 10730  00 0   0  44127133696  
3011 3011stale+down 2018-06-05 
07:30:27.578512   397231'457143438813:4600135  [NONE,21,76] 
21  [NONE,21,76] 21   286462'438402 
2018-05-20 18:06:12.443141   286462'438402 2018-05-20 18:06:12.443141       
  0 




Le mardi 05 juin 2018 à 09:25 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster in "stale" state : a lots of RBD are blocked since
> ~10
> hours. In the status I see PG in stale or down state, but thoses PG
> doesn't seem to exists anymore :
> 
> root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
> HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull osd(s);
> 16 pool(s) nearfull; 4645278/103969515 objects misplaced (4.468%);
> Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs
> peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515
> objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized; 229
> slow requests are blocked > 32 sec; 4074 stuck requests are blocked >
> 4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-sbg,hyp02-
> sbg,hyp03-sbg are using a lot of disk space
> PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 pgs
> down, 2 pgs peering, 3 pgs stale
> pg 31.8b is down, acting [2147483647,16,36]
> pg 31.8e is down, acting [2147483647,29,19]
> pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]
> 
> root! stor00-sbg:~# ceph pg 31.8b query
> Error ENOENT: i don't have pgid 31.8b
> 
> root! stor00-sbg:~# ceph pg 31.8e query
> Error ENOENT: i don't have pgid 31.8e
> 
> root! stor00-sbg:~# ceph pg 46.b8 query
> Error ENOENT: i don't have pgid 46.b8
> 
> 
> We just loose an HDD, and mark the corresponding OSD as "lost".
> 
> Any idea of what should I do ?
> 
> Thanks,
> 
> Olivier
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ghost PG : "i don't have pgid xx"

2018-06-05 Thread Olivier Bonvalet
Hi,

I have a cluster in "stale" state : a lots of RBD are blocked since ~10
hours. In the status I see PG in stale or down state, but thoses PG
doesn't seem to exists anymore :

root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'
HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull osd(s); 16 
pool(s) nearfull; 4645278/103969515 objects misplaced (4.468%); Reduced data 
availability: 643 pgs inactive, 12 pgs down, 2 pgs peering, 3 pgs stale; 
Degraded data redundancy: 2723173/103969515 objects degraded (2.619%), 387 pgs 
degraded, 297 pgs undersized; 229 slow requests are blocked > 32 sec; 4074 
stuck requests are blocked > 4096 sec; too many PGs per OSD (202 > max 200); 
mons hyp01-sbg,hyp02-sbg,hyp03-sbg are using a lot of disk space
PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs 
peering, 3 pgs stale
pg 31.8b is down, acting [2147483647,16,36]
pg 31.8e is down, acting [2147483647,29,19]
pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]

root! stor00-sbg:~# ceph pg 31.8b query
Error ENOENT: i don't have pgid 31.8b

root! stor00-sbg:~# ceph pg 31.8e query
Error ENOENT: i don't have pgid 31.8e

root! stor00-sbg:~# ceph pg 46.b8 query
Error ENOENT: i don't have pgid 46.b8


We just loose an HDD, and mark the corresponding OSD as "lost".

Any idea of what should I do ?

Thanks,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : general protection fault: 0000 [#1] SMP

2017-10-12 Thread Olivier Bonvalet
Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
> ceph-induced, it looks like.  I don't remember seeing anything
> similar
> in the context of krbd.
> 
> This is a Xen dom0 kernel, right?  What did the workload look like?
> Can you provide dmesg before the crash?

Hi,

yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
0.94.10 Ceph (so, Hammer).

Before this error, I add this in logs :

Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message 
88021a910200 data crc 2306836368 != exp. 2215155875
Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message 
880220b10100 data crc 842840543 != exp. 2657161714
Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 
bad crc/signature
Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message 
88025d1aa400 data crc 938978341 != exp. 4154366769
Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message 
880260300400 data crc 2988747958 != exp. 319958859
Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 
bad crc/signature
Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message 
8802208b8300 data crc 369498361 != exp. 906022772
Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 
bad crc/signature
Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message 
880236f9a000 data crc 2586662963 != exp. 2886241494
Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 
bad crc/signature
Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message 
88027e633c00 data crc 1102363051 != exp. 679243837
Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 
bad crc/signature
Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message 
88024ce0a800 data crc 2467415342 != exp. 1753860323
Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 
bad crc/signature
Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault:  
[#1] SMP


We had to switch to TCP Cubic (instead of badly configured TCP BBR, without 
FQ), to reduce the data crc errors.
But since we still had some errors, last night we rebooted all the OSD nodes in 
Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.

Since the last 7 hours, we haven't got any data crc errors from OSD, but we had 
one from a MON. Without hang/crash.

About the workload, the Xen VMs are mainly LAMP servers : http traffic, handle 
by nginx or apache, php, and MySQL databases.

Thanks,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] general protection fault: 0000 [#1] SMP

2017-10-11 Thread Olivier Bonvalet
Hi,

I had a "general protection fault: " with Ceph RBD kernel client.
Not sure how to read the call, is it Ceph related ?


Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault:  
[#1] SMP
Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid 
binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev 
br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph 
xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul 
ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel 
aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper 
intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt 
fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter 
ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables 
x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor xor async_tx raid6_pq
Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 
multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core 
xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp 
usb_common pps_core mdio
Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: 
kworker/1:0 Not tainted 4.13-dae-dom0 #2
Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel 
Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 
10/12/2016
Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.898355] task: 8801ce434280 
task.stack: c900151bc000
Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:c900151bfac0 
EFLAGS: 00010202
Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: 8801b63df000 RBX: 
88021b41be00 RCX: 04df
Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 04df RSI: 
4450736e24806564 RDI: 8801b63df000
Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ea0005fdd8c8 R08: 
88028545d618 R09: 0010
Oct 11 16:15:11 lorunde kernel: [311418.902104] R10:  R11: 
880215815000 R12: 
Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: 8802158156c0 R14: 
 R15: 8801ce434280
Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  () 
GS:88028544() knlGS:88028544
Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS:  ES:  
CR0: 80050033
Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 55a8461cfc20 CR3: 
01809000 CR4: 00042660
Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? 
__sk_mem_raise_allocated+0x4e/0x220
Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? 
__tcp_push_pending_frames+0x28/0x90
Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? do_tcp_sendpages+0x4fc/0x590
Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? ceph_tcp_sendpage+0x5c/0xc0 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? ceph_con_workfn+0xd08/0x22a0 
[libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? 
ceph_osdc_start_request+0x23/0x30 [libceph]
Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? 
rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? process_one_work+0x1ad/0x340
Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? process_one_work+0x340/0x340
Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? 
kthread_create_on_node+0x40/0x40
Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
Oct 11 16:15:11 lorunde kernel: [311418.919826] Code: 43 4e 5b eb ec eb 1e 0f 
1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 
44 00 00 48 89 f8 48 89 d1  a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 
72 7e 40 38 
Oct 11 16:15:11 lorun

[ceph-users] Re : Re : Re : bad crc/signature errors

2017-10-06 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 21:52 +0200, Ilya Dryomov a écrit :
> On Thu, Oct 5, 2017 at 6:05 PM, Olivier Bonvalet  > wrote:
> > Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit :
> > > When did you start seeing these errors?  Can you correlate that
> > > to
> > > a ceph or kernel upgrade?  If not, and if you don't see other
> > > issues,
> > > I'd write it off as faulty hardware.
> > 
> > Well... I have one hypervisor (Xen 4.6 and kernel Linux 4.1.13),
> > which
> 
> Is that 4.1.13 or 4.13.1?
> 

Linux 4.1.13. The old Debian 8, with Xen 4.6 from upstream.


> > have the problem for a long time, at least since 1 month (I haven't
> > older logs).
> > 
> > But, on others hypervisors (Xen 4.8 with Linux 4.9.x), I haven't
> > the
> > problem.
> > And it's when I upgraded thoses hypervisors to Linux 4.13.x, that
> > "bad
> > crc" errors appeared.
> > 
> > Note : if I upgraded kernels on Xen 4.8 hypervisors, it's because
> > some
> > DISCARD commands over RBD were blocking ("fstrim" works, but not
> > "lvremove" with discard enabled). After upgrading to Linux 4.13.3,
> > DISCARD works again on Xen 4.8.
> 
> Which kernel did you upgrade from to 4.13.3 exactly?
> 
> 

4.9.47 or 4.9.52, I don't have more precise data about this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 17:03 +0200, Ilya Dryomov a écrit :
> When did you start seeing these errors?  Can you correlate that to
> a ceph or kernel upgrade?  If not, and if you don't see other issues,
> I'd write it off as faulty hardware.

Well... I have one hypervisor (Xen 4.6 and kernel Linux 4.1.13), which
have the problem for a long time, at least since 1 month (I haven't
older logs).

But, on others hypervisors (Xen 4.8 with Linux 4.9.x), I haven't the
problem.
And it's when I upgraded thoses hypervisors to Linux 4.13.x, that "bad
crc" errors appeared.

Note : if I upgraded kernels on Xen 4.8 hypervisors, it's because some
DISCARD commands over RBD were blocking ("fstrim" works, but not
"lvremove" with discard enabled). After upgrading to Linux 4.13.3,
DISCARD works again on Xen 4.8.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 11:10 +0200, Ilya Dryomov a écrit :
> On Thu, Oct 5, 2017 at 9:03 AM, Olivier Bonvalet  > wrote:
> > I also see that, but on 4.9.52 and 4.13.3 kernel.
> > 
> > I also have some kernel panic, but don't know if it's related (RBD
> > are
> > mapped on Xen hosts).
> 
> Do you have that panic message?
> 
> Do you use rbd devices for something other than Xen?  If so, have you
> ever seen these errors outside of Xen?
> 
> Thanks,
> 
> Ilya
> 

No, I don't have that panic message : the hosts reboots way too
quickly. And no, I only use this cluster with Xen.

Sorry for this useless answer...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
Le jeudi 05 octobre 2017 à 11:47 +0200, Ilya Dryomov a écrit :
> The stable pages bug manifests as multiple sporadic connection
> resets,
> because in that case CRCs computed by the kernel don't always match
> the
> data that gets sent out.  When the mismatch is detected on the OSD
> side, OSDs reset the connection and you'd see messages like
> 
>   libceph: osd1 1.2.3.4:6800 socket closed (con state OPEN)
>   libceph: osd2 1.2.3.4:6804 socket error on write
> 
> This is a different issue.  Josy, Adrian, Olivier, do you also see
> messages of the "libceph: read_partial_message ..." type or is it
> just
> "libceph: ... bad crc/signature" errors?
> 
> Thanks,
> 
> Ilya

I have "read_partial_message" too, for example :

Oct  5 09:00:47 lorunde kernel: [65575.969322] libceph: read_partial_message 
88027c231500 data crc 181941039 != exp. 115232978
Oct  5 09:00:47 lorunde kernel: [65575.969953] libceph: osd122 10.0.0.31:6800 
bad crc/signature
Oct  5 09:04:30 lorunde kernel: [65798.958344] libceph: read_partial_message 
880254a25c00 data crc 443114996 != exp. 2014723213
Oct  5 09:04:30 lorunde kernel: [65798.959044] libceph: osd18 10.0.0.22:6802 
bad crc/signature
Oct  5 09:14:28 lorunde kernel: [66396.788272] libceph: read_partial_message 
880238636200 data crc 1797729588 != exp. 2550563968
Oct  5 09:14:28 lorunde kernel: [66396.788984] libceph: osd43 10.0.0.9:6804 bad 
crc/signature
Oct  5 10:09:36 lorunde kernel: [69704.211672] libceph: read_partial_message 
8802712dff00 data crc 2241944833 != exp. 762990605
Oct  5 10:09:36 lorunde kernel: [69704.212422] libceph: osd103 10.0.0.28:6804 
bad crc/signature
Oct  5 10:25:41 lorunde kernel: [70669.203596] libceph: read_partial_message 
880257521400 data crc 3655331946 != exp. 2796991675
Oct  5 10:25:41 lorunde kernel: [70669.204462] libceph: osd16 10.0.0.21:6806 
bad crc/signature
Oct  5 10:25:52 lorunde kernel: [70680.255943] libceph: read_partial_message 
880245e3d600 data crc 3787567693 != exp. 725251636
Oct  5 10:25:52 lorunde kernel: [70680.257066] libceph: osd60 10.0.0.23:6800 
bad crc/signature


On OSD side, for osd122 for example, I don't see any "reset" in osd
log.


Thanks,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : bad crc/signature errors

2017-10-05 Thread Olivier Bonvalet
I also see that, but on 4.9.52 and 4.13.3 kernel.

I also have some kernel panic, but don't know if it's related (RBD are
mapped on Xen hosts).

Le jeudi 05 octobre 2017 à 05:53 +, Adrian Saul a écrit :
> We see the same messages and are similarly on a 4.4 KRBD version that
> is affected by this.
> 
> I have seen no impact from it so far that I know about
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > Behalf Of
> > Jason Dillaman
> > Sent: Thursday, 5 October 2017 5:45 AM
> > To: Gregory Farnum 
> > Cc: ceph-users ; Josy
> > 
> > Subject: Re: [ceph-users] bad crc/signature errors
> > 
> > Perhaps this is related to a known issue on some 4.4 and later
> > kernels [1]
> > where the stable write flag was not preserved by the kernel?
> > 
> > [1] http://tracker.ceph.com/issues/19275
> > 
> > On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum 
> > wrote:
> > > That message indicates that the checksums of messages between
> > > your
> > > kernel client and OSD are incorrect. It could be actual physical
> > > transmission errors, but if you don't see other issues then this
> > > isn't
> > > fatal; they can recover from it.
> > > 
> > > On Wed, Oct 4, 2017 at 8:52 AM Josy 
> > 
> > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > We have setup a cluster with 8 OSD servers (31 disks)
> > > > 
> > > > Ceph health is Ok.
> > > > --
> > > > [root@las1-1-44 ~]# ceph -s
> > > >cluster:
> > > >  id: de296604-d85c-46ab-a3af-add3367f0e6d
> > > >  health: HEALTH_OK
> > > > 
> > > >services:
> > > >  mon: 3 daemons, quorum
> > > > ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
> > > >  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
> > > >  osd: 31 osds: 31 up, 31 in
> > > > 
> > > >data:
> > > >  pools:   4 pools, 510 pgs
> > > >  objects: 459k objects, 1800 GB
> > > >  usage:   5288 GB used, 24461 GB / 29749 GB avail
> > > >  pgs: 510 active+clean
> > > > 
> > > > 
> > > > We created a pool and mounted it as RBD in one of the client
> > > > server.
> > > > While adding data to it, we see this below error :
> > > > 
> > > > 
> > > > [939656.039750] libceph: osd20 10.255.0.9:6808 bad
> > > > crc/signature
> > > > [939656.041079] libceph: osd16 10.255.0.8:6816 bad
> > > > crc/signature
> > > > [939735.627456] libceph: osd11 10.255.0.7:6800 bad
> > > > crc/signature
> > > > [939735.628293] libceph: osd30 10.255.0.11:6804 bad
> > > > crc/signature
> > > > 
> > > > =
> > > > 
> > > > Can anyone explain what is this and if I can fix it ?
> > > > 
> > > > 
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> > 
> > 
> > --
> > Jason
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> Confidentiality: This email and any attachments are confidential and
> may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the
> named addressee(s). They may only be copied, distributed or disclosed
> with the consent of the copyright owner. If you have received this
> email by mistake or by breach of the confidentiality clause, please
> notify the sender immediately by return email and delete or destroy
> all copies of the email. Any confidentiality, privilege or copyright
> is not waived or lost because this email has been sent to you by
> mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.com IPv6 down

2015-09-23 Thread Olivier Bonvalet
Le mercredi 23 septembre 2015 à 13:41 +0200, Wido den Hollander a écrit
 :
> Hmm, that is weird. It works for me here from the Netherlands via
> IPv6:

You're right, I checked from other providers and it works.

So, a problem between Free (France) and Dreamhost ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph.com IPv6 down

2015-09-23 Thread Olivier Bonvalet
Hi,

since several hours http://ceph.com/ doesn't reply anymore in IPv6.
It pings, and we can open TCP socket, but nothing more :


~$ nc -w30 -v -6 ceph.com 80
Connection to ceph.com 80 port [tcp/http] succeeded!
GET / HTTP/1.0
Host: ceph.com




But, a HEAD query works :

~$ nc -w30 -v -6 ceph.com 80
Connection to ceph.com 80 port [tcp/http] succeeded!
HEAD / HTTP/1.0
Host: ceph.com
HTTP/1.0 200 OK
Date: Wed, 23 Sep 2015 11:35:27 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16
X-Powered-By: PHP/5.4.16
Set-Cookie: PHPSESSID=q0jf4mh9rqfk5du4kn8tcnqen1; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, 
pre-check=0
Pragma: no-cache
X-Pingback: http://ceph.com/xmlrpc.php
Link: ; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8



So, from my browser the website is unavailable.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Hi,

I think I found the problem : a way too large journal.
I catch this from logs of an OSD having blocked queries :

OSD.15 :

2015-09-19 00:41:12.717062 7fb8a3c44700  1 journal check_for_full at 3548528640 
: JOURNAL FULL 3548528640 >= 1376255 (max_size 4294967296 start 3549904896)
2015-09-19 00:41:43.124590 7fb8a6181700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.405719 secs
2015-09-19 00:41:43.124596 7fb8a6181700  0 log [WRN] : slow request 30.405719 
seconds old, received at 2015-09-19 00:41:12.718829: 
osd_op(client.31621623.1:5392489797 rb.0.1b844d6.238e1f29.04d3 [write 
0~4096] 6.3aed306f snapc 4=[4,11096,11018] ondisk+write e847952) v4 
currently waiting for subops from 19
2015-09-19 00:41:43.124599 7fb8a6181700  0 log [WRN] : slow request 30.172735 
seconds old, received at 2015-09-19 00:41:12.951813: 
osd_op(client.31435077.1:8423014905 rb.0.1c39394.238e1f29.037a [write 
1499136~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124602 7fb8a6181700  0 log [WRN] : slow request 30.172703 
seconds old, received at 2015-09-19 00:41:12.951845: 
osd_op(client.31435077.1:8423014906 rb.0.1c39394.238e1f29.037a [write 
1523712~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124604 7fb8a6181700  0 log [WRN] : slow request 30.172576 
seconds old, received at 2015-09-19 00:41:12.951972: 
osd_op(client.31435077.1:8423014907 rb.0.1c39394.238e1f29.037a [write 
1515520~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124606 7fb8a6181700  0 log [WRN] : slow request 30.172546 
seconds old, received at 2015-09-19 00:41:12.952002: 
osd_op(client.31435077.1:8423014909 rb.0.1c39394.238e1f29.037a [write 
1531904~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28

and at same time on OSD.19 :

2015-09-19 00:41:19.549508 7f55973c0700  0 -- 192.168.42.22:6806/28596 >> 
192.168.42.16:6828/38905 pipe(0x230f sd=358 :6806 s=2 pgs=14268 cs=3 l=0 
c=0x6d9cb00).fault with nothing to send, going to standby
2015-09-19 00:41:43.246421 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 30.253274 secs
2015-09-19 00:41:43.246428 7f55ba277700  0 log [WRN] : slow request 30.253274 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15
2015-09-19 00:42:13.251591 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 60.258446 secs
2015-09-19 00:42:13.251596 7f55ba277700  0 log [WRN] : slow request 60.258446 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15

So the blocking seem to be the "JOURNAL FULL" event, with big numbers. 
3548528640, is the journal size ?
I just reduced the filestore_max_sync_interval to 30s, and everything
seems to work fine.

For SSD OSD, with journal on same device, big journal is a crazy
thing... I suppose I break this setup when trying to tune the journal
for the HDD pool.

At same time, is there tips tuning journal in case of HDD OSD, with
(potentially big) SSD journal, and hardware RAID card which handle
write back ?

Thanks for your help.

Olivier


Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster with lot of blocked operations each time I try to
> move
> data (by reweighting a little an OSD).
> 
> It's a full SSD cluster, with 10GbE network.
> 
> In logs, when I have blocked OSD, on the main OSD I can see that :
> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 33.976680 secs
> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 63.981596 secs
> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [writ

Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Olivier Bonvalet
Hi,

not sure if it's related, but there is recent changes because of a
security issue :

http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/




Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit :
> Hi all, we've had the following in our
> /etc/apt/sources.list.d/ceph.list 
> for a while based on some previous docs,
> 
> # ceph upstream stable (currently giant) release packages for wheezy:
> deb http://ceph.com/debian/ wheezy main
> 
> # ceph extras:
> deb http://ceph.com/packages/ceph-extras/debian wheezy main
> 
> but it seems like the straight "debian/" portion of that path has
> gone 
> missing recently, and now there's only debian-firefly/, debian
> -giant/, 
> debian-hammer/, etc.
> 
> Is that just an oversight, or should we be switching our sources to
> one 
> of the named releases?  I figured that the unnamed one would 
> automatically track what ceph currently considered "stable" for the 
> target distro release for me, but maybe that's not the case.
> 
> Thanks,
> Brian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 14:14 +0200, Paweł Sadowski a écrit :
> It might be worth checking how many threads you have in your system
> (ps
> -eL | wc -l). By default there is a limit of 32k (sysctl -q
> kernel.pid_max). There is/was a bug in fork()
> (https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit
> is
> reached. We hit a situation when OSD trying to create new thread was
> killed and reports 'Cannot allocate memory' (12 OSD per node created
> more than 32k threads).
> 

Thanks ! For now I don't see more than 5k threads on nodes with 12 OSD,
but maybe during recovery/backfilling ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
> > On 18 Sep 2015, at 11:28, Christian Balzer  wrote:
> > 
> > On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
> > 
> > > Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
> > > :
> > > > In that case it can either be slow monitors (slow network, slow
> > > > disks(!!!)  or a CPU or memory problem).
> > > > But it still can also be on the OSD side in the form of either
> > > > CPU
> > > > usage or memory pressure - in my case there were lots of memory
> > > > used
> > > > for pagecache (so for all intents and purposes considered
> > > > "free") but
> > > > when peering the OSD had trouble allocating any memory from it
> > > > and it
> > > > caused lots of slow ops and peering hanging in there for a
> > > > while.
> > > > This also doesn't show as high CPU usage, only kswapd spins up
> > > > a bit
> > > > (don't be fooled by its name, it has nothing to do with swap in
> > > > this
> > > > case).
> > > 
> > > My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
> > > (for
> > > 4x800GB ones), so I will try track this too. Thanks !
> > > 
> > I haven't seen this (known problem) with 64GB or 128GB nodes,
> > probably
> > because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
> > respectively.
> > 
> 
> I had this set to 6G and that doesn't help. This "buffer" is probably
> only useful for some atomic allocations that can use it, not for
> userland processes and their memory. Or maybe they get memory from
> this pool but it gets replenished immediately.
> QEMU has no problem allocating 64G on the same host, OSD struggles to
> allocate memory during startup or when PGs are added during
> rebalancing - probably because it does a lot of smaller allocations
> instead of one big.
> 

For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
any rebalance, but I can see a reduced filestore.commitcycle_latency.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
> In that case it can either be slow monitors (slow network, slow
> disks(!!!)  or a CPU or memory problem).
> But it still can also be on the OSD side in the form of either CPU
> usage or memory pressure - in my case there were lots of memory used
> for pagecache (so for all intents and purposes considered "free") but
> when peering the OSD had trouble allocating any memory from it and it
> caused lots of slow ops and peering hanging in there for a while.
> This also doesn't show as high CPU usage, only kswapd spins up a bit
> (don't be fooled by its name, it has nothing to do with swap in this
> case).

My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
4x800GB ones), so I will try track this too. Thanks !


> echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
> routine now and that problem is gone.
> 
> Jan
> 
> > On 18 Sep 2015, at 10:53, Olivier Bonvalet 
> > wrote:
> > 
> > mmm good point.
> > 
> > I don't see CPU or IO problem on mons, but in logs, I have this :
> > 
> > 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
> > v86359128:
> > 6632 pgs: 77 inactive, 1 remapped, 10
> > active+remapped+wait_backfill, 25
> > peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> > active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
> > 58578
> > GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
> > 8417/15680513
> > objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> > 
> > 
> > So... it can be a peering problem. Didn't see that, thanks.
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> > > Could this be caused by monitors? In my case lagging monitors can
> > > also cause slow requests (because of slow peering). Not sure if
> > > that's expected or not, but it of course doesn't show on the OSDs
> > > as
> > > any kind of bottleneck when you try to investigate...
> > > 
> > > Jan
> > > 
> > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet  > > > >
> > > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > sorry for missing informations. I was to avoid putting too much
> > > > inappropriate infos ;)
> > > > 
> > > > 
> > > > 
> > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > > écrit :
> > > > > Hello,
> > > > > 
> > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > > 
> > > > > The items below help, but be a s specific as possible, from
> > > > > OS,
> > > > > kernel
> > > > > version to Ceph version, "ceph -s", any other specific
> > > > > details
> > > > > (pool
> > > > > type,
> > > > > replica size).
> > > > > 
> > > > 
> > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > > kernel,
> > > > and Ceph 0.80.10.
> > > > I don't have anymore ceph status right now. But I have
> > > > data to move tonight again, so I'll track that.
> > > > 
> > > > The affected pool is a standard one (no erasure coding), with
> > > > only
> > > > 2 replica (size=2).
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > Some additionnal informations :
> > > > > > - I have 4 SSD per node.
> > > > > Type, if nothing else for anecdotal reasons.
> > > > 
> > > > I have 7 storage nodes here :
> > > > - 3 nodes which have each 12 OSD of 300GB
> > > > SSD
> > > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > > 
> > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > > 
> > > > 
> > > > 
> > > > > > - the CPU usage is near 0
> > > > > > - IO wait is near 0 too
> > > > > Including the trouble OSD(s)?
> > > > 
> > > > Yes
> > > > 
> > > > 
> > > > > Measured how, iostat or atop?
> > > > 
> > > > iostat, htop, and confirmed with Zabbix supervisor.
> > > > 
> > > > 
> > > > 
> > > >

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
mmm good point.

I don't see CPU or IO problem on mons, but in logs, I have this :

2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap v86359128:
6632 pgs: 77 inactive, 1 remapped, 10 active+remapped+wait_backfill, 25
peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used, 58578
GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s; 8417/15680513
objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering


So... it can be a peering problem. Didn't see that, thanks.



Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> Could this be caused by monitors? In my case lagging monitors can
> also cause slow requests (because of slow peering). Not sure if
> that's expected or not, but it of course doesn't show on the OSDs as
> any kind of bottleneck when you try to investigate...
> 
> Jan
> 
> > On 18 Sep 2015, at 09:37, Olivier Bonvalet 
> > wrote:
> > 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> > The affected pool is a standard one (no erasure coding), with only
> > 2 replica (size=2).
> > 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I increase the weight of all 4 OSDs of a node,
> > I
> > frequently have blocked IO from 1 OSD of this node.
> > 
> > 
> > 
> > > Does restarting the OSD fix things?
> > 
> > Yes. For several minutes.
> > 
> > 
> > > Christian
> > > > 
> > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > écrit :
> > > > > Hi,
> > > > > 
> > > > > I have a cluster with lot of blocked operations each time I
> > > > > try
> > > > > to
> > > > > move
> > > > > data (by reweighting a little an OSD).
> > > > > 
> > > > > It's a full SSD cluster, with 10GbE network.
> > > > > 
> > > > > In logs, when I have blocked OSD, on the main OSD I can see
> > > > > that
> > > > > :
> > > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > requests, 1 included below; oldest blocked for > 33.976680
> > > > > secs
> > > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > > request
> > > > > 30.125556 seconds old, received at 2015-09

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> Hello,
> 
> On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> Nah, everything helps, there are known problems with some versions,
> kernels, file systems, etc.
> 
> Speaking of which, what FS are you using on your OSDs?
> 

XFS.

> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> All my stuff is on Jessie, but at least Firefly should be stable and
> I
> haven't seen anything like your problem with it.
> And while 3.14 is a LTS kernel I wonder if something newer may be
> beneficial, but probably not.
> 

Well, I can try a 3.18.x kernel. But for that I have to restart all
nodes, which will throw some backfilling and probably some blocked IO
too ;)


> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> I was interested in that to see how many pools and PGs you have.

Well :

cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
 health HEALTH_OK
 monmap e21: 3 mons at 
{faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0}, 
election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh
 osdmap e847496: 88 osds: 88 up, 87 in
  pgmap v86390609: 6632 pgs, 16 pools, 18883 GB data, 5266 kobjects
68559 GB used, 59023 GB / 124 TB avail
6632 active+clean
  client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s


There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
pool use different OSD, on different nodes.
Since I don't often balance data of this hdd pool, I don't yet see
problem on it.



> >  The affected pool is a standard one (no erasure coding), with only
> > 2
> > replica (size=2).
> > 
> Good, nothing fancy going on there then.
> 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> Type as in model/maker, but helpful information.
> 

300GB models are Intel SSDSC2BB300G4 (DC S3500).
800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).




> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> 
> Good. I'm sure you checked for network errors. 
> Single network or split client/cluster network?
> 

It's the first thing I checked, and latency and packet loss is
monitored between each node and mons, but maybe I forgot some checks.


> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> That be below the threshold for most system tunables (there are
> various
> threads and articles on how to tune Ceph for "large" clusters).
> 
> Since this happens only when your cluster reshuffles data (and thus
> has
> more threads going) what is your ulimit setting for open files?


Wow... the default one on Debian Wheezy : 1024.



> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I in

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Hi,

sorry for missing informations. I was to avoid putting too much
inappropriate infos ;)



Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a écrit :
> Hello,
> 
> On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> 
> The items below help, but be a s specific as possible, from OS,
> kernel
> version to Ceph version, "ceph -s", any other specific details (pool
> type,
> replica size).
> 

So, all nodes use Debian Wheezy, running on a vanilla 3.14.x kernel,
and Ceph 0.80.10.
I don't have anymore ceph status right now. But I have
data to move tonight again, so I'll track that.

The affected pool is a standard one (no erasure coding), with only 2 replica 
(size=2).




> > Some additionnal informations :
> > - I have 4 SSD per node.
> Type, if nothing else for anecdotal reasons.

I have 7 storage nodes here :
- 3 nodes which have each 12 OSD of 300GB
SSD
- 4 nodes which have each  4 OSD of 800GB SSD

And I'm trying to replace 12x300GB nodes by 4x800GB nodes.



> > - the CPU usage is near 0
> > - IO wait is near 0 too
> Including the trouble OSD(s)?

Yes


> Measured how, iostat or atop?

iostat, htop, and confirmed with Zabbix supervisor.




> > - bandwith usage is also near 0
> > 
> Yeah, all of the above are not surprising if everything is stuck
> waiting
> on some ops to finish. 
> 
> How many nodes are we talking about?


7 nodes, 52 OSDs.



> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> Is it just one specific OSD (or a set of them) or is that all over
> the
> place?

A set of them. When I increase the weight of all 4 OSDs of a node, I
frequently have blocked IO from 1 OSD of this node.



> Does restarting the OSD fix things?

Yes. For several minutes.


> Christian
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for > 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for > 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
But yes, I will try to increase OSD verbosity.

Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
I use Ceph 0.80.10.

I see IO wait is near 0 thanks to iostat, htop (in detailed mode), and
rechecked with Zabbix supervisor.


Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Some additionnal informations :
- I have 4 SSD per node.
- the CPU usage is near 0
- IO wait is near 0 too
- bandwith usage is also near 0

The whole cluster seems waiting for something... but I don't see what.


Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster with lot of blocked operations each time I try to
> move
> data (by reweighting a little an OSD).
> 
> It's a full SSD cluster, with 10GbE network.
> 
> In logs, when I have blocked OSD, on the main OSD I can see that :
> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 33.976680 secs
> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 63.981596 secs
> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 
> How should I read that ? What this OSD is waiting for ?
> 
> Thanks for any help,
> 
> Olivier
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lot of blocked operations

2015-09-17 Thread Olivier Bonvalet
Hi,

I have a cluster with lot of blocked operations each time I try to move
data (by reweighting a little an OSD).

It's a full SSD cluster, with 10GbE network.

In logs, when I have blocked OSD, on the main OSD I can see that :
2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow requests, 1 
included below; oldest blocked for > 33.976680 secs
2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request 30.125556 
seconds old, received at 2015-09-18 01:54:46.855821: 
osd_op(client.29760717.1:18680817544 rb.0.1c16005.238e1f29.027f [write 
180224~16384] 6.c11916a4 snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) 
v4 currently reached pg
2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow requests, 1 
included below; oldest blocked for > 63.981596 secs
2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request 60.130472 
seconds old, received at 2015-09-18 01:54:46.855821: 
osd_op(client.29760717.1:18680817544 rb.0.1c16005.238e1f29.027f [write 
180224~16384] 6.c11916a4 snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) 
v4 currently reached pg

How should I read that ? What this OSD is waiting for ?

Thanks for any help,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly 0.80.10 ready to upgrade to?

2015-07-21 Thread Olivier Bonvalet
Le mardi 21 juillet 2015 à 07:06 -0700, Sage Weil a écrit :
> On Tue, 21 Jul 2015, Olivier Bonvalet wrote:
> > Le lundi 13 juillet 2015 à 11:31 +0100, Gregory Farnum a écrit :
> > > On Mon, Jul 13, 2015 at 11:25 AM, Kostis Fardelas <
> > > dante1...@gmail.com> wrote:
> > > > Hello,
> > > > it seems that new packages for firefly have been uploaded to 
> repo.
> > > > However, I can't find any details in Ceph Release notes. There 
> is 
> > > > only
> > > > one thread in ceph-devel [1], but it is not clear what this new
> > > > version is about. Is it safe to upgrade from 0.80.9 to 0.80.10?
> > > 
> > > These packages got created and uploaded to the repository without
> > > release notes. I'm not sure why but I believe they're safe to 
> use.
> > > Hopefully Sage and our release guys can resolve that soon as 
> we've
> > > gotten several queries on the subject. :)
> > > -Greg
> > > ___
> > 
> > 
> > Hi,
> > 
> > any update on that point ? Packages were uploaded on repositories 
> one
> > month ago.
> > 
> > I would appreciate a confirmation "go!" or "NO go!" ;)
> 
> Sorry, I was sick and this dropped off my list.  I'll post the 
> release 
> notes today.
> 
> Thanks!
> sage

Great, I take that for a "go!".

Thanks Sage :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly 0.80.10 ready to upgrade to?

2015-07-21 Thread Olivier Bonvalet
Le lundi 13 juillet 2015 à 11:31 +0100, Gregory Farnum a écrit :
> On Mon, Jul 13, 2015 at 11:25 AM, Kostis Fardelas <
> dante1...@gmail.com> wrote:
> > Hello,
> > it seems that new packages for firefly have been uploaded to repo.
> > However, I can't find any details in Ceph Release notes. There is 
> > only
> > one thread in ceph-devel [1], but it is not clear what this new
> > version is about. Is it safe to upgrade from 0.80.9 to 0.80.10?
> 
> These packages got created and uploaded to the repository without
> release notes. I'm not sure why but I believe they're safe to use.
> Hopefully Sage and our release guys can resolve that soon as we've
> gotten several queries on the subject. :)
> -Greg
> ___


Hi,

any update on that point ? Packages were uploaded on repositories one
month ago.

I would appreciate a confirmation "go!" or "NO go!" ;)

thanks,
Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet
Hi,

Le lundi 23 mars 2015 à 07:29 -0700, Gregory Farnum a écrit :
> On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet  wrote:
> > Hi,
> >
> > I'm still trying to find why there is much more write operations on
> > filestore since Emperor/Firefly than from Dumpling.
> 
> Do you have any history around this? It doesn't sound familiar,
> although I bet it's because of the WBThrottle and flushing changes.

I only have history for block device stats and global stats reports by
«ceph status».
When I have upgrade from Dumpling to Firefly (via Emperor), write
operations increased a lot on OSD.
I suppose it's because of WBThrottle too, but can't find any parameter
able to confirm that.


> >
> > So, I add monitoring of all perf counters values from OSD.
> >
> > From what I see : «filestore.ops» reports an average of 78 operations
> > per seconds. But, block device monitoring reports an average of 113
> > operations per seconds (+45%).
> > please thoses 2 graphs :
> > - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
> > - https://daevel.fr/img/firefly/osd-70.sda-ops.png
> 
> That's unfortunate but perhaps not surprising — any filestore op can
> change a backing file (which requires hitting both the file and the
> inode: potentially two disk seeks), as well as adding entries to the
> leveldb instance.
> -Greg
> 

Ok thanks, so this part can be «normal».

> >
> > Do you see what can explain this difference ? (this OSD use XFS)
> >
> > Thanks,
> > Olivier
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More writes on blockdevice than on filestore ?

2015-03-23 Thread Olivier Bonvalet
Erg... I sent to fast. Bad title, please read «More writes on
blockdevice than on filestore)


Le lundi 23 mars 2015 à 14:21 +0100, Olivier Bonvalet a écrit :
> Hi,
> 
> I'm still trying to find why there is much more write operations on
> filestore since Emperor/Firefly than from Dumpling.
> 
> So, I add monitoring of all perf counters values from OSD.
> 
> From what I see : «filestore.ops» reports an average of 78 operations
> per seconds. But, block device monitoring reports an average of 113
> operations per seconds (+45%).
> please thoses 2 graphs :
> - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
> - https://daevel.fr/img/firefly/osd-70.sda-ops.png
> 
> Do you see what can explain this difference ? (this OSD use XFS)
> 
> Thanks,
> Olivier
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet
Hi,

I'm still trying to find why there is much more write operations on
filestore since Emperor/Firefly than from Dumpling.

So, I add monitoring of all perf counters values from OSD.

From what I see : «filestore.ops» reports an average of 78 operations
per seconds. But, block device monitoring reports an average of 113
operations per seconds (+45%).
please thoses 2 graphs :
- https://daevel.fr/img/firefly/osd-70.filestore-ops.png
- https://daevel.fr/img/firefly/osd-70.sda-ops.png

Do you see what can explain this difference ? (this OSD use XFS)

Thanks,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Yes, good idea.

I was looking the «WBThrottle» feature, but go for logging instead.


Le mercredi 04 mars 2015 à 17:10 +0100, Alexandre DERUMIER a écrit :
> >>Only writes ;) 
> 
> ok, so maybe some background operations (snap triming, scrubing...).
> 
> maybe debug_osd=20 , could give you more logs ?
> 
> 
> ----- Mail original -
> De: "Olivier Bonvalet" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Mercredi 4 Mars 2015 16:42:13
> Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
> 
> Only writes ;) 
> 
> 
> Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : 
> > >>The change is only on OSD (and not on OSD journal). 
> > 
> > do you see twice iops for read and write ? 
> > 
> > if only read, maybe a read ahead bug could explain this. 
> > 
> > - Mail original - 
> > De: "Olivier Bonvalet"  
> > À: "aderumier"  
> > Cc: "ceph-users"  
> > Envoyé: Mercredi 4 Mars 2015 15:13:30 
> > Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
> > 
> > Ceph health is OK yes. 
> > 
> > The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
> > ceph : there is no change between dumpling and firefly. The change is 
> > only on OSD (and not on OSD journal). 
> > 
> > 
> > Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
> > > >>The load problem is permanent : I have twice IO/s on HDD since firefly. 
> > > 
> > > Oh, permanent, that's strange. (If you don't see more traffic coming from 
> > > clients, I don't understand...) 
> > > 
> > > do you see also twice ios/ ops in "ceph -w " stats ? 
> > > 
> > > is the ceph health ok ? 
> > > 
> > > 
> > > 
> > > - Mail original - 
> > > De: "Olivier Bonvalet"  
> > > À: "aderumier"  
> > > Cc: "ceph-users"  
> > > Envoyé: Mercredi 4 Mars 2015 14:49:41 
> > > Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to 
> > > firefly 
> > > 
> > > Thanks Alexandre. 
> > > 
> > > The load problem is permanent : I have twice IO/s on HDD since firefly. 
> > > And yes, the problem hang the production at night during snap trimming. 
> > > 
> > > I suppose there is a new OSD parameter which change behavior of the 
> > > journal, or something like that. But didn't find anything about that. 
> > > 
> > > Olivier 
> > > 
> > > Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
> > > > Hi, 
> > > > 
> > > > maybe this is related ?: 
> > > > 
> > > > http://tracker.ceph.com/issues/9503 
> > > > "Dumpling: removing many snapshots in a short time makes OSDs go 
> > > > berserk" 
> > > > 
> > > > http://tracker.ceph.com/issues/9487 
> > > > "dumpling: snaptrimmer causes slow requests while backfilling. 
> > > > osd_snap_trim_sleep not helping" 
> > > > 
> > > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
> > > >  
> > > > 
> > > > 
> > > > 
> > > > I think it's already backport in dumpling, not sure it's already done 
> > > > for firefly 
> > > > 
> > > > 
> > > > Alexandre 
> > > > 
> > > > 
> > > > 
> > > > - Mail original - 
> > > > De: "Olivier Bonvalet"  
> > > > À: "ceph-users"  
> > > > Envoyé: Mercredi 4 Mars 2015 12:10:30 
> > > > Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
> > > > 
> > > > Hi, 
> > > > 
> > > > last saturday I upgraded my production cluster from dumpling to emperor 
> > > > (since we were successfully using it on a test cluster). 
> > > > A couple of hours later, we had falling OSD : some of them were marked 
> > > > as down by Ceph, probably because of IO starvation. I marked the 
> > > > cluster 
> > > > in «noout», start downed OSD, then let him recover. 24h later, same 
> > > > problem (near same hour). 
> > > > 
> > > > So, I choose to directly upgrade to firefly, which is maintained. 
> > > > Things are

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Only writes ;)


Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit :
> >>The change is only on OSD (and not on OSD journal). 
> 
> do you see twice iops for read and write ?
> 
> if only read, maybe a read ahead bug could explain this. 
> 
> - Mail original -----
> De: "Olivier Bonvalet" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Mercredi 4 Mars 2015 15:13:30
> Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
> 
> Ceph health is OK yes. 
> 
> The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
> ceph : there is no change between dumpling and firefly. The change is 
> only on OSD (and not on OSD journal). 
> 
> 
> Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
> > >>The load problem is permanent : I have twice IO/s on HDD since firefly. 
> > 
> > Oh, permanent, that's strange. (If you don't see more traffic coming from 
> > clients, I don't understand...) 
> > 
> > do you see also twice ios/ ops in "ceph -w " stats ? 
> > 
> > is the ceph health ok ? 
> > 
> > 
> > 
> > - Mail original - 
> > De: "Olivier Bonvalet"  
> > À: "aderumier"  
> > Cc: "ceph-users"  
> > Envoyé: Mercredi 4 Mars 2015 14:49:41 
> > Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
> > 
> > Thanks Alexandre. 
> > 
> > The load problem is permanent : I have twice IO/s on HDD since firefly. 
> > And yes, the problem hang the production at night during snap trimming. 
> > 
> > I suppose there is a new OSD parameter which change behavior of the 
> > journal, or something like that. But didn't find anything about that. 
> > 
> > Olivier 
> > 
> > Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
> > > Hi, 
> > > 
> > > maybe this is related ?: 
> > > 
> > > http://tracker.ceph.com/issues/9503 
> > > "Dumpling: removing many snapshots in a short time makes OSDs go berserk" 
> > > 
> > > http://tracker.ceph.com/issues/9487 
> > > "dumpling: snaptrimmer causes slow requests while backfilling. 
> > > osd_snap_trim_sleep not helping" 
> > > 
> > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
> > >  
> > > 
> > > 
> > > 
> > > I think it's already backport in dumpling, not sure it's already done for 
> > > firefly 
> > > 
> > > 
> > > Alexandre 
> > > 
> > > 
> > > 
> > > - Mail original - 
> > > De: "Olivier Bonvalet"  
> > > À: "ceph-users"  
> > > Envoyé: Mercredi 4 Mars 2015 12:10:30 
> > > Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
> > > 
> > > Hi, 
> > > 
> > > last saturday I upgraded my production cluster from dumpling to emperor 
> > > (since we were successfully using it on a test cluster). 
> > > A couple of hours later, we had falling OSD : some of them were marked 
> > > as down by Ceph, probably because of IO starvation. I marked the cluster 
> > > in «noout», start downed OSD, then let him recover. 24h later, same 
> > > problem (near same hour). 
> > > 
> > > So, I choose to directly upgrade to firefly, which is maintained. 
> > > Things are better, but the cluster is slower than with dumpling. 
> > > 
> > > The main problem seems that OSD have twice more write operations par 
> > > second : 
> > > https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
> > > https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
> > > 
> > > But journal doesn't change (SSD dedicated to OSD70+71+72) : 
> > > https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
> > > 
> > > Neither node bandwidth : 
> > > https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
> > > 
> > > Or whole cluster IO activity : 
> > > https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
> > > 
> > > Some background : 
> > > The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
> > > journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
> > > 
> > > I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
> > > nodes (so a total of 27 «HDD+SSD» OSD). 
> > > 
> > > The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
> > > «rbd snap rm» operations). 
> > > osd_snap_trim_sleep is setup to 0.8 since monthes. 
> > > Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
> > > doesn't seem to really help. 
> > > 
> > > The only thing which seems to help, is to reduce osd_disk_threads from 8 
> > > to 1. 
> > > 
> > > So. Any idea about what's happening ? 
> > > 
> > > Thanks for any help, 
> > > Olivier 
> > > 
> > > ___ 
> > > ceph-users mailing list 
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > 
> > 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Ceph health is OK yes.

The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by
ceph : there is no change between dumpling and firefly. The change is
only on OSD (and not on OSD journal).


Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit :
> >>The load problem is permanent : I have twice IO/s on HDD since firefly.
> 
> Oh, permanent, that's strange. (If you don't see more traffic coming from 
> clients, I don't understand...)
> 
> do you see also twice ios/ ops in "ceph -w " stats ?
> 
> is the ceph health ok ?
> 
> 
> 
> - Mail original -
> De: "Olivier Bonvalet" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Mercredi 4 Mars 2015 14:49:41
> Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
> 
> Thanks Alexandre. 
> 
> The load problem is permanent : I have twice IO/s on HDD since firefly. 
> And yes, the problem hang the production at night during snap trimming. 
> 
> I suppose there is a new OSD parameter which change behavior of the 
> journal, or something like that. But didn't find anything about that. 
> 
> Olivier 
> 
> Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
> > Hi, 
> > 
> > maybe this is related ?: 
> > 
> > http://tracker.ceph.com/issues/9503 
> > "Dumpling: removing many snapshots in a short time makes OSDs go berserk" 
> > 
> > http://tracker.ceph.com/issues/9487 
> > "dumpling: snaptrimmer causes slow requests while backfilling. 
> > osd_snap_trim_sleep not helping" 
> > 
> > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
> >  
> > 
> > 
> > 
> > I think it's already backport in dumpling, not sure it's already done for 
> > firefly 
> > 
> > 
> > Alexandre 
> > 
> > 
> > 
> > - Mail original - 
> > De: "Olivier Bonvalet"  
> > À: "ceph-users"  
> > Envoyé: Mercredi 4 Mars 2015 12:10:30 
> > Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
> > 
> > Hi, 
> > 
> > last saturday I upgraded my production cluster from dumpling to emperor 
> > (since we were successfully using it on a test cluster). 
> > A couple of hours later, we had falling OSD : some of them were marked 
> > as down by Ceph, probably because of IO starvation. I marked the cluster 
> > in «noout», start downed OSD, then let him recover. 24h later, same 
> > problem (near same hour). 
> > 
> > So, I choose to directly upgrade to firefly, which is maintained. 
> > Things are better, but the cluster is slower than with dumpling. 
> > 
> > The main problem seems that OSD have twice more write operations par 
> > second : 
> > https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
> > https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
> > 
> > But journal doesn't change (SSD dedicated to OSD70+71+72) : 
> > https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
> > 
> > Neither node bandwidth : 
> > https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
> > 
> > Or whole cluster IO activity : 
> > https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
> > 
> > Some background : 
> > The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
> > journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
> > 
> > I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
> > nodes (so a total of 27 «HDD+SSD» OSD). 
> > 
> > The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
> > «rbd snap rm» operations). 
> > osd_snap_trim_sleep is setup to 0.8 since monthes. 
> > Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
> > doesn't seem to really help. 
> > 
> > The only thing which seems to help, is to reduce osd_disk_threads from 8 
> > to 1. 
> > 
> > So. Any idea about what's happening ? 
> > 
> > Thanks for any help, 
> > Olivier 
> > 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Thanks Alexandre.

The load problem is permanent : I have twice IO/s on HDD since firefly.
And yes, the problem hang the production at night during snap trimming.

I suppose there is a new OSD parameter which change behavior of the
journal, or something like that. But didn't find anything about that.

Olivier

Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit :
> Hi,
> 
> maybe this is related ?:
> 
> http://tracker.ceph.com/issues/9503
> "Dumpling: removing many snapshots in a short time makes OSDs go berserk"
> 
> http://tracker.ceph.com/issues/9487
> "dumpling: snaptrimmer causes slow requests while backfilling. 
> osd_snap_trim_sleep not helping"
> 
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
> 
> 
> 
> I think it's already backport in dumpling, not sure it's already done for 
> firefly
> 
> 
> Alexandre
> 
> 
> 
> - Mail original -
> De: "Olivier Bonvalet" 
> À: "ceph-users" 
> Envoyé: Mercredi 4 Mars 2015 12:10:30
> Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly
> 
> Hi, 
> 
> last saturday I upgraded my production cluster from dumpling to emperor 
> (since we were successfully using it on a test cluster). 
> A couple of hours later, we had falling OSD : some of them were marked 
> as down by Ceph, probably because of IO starvation. I marked the cluster 
> in «noout», start downed OSD, then let him recover. 24h later, same 
> problem (near same hour). 
> 
> So, I choose to directly upgrade to firefly, which is maintained. 
> Things are better, but the cluster is slower than with dumpling. 
> 
> The main problem seems that OSD have twice more write operations par 
> second : 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
> 
> But journal doesn't change (SSD dedicated to OSD70+71+72) : 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
> 
> Neither node bandwidth : 
> https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
> 
> Or whole cluster IO activity : 
> https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
> 
> Some background : 
> The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
> journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
> 
> I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
> nodes (so a total of 27 «HDD+SSD» OSD). 
> 
> The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
> «rbd snap rm» operations). 
> osd_snap_trim_sleep is setup to 0.8 since monthes. 
> Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
> doesn't seem to really help. 
> 
> The only thing which seems to help, is to reduce osd_disk_threads from 8 
> to 1. 
> 
> So. Any idea about what's happening ? 
> 
> Thanks for any help, 
> Olivier 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Perf problem after upgrade from dumpling to firefly

2015-03-04 Thread Olivier Bonvalet
Hi,

last saturday I upgraded my production cluster from dumpling to emperor
(since we were successfully using it on a test cluster).
A couple of hours later, we had falling OSD : some of them were marked
as down by Ceph, probably because of IO starvation. I marked the cluster
in «noout», start downed OSD, then let him recover. 24h later, same
problem (near same hour).

So, I choose to directly upgrade to firefly, which is maintained.
Things are better, but the cluster is slower than with dumpling.

The main problem seems that OSD have twice more write operations par
second :
https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png
https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png

But journal doesn't change (SSD dedicated to OSD70+71+72) :
https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png

Neither node bandwidth :
https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png

Or whole cluster IO activity :
https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png

Some background :
The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD
journal» OSD. Only «HDD+SSD» OSD seems to be affected.

I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD»
nodes (so a total of 27 «HDD+SSD» OSD).

The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (=
«rbd snap rm» operations).
osd_snap_trim_sleep is setup to 0.8 since monthes.
Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It
doesn't seem to really help.

The only thing which seems to help, is to reduce osd_disk_threads from 8
to 1.

So. Any idea about what's happening ?

Thanks for any help,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80.8 and librbd performance

2015-03-03 Thread Olivier Bonvalet
Le mardi 03 mars 2015 à 16:32 -0800, Sage Weil a écrit :
> On Wed, 4 Mar 2015, Olivier Bonvalet wrote:
> > Does kernel client affected by the problem ?
> 
> Nope.  The kernel client is unaffected.. the issue is in librbd.
> 
> sage
> 


Ok, thanks for the clarification.
So I have to dig !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80.8 and librbd performance

2015-03-03 Thread Olivier Bonvalet
Does kernel client affected by the problem ?

Le mardi 03 mars 2015 à 15:19 -0800, Sage Weil a écrit :
> Hi,
> 
> This is just a heads up that we've identified a performance regression in 
> v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way 
> through QA and should be out in a few days.  If you haven't upgraded yet 
> you may want to wait.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd import-diff + erasure coding

2014-07-23 Thread Olivier Bonvalet
Ok, I just found this message from Gregory Farnum :
« You can't use erasure coded pools directly with RBD. They're only
suitable for use with RGW or as the base pool for a replicated cache
pool, and you need to be very careful/specific with the configuration. I
believe this is well-documented, so check it out! :) »

So, it's not usable to backup a production cluster. I have to use
replicated pool.

Le mercredi 23 juillet 2014 à 17:51 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> from my tests, I can't import snapshot from a replicated pool (in
> cluster1) to an erasure-coding pool (in cluster2).
> 
> Is it a known limitation ? A temporary one ?
> Or did I make a mistake somewhere ?
> 
> The cluster1 (aka production) is running Ceph 0.67.9), and the cluster2
> (aka backup) is running Ceph 0.80.4.
> 
> Thanks for any help.
> 
> Olivier
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd import-diff + erasure coding

2014-07-23 Thread Olivier Bonvalet
Hi,

from my tests, I can't import snapshot from a replicated pool (in
cluster1) to an erasure-coding pool (in cluster2).

Is it a known limitation ? A temporary one ?
Or did I make a mistake somewhere ?

The cluster1 (aka production) is running Ceph 0.67.9), and the cluster2
(aka backup) is running Ceph 0.80.4.

Thanks for any help.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data still in OSD directories after removing

2014-05-22 Thread Olivier Bonvalet

Le mercredi 21 mai 2014 à 18:20 -0700, Josh Durgin a écrit :
> On 05/21/2014 03:03 PM, Olivier Bonvalet wrote:
> > Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit :
> >> You're certain that that is the correct prefix for the rbd image you
> >> removed?  Do you see the objects lists when you do 'rados -p rbd ls - |
> >> grep '?
> >
> > I'm pretty sure yes : since I didn't see a lot of space freed by the
> > "rbd snap purge" command, I looked at the RBD prefix before to do the
> > "rbd rm" (it's not the first time I see that problem, but previous time
> > without the RBD prefix I was not able to check).
> >
> > So :
> > - "rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29" return nothing
> > at all
> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
> >   error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such
> > file or directory
> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.
> >   error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such
> > file or directory
> > - # ls -al 
> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
> > -rw-r--r-- 1 root root 4194304 oct.   8  2013 
> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
> >
> >
> >> If the objects really are orphaned, teh way to clean them up is via 'rados
> >> -p rbd rm '.  I'd like to get to the bottom of how they ended
> >> up that way first, though!
> >
> > I suppose the problem came from me, by doing CTRL+C while "rbd snap
> > purge $IMG".
> > "rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026" don't remove
> > thoses files, and just answer with a "No such file or directory".
> 
> Those files are all for snapshots, which are removed by the osds
> asynchronously in a process called 'snap trimming'. There's no
> way to directly remove them via rados.
> 
> Since you stopped 'rbd snap purge' partway through, it may
> have removed the reference to the snapshot before removing
> the snapshot itself.
> 
> You can get a list of snapshot ids for the remaining objects
> via the 'rados listsnaps' command, and use
> rados_ioctx_selfmanaged_snap_remove() (no convenient wrapper
> unfortunately) on each of those snapshot ids to be sure they are all
> scheduled for asynchronous deletion.
> 
> Josh
> 

Great : "rados listsnaps" see it :
# rados listsnaps -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
rb.0.14bfb5a.238e1f29.0002f026:
cloneid snaps   sizeoverlap
41554   35746   4194304 []

So, I have to write&compile a wrapper to
rados_ioctx_selfmanaged_snap_remove(), and find a way to obtain a list
of all "orphan" objects ?

I also try to recreate the object (rados put) then remove it (rados rm),
but snapshots still here.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data still in OSD directories after removing

2014-05-21 Thread Olivier Bonvalet
Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit :
> 
> You should definitely not do this!  :)

Of course ;)

> 
> You're certain that that is the correct prefix for the rbd image you 
> removed?  Do you see the objects lists when you do 'rados -p rbd ls - | 
> grep '?

I'm pretty sure yes : since I didn't see a lot of space freed by the
"rbd snap purge" command, I looked at the RBD prefix before to do the
"rbd rm" (it's not the first time I see that problem, but previous time
without the RBD prefix I was not able to check).

So : 
- "rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29" return nothing
at all
- # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
 error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such
file or directory
- # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.
 error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such
file or directory
- # ls -al 
/var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
-rw-r--r-- 1 root root 4194304 oct.   8  2013 
/var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9


> If the objects really are orphaned, teh way to clean them up is via 'rados 
> -p rbd rm '.  I'd like to get to the bottom of how they ended 
> up that way first, though!

I suppose the problem came from me, by doing CTRL+C while "rbd snap
purge $IMG".
"rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026" don't remove
thoses files, and just answer with a "No such file or directory".

Thanks,
Olivier



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data still in OSD directories after removing

2014-05-21 Thread Olivier Bonvalet
Hi,

I have a lot of space wasted by this problem (about 10GB per OSD, just
for this RBD image).
If OSDs can't detect orphans files, should I manually detect them, then
remove them ?

This command can do the job, at least for this image prefix :
find /var/lib/ceph/osd/ -name 'rb.0.14bfb5a.238e1f29.*' -delete

Thanks for any advice,
Olivier

PS : not sure if this kind of problem is for the user or dev mailing
list.

Le mardi 20 mai 2014 à 11:32 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> short : I removed a 1TB RBD image, but I still see files about it on
> OSD.
> 
> 
> long :
> 1) I did : "rbd snap purge $pool/$img"
>but since it overload the cluster, I stopped it (CTRL+C)
> 2) latter, "rbd snap purge $pool/$img"
> 3) then, "rbd rm $pool/$img"
> 
> now, on the disk I can found files of this v1 RBD image (prefix was
> rb.0.14bfb5a.238e1f29) :
> 
> # find /var/lib/ceph/osd/ceph-64/ -name 'rb.0.14bfb5a.238e1f29.*'
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.00021431__snapdir_C96635C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.5622__a252_32F435C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.00021431__a252_C96635C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.5622__snapdir_32F435C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_9/rb.0.14bfb5a.238e1f29.00011e08__a172_594495C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_9/rb.0.14bfb5a.238e1f29.00011e08__snapdir_594495C1__9
> /var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_A/rb.0.14bfb5a.238e1f29.00021620__a252_779FA5C1__9
> ...
> 
> 
> So, is there a way to force OSD to detect if files are orphans, then
> remove them ?
> 
> Thanks,
> Olivier
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Data still in OSD directories after removing

2014-05-20 Thread Olivier Bonvalet
Hi,

short : I removed a 1TB RBD image, but I still see files about it on
OSD.


long :
1) I did : "rbd snap purge $pool/$img"
   but since it overload the cluster, I stopped it (CTRL+C)
2) latter, "rbd snap purge $pool/$img"
3) then, "rbd rm $pool/$img"

now, on the disk I can found files of this v1 RBD image (prefix was
rb.0.14bfb5a.238e1f29) :

# find /var/lib/ceph/osd/ceph-64/ -name 'rb.0.14bfb5a.238e1f29.*'
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.00021431__snapdir_C96635C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.5622__a252_32F435C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.00021431__a252_C96635C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_3/rb.0.14bfb5a.238e1f29.5622__snapdir_32F435C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_9/rb.0.14bfb5a.238e1f29.00011e08__a172_594495C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_9/rb.0.14bfb5a.238e1f29.00011e08__snapdir_594495C1__9
/var/lib/ceph/osd/ceph-64/current/9.5c1_head/DIR_1/DIR_C/DIR_5/DIR_A/rb.0.14bfb5a.238e1f29.00021620__a252_779FA5C1__9
...


So, is there a way to force OSD to detect if files are orphans, then
remove them ?

Thanks,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The Ceph disk I would like to have

2014-03-25 Thread Olivier Bonvalet
Hi,

not sure it's related to ceph... you should probably look at ownClound
project, no ?

Or use any S3/Swift client which will know how to exchange data with a
RADOS gateway.

Le mardi 25 mars 2014 à 16:49 +0100, Loic Dachary a écrit :
> Hi,
> 
> It's not available yet but ... are we far away ? 
> 
> I would like to go to the hardware store and buy a Ceph enabled disk, plug it 
> to my internet box and use it. I would buy another for my sister and pair it 
> with mine so we share photos and movies we like. My mail would go there too, 
> encrypted because I would not want my sister to read it. My brother in law is 
> likely to buy one if he knows he can pair with us and grow our family 
> storage. 
> 
> The user interface could be crude : when buying a disk and installing it 
> somewhere, a geek is involved. I would like it to be simple and reliable. 
> When discovering it for the first time I would like to say : "that makes 
> sense". I don't expect something that would magically work. The HOWTO install 
> / fix / diagnose would have to fit on a single page. 
> 
> Cheers
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance and disk usage of snapshots

2013-09-28 Thread Olivier Bonvalet
Hi,

Le mardi 24 septembre 2013 à 18:37 +0200, Corin Langosch a écrit :
> Hi there,
> 
> do snapshots have an impact on write performance? I assume on each write all 
> snapshots have to get updated (cow) so the more snapshots exist the worse 
> write 
> performance will get?
> 

Not exactly : the first time a write is done on a snapshot block, yes
that block (4MB per default) is duplicated on disk. So if you do 1
snapshot per RBD every day, each modified block will be duplicated once
time during the day. So, it's not a big impact.

But if you do frequent snapshots, one per hour for example, and your
workload is a lot of 8KB random write (MySQL Innodb...), then each of
this 8KB will throw a 4MB duplication on disk. Which is a big write
amplification here.


> Is there any way to see how much disk space a snapshot occupies? I assume 
> because of cow snapshots start with 0 real disk usage and grow over time as 
> the 
> underlying object changes?

Well, since "rados df" and "ceph df" don't report correctly space used
by snapshots, no, you can't. Or I didn't find how !

Small example : you have a 8MB RBD, and make a snapshot on it. Then you
still have 8MB of space used. Then you write 8KB on the first block,
ceph duplicate that block and now you have 12MB used on disk. But ceph
will report 8MB + 8KB used, not 12MB.


Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
I removed some garbage about hosts faude / rurkh / murmillia (they was
temporarily added because cluster was full). So the "clean" CRUSH map :


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 device24
device 25 device25
device 26 device26
device 27 device27
device 28 device28
device 29 device29
device 30 device30
device 31 device31
device 32 device32
device 33 device33
device 34 device34
device 35 device35
device 36 device36
device 37 device37
device 38 device38
device 39 device39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77
device 78 osd.78

# types
type 0 osd
type 1 host
type 2 rack
type 3 net
type 4 room
type 5 datacenter
type 6 root

# buckets
host dragan {
id -17  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.70 weight 2.720
item osd.71 weight 2.720
item osd.72 weight 2.720
item osd.73 weight 2.720
item osd.74 weight 2.720
item osd.75 weight 2.720
item osd.76 weight 2.720
item osd.77 weight 2.720
item osd.78 weight 2.720
}
rack SAS15B01 {
id -40  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item dragan weight 24.480
}
net SAS188-165-15 {
id -72  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS15B01 weight 24.480
}
room SASs15 {
id -90  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS188-165-15 weight 24.480
}
datacenter SASrbx1 {
id -100 # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SASs15 weight 24.480
}
host taman {
id -16  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.49 weight 2.720
item osd.62 weight 2.720
item osd.63 weight 2.720
item osd.64 weight 2.720
item osd.65 weight 2.720
item osd.66 weight 2.720
item osd.67 weight 2.720
item osd.68 weight 2.720
item osd.69 weight 2.720
}
rack SAS31A10 {
id -15  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item taman weight 24.480
}
net SAS178-33-62 {
id -14  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS31A10 weight 24.480
}
room SASs31 {
id -13  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS178-33-62 weight 24.480
}
host kaino {
id -9   # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.40 weight 2.720
item osd.41 weight 2.720
item osd.42 weight 2.720
item osd.43 weight 2.720
item osd.44 weight 2.720
item osd.45 weight 2.720
item osd.46 weight 2.720
item osd.47 weight 2.720
item osd.48 weight 2.720
}
rack SAS34A14 {
id -10  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item kaino weight 24.480
}
net SAS5-135-135 {
id -11  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS34A14 weight 24.480
}
room SASs34 {
id -12  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS5-135-135 weight 24.480
}
datacenter SASrbx2 {
id -101 # do not change unnecessarily
# weight 48.960
alg straw

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
Very simple test on a new pool "ssdtest", with 3 replica full SSD
(crushrule 3) :

# rbd create ssdtest/test-mysql --size 102400
# rbd map ssdtest/test-mysql
# dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=4M count=500
# ceph df | grep ssdtest
ssdtest10 2000M 0 502 

host1:# du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
3135780total
host2:# du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
3028804total
→ so 6020kB on disk, wich seems correct (and a find reports 739+767
files of 4MB, so it's also good).



First snapshot :

# rbd snap create ssdtest/test-mysql@s1
# dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=4M count=250
# ceph df | grep ssdtest
ssdtest10 3000M 0 752 
2 * # du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
→ 9024kB on disk, which is correct again.



Second snapshot :

# rbd snap create ssdtest/test-mysql@s2
Here I write 4KB only it 100 differents rados blocks :
# for I in '' 1 2 3 4 5 6 7 8 9 ; do for J in 0 1 2 3 4 5 6 7 8 9 ; do
OFFSET=$I$J ; dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=1k seek=
$((OFFSET*4096)) count=4 ; done ; done
# ceph df | grep ssdtest
ssdtest10 3000M 0 852 

Here the "USED" column of "ceph df" is wrong. And on the disk I see
10226kB used.


So, for me the problem come from "ceph df" (and "rados df"), wich don't
correctly reports space used by partially writed object.

Or is it XFS related only ?


Le mercredi 11 septembre 2013 à 11:00 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> do you need more information about that ?
> 
> thanks,
> Olivier
> 
> Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit :
> > Can you post the rest of you crush map?
> > -Sam
> > 
> > On Tue, Sep 10, 2013 at 5:52 AM, Olivier Bonvalet  
> > wrote:
> > > I also checked that all files in that PG still are on that PG :
> > >
> > > for IMG in `find . -type f -printf '%f\n' | awk -F '__' '{ print $1 }' |
> > > sort --unique` ; do echo -n "$IMG "; ceph osd map ssd3copies $IMG | grep
> > > -v 6\\.31f ; echo ; done
> > >
> > > And all objects are referenced in rados (compared with "rados --pool
> > > ssd3copies ls rados.ssd3copies.dump").
> > >
> > >
> > >
> > > Le mardi 10 septembre 2013 à 13:46 +0200, Olivier Bonvalet a écrit :
> > >> Some additionnal informations : if I look on one PG only, for example
> > >> the 6.31f. "ceph pg dump" report a size of 616GB :
> > >>
> > >> # ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024) } END { print 
> > >> SUM }'
> > >> 631717
> > >>
> > >> But on disk, on the 3 replica I have :
> > >> # du -sh  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> > >> 1,3G  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> > >>
> > >> Since I was suspected a snapshot problem, I try to count only "head
> > >> files" :
> > >> # find /var/lib/ceph/osd/ceph-50/current/6.31f_head/ -type f -name 
> > >> '*head*' -print0 | xargs -r -0 du -hc | tail -n1
> > >> 448M  total
> > >>
> > >> and the content of the directory : http://pastebin.com/u73mTvjs
> > >>
> > >>
> > >> Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit :
> > >> > Hi,
> > >> >
> > >> > I have a space problem on a production cluster, like if there is unused
> > >> > data not freed : "ceph df" and "rados df" reports 613GB of data, and
> > >> > disk usage is 2640GB (with 3 replica). It should be near 1839GB.
> > >> >
> > >> >
> > >> > I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
> > >> > rules to put pools on SAS or on SSD.
> > >> >
> > >> > My pools :
> > >> > # ceph osd dump | grep ^pool
> > >> > pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash 
> > >> > rjenkins pg_num 576 pgp_num 576 last_change 68315 owner 0 
> > >> > crash_replay_interval 45
> > >> > pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash 
> > >> > rjenkins pg_num 576 pgp_num 576 last_change 68317 owner 0
> > >> > pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash 
> > >> > rjenkins pg_num 576 pgp_num 576 last_chan

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-11 Thread Olivier Bonvalet
Hi,

do you need more information about that ?

thanks,
Olivier

Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit :
> Can you post the rest of you crush map?
> -Sam
> 
> On Tue, Sep 10, 2013 at 5:52 AM, Olivier Bonvalet  wrote:
> > I also checked that all files in that PG still are on that PG :
> >
> > for IMG in `find . -type f -printf '%f\n' | awk -F '__' '{ print $1 }' |
> > sort --unique` ; do echo -n "$IMG "; ceph osd map ssd3copies $IMG | grep
> > -v 6\\.31f ; echo ; done
> >
> > And all objects are referenced in rados (compared with "rados --pool
> > ssd3copies ls rados.ssd3copies.dump").
> >
> >
> >
> > Le mardi 10 septembre 2013 à 13:46 +0200, Olivier Bonvalet a écrit :
> >> Some additionnal informations : if I look on one PG only, for example
> >> the 6.31f. "ceph pg dump" report a size of 616GB :
> >>
> >> # ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024) } END { print SUM 
> >> }'
> >> 631717
> >>
> >> But on disk, on the 3 replica I have :
> >> # du -sh  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> >> 1,3G  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> >>
> >> Since I was suspected a snapshot problem, I try to count only "head
> >> files" :
> >> # find /var/lib/ceph/osd/ceph-50/current/6.31f_head/ -type f -name 
> >> '*head*' -print0 | xargs -r -0 du -hc | tail -n1
> >> 448M  total
> >>
> >> and the content of the directory : http://pastebin.com/u73mTvjs
> >>
> >>
> >> Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit :
> >> > Hi,
> >> >
> >> > I have a space problem on a production cluster, like if there is unused
> >> > data not freed : "ceph df" and "rados df" reports 613GB of data, and
> >> > disk usage is 2640GB (with 3 replica). It should be near 1839GB.
> >> >
> >> >
> >> > I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
> >> > rules to put pools on SAS or on SSD.
> >> >
> >> > My pools :
> >> > # ceph osd dump | grep ^pool
> >> > pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
> >> > pg_num 576 pgp_num 576 last_change 68315 owner 0 crash_replay_interval 45
> >> > pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash 
> >> > rjenkins pg_num 576 pgp_num 576 last_change 68317 owner 0
> >> > pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins 
> >> > pg_num 576 pgp_num 576 last_change 68321 owner 0
> >> > pool 3 'hdd3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> >> > rjenkins pg_num 200 pgp_num 200 last_change 172933 owner 0
> >> > pool 6 'ssd3copies' rep size 3 min_size 1 crush_ruleset 7 object_hash 
> >> > rjenkins pg_num 800 pgp_num 800 last_change 172929 owner 0
> >> > pool 9 'sas3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> >> > rjenkins pg_num 2048 pgp_num 2048 last_change 172935 owner 0
> >> >
> >> > Only hdd3copies, sas3copies and ssd3copies are really used :
> >> > # ceph df
> >> > GLOBAL:
> >> > SIZE   AVAIL  RAW USED %RAW USED
> >> > 76498G 51849G 24648G   32.22
> >> >
> >> > POOLS:
> >> > NAME   ID USED  %USED OBJECTS
> >> > data   0  46753 0 72
> >> > metadata   1  0 0 0
> >> > rbd2  8 0 1
> >> > hdd3copies 3  2724G 3.56  5190954
> >> > ssd3copies 6  613G  0.80  347668
> >> > sas3copies 9  3692G 4.83  764394
> >> >
> >> >
> >> > My CRUSH rules was :
> >> >
> >> > rule SASperHost {
> >> > ruleset 4
> >> > type replicated
> >> > min_size 1
> >> > max_size 10
> >> > step take SASroot
> >> > step chooseleaf firstn 0 type host
> >> > step emit
> >> > }
> >> >
> >> > and :
> >> >
> >> > rule SSDperOSD {
> >> > ruleset 3
> >> >   

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet

I removed some garbage about hosts faude / rurkh / murmillia (they was
temporarily added because cluster was full). So the "clean" CRUSH map :


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 device24
device 25 device25
device 26 device26
device 27 device27
device 28 device28
device 29 device29
device 30 device30
device 31 device31
device 32 device32
device 33 device33
device 34 device34
device 35 device35
device 36 device36
device 37 device37
device 38 device38
device 39 device39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77
device 78 osd.78

# types
type 0 osd
type 1 host
type 2 rack
type 3 net
type 4 room
type 5 datacenter
type 6 root

# buckets
host dragan {
id -17  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.70 weight 2.720
item osd.71 weight 2.720
item osd.72 weight 2.720
item osd.73 weight 2.720
item osd.74 weight 2.720
item osd.75 weight 2.720
item osd.76 weight 2.720
item osd.77 weight 2.720
item osd.78 weight 2.720
}
rack SAS15B01 {
id -40  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item dragan weight 24.480
}
net SAS188-165-15 {
id -72  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS15B01 weight 24.480
}
room SASs15 {
id -90  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS188-165-15 weight 24.480
}
datacenter SASrbx1 {
id -100 # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SASs15 weight 24.480
}
host taman {
id -16  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.49 weight 2.720
item osd.62 weight 2.720
item osd.63 weight 2.720
item osd.64 weight 2.720
item osd.65 weight 2.720
item osd.66 weight 2.720
item osd.67 weight 2.720
item osd.68 weight 2.720
item osd.69 weight 2.720
}
rack SAS31A10 {
id -15  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item taman weight 24.480
}
net SAS178-33-62 {
id -14  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS31A10 weight 24.480
}
room SASs31 {
id -13  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS178-33-62 weight 24.480
}
host kaino {
id -9   # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.40 weight 2.720
item osd.41 weight 2.720
item osd.42 weight 2.720
item osd.43 weight 2.720
item osd.44 weight 2.720
item osd.45 weight 2.720
item osd.46 weight 2.720
item osd.47 weight 2.720
item osd.48 weight 2.720
}
rack SAS34A14 {
id -10  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item kaino weight 24.480
}
net SAS5-135-135 {
id -11  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS34A14 weight 24.480
}
room SASs34 {
id -12  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS5-135-135 weight 24.480
}
datacenter SASrbx2 {
id -101 # do not change unnecessarily
# weight 48.960
alg stra

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
   




Le mardi 10 septembre 2013 à 21:01 +0200, Olivier Bonvalet a écrit :
> I removed some garbage about hosts faude / rurkh / murmillia (they was
> temporarily added because cluster was full). So the "clean" CRUSH map :
> 
> 
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> 
> # devices
> device 0 device0
> device 1 device1
> device 2 device2
> device 3 device3
> device 4 device4
> device 5 device5
> device 6 device6
> device 7 device7
> device 8 device8
> device 9 device9
> device 10 device10
> device 11 device11
> device 12 device12
> device 13 device13
> device 14 device14
> device 15 device15
> device 16 device16
> device 17 device17
> device 18 device18
> device 19 device19
> device 20 device20
> device 21 device21
> device 22 device22
> device 23 device23
> device 24 device24
> device 25 device25
> device 26 device26
> device 27 device27
> device 28 device28
> device 29 device29
> device 30 device30
> device 31 device31
> device 32 device32
> device 33 device33
> device 34 device34
> device 35 device35
> device 36 device36
> device 37 device37
> device 38 device38
> device 39 device39
> device 40 osd.40
> device 41 osd.41
> device 42 osd.42
> device 43 osd.43
> device 44 osd.44
> device 45 osd.45
> device 46 osd.46
> device 47 osd.47
> device 48 osd.48
> device 49 osd.49
> device 50 osd.50
> device 51 osd.51
> device 52 osd.52
> device 53 osd.53
> device 54 osd.54
> device 55 osd.55
> device 56 osd.56
> device 57 osd.57
> device 58 osd.58
> device 59 osd.59
> device 60 osd.60
> device 61 osd.61
> device 62 osd.62
> device 63 osd.63
> device 64 osd.64
> device 65 osd.65
> device 66 osd.66
> device 67 osd.67
> device 68 osd.68
> device 69 osd.69
> device 70 osd.70
> device 71 osd.71
> device 72 osd.72
> device 73 osd.73
> device 74 osd.74
> device 75 osd.75
> device 76 osd.76
> device 77 osd.77
> device 78 osd.78
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 net
> type 4 room
> type 5 datacenter
> type 6 root
> 
> # buckets
> host dragan {
>   id -17  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item osd.70 weight 2.720
>   item osd.71 weight 2.720
>   item osd.72 weight 2.720
>   item osd.73 weight 2.720
>   item osd.74 weight 2.720
>   item osd.75 weight 2.720
>   item osd.76 weight 2.720
>   item osd.77 weight 2.720
>   item osd.78 weight 2.720
> }
> rack SAS15B01 {
>   id -40  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item dragan weight 24.480
> }
> net SAS188-165-15 {
>   id -72  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item SAS15B01 weight 24.480
> }
> room SASs15 {
>   id -90  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item SAS188-165-15 weight 24.480
> }
> datacenter SASrbx1 {
>   id -100 # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item SASs15 weight 24.480
> }
> host taman {
>   id -16  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item osd.49 weight 2.720
>   item osd.62 weight 2.720
>   item osd.63 weight 2.720
>   item osd.64 weight 2.720
>   item osd.65 weight 2.720
>   item osd.66 weight 2.720
>   item osd.67 weight 2.720
>   item osd.68 weight 2.720
>   item osd.69 weight 2.720
> }
> rack SAS31A10 {
>   id -15  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item taman weight 24.480
> }
> net SAS178-33-62 {
>   id -14  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item SAS31A10 weight 24.480
> }
> room SASs31 {
>   id -13  # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item SAS178-33-62 weight 24.480
> }
> host kaino {
>   id -9   # do not change unnecessarily
>   # weight 24.480
>   alg straw
>   hash 0  # rjenkins1
>   item osd.40 weight 2.720
>   item osd.41 weight 2.720
>

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit :
> Can you post the rest of you crush map?
> -Sam
> 

Yes :

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 device24
device 25 device25
device 26 device26
device 27 device27
device 28 device28
device 29 device29
device 30 device30
device 31 device31
device 32 device32
device 33 device33
device 34 device34
device 35 device35
device 36 device36
device 37 device37
device 38 device38
device 39 device39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77
device 78 osd.78

# types
type 0 osd
type 1 host
type 2 rack
type 3 net
type 4 room
type 5 datacenter
type 6 root

# buckets
host dragan {
id -17  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.70 weight 2.720
item osd.71 weight 2.720
item osd.72 weight 2.720
item osd.73 weight 2.720
item osd.74 weight 2.720
item osd.75 weight 2.720
item osd.76 weight 2.720
item osd.77 weight 2.720
item osd.78 weight 2.720
}
rack SAS15B01 {
id -40  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item dragan weight 24.480
}
net SAS188-165-15 {
id -72  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS15B01 weight 24.480
}
room SASs15 {
id -90  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS188-165-15 weight 24.480
}
datacenter SASrbx1 {
id -100 # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SASs15 weight 24.480
}
host taman {
id -16  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.49 weight 2.720
item osd.62 weight 2.720
item osd.63 weight 2.720
item osd.64 weight 2.720
item osd.65 weight 2.720
item osd.66 weight 2.720
item osd.67 weight 2.720
item osd.68 weight 2.720
item osd.69 weight 2.720
}
rack SAS31A10 {
id -15  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item taman weight 24.480
}
net SAS178-33-62 {
id -14  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS31A10 weight 24.480
}
room SASs31 {
id -13  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS178-33-62 weight 24.480
}
host kaino {
id -9   # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item osd.40 weight 2.720
item osd.41 weight 2.720
item osd.42 weight 2.720
item osd.43 weight 2.720
item osd.44 weight 2.720
item osd.45 weight 2.720
item osd.46 weight 2.720
item osd.47 weight 2.720
item osd.48 weight 2.720
}
rack SAS34A14 {
id -10  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item kaino weight 24.480
}
net SAS5-135-135 {
id -11  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS34A14 weight 24.480
}
room SASs34 {
id -12  # do not change unnecessarily
# weight 24.480
alg straw
hash 0  # rjenkins1
item SAS5-135-135 weight 24.480
}
datacenter SASrbx2 {
id -101 # do not change unnecessarily
# weight 48.960
alg straw
hash 0  # rjenkins

Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
I also checked that all files in that PG still are on that PG :

for IMG in `find . -type f -printf '%f\n' | awk -F '__' '{ print $1 }' |
sort --unique` ; do echo -n "$IMG "; ceph osd map ssd3copies $IMG | grep
-v 6\\.31f ; echo ; done

And all objects are referenced in rados (compared with "rados --pool
ssd3copies ls rados.ssd3copies.dump").



Le mardi 10 septembre 2013 à 13:46 +0200, Olivier Bonvalet a écrit :
> Some additionnal informations : if I look on one PG only, for example
> the 6.31f. "ceph pg dump" report a size of 616GB :
> 
> # ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024) } END { print SUM }'
> 631717
> 
> But on disk, on the 3 replica I have :
> # du -sh  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> 1,3G  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> 
> Since I was suspected a snapshot problem, I try to count only "head
> files" :
> # find /var/lib/ceph/osd/ceph-50/current/6.31f_head/ -type f -name '*head*' 
> -print0 | xargs -r -0 du -hc | tail -n1
> 448M  total
> 
> and the content of the directory : http://pastebin.com/u73mTvjs
> 
> 
> Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit :
> > Hi,
> > 
> > I have a space problem on a production cluster, like if there is unused
> > data not freed : "ceph df" and "rados df" reports 613GB of data, and
> > disk usage is 2640GB (with 3 replica). It should be near 1839GB.
> > 
> > 
> > I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
> > rules to put pools on SAS or on SSD.
> > 
> > My pools :
> > # ceph osd dump | grep ^pool
> > pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
> > pg_num 576 pgp_num 576 last_change 68315 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash 
> > rjenkins pg_num 576 pgp_num 576 last_change 68317 owner 0
> > pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins 
> > pg_num 576 pgp_num 576 last_change 68321 owner 0
> > pool 3 'hdd3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> > rjenkins pg_num 200 pgp_num 200 last_change 172933 owner 0
> > pool 6 'ssd3copies' rep size 3 min_size 1 crush_ruleset 7 object_hash 
> > rjenkins pg_num 800 pgp_num 800 last_change 172929 owner 0
> > pool 9 'sas3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> > rjenkins pg_num 2048 pgp_num 2048 last_change 172935 owner 0
> > 
> > Only hdd3copies, sas3copies and ssd3copies are really used :
> > # ceph df
> > GLOBAL:
> > SIZE   AVAIL  RAW USED %RAW USED 
> > 76498G 51849G 24648G   32.22 
> > 
> > POOLS:
> > NAME   ID USED  %USED OBJECTS 
> > data   0  46753 0 72  
> > metadata   1  0 0 0   
> > rbd2  8 0 1   
> > hdd3copies 3  2724G 3.56  5190954 
> > ssd3copies 6  613G  0.80  347668  
> > sas3copies 9  3692G 4.83  764394  
> > 
> > 
> > My CRUSH rules was :
> > 
> > rule SASperHost {
> > ruleset 4
> > type replicated
> > min_size 1
> > max_size 10
> > step take SASroot
> > step chooseleaf firstn 0 type host
> > step emit
> > }
> > 
> > and :
> > 
> > rule SSDperOSD {
> > ruleset 3
> > type replicated
> > min_size 1
> > max_size 10
> > step take SSDroot
> > step choose firstn 0 type osd
> > step emit
> > }
> > 
> > 
> > but, since the cluster was full because of that space problem, I swith to a 
> > different rule :
> > 
> > rule SSDperOSDfirst {
> > ruleset 7
> > type replicated
> > min_size 1
> > max_size 10
> > step take SSDroot
> > step choose firstn 1 type osd
> > step emit
> > step take SASroot
> > step chooseleaf firstn -1 type net
> > step emit
> > }
> > 
> > 
> > So with that last rule, I should have only one replica on my SSD OSD, so 
> > 613GB of space used. But if I check on OSD I see 1212GB really used.
> > 
> > I also use snapshots, maybe snapshots are ignored by "ceph df" and "rados 
> > df" ?
> > 
> > Thanks for any help.
> > 
> > Olivier
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
Some additionnal informations : if I look on one PG only, for example
the 6.31f. "ceph pg dump" report a size of 616GB :

# ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024) } END { print SUM }'
631717

But on disk, on the 3 replica I have :
# du -sh  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
1,3G/var/lib/ceph/osd/ceph-50/current/6.31f_head/

Since I was suspected a snapshot problem, I try to count only "head
files" :
# find /var/lib/ceph/osd/ceph-50/current/6.31f_head/ -type f -name '*head*' 
-print0 | xargs -r -0 du -hc | tail -n1
448Mtotal

and the content of the directory : http://pastebin.com/u73mTvjs


Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a space problem on a production cluster, like if there is unused
> data not freed : "ceph df" and "rados df" reports 613GB of data, and
> disk usage is 2640GB (with 3 replica). It should be near 1839GB.
> 
> 
> I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
> rules to put pools on SAS or on SSD.
> 
> My pools :
> # ceph osd dump | grep ^pool
> pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
> pg_num 576 pgp_num 576 last_change 68315 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash rjenkins 
> pg_num 576 pgp_num 576 last_change 68317 owner 0
> pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins 
> pg_num 576 pgp_num 576 last_change 68321 owner 0
> pool 3 'hdd3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> rjenkins pg_num 200 pgp_num 200 last_change 172933 owner 0
> pool 6 'ssd3copies' rep size 3 min_size 1 crush_ruleset 7 object_hash 
> rjenkins pg_num 800 pgp_num 800 last_change 172929 owner 0
> pool 9 'sas3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash 
> rjenkins pg_num 2048 pgp_num 2048 last_change 172935 owner 0
> 
> Only hdd3copies, sas3copies and ssd3copies are really used :
> # ceph df
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED 
> 76498G 51849G 24648G   32.22 
> 
> POOLS:
> NAME   ID USED  %USED OBJECTS 
> data   0  46753 0 72  
> metadata   1  0 0 0   
> rbd2  8 0 1   
> hdd3copies 3  2724G 3.56  5190954 
> ssd3copies 6  613G  0.80  347668  
> sas3copies 9  3692G 4.83  764394  
> 
> 
> My CRUSH rules was :
> 
> rule SASperHost {
>   ruleset 4
>   type replicated
>   min_size 1
>   max_size 10
>   step take SASroot
>   step chooseleaf firstn 0 type host
>   step emit
> }
> 
> and :
> 
> rule SSDperOSD {
>   ruleset 3
>   type replicated
>   min_size 1
>   max_size 10
>   step take SSDroot
>   step choose firstn 0 type osd
>   step emit
> }
> 
> 
> but, since the cluster was full because of that space problem, I swith to a 
> different rule :
> 
> rule SSDperOSDfirst {
>   ruleset 7
>   type replicated
>   min_size 1
>   max_size 10
>   step take SSDroot
>   step choose firstn 1 type osd
>   step emit
> step take SASroot
> step chooseleaf firstn -1 type net
> step emit
> }
> 
> 
> So with that last rule, I should have only one replica on my SSD OSD, so 
> 613GB of space used. But if I check on OSD I see 1212GB really used.
> 
> I also use snapshots, maybe snapshots are ignored by "ceph df" and "rados df" 
> ?
> 
> Thanks for any help.
> 
> Olivier
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph space problem, garbage collector ?

2013-09-10 Thread Olivier Bonvalet
Hi,

I have a space problem on a production cluster, like if there is unused
data not freed : "ceph df" and "rados df" reports 613GB of data, and
disk usage is 2640GB (with 3 replica). It should be near 1839GB.


I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
rules to put pools on SAS or on SSD.

My pools :
# ceph osd dump | grep ^pool
pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 
576 pgp_num 576 last_change 68315 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash rjenkins 
pg_num 576 pgp_num 576 last_change 68317 owner 0
pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 
576 pgp_num 576 last_change 68321 owner 0
pool 3 'hdd3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash rjenkins 
pg_num 200 pgp_num 200 last_change 172933 owner 0
pool 6 'ssd3copies' rep size 3 min_size 1 crush_ruleset 7 object_hash rjenkins 
pg_num 800 pgp_num 800 last_change 172929 owner 0
pool 9 'sas3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash rjenkins 
pg_num 2048 pgp_num 2048 last_change 172935 owner 0

Only hdd3copies, sas3copies and ssd3copies are really used :
# ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED 
76498G 51849G 24648G   32.22 

POOLS:
NAME   ID USED  %USED OBJECTS 
data   0  46753 0 72  
metadata   1  0 0 0   
rbd2  8 0 1   
hdd3copies 3  2724G 3.56  5190954 
ssd3copies 6  613G  0.80  347668  
sas3copies 9  3692G 4.83  764394  


My CRUSH rules was :

rule SASperHost {
ruleset 4
type replicated
min_size 1
max_size 10
step take SASroot
step chooseleaf firstn 0 type host
step emit
}

and :

rule SSDperOSD {
ruleset 3
type replicated
min_size 1
max_size 10
step take SSDroot
step choose firstn 0 type osd
step emit
}


but, since the cluster was full because of that space problem, I swith to a 
different rule :

rule SSDperOSDfirst {
ruleset 7
type replicated
min_size 1
max_size 10
step take SSDroot
step choose firstn 1 type osd
step emit
step take SASroot
step chooseleaf firstn -1 type net
step emit
}


So with that last rule, I should have only one replica on my SSD OSD, so 613GB 
of space used. But if I check on OSD I see 1212GB really used.

I also use snapshots, maybe snapshots are ignored by "ceph df" and "rados df" ?

Thanks for any help.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Xen - RBD io hang

2013-08-28 Thread Olivier Bonvalet

Le mercredi 28 août 2013 à 10:07 +0200, Sylvain Munaut a écrit :
> Hi,
> 
> > I use Ceph 0.61.8 and Xen 4.2.2 (Debian) in production, and can't use
> > kernel 3.10.* on dom0, which hang very soon. But it's visible in kernel
> > logs of the dom0, not the domU.
> 
> Weird. I'm using 3.10.0 without issue here. What's the issue you're hitting ?
> 
> Cheers,
> 
> Sylvain
> 

Hi,

this one : http://tracker.ceph.com/issues/5760

it seems to be related to snapshots.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real size of rbd image

2013-08-27 Thread Olivier Bonvalet
Le mardi 27 août 2013 à 13:44 -0700, Josh Durgin a écrit :
> On 08/27/2013 01:39 PM, Timofey Koolin wrote:
> > Is way to know real size of rbd image and rbd snapshots?
> > rbd ls -l write declared size of image, but I want to know real size.
> 
> You can sum the sizes of the extents reported by:
> 
>  rbd diff pool/image[@snap] [--format json]
> 
> That's the difference since the beginning of time, so it reports all
> used extents.
> 
> Josh


Very good tip Josh ! It's the fastest way I seen.


As a result, with awk to sum all extents :
rbd diff $POOL/$IMAGE | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" 
}'


Really fast, thanks.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Xen - RBD io hang

2013-08-27 Thread Olivier Bonvalet
Hi,

I use Ceph 0.61.8 and Xen 4.2.2 (Debian) in production, and can't use
kernel 3.10.* on dom0, which hang very soon. But it's visible in kernel
logs of the dom0, not the domU.

Anyway, you should probably re-try with kernel 3.9.11 for the dom0 (I
also use 3.10.9 in domU).

Olivier

Le mardi 27 août 2013 à 11:46 +0100, James Dingwall a écrit :
> Hi,
> 
> I am doing some experimentation with Ceph and Xen (on the same host) and 
> I'm experiencing some problems with the rbd device that I'm using as the 
> block device.  My environment is:
> 
> 2 node Ceph 0.67.2 cluster, 4x OSD (btrfs) and 1x mon
> Xen 4.3.0
> Kernel 3.10.9
> 
> The domU I'm trying to build is from the Ubuntu 13.04 desktop release.  
> When I pass through the rbd (format 1 or 2) device as 
> phy:/dev/rbd/rbd/ubuntu-test then the domU has no problems reading data 
> from it, the test I ran was:
> 
> for i in $(seq 0 1023) ; do
>  dd if=/dev/xvda of=/dev/null bs=4k count=1024 skip=$(($i * 4))
> done
> 
> However writing data causes the domU to hang while while i is still in 
> single figures but it doesn't seem consistent about the exact value.
> for i in $(seq 0 1023) ; do
>  dd of=/dev/xvda of=/dev/zero bs=4k count=1024 seek=$(($i * 4))
> done
> 
> eventually the kernel in the domU will print a hung task warning.  I 
> have tried the domU as pv and hvm (with xen_platform_pci = 1 and 0) but 
> have the same behaviour in both cases.  Once this state is triggered on 
> the rbd device then any interaction with it in dom0 will result in the 
> same hang.  I'm assuming that there is some unfavourable interaction 
> between ceph/rbd and blkback but I haven't found anything in the dom0 
> logs so I would like to know if anyone has some suggestions about where 
> to start trying to hunt this down.
> 
> Thanks,
> James
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have an OSD which crash every time I try to start it (see logs below).
> Is it a known problem ? And is there a way to fix it ?
> 
> root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
> 2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
> (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
> 2013-08-19 11:07:48.516363 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
> appears to work
> 2013-08-19 11:07:48.516380 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
> 'filestore fiemap' config option
> 2013-08-19 11:07:48.516514 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
> 2013-08-19 11:07:48.517087 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
> supported
> 2013-08-19 11:07:48.517389 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
> 2013-08-19 11:07:49.199483 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
> btrfs not detected
> 2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
> 2013-08-19 11:07:52.199908 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
> appears to work
> 2013-08-19 11:07:52.199916 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
> 'filestore fiemap' config option
> 2013-08-19 11:07:52.200058 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
> 2013-08-19 11:07:52.200886 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
> supported
> 2013-08-19 11:07:52.200919 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
> 2013-08-19 11:07:52.215850 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
> btrfs not detected
> 2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has 
> features 262144, adjusting msgr requires for clients
> 2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has 
> features 262144, adjusting msgr requires for osds
> 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
> OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
> 11:08:13.579519
> osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))
> 
>  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
>  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
>  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
> PG::RecoveryCtx*, std::set, 
> std::less >, std::allocator 
> > >*)+0x3c8) [0x6f8f48]
>  3: (OSD::process_peering_events(std::list > const&, 
> ThreadPool::TPHandle&)+0x31f) [0x6f975f]
>  4: (OSD::PeeringWQ::_process(std::list > const&, 
> ThreadPool::TPHandle&)+0x14) [0x7391d4]
>  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
>  6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
>  7: (()+0x6b50) [0x7f6fe3070b50]
>  8: (clone()+0x6d) [0x7f6fe15cba7d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> full logs here : http://pastebin.com/RphNyLU0
> 
> 

Hi,

still same problem with Ceph 0.61.8 :

2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 
23:01:58.313955
osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x3c8) [0x6fa708]
 3: (OSD::process_peering_events(std::list > const&, 
Th

[ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Hi,

I have an OSD which crash every time I try to start it (see logs below).
Is it a known problem ? And is there a way to fix it ?

root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
(8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
2013-08-19 11:07:48.516363 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:48.516380 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:48.516514 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:48.517087 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:48.517389 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps <>
2013-08-19 11:07:49.199483 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
2013-08-19 11:07:52.199908 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:52.199916 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:52.200058 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:52.200886 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:52.200919 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps <>
2013-08-19 11:07:52.215850 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for clients
2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
11:08:13.579519
osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x3c8) [0x6f8f48]
 3: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x31f) [0x6f975f]
 4: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x7391d4]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
 7: (()+0x6b50) [0x7f6fe3070b50]
 8: (clone()+0x6d) [0x7f6fe15cba7d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

full logs here : http://pastebin.com/RphNyLU0


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replace all monitors

2013-08-10 Thread Olivier Bonvalet
Le jeudi 08 août 2013 à 18:04 -0700, Sage Weil a écrit :
> On Fri, 9 Aug 2013, Olivier Bonvalet wrote:
> > Le jeudi 08 ao?t 2013 ? 09:43 -0700, Sage Weil a ?crit :
> > > On Thu, 8 Aug 2013, Olivier Bonvalet wrote:
> > > > Hi,
> > > > 
> > > > from now I have 5 monitors which share slow SSD with several OSD
> > > > journal. As a result, each data migration operation (reweight, recovery,
> > > > etc) is very slow and the cluster is near down.
> > > > 
> > > > So I have to change that. I'm looking to replace this 5 monitors by 3
> > > > new monitors, which still share (very fast) SSD with several OSD.
> > > > I suppose it's not a good idea, since monitors should have a dedicated
> > > > storage. What do you think about that ?
> > > > Is it a better practice to have dedicated storage, but share CPU with
> > > > Xen VM ?
> > > 
> > > I think it's okay, as long as you aren't wroried about the device filling 
> > > up and the monitors are on different hosts.
> > 
> > Not sure to understand : by ?dedicated storage?, I was talking of the
> > monitor. Can I put monitors on Xen ?host?, if they have dedicated
> > storage ?
> 
> Yeah, Xen would work fine here, although I'm not sure it is necessary.  
> Just putting /var/lib/mon on a different storage device will probably be 
> the most important piece.  It sounds like it is storage contention, and 
> not CPU contention, that is the source of your problems.
> 
> sage
> 

Yop, the transition worked fine, thanks ! Newer mon are really fasters,
and now I can migrate data without downtime. Good job devs !

Thanks again.

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replace all monitors

2013-08-08 Thread Olivier Bonvalet
Le jeudi 08 août 2013 à 09:43 -0700, Sage Weil a écrit :
> On Thu, 8 Aug 2013, Olivier Bonvalet wrote:
> > Hi,
> > 
> > from now I have 5 monitors which share slow SSD with several OSD
> > journal. As a result, each data migration operation (reweight, recovery,
> > etc) is very slow and the cluster is near down.
> > 
> > So I have to change that. I'm looking to replace this 5 monitors by 3
> > new monitors, which still share (very fast) SSD with several OSD.
> > I suppose it's not a good idea, since monitors should have a dedicated
> > storage. What do you think about that ?
> > Is it a better practice to have dedicated storage, but share CPU with
> > Xen VM ?
> 
> I think it's okay, as long as you aren't wroried about the device filling 
> up and the monitors are on different hosts.

Not sure to understand : by «dedicated storage», I was talking of the
monitor. Can I put monitors on Xen «host», if they have dedicated
storage ?

> 
> > Second point, I'm not sure how to do that migration, without downtime.
> > I was hoping to add the 3 new monitors, then progressively remove the 5
> > old monitors, but in the doc [1] indicate a special procedure for
> > unhealthy cluster, which seem to be for clusters with damaged monitors,
> > right ? In my case I only have dead PG [2] (#5226), from which I can't
> > recover, but monitors are fine. Can I use the standard procedure ?
> 
> The 'healthy' caveat in this case is about the monitor cluster; teh 
> special procedure is only needed if you don't have enough healthy mons to 
> form a  quorum.  The normal procedure should work just fine.
> 

Great, thanks !


> sage
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replace all monitors

2013-08-08 Thread Olivier Bonvalet
Hi,

from now I have 5 monitors which share slow SSD with several OSD
journal. As a result, each data migration operation (reweight, recovery,
etc) is very slow and the cluster is near down.

So I have to change that. I'm looking to replace this 5 monitors by 3
new monitors, which still share (very fast) SSD with several OSD.
I suppose it's not a good idea, since monitors should have a dedicated
storage. What do you think about that ?
Is it a better practice to have dedicated storage, but share CPU with
Xen VM ?

Second point, I'm not sure how to do that migration, without downtime.
I was hoping to add the 3 new monitors, then progressively remove the 5
old monitors, but in the doc [1] indicate a special procedure for
unhealthy cluster, which seem to be for clusters with damaged monitors,
right ? In my case I only have dead PG [2] (#5226), from which I can't
recover, but monitors are fine. Can I use the standard procedure ?

Thanks,
Olivier

[1] 
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors
[2] http://tracker.ceph.com/issues/5226

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread Olivier Bonvalet
It's Xen yes, but no I didn't tried the RBD tab client, for two
reasons :
- too young to enable it in production
- Debian packages don't have the TAP driver


Le lundi 05 août 2013 à 01:43 +, James Harper a écrit :
> What VM? If Xen, have you tried the rbd tap client?
> 
> James
> 
> > -Original Message-
> > From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
> > boun...@lists.ceph.com] On Behalf Of Olivier Bonvalet
> > Sent: Monday, 5 August 2013 11:07 AM
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103
> > 
> > 
> > Hi,
> > 
> > I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> > 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> > VM which use RBD kernel client.
> > 
> > 
> > In kernel logs, I have :
> > 
> > Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at
> > net/ceph/osd_client.c:2103!
> > Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] 
> > SMP
> > Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd
> > libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter
> > ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT
> > xt_tcpudp iptable_filter ip_tables x_tables bridge loop coretemp
> > ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper
> > ablk_helper cryptd iTCO_wdt iTCO_vendor_support gpio_ich microcode
> > serio_raw sb_edac edac_core evdev lpc_ich i2c_i801 mfd_core wmi ac
> > ioatdma shpchp button dm_mod hid_generic usbhid hid sg sd_mod
> > crc_t10dif crc32c_intel isci megaraid_sas libsas ahci libahci ehci_pci 
> > ehci_hcd
> > libata scsi_transport_sas igb scsi_mod i2c_algo_bit ixgbe usbcore i2c_core
> > dca usb_common ptp pps_core mdio
> > Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm:
> > blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> > Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro
> > X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> > Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti:
> > 88003803a000 task.ti: 88003803a000
> > Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> > e030:[]
> > [] ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> > Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8
> > EFLAGS: 00010212
> > Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX:
> > 880033a182ec RCX: 
> > Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI:
> > 8050 RDI: 880030d34888
> > Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08:
> > 88003803ba58 R09: 
> > Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11:
> >  R12: 880033ba3500
> > Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14:
> > 88003847aa78 R15: 88003847ab58
> > Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700()
> > GS:88003f84() knlGS:
> > Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES: 
> > CR0: 80050033
> > Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3:
> > 2be14000 CR4: 00042660
> > Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1:
> >  DR2: 
> > Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6:
> > 0ff0 DR7: 0400
> > Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> > Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000
> > 00243847aa78  880039949b40
> > Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201
> > 880033811d98 88003803ba80 88003847aa78
> > Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380
> > 880002a38400 2000 a029584c
> > Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> > Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ?
> > rbd_osd_req_format_write+0x71/0x7c [rbd]
> > Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ?
> > rbd_img_request_fill+0x695/0x736 [rbd]
> > Aug  5 02:51:22 murmillia kernel: [  289.213562]  [] ?
> > arch_local_irq_restore+0x7/0x8
> > Aug  5 02:51:22 murmillia kernel: [  289.213667]  

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-04 Thread Olivier Bonvalet
Yes of course, thanks !

Le dimanche 04 août 2013 à 20:59 -0700, Sage Weil a écrit :
> Hi Olivier,
> 
> This looks like http://tracker.ceph.com/issues/5760.  We should be able to 
> look at this more closely this week.  In the meantime, you might want to 
> go back to 3.9.x.  If we have a patch that addresses the bug, would you be 
> able to test it?
> 
> Thanks!
> sage
> 
> 
> On Mon, 5 Aug 2013, Olivier Bonvalet wrote:
> > Sorry, the "dev" list is probably a better place for that one.
> > 
> > Le lundi 05 ao?t 2013 ? 03:07 +0200, Olivier Bonvalet a ?crit :
> > > Hi,
> > > 
> > > I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> > > 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> > > VM which use RBD kernel client. 
> > > 
> > > 
> > > In kernel logs, I have :
> > > 
> > > Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at 
> > > net/ceph/osd_client.c:2103!
> > > Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  
> > > [#1] SMP 
> > > Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc 
> > > rbd libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT 
> > > ip6table_filter ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev 
> > > ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge loop 
> > > coretemp ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul 
> > > glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support gpio_ich 
> > > microcode serio_raw sb_edac edac_core evdev lpc_ich i2c_i801 mfd_core wmi 
> > > ac ioatdma shpchp button dm_mod hid_generic usbhid hid sg sd_mod 
> > > crc_t10dif crc32c_intel isci megaraid_sas libsas ahci libahci ehci_pci 
> > > ehci_hcd libata scsi_transport_sas igb scsi_mod i2c_algo_bit ixgbe 
> > > usbcore i2c_core dca usb_common ptp pps_core mdio
> > > Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm: 
> > > blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> > > Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: 
> > > Supermicro X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> > > Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 
> > > ti: 88003803a000 task.ti: 88003803a000
> > > Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> > > e030:[]  [] 
> > > ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> > > Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: 
> > > e02b:88003803b9f8  EFLAGS: 00010212
> > > Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 
> > > RBX: 880033a182ec RCX: 
> > > Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af 
> > > RSI: 8050 RDI: 880030d34888
> > > Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 
> > > R08: 88003803ba58 R09: 
> > > Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  
> > > R11:  R12: 880033ba3500
> > > Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 
> > > R14: 88003847aa78 R15: 88003847ab58
> > > Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  
> > > 7f775da8c700() GS:88003f84() knlGS:
> > > Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES: 
> > >  CR0: 80050033
> > > Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 
> > > CR3: 2be14000 CR4: 00042660
> > > Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  
> > > DR1:  DR2: 
> > > Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  
> > > DR6: 0ff0 DR7: 0400
> > > Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> > > Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000 
> > > 00243847aa78  880039949b40
> > > Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201 
> > > 880033811d98 88003803ba80 88003847aa78
> > > Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380 
> > > 880002a38400 2000 a029584c
> > > Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> > > Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ? 
> > > rbd_osd_req_f

Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-04 Thread Olivier Bonvalet
Sorry, the "dev" list is probably a better place for that one.

Le lundi 05 août 2013 à 03:07 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> VM which use RBD kernel client. 
> 
> 
> In kernel logs, I have :
> 
> Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at 
> net/ceph/osd_client.c:2103!
> Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] 
> SMP 
> Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd 
> libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter 
> ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT xt_tcpudp 
> iptable_filter ip_tables x_tables bridge loop coretemp ghash_clmulni_intel 
> aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt 
> iTCO_vendor_support gpio_ich microcode serio_raw sb_edac edac_core evdev 
> lpc_ich i2c_i801 mfd_core wmi ac ioatdma shpchp button dm_mod hid_generic 
> usbhid hid sg sd_mod crc_t10dif crc32c_intel isci megaraid_sas libsas ahci 
> libahci ehci_pci ehci_hcd libata scsi_transport_sas igb scsi_mod i2c_algo_bit 
> ixgbe usbcore i2c_core dca usb_common ptp pps_core mdio
> Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm: 
> blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro 
> X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti: 
> 88003803a000 task.ti: 88003803a000
> Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> e030:[]  [] 
> ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8  
> EFLAGS: 00010212
> Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX: 
> 880033a182ec RCX: 
> Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI: 
> 8050 RDI: 880030d34888
> Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08: 
> 88003803ba58 R09: 
> Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11: 
>  R12: 880033ba3500
> Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14: 
> 88003847aa78 R15: 88003847ab58
> Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700() 
> GS:88003f84() knlGS:
> Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES:  
> CR0: 80050033
> Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3: 
> 2be14000 CR4: 00042660
> Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1: 
>  DR2: 
> Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6: 
> 0ff0 DR7: 0400
> Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000 
> 00243847aa78  880039949b40
> Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201 
> 880033811d98 88003803ba80 88003847aa78
> Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380 
> 880002a38400 2000 a029584c
> Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ? 
> rbd_osd_req_format_write+0x71/0x7c [rbd]
> Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ? 
> rbd_img_request_fill+0x695/0x736 [rbd]
> Aug  5 02:51:22 murmillia kernel: [  289.213562]  [] ? 
> arch_local_irq_restore+0x7/0x8
> Aug  5 02:51:22 murmillia kernel: [  289.213667]  [] ? 
> down_read+0x9/0x19
> Aug  5 02:51:22 murmillia kernel: [  289.213763]  [] ? 
> rbd_request_fn+0x191/0x22e [rbd]
> Aug  5 02:51:22 murmillia kernel: [  289.213864]  [] ? 
> __blk_run_queue_uncond+0x1e/0x26
> Aug  5 02:51:22 murmillia kernel: [  289.213962]  [] ? 
> blk_flush_plug_list+0x1c1/0x1e4
> Aug  5 02:51:22 murmillia kernel: [  289.214059]  [] ? 
> blk_finish_plug+0xb/0x2a
> Aug  5 02:51:22 murmillia kernel: [  289.214157]  [] ? 
> dispatch_rw_block_io+0x33e/0x3f0
> Aug  5 02:51:22 murmillia kernel: [  289.214259]  [] ? 
> find_busiest_group+0x28/0x1d4
> Aug  5 02:51:22 murmillia kernel: [  289.214357]  [] ? 
> load_balance+0xb9/0x5e1
> Aug  5 02:51:22 murmillia kernel: [  289.214454]  [] ? 
> xen_hypercall_xen_version+0xa/0x20
> Aug  5 02:51:22 murmillia kernel: [ 

[ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-04 Thread Olivier Bonvalet

Hi,

I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
VM which use RBD kernel client. 


In kernel logs, I have :

Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at 
net/ceph/osd_client.c:2103!
Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] SMP 
Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd 
libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter 
ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT xt_tcpudp 
iptable_filter ip_tables x_tables bridge loop coretemp ghash_clmulni_intel 
aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt 
iTCO_vendor_support gpio_ich microcode serio_raw sb_edac edac_core evdev 
lpc_ich i2c_i801 mfd_core wmi ac ioatdma shpchp button dm_mod hid_generic 
usbhid hid sg sd_mod crc_t10dif crc32c_intel isci megaraid_sas libsas ahci 
libahci ehci_pci ehci_hcd libata scsi_transport_sas igb scsi_mod i2c_algo_bit 
ixgbe usbcore i2c_core dca usb_common ptp pps_core mdio
Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm: 
blkback.3.xvda Not tainted 3.10-dae-dom0 #1
Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro 
X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti: 
88003803a000 task.ti: 88003803a000
Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: e030:[] 
 [] ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8  
EFLAGS: 00010212
Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX: 
880033a182ec RCX: 
Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI: 
8050 RDI: 880030d34888
Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08: 
88003803ba58 R09: 
Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11: 
 R12: 880033ba3500
Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14: 
88003847aa78 R15: 88003847ab58
Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700() 
GS:88003f84() knlGS:
Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES:  
CR0: 80050033
Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3: 
2be14000 CR4: 00042660
Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1: 
 DR2: 
Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6: 
0ff0 DR7: 0400
Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000 
00243847aa78  880039949b40
Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201 
880033811d98 88003803ba80 88003847aa78
Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380 
880002a38400 2000 a029584c
Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ? 
rbd_osd_req_format_write+0x71/0x7c [rbd]
Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ? 
rbd_img_request_fill+0x695/0x736 [rbd]
Aug  5 02:51:22 murmillia kernel: [  289.213562]  [] ? 
arch_local_irq_restore+0x7/0x8
Aug  5 02:51:22 murmillia kernel: [  289.213667]  [] ? 
down_read+0x9/0x19
Aug  5 02:51:22 murmillia kernel: [  289.213763]  [] ? 
rbd_request_fn+0x191/0x22e [rbd]
Aug  5 02:51:22 murmillia kernel: [  289.213864]  [] ? 
__blk_run_queue_uncond+0x1e/0x26
Aug  5 02:51:22 murmillia kernel: [  289.213962]  [] ? 
blk_flush_plug_list+0x1c1/0x1e4
Aug  5 02:51:22 murmillia kernel: [  289.214059]  [] ? 
blk_finish_plug+0xb/0x2a
Aug  5 02:51:22 murmillia kernel: [  289.214157]  [] ? 
dispatch_rw_block_io+0x33e/0x3f0
Aug  5 02:51:22 murmillia kernel: [  289.214259]  [] ? 
find_busiest_group+0x28/0x1d4
Aug  5 02:51:22 murmillia kernel: [  289.214357]  [] ? 
load_balance+0xb9/0x5e1
Aug  5 02:51:22 murmillia kernel: [  289.214454]  [] ? 
xen_hypercall_xen_version+0xa/0x20
Aug  5 02:51:22 murmillia kernel: [  289.214552]  [] ? 
__do_block_io_op+0x258/0x390
Aug  5 02:51:22 murmillia kernel: [  289.214649]  [] ? 
xen_end_context_switch+0xa/0x14
Aug  5 02:51:22 murmillia kernel: [  289.214747]  [] ? 
__switch_to+0x13e/0x3c0
Aug  5 02:51:22 murmillia kernel: [  289.214843]  [] ? 
xen_blkif_schedule+0x30d/0x418
Aug  5 02:51:22 murmillia kernel: [  289.214947]  [] ? 
finish_wait+0x60/0x60
Aug  5 02:51:22 murmillia kernel: [  289.215042]  [] ? 
xen_blkif_be_int+0x25/0x25
Aug  5 02:51:22 murmillia kernel: [  289.215138]  [] ? 
kthread+0x7d/0x85
Aug  5 02:51:22 murmillia kernel: [  289.215232]  [] ? 
__k

Re: [ceph-users] VMs freez after slow requests

2013-06-03 Thread Olivier Bonvalet

Le lundi 03 juin 2013 à 08:04 -0700, Gregory Farnum a écrit :
> On Sunday, June 2, 2013, Dominik Mostowiec wrote:
> Hi,
> I try to start postgres cluster on VMs with second disk
> mounted from
> ceph (rbd - kvm).
> I started some writes (pgbench initialisation) on 8 VMs and
> VMs freez.
> Ceph reports slow request on 1 osd. I restarted this osd to
> remove
> slows and VMs hangs permanently.
> Is this a normal situation afer cluster problems?
> 
> 
> Definitely not. Is your cluster reporting as healthy (what's "ceph -s"
> say)? Can you get anything off your hung VMs (like dmesg output)?
> -Greg
> 
> 
> -- 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

I also see that with Xen and kernel RBD client, when the ceph cluster
was full : in fact after some errors the block device switch in
read-only mode, and I didn't find any way to fix that ("mount -o
remount,rw" doesn't work). I had to reboot all the VM.

But since I don't have to unmap/remap RBD device, I don't think it's a
Ceph/RBD problem. Probably a Xen or Linux "feature".

Olivier





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon store.db size

2013-06-02 Thread Olivier Bonvalet
Hi,

it's a Cuttlefish bug, which should be fixed in next point release very
soon.

Olivier

Le dimanche 02 juin 2013 à 18:51 +1000, Bond, Darryl a écrit :
> Cluster has gone into HEALTH_WARN because the mon filesystem is 12%
> The cluster was upgraded to cuttlefish last week and had been running on 
> bobtail for a few months.
> 
> How big can I expect the /var/lib/ceph/mon to get, what influences it's size.
> It is at 11G now, I'm not sure how fast it has been growing though.
> 
> Darryl
> 
> The contents of this electronic message and any attachments are intended only 
> for the addressee and may contain legally privileged, personal, sensitive or 
> confidential information. If you are not the intended addressee, and have 
> received this email, any transmission, distribution, downloading, printing or 
> photocopying of the contents of this message or attachments is strictly 
> prohibited. Any legal privilege or confidentiality attached to this message 
> and attachments is not waived, lost or destroyed by reason of delivery to any 
> person other than intended addressee. If you have received this message and 
> are not the intended addressee you should notify the sender by return email 
> and destroy all copies of the message and any attachments. Unless expressly 
> attributed, the views expressed in this email do not necessarily represent 
> the views of the company.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [solved] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
Ok, so :
- after a second "rbd rm XXX", the image was gone
- and "rados ls" doesn't see any object from that image
- so I tried to move thoses files

=> scrub is now ok !

So for me it's fixed. Thanks

Le vendredi 31 mai 2013 à 16:34 +0200, Olivier Bonvalet a écrit :
> Note that I still have scrub errors, but rados doesn't see thoses
> objects :
> 
> root! brontes:~# rados -p hdd3copies ls | grep '^rb.0.15c26.238e1f29'
> root! brontes:~# 
> 
> 
> 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
Note that I still have scrub errors, but rados doesn't see thoses
objects :

root! brontes:~# rados -p hdd3copies ls | grep '^rb.0.15c26.238e1f29'
root! brontes:~# 



Le vendredi 31 mai 2013 à 15:36 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> sorry for the late answer : trying to fix that, I tried to delete the
> image (rbd rm XXX), the "rbd rm" complete without errors, but "rbd ls"
> still display this image.
> 
> What should I do ?
> 
> 
> Here the files for the PG 3.6b :
> 
> # find /var/lib/ceph/osd/ceph-28/current/3.6b_head/ -name 
> 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l
> -rw-r--r-- 1 root root 4194304 19 mai   22:52 
> /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
> -rw-r--r-- 1 root root 4194304 19 mai   23:00 
> /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
> -rw-r--r-- 1 root root 4194304 19 mai   22:59 
> /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3
> 
> # find /var/lib/ceph/osd/ceph-23/current/3.6b_head/ -name 
> 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l
> -rw-r--r-- 1 root root 4194304 25 mars  19:18 
> /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
> -rw-r--r-- 1 root root 4194304 25 mars  19:33 
> /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
> -rw-r--r-- 1 root root 4194304 25 mars  19:34 
> /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3
> 
> # find /var/lib/ceph/osd/ceph-5/current/3.6b_head/ -name 
> 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l
> -rw-r--r-- 1 root root 4194304 25 mars  19:18 
> /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
> -rw-r--r-- 1 root root 4194304 25 mars  19:33 
> /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
> -rw-r--r-- 1 root root 4194304 25 mars  19:34 
> /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3
> 
> 
> As you can see, OSD doesn't contain any other data on thoses PG for this RBD 
> image. Should I remove them thought rados ?
> 
> 
> In fact I remember that some of thoses files was truncated (size 0), then I 
> manually copy data from osd-5. It was probably an error to do that.
> 
> 
> Thanks,
> Olivier
> 
> Le jeudi 23 mai 2013 à 15:53 -0700, Samuel Just a écrit :
> > Can you send the filenames in the pg directories for those 4 pgs?
> > -Sam
> > 
> > On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet  
> > wrote:
> > > No :
> > > pg 3.7c is active+clean+inconsistent, acting [24,13,39]
> > > pg 3.6b is active+clean+inconsistent, acting [28,23,5]
> > > pg 3.d is active+clean+inconsistent, acting [29,4,11]
> > > pg 3.1 is active+clean+inconsistent, acting [28,19,5]
> > >
> > > But I suppose that all PG *was* having the osd.25 as primary (on the
> > > same host), which is (disabled) buggy OSD.
> > >
> > > Question : "12d7" in object path is the snapshot id, right ? If it's the
> > > case, I haven't got any snapshot with this id for the
> > > rb.0.15c26.238e1f29 image.
> > >
> > > So, which files should I remove ?
> > >
> > > Thanks for your help.
> > >
> > >
> > > Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit :
> > >> Do all of the affected PGs share osd.28 as the primary?  I think the
> > >> only recovery is probably to manually remove the orphaned clones.
> > >> -Sam
> > >>
> > >> On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet  
> > >> wrote:
> > >> > Not yet. I keep it for now.
> > >> >
> > >> > Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit :
> > >> >> rb.0.15c26.238e1f29
> > >> >>
> > >> >> Has that rbd volume been removed?
> > >> >> -Sam
> > >> >>
> > >> >> On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet 
> > >> >>  wrote:
> > >> >> > 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail.
> > >> >> >
> > >> >> >

Re: [ceph-users] scrub error: found clone without head

2013-05-31 Thread Olivier Bonvalet
Hi,

sorry for the late answer : trying to fix that, I tried to delete the
image (rbd rm XXX), the "rbd rm" complete without errors, but "rbd ls"
still display this image.

What should I do ?


Here the files for the PG 3.6b :

# find /var/lib/ceph/osd/ceph-28/current/3.6b_head/ -name 
'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l
-rw-r--r-- 1 root root 4194304 19 mai   22:52 
/var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
-rw-r--r-- 1 root root 4194304 19 mai   23:00 
/var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
-rw-r--r-- 1 root root 4194304 19 mai   22:59 
/var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3

# find /var/lib/ceph/osd/ceph-23/current/3.6b_head/ -name 
'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l
-rw-r--r-- 1 root root 4194304 25 mars  19:18 
/var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
-rw-r--r-- 1 root root 4194304 25 mars  19:33 
/var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
-rw-r--r-- 1 root root 4194304 25 mars  19:34 
/var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3

# find /var/lib/ceph/osd/ceph-5/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' 
-print0 | xargs -r -0 ls -l
-rw-r--r-- 1 root root 4194304 25 mars  19:18 
/var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3
-rw-r--r-- 1 root root 4194304 25 mars  19:33 
/var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3
-rw-r--r-- 1 root root 4194304 25 mars  19:34 
/var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3


As you can see, OSD doesn't contain any other data on thoses PG for this RBD 
image. Should I remove them thought rados ?


In fact I remember that some of thoses files was truncated (size 0), then I 
manually copy data from osd-5. It was probably an error to do that.


Thanks,
Olivier

Le jeudi 23 mai 2013 à 15:53 -0700, Samuel Just a écrit :
> Can you send the filenames in the pg directories for those 4 pgs?
> -Sam
> 
> On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet  wrote:
> > No :
> > pg 3.7c is active+clean+inconsistent, acting [24,13,39]
> > pg 3.6b is active+clean+inconsistent, acting [28,23,5]
> > pg 3.d is active+clean+inconsistent, acting [29,4,11]
> > pg 3.1 is active+clean+inconsistent, acting [28,19,5]
> >
> > But I suppose that all PG *was* having the osd.25 as primary (on the
> > same host), which is (disabled) buggy OSD.
> >
> > Question : "12d7" in object path is the snapshot id, right ? If it's the
> > case, I haven't got any snapshot with this id for the
> > rb.0.15c26.238e1f29 image.
> >
> > So, which files should I remove ?
> >
> > Thanks for your help.
> >
> >
> > Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit :
> >> Do all of the affected PGs share osd.28 as the primary?  I think the
> >> only recovery is probably to manually remove the orphaned clones.
> >> -Sam
> >>
> >> On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet  
> >> wrote:
> >> > Not yet. I keep it for now.
> >> >
> >> > Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit :
> >> >> rb.0.15c26.238e1f29
> >> >>
> >> >> Has that rbd volume been removed?
> >> >> -Sam
> >> >>
> >> >> On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet 
> >> >>  wrote:
> >> >> > 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail.
> >> >> >
> >> >> >
> >> >> > Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit :
> >> >> >> What version are you running?
> >> >> >> -Sam
> >> >> >>
> >> >> >> On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet 
> >> >> >>  wrote:
> >> >> >> > Is it enough ?
> >> >> >> >
> >> >> >> > # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found 
> >> >> >> > clone without head'
> >> >> >> > 2013-05-22 15:43:09.308352 7f707dd64700  0 log [INF] : 9.105 scrub 
> >> >

[ceph-users] Edge effect with multiple RBD kernel clients per host ?

2013-05-25 Thread Olivier Bonvalet
Hi,

I seem to have a bad edge effect in my setup, don't know if it's a RBD
problem or a Xen problem.

So, I have one Ceph cluster, in which I setup 2 different storage
pools : one on SSD and one on SAS. With appropriate CRUSH rules, those
pools are complety separated, only MON are commons.

Then, on a Xen host A, I run "VMSSD" and "VMSAS". If I launch a big
reballance on the "SSD pool", then the "VMSSD" *and* "VMSAS" will slow
down (a lot of IOWait). But if I move the "VMSAS" on a different Xen
host (B), then "VMSSD" will still be slow, but the "VMSAS" will be fast
again.

The first thing I checked is the network of the Xen host A, but I didn't
find any problem.

So, is there a queue, shared by all RBD kernel clients running on a same
host ? Or something which can explain this edge effect ?


Olivier

PS : one precision, I have about 60 RBD mapped on the "Xen host A",
don't know if it can be the key of the problem.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-23 Thread Olivier Bonvalet
No : 
pg 3.7c is active+clean+inconsistent, acting [24,13,39]
pg 3.6b is active+clean+inconsistent, acting [28,23,5]
pg 3.d is active+clean+inconsistent, acting [29,4,11]
pg 3.1 is active+clean+inconsistent, acting [28,19,5]

But I suppose that all PG *was* having the osd.25 as primary (on the
same host), which is (disabled) buggy OSD.

Question : "12d7" in object path is the snapshot id, right ? If it's the
case, I haven't got any snapshot with this id for the
rb.0.15c26.238e1f29 image.

So, which files should I remove ?

Thanks for your help.


Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit :
> Do all of the affected PGs share osd.28 as the primary?  I think the
> only recovery is probably to manually remove the orphaned clones.
> -Sam
> 
> On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet  wrote:
> > Not yet. I keep it for now.
> >
> > Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit :
> >> rb.0.15c26.238e1f29
> >>
> >> Has that rbd volume been removed?
> >> -Sam
> >>
> >> On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet  
> >> wrote:
> >> > 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail.
> >> >
> >> >
> >> > Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit :
> >> >> What version are you running?
> >> >> -Sam
> >> >>
> >> >> On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet 
> >> >>  wrote:
> >> >> > Is it enough ?
> >> >> >
> >> >> > # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone 
> >> >> > without head'
> >> >> > 2013-05-22 15:43:09.308352 7f707dd64700  0 log [INF] : 9.105 scrub ok
> >> >> > 2013-05-22 15:44:21.054893 7f707dd64700  0 log [INF] : 9.451 scrub ok
> >> >> > 2013-05-22 15:44:52.898784 7f707cd62700  0 log [INF] : 9.784 scrub ok
> >> >> > 2013-05-22 15:47:43.148515 7f707cd62700  0 log [INF] : 9.3c3 scrub ok
> >> >> > 2013-05-22 15:47:45.717085 7f707dd64700  0 log [INF] : 9.3d0 scrub ok
> >> >> > 2013-05-22 15:52:14.573815 7f707dd64700  0 log [ERR] : scrub 3.6b 
> >> >> > ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without 
> >> >> > head
> >> >> > 2013-05-22 15:55:07.230114 7f707d563700  0 log [ERR] : scrub 3.6b 
> >> >> > 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without 
> >> >> > head
> >> >> > 2013-05-22 15:56:56.456242 7f707d563700  0 log [ERR] : scrub 3.6b 
> >> >> > b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without 
> >> >> > head
> >> >> > 2013-05-22 15:57:51.667085 7f707dd64700  0 log [ERR] : 3.6b scrub 3 
> >> >> > errors
> >> >> > 2013-05-22 15:57:55.241224 7f707dd64700  0 log [INF] : 9.450 scrub ok
> >> >> > 2013-05-22 15:57:59.800383 7f707cd62700  0 log [INF] : 9.465 scrub ok
> >> >> > 2013-05-22 15:59:55.024065 7f707661a700  0 -- 192.168.42.3:6803/12142 
> >> >> > >> 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 
> >> >> > pgs=200652 cs=73 l=0).fault with nothing to send, going to standby
> >> >> > 2013-05-22 16:01:45.542579 7f7022770700  0 -- 192.168.42.3:6803/12142 
> >> >> > >> 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 
> >> >> > l=0).accept connect_seq 74 vs existing 73 state standby
> >> >> > --
> >> >> > 2013-05-22 16:29:49.544310 7f707dd64700  0 log [INF] : 9.4eb scrub ok
> >> >> > 2013-05-22 16:29:53.190233 7f707dd64700  0 log [INF] : 9.4f4 scrub ok
> >> >> > 2013-05-22 16:29:59.478736 7f707dd64700  0 log [INF] : 8.6bb scrub ok
> >> >> > 2013-05-22 16:35:12.240246 7f7022770700  0 -- 192.168.42.3:6803/12142 
> >> >> > >> 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 
> >> >> > cs=75 l=0).fault with nothing to send, going to standby
> >> >> > 2013-05-22 16:35:19.519019 7f707d563700  0 log [INF] : 8.700 scrub ok
> >> >> > 2013-05-22 16:39:15.422532 7f707dd64700  0 log [ERR] : scrub 3.1 
> >> >> > b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without 
> >> >> > head
> >> >> > 2013-05-22 16:40:04.995256 7f707cd62700  0 log [ERR] : scrub 3.1 
> >> >> > bccad701/rb.0.15c26.238e1f29.9a

Re: [ceph-users] scrub error: found clone without head

2013-05-23 Thread Olivier Bonvalet
Not yet. I keep it for now.

Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit :
> rb.0.15c26.238e1f29
> 
> Has that rbd volume been removed?
> -Sam
> 
> On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet  
> wrote:
> > 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail.
> >
> >
> > Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit :
> >> What version are you running?
> >> -Sam
> >>
> >> On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet  
> >> wrote:
> >> > Is it enough ?
> >> >
> >> > # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone 
> >> > without head'
> >> > 2013-05-22 15:43:09.308352 7f707dd64700  0 log [INF] : 9.105 scrub ok
> >> > 2013-05-22 15:44:21.054893 7f707dd64700  0 log [INF] : 9.451 scrub ok
> >> > 2013-05-22 15:44:52.898784 7f707cd62700  0 log [INF] : 9.784 scrub ok
> >> > 2013-05-22 15:47:43.148515 7f707cd62700  0 log [INF] : 9.3c3 scrub ok
> >> > 2013-05-22 15:47:45.717085 7f707dd64700  0 log [INF] : 9.3d0 scrub ok
> >> > 2013-05-22 15:52:14.573815 7f707dd64700  0 log [ERR] : scrub 3.6b 
> >> > ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 15:55:07.230114 7f707d563700  0 log [ERR] : scrub 3.6b 
> >> > 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 15:56:56.456242 7f707d563700  0 log [ERR] : scrub 3.6b 
> >> > b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 15:57:51.667085 7f707dd64700  0 log [ERR] : 3.6b scrub 3 
> >> > errors
> >> > 2013-05-22 15:57:55.241224 7f707dd64700  0 log [INF] : 9.450 scrub ok
> >> > 2013-05-22 15:57:59.800383 7f707cd62700  0 log [INF] : 9.465 scrub ok
> >> > 2013-05-22 15:59:55.024065 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
> >> > 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 
> >> > cs=73 l=0).fault with nothing to send, going to standby
> >> > 2013-05-22 16:01:45.542579 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
> >> > 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 
> >> > l=0).accept connect_seq 74 vs existing 73 state standby
> >> > --
> >> > 2013-05-22 16:29:49.544310 7f707dd64700  0 log [INF] : 9.4eb scrub ok
> >> > 2013-05-22 16:29:53.190233 7f707dd64700  0 log [INF] : 9.4f4 scrub ok
> >> > 2013-05-22 16:29:59.478736 7f707dd64700  0 log [INF] : 8.6bb scrub ok
> >> > 2013-05-22 16:35:12.240246 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
> >> > 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 
> >> > l=0).fault with nothing to send, going to standby
> >> > 2013-05-22 16:35:19.519019 7f707d563700  0 log [INF] : 8.700 scrub ok
> >> > 2013-05-22 16:39:15.422532 7f707dd64700  0 log [ERR] : scrub 3.1 
> >> > b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 16:40:04.995256 7f707cd62700  0 log [ERR] : scrub 3.1 
> >> > bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 16:41:07.008717 7f707d563700  0 log [ERR] : scrub 3.1 
> >> > 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without 
> >> > head
> >> > 2013-05-22 16:41:42.460280 7f707c561700  0 log [ERR] : 3.1 scrub 3 errors
> >> > 2013-05-22 16:46:12.385678 7f7077735700  0 -- 192.168.42.3:6803/12142 >> 
> >> > 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 
> >> > l=0).accept connect_seq 76 vs existing 75 state standby
> >> > 2013-05-22 16:58:36.079010 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
> >> > 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 
> >> > l=0).accept connect_seq 40 vs existing 39 state standby
> >> > 2013-05-22 16:58:36.798038 7f707d563700  0 log [INF] : 9.50c scrub ok
> >> > 2013-05-22 16:58:40.104159 7f707c561700  0 log [INF] : 9.526 scrub ok
> >> >
> >> >
> >> > Note : I have 8 scrub errors like that, on 4 impacted PG, and all 
> >> > impacted objects are about the same RBD image (rb.0.15c26.238e1f29).
> >> >
> >> >
> >> >
> >> > Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just

Re: [ceph-users] scrub error: found clone without head

2013-05-22 Thread Olivier Bonvalet
0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail.


Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit :
> What version are you running?
> -Sam
> 
> On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet  
> wrote:
> > Is it enough ?
> >
> > # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone 
> > without head'
> > 2013-05-22 15:43:09.308352 7f707dd64700  0 log [INF] : 9.105 scrub ok
> > 2013-05-22 15:44:21.054893 7f707dd64700  0 log [INF] : 9.451 scrub ok
> > 2013-05-22 15:44:52.898784 7f707cd62700  0 log [INF] : 9.784 scrub ok
> > 2013-05-22 15:47:43.148515 7f707cd62700  0 log [INF] : 9.3c3 scrub ok
> > 2013-05-22 15:47:45.717085 7f707dd64700  0 log [INF] : 9.3d0 scrub ok
> > 2013-05-22 15:52:14.573815 7f707dd64700  0 log [ERR] : scrub 3.6b 
> > ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head
> > 2013-05-22 15:55:07.230114 7f707d563700  0 log [ERR] : scrub 3.6b 
> > 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head
> > 2013-05-22 15:56:56.456242 7f707d563700  0 log [ERR] : scrub 3.6b 
> > b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head
> > 2013-05-22 15:57:51.667085 7f707dd64700  0 log [ERR] : 3.6b scrub 3 errors
> > 2013-05-22 15:57:55.241224 7f707dd64700  0 log [INF] : 9.450 scrub ok
> > 2013-05-22 15:57:59.800383 7f707cd62700  0 log [INF] : 9.465 scrub ok
> > 2013-05-22 15:59:55.024065 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
> > 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 
> > l=0).fault with nothing to send, going to standby
> > 2013-05-22 16:01:45.542579 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
> > 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 
> > l=0).accept connect_seq 74 vs existing 73 state standby
> > --
> > 2013-05-22 16:29:49.544310 7f707dd64700  0 log [INF] : 9.4eb scrub ok
> > 2013-05-22 16:29:53.190233 7f707dd64700  0 log [INF] : 9.4f4 scrub ok
> > 2013-05-22 16:29:59.478736 7f707dd64700  0 log [INF] : 8.6bb scrub ok
> > 2013-05-22 16:35:12.240246 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
> > 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 
> > l=0).fault with nothing to send, going to standby
> > 2013-05-22 16:35:19.519019 7f707d563700  0 log [INF] : 8.700 scrub ok
> > 2013-05-22 16:39:15.422532 7f707dd64700  0 log [ERR] : scrub 3.1 
> > b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head
> > 2013-05-22 16:40:04.995256 7f707cd62700  0 log [ERR] : scrub 3.1 
> > bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head
> > 2013-05-22 16:41:07.008717 7f707d563700  0 log [ERR] : scrub 3.1 
> > 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head
> > 2013-05-22 16:41:42.460280 7f707c561700  0 log [ERR] : 3.1 scrub 3 errors
> > 2013-05-22 16:46:12.385678 7f7077735700  0 -- 192.168.42.3:6803/12142 >> 
> > 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 
> > l=0).accept connect_seq 76 vs existing 75 state standby
> > 2013-05-22 16:58:36.079010 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
> > 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 
> > l=0).accept connect_seq 40 vs existing 39 state standby
> > 2013-05-22 16:58:36.798038 7f707d563700  0 log [INF] : 9.50c scrub ok
> > 2013-05-22 16:58:40.104159 7f707c561700  0 log [INF] : 9.526 scrub ok
> >
> >
> > Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted 
> > objects are about the same RBD image (rb.0.15c26.238e1f29).
> >
> >
> >
> > Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit :
> >> Can you post your ceph.log with the period including all of these errors?
> >> -Sam
> >>
> >> On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich
> >>  wrote:
> >> > Olivier Bonvalet пишет:
> >> >>
> >> >> Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit :
> >> >>> Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit :
> >> >>>> I have 4 scrub errors (3 PGs - "found clone without head"), on one 
> >> >>>> OSD. Not
> >> >>>> repairing. How to repair it exclude re-creating of OSD?
> >> >>>>
> >> >>>> Now it "easy" to clean+create OSD, but in theory - in case there are 
> >> >>>> multiple
> >> >>>> OSDs - it may cause data lost.
> >> >>&

Re: [ceph-users] scrub error: found clone without head

2013-05-22 Thread Olivier Bonvalet
Is it enough ?

# tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without 
head'
2013-05-22 15:43:09.308352 7f707dd64700  0 log [INF] : 9.105 scrub ok
2013-05-22 15:44:21.054893 7f707dd64700  0 log [INF] : 9.451 scrub ok
2013-05-22 15:44:52.898784 7f707cd62700  0 log [INF] : 9.784 scrub ok
2013-05-22 15:47:43.148515 7f707cd62700  0 log [INF] : 9.3c3 scrub ok
2013-05-22 15:47:45.717085 7f707dd64700  0 log [INF] : 9.3d0 scrub ok
2013-05-22 15:52:14.573815 7f707dd64700  0 log [ERR] : scrub 3.6b 
ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head
2013-05-22 15:55:07.230114 7f707d563700  0 log [ERR] : scrub 3.6b 
261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head
2013-05-22 15:56:56.456242 7f707d563700  0 log [ERR] : scrub 3.6b 
b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head
2013-05-22 15:57:51.667085 7f707dd64700  0 log [ERR] : 3.6b scrub 3 errors
2013-05-22 15:57:55.241224 7f707dd64700  0 log [INF] : 9.450 scrub ok
2013-05-22 15:57:59.800383 7f707cd62700  0 log [INF] : 9.465 scrub ok
2013-05-22 15:59:55.024065 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 
l=0).fault with nothing to send, going to standby
2013-05-22 16:01:45.542579 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 l=0).accept 
connect_seq 74 vs existing 73 state standby
--
2013-05-22 16:29:49.544310 7f707dd64700  0 log [INF] : 9.4eb scrub ok
2013-05-22 16:29:53.190233 7f707dd64700  0 log [INF] : 9.4f4 scrub ok
2013-05-22 16:29:59.478736 7f707dd64700  0 log [INF] : 8.6bb scrub ok
2013-05-22 16:35:12.240246 7f7022770700  0 -- 192.168.42.3:6803/12142 >> 
192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 
l=0).fault with nothing to send, going to standby
2013-05-22 16:35:19.519019 7f707d563700  0 log [INF] : 8.700 scrub ok
2013-05-22 16:39:15.422532 7f707dd64700  0 log [ERR] : scrub 3.1 
b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head
2013-05-22 16:40:04.995256 7f707cd62700  0 log [ERR] : scrub 3.1 
bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head
2013-05-22 16:41:07.008717 7f707d563700  0 log [ERR] : scrub 3.1 
8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head
2013-05-22 16:41:42.460280 7f707c561700  0 log [ERR] : 3.1 scrub 3 errors
2013-05-22 16:46:12.385678 7f7077735700  0 -- 192.168.42.3:6803/12142 >> 
192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 l=0).accept 
connect_seq 76 vs existing 75 state standby
2013-05-22 16:58:36.079010 7f707661a700  0 -- 192.168.42.3:6803/12142 >> 
192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 l=0).accept 
connect_seq 40 vs existing 39 state standby
2013-05-22 16:58:36.798038 7f707d563700  0 log [INF] : 9.50c scrub ok
2013-05-22 16:58:40.104159 7f707c561700  0 log [INF] : 9.526 scrub ok


Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted 
objects are about the same RBD image (rb.0.15c26.238e1f29).



Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit :
> Can you post your ceph.log with the period including all of these errors?
> -Sam
> 
> On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich
>  wrote:
> > Olivier Bonvalet пишет:
> >>
> >> Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit :
> >>> Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit :
> >>>> I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. 
> >>>> Not
> >>>> repairing. How to repair it exclude re-creating of OSD?
> >>>>
> >>>> Now it "easy" to clean+create OSD, but in theory - in case there are 
> >>>> multiple
> >>>> OSDs - it may cause data lost.
> >>>
> >>> I have same problem : 8 objects (4 PG) with error "found clone without
> >>> head". How can I fix that ?
> >> since "pg repair" doesn't handle that kind of errors, is there a way to
> >> manually fix that ? (it's a production cluster)
> >
> > Trying to fix manually I cause assertions in trimming process (died OSD). 
> > And
> > many others troubles. So, if you want to keep cluster running, wait for
> > developers answer. IMHO.
> >
> > About manual repair attempt: see issue #4937. Also similar results - in 
> > subject
> > "Inconsistent PG's, repair ineffective".
> >
> > --
> > WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-22 Thread Olivier Bonvalet

Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit :
> Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit :
> > I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. Not
> > repairing. How to repair it exclude re-creating of OSD?
> > 
> > Now it "easy" to clean+create OSD, but in theory - in case there are 
> > multiple
> > OSDs - it may cause data lost.
> > 
> > -- 
> > WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> Hi,
> 
> I have same problem : 8 objects (4 PG) with error "found clone without
> head". How can I fix that ?
> 
> thanks,
> Olivier
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi,

since "pg repair" doesn't handle that kind of errors, is there a way to
manually fix that ? (it's a production cluster)

thanks in advance,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-20 Thread Olivier Bonvalet
Great, thanks. I will follow this issue, and add informations if needed.

Le lundi 20 mai 2013 à 17:22 +0300, Dzianis Kahanovich a écrit :
> http://tracker.ceph.com/issues/4937
> 
> For me it progressed up to ceph reinstall with repair data from backup (I help
> ceph die, but it was IMHO self-provocation for force reinstall). Now (at least
> to my summer outdoors) I keep v0.62 (3 nodes) with every pool size=3 
> min_size=2
> (was - size=2 min_size=1).
> 
> But try to do nothing first and try to install latest version. And keep your
> vote to issue #4937 to force developers.
> 
> Olivier Bonvalet пишет:
> > Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit :
> >> I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. Not
> >> repairing. How to repair it exclude re-creating of OSD?
> >>
> >> Now it "easy" to clean+create OSD, but in theory - in case there are 
> >> multiple
> >> OSDs - it may cause data lost.
> >>
> >> -- 
> >> WBR, Dzianis Kahanovich AKA Denis Kaganovich, 
> >> http://mahatma.bspu.unibel.by/
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > 
> > 
> > Hi,
> > 
> > I have same problem : 8 objects (4 PG) with error "found clone without
> > head". How can I fix that ?
> > 
> > thanks,
> > Olivier
> > 
> > 
> > 
> 
> 
> -- 
> WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub error: found clone without head

2013-05-19 Thread Olivier Bonvalet
Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit :
> I have 4 scrub errors (3 PGs - "found clone without head"), on one OSD. Not
> repairing. How to repair it exclude re-creating of OSD?
> 
> Now it "easy" to clean+create OSD, but in theory - in case there are multiple
> OSDs - it may cause data lost.
> 
> -- 
> WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


Hi,

I have same problem : 8 objects (4 PG) with error "found clone without
head". How can I fix that ?

thanks,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG down & incomplete

2013-05-19 Thread Olivier Bonvalet
From what I read, one solution could be "ceph pg force_create_pg", but
if I well understand it will recreate the whole PG as an empty one.

In my case I would like to only create missing objects (empty, of
course, since data is lost), to don't have anymore IO locked "waiting
for missing object".


Le vendredi 17 mai 2013 à 23:37 +0200, Olivier Bonvalet a écrit :
> Yes, osd.10 is near full because of bad data repartition (not enought PG
> I suppose), and the difficulty to remove snapshot without overloading
> the cluster.
> 
> The problem on osd.25 was a crash during scrub... I tried to reweight
> it, and set it out, without any success. I have added some OSD too.
> 
> Logs from my emails «scrub shutdown the OSD process» (the 15th april) :
> 
> 
>  ...
> 
> 
> 
> 
> But now, when I start the osd.25, I obtain :
> 

>  ...
> 
> 
> 
> 
> 
> 
> Le vendredi 17 mai 2013 à 11:36 -0700, John Wilkins a écrit :
> > Another thing... since your osd.10 is near full, your cluster may be
> > fairly close to capacity for the purposes of rebalancing.  Have a look
> > at:
> > 
> > http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
> > http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#no-free-drive-space
> > 
> > Maybe we can get some others to look at this.  It's not clear to me
> > why the other OSD crashes after you take osd.25 out. It could be
> > capacity, but that shouldn't make it crash. Have you tried adding more
> > OSDs to increase capacity?
> > 
> > 
> > 
> > On Fri, May 17, 2013 at 11:27 AM, John Wilkins  
> > wrote:
> > > It looks like you have the "noout" flag set:
> > >
> > > "noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
> > >monmap e7: 5 mons at
> > > {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
> > > election epoch 2584, quorum 0,1,2,3 a,b,c,e
> > >osdmap e82502: 50 osds: 48 up, 48 in"
> > >
> > > http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing
> > >
> > > If you have down OSDs that don't get marked out, that would certainly
> > > cause problems. Have you tried restarting the failed OSDs?
> > >
> > > What do the logs look like for osd.15 and osd.25?
> > >
> > > On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet  
> > > wrote:
> > >> Hi,
> > >>
> > >> thanks for your answer. In fact I have several different problems, which
> > >> I tried to solve separatly :
> > >>
> > >> 1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
> > >> lost.
> > >> 2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
> > >> monitors running.
> > >> 3) I have 4 old inconsistent PG that I can't repair.
> > >>
> > >>
> > >> So the status :
> > >>
> > >>health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
> > >> inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
> > >> noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
> > >>monmap e7: 5 mons at
> > >> {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
> > >>  election epoch 2584, quorum 0,1,2,3 a,b,c,e
> > >>osdmap e82502: 50 osds: 48 up, 48 in
> > >> pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
> > >> +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
> > >> +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
> > >> 137KB/s rd, 1852KB/s wr, 199op/s
> > >>mdsmap e1: 0/0/1 up
> > >>
> > >>
> > >>
> > >> The tree :
> > >>
> > >> # idweight  type name   up/down reweight
> > >> -8  14.26   root SSDroot
> > >> -27 8   datacenter SSDrbx2
> > >> -26 8   room SSDs25
> > >> -25 8   net SSD188-165-12
> > >> -24 8   rack SSD25B09
> > >> -23 8   host lyll
> > >> 46  2   osd.46  
> > >> up  1
> > >> 47  2   

Re: [ceph-users] PG down & incomplete

2013-05-17 Thread Olivier Bonvalet
Yes, I set the "noout" flag to avoid the auto balancing of the osd.25,
which will crash all OSD of this host (already tried several times).

Le vendredi 17 mai 2013 à 11:27 -0700, John Wilkins a écrit :
> It looks like you have the "noout" flag set:
> 
> "noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
>monmap e7: 5 mons at
> {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
> election epoch 2584, quorum 0,1,2,3 a,b,c,e
>osdmap e82502: 50 osds: 48 up, 48 in"
> 
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing
> 
> If you have down OSDs that don't get marked out, that would certainly
> cause problems. Have you tried restarting the failed OSDs?
> 
> What do the logs look like for osd.15 and osd.25?
> 
> On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet  wrote:
> > Hi,
> >
> > thanks for your answer. In fact I have several different problems, which
> > I tried to solve separatly :
> >
> > 1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
> > lost.
> > 2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
> > monitors running.
> > 3) I have 4 old inconsistent PG that I can't repair.
> >
> >
> > So the status :
> >
> >health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
> > inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
> > noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
> >monmap e7: 5 mons at
> > {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
> >  election epoch 2584, quorum 0,1,2,3 a,b,c,e
> >osdmap e82502: 50 osds: 48 up, 48 in
> > pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
> > +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
> > +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
> > 137KB/s rd, 1852KB/s wr, 199op/s
> >mdsmap e1: 0/0/1 up
> >
> >
> >
> > The tree :
> >
> > # idweight  type name   up/down reweight
> > -8  14.26   root SSDroot
> > -27 8   datacenter SSDrbx2
> > -26 8   room SSDs25
> > -25 8   net SSD188-165-12
> > -24 8   rack SSD25B09
> > -23 8   host lyll
> > 46  2   osd.46  up  
> > 1
> > 47  2   osd.47  up  
> > 1
> > 48  2   osd.48  up  
> > 1
> > 49  2   osd.49  up  
> > 1
> > -10 4.26datacenter SSDrbx3
> > -12 2   room SSDs43
> > -13 2   net SSD178-33-122
> > -16 2   rack SSD43S01
> > -17 2   host kaino
> > 42  1   osd.42  up  
> > 1
> > 43  1   osd.43  up  
> > 1
> > -22 2.26room SSDs45
> > -21 2.26net SSD5-135-138
> > -20 2.26rack SSD45F01
> > -19 2.26host taman
> > 44  1.13osd.44  up  
> > 1
> > 45  1.13osd.45  up  
> > 1
> > -9  2   datacenter SSDrbx4
> > -11 2   room SSDs52
> > -14 2   net SSD176-31-226
> > -15 2   rack SSD52B09
> > -18 2   host dragan
> > 40  1   osd.40  up  
> > 1
> > 41  1   osd.41  up  
> > 1
> > -1  33.43   root SASroot
> > -10015.9datacenter SASrbx1
> > -90 15.9room SASs15
> > -72 15.9net SAS188-165-15
> > -40 8   rack SAS15B01
> > -3  8

Re: [ceph-users] PG down & incomplete

2013-05-17 Thread Olivier Bonvalet
an since forever, current state incomplete, last
acting [19,30]
pg 8.71d is stuck unclean since forever, current state incomplete, last
acting [24,19]
pg 8.3fa is stuck unclean since forever, current state incomplete, last
acting [19,31]
pg 8.3e0 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.56c is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 8.19f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.792 is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 4.0 is stuck unclean since forever, current state incomplete, last
acting [28,19]
pg 8.78a is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.23e is stuck unclean since forever, current state incomplete, last
acting [32,13]
pg 8.2ff is stuck unclean since forever, current state incomplete, last
acting [6,19]
pg 8.5e2 is stuck unclean since forever, current state incomplete, last
acting [0,19]
pg 8.528 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.20f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.372 is stuck unclean since forever, current state incomplete, last
acting [19,24]
pg 8.792 is incomplete, acting [19,28]
pg 8.78a is incomplete, acting [31,19]
pg 8.71d is incomplete, acting [24,19]
pg 8.5e2 is incomplete, acting [0,19]
pg 8.56c is incomplete, acting [19,28]
pg 8.528 is incomplete, acting [31,19]
pg 8.3fa is incomplete, acting [19,31]
pg 8.3e0 is incomplete, acting [31,19]
pg 8.372 is incomplete, acting [19,24]
pg 8.2ff is incomplete, acting [6,19]
pg 8.23e is incomplete, acting [32,13]
pg 8.20f is incomplete, acting [31,19]
pg 8.19f is incomplete, acting [31,19]
pg 3.7c is active+clean+inconsistent, acting [24,13,39]
pg 3.6b is active+clean+inconsistent, acting [28,23,5]
pg 4.5c is incomplete, acting [19,30]
pg 3.d is active+clean+inconsistent, acting [29,4,11]
pg 4.0 is incomplete, acting [28,19]
pg 3.1 is active+clean+inconsistent, acting [28,19,5]
osd.10 is near full at 85%
19 scrub errors
noout flag(s) set
mon.d (rank 4) addr 10.0.0.6:6789/0 is down (out of quorum)


Pools 4 and 8 have only 2 replica, and pool 3 have 3 replica but
inconsistent data.

Thanks in advance.

Le vendredi 17 mai 2013 à 00:14 -0700, John Wilkins a écrit :
> If you can follow the documentation here:
> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/  and
> http://ceph.com/docs/master/rados/troubleshooting/  to provide some
> additional information, we may be better able to help you.
> 
> For example, "ceph osd tree" would help us understand the status of
> your cluster a bit better.
> 
> On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet  
> wrote:
> > Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
> >> Hi,
> >>
> >> I have some PG in state down and/or incomplete on my cluster, because I
> >> loose 2 OSD and a pool was having only 2 replicas. So of course that
> >> data is lost.
> >>
> >> My problem now is that I can't retreive a "HEALTH_OK" status : if I try
> >> to remove, read or overwrite the corresponding RBD images, near all OSD
> >> hang (well... they don't do anything and requests stay in a growing
> >> queue, until the production will be done).
> >>
> >> So, what can I do to remove that corrupts images ?
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > Up. Nobody can help me on that problem ?
> >
> > Thanks,
> >
> > Olivier
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> John Wilkins
> Senior Technical Writer
> Intank
> john.wilk...@inktank.com
> (415) 425-9599
> http://inktank.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG down & incomplete

2013-05-16 Thread Olivier Bonvalet
Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have some PG in state down and/or incomplete on my cluster, because I
> loose 2 OSD and a pool was having only 2 replicas. So of course that
> data is lost.
> 
> My problem now is that I can't retreive a "HEALTH_OK" status : if I try
> to remove, read or overwrite the corresponding RBD images, near all OSD
> hang (well... they don't do anything and requests stay in a growing
> queue, until the production will be done).
> 
> So, what can I do to remove that corrupts images ?
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Up. Nobody can help me on that problem ?

Thanks,

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG down & incomplete

2013-05-14 Thread Olivier Bonvalet
Hi,

I have some PG in state down and/or incomplete on my cluster, because I
loose 2 OSD and a pool was having only 2 replicas. So of course that
data is lost.

My problem now is that I can't retreive a "HEALTH_OK" status : if I try
to remove, read or overwrite the corresponding RBD images, near all OSD
hang (well... they don't do anything and requests stay in a growing
queue, until the production will be done).

So, what can I do to remove that corrupts images ?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-12 Thread Olivier Bonvalet
Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit :
> Hello folks,
> 
> I'm in the process of testing CEPH and RBD, I have set up a small 
> cluster of  hosts running each a MON and an OSD with both journal and 
> data on the same SSD (ok this is stupid but this is simple to verify the 
> disks are not the bottleneck for 1 client). All nodes are connected on a 
> 1Gb network (no dedicated network for OSDs, shame on me :).
> 
> Summary : the RBD performance is poor compared to benchmark
> 
> A 5 seconds seq read benchmark shows something like this :
> >sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >  0   0 0 0 0 0 - 0
> >  1  163923   91.958692 0.966117  0.431249
> >  2  166448   95.9602   100 0.513435   0.53849
> >  3  169074   98.6317   104 0.25631   0.55494
> >  4  119584   83.973540 1.80038   0.58712
> >  Total time run:4.165747
> > Total reads made: 95
> > Read size:4194304
> > Bandwidth (MB/sec):91.220
> >
> > Average Latency:   0.678901
> > Max latency:   1.80038
> > Min latency:   0.104719
> 
> 91MB read performance, quite good !
> 
> Now the RBD performance :
> > root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
> > 100+0 records in
> > 100+0 records out
> > 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s
> 
> There is a 3x performance factor (same for write: ~60M benchmark, ~20M 
> dd on block device)
> 
> The network is ok, the CPU is also ok on all OSDs.
> CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some 
> patches for the SoC being used)
> 
> Can you show me the starting point for digging into this ?
> 
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

You should try to increase read_ahead to 512K instead of the defaults
128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference
on reads with that.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub shutdown the OSD process / data loss

2013-04-22 Thread Olivier Bonvalet
Le samedi 20 avril 2013 à 09:10 +0200, Olivier Bonvalet a écrit :
> Le mercredi 17 avril 2013 à 20:52 +0200, Olivier Bonvalet a écrit :
> > What I didn't understand is why the OSD process crash, instead of
> > marking that PG "corrupted", and does that PG really "corrupted" are
> > is
> > this just an OSD bug ?
> 
> Once again, a bit more informations : by searching informations about
> one of this faulty PG (3.d), I found that :
> 
>   -592> 2013-04-20 08:31:56.838280 7f0f41d1b700  0 log [ERR] : 3.d osd.25 
> inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 
> found  expected 12d7
>   -591> 2013-04-20 08:31:56.838284 7f0f41d1b700  0 log [ERR] : 3.d osd.4 
> inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 
> found  expected 12d7
>   -590> 2013-04-20 08:31:56.838290 7f0f41d1b700  0 log [ERR] : 3.d osd.4: 
> soid a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 size 4194304 != known 
> size 0
>   -589> 2013-04-20 08:31:56.838292 7f0f41d1b700  0 log [ERR] : 3.d osd.11 
> inconsistent snapcolls on a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 
> found  expected 12d7
>   -588> 2013-04-20 08:31:56.838294 7f0f41d1b700  0 log [ERR] : 3.d osd.11: 
> soid a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 size 4194304 != known 
> size 0
>   -587> 2013-04-20 08:31:56.838395 7f0f41d1b700  0 log [ERR] : scrub 3.d 
> a8620b0d/rb.0.15c26.238e1f29.4603/12d7//3 on disk size (0) does not 
> match object info size (4194304)
> 
> 
> I prefered to verify, so I found that :
> 
> # md5sum 
> /var/lib/ceph/osd/ceph-*/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.4603__12d7_A8620B0D__3
> 217ac2518dfe9e1502e5bfedb8be29b8  
> /var/lib/ceph/osd/ceph-4/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.4603__12d7_A8620B0D__3
>  (4MB)
> 217ac2518dfe9e1502e5bfedb8be29b8  
> /var/lib/ceph/osd/ceph-11/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.4603__12d7_A8620B0D__3
>  (4MB)
> d41d8cd98f00b204e9800998ecf8427e  
> /var/lib/ceph/osd/ceph-25/current/3.d_head/DIR_D/DIR_0/DIR_B/DIR_0/rb.0.15c26.238e1f29.4603__12d7_A8620B0D__3
>  (0B)
> 
> 
> So this object is identical on OSD 4 and 11, but is empty on OSD 25.
> Since 4 is the master, this should not be a problem, so I try a repair,
> without any success :
> ceph pg repair 3.d
> 
> 
> Is there a way to force rewrite of this replica ?
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Don't know if it's related, but I see data loss on my cluster on
multiple RBD images (corrupted FS, database and some empty files).

I suppose It's related.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >