Re: [ceph-users] librbd on opensolaris/illumos

2016-03-30 Thread Sumit Gaur
Thanks Gregory,
For your clear response. Do you see any problem if ceph cluster is used via
KVM zone in Opensolaris ?
I assume there is no issue to install ceph client on KVM zone in
opensolaris.

-sumit

On Wed, Mar 30, 2016 at 2:47 AM, Gregory Farnum  wrote:

> On Mon, Mar 28, 2016 at 9:55 PM, Sumit Gaur  wrote:
> > Hello ,
> > Can anybody let me know if ceph team is working on porting of librbd on
> > openSolaris like it did for librados ?
>
> Nope, this isn't on anybody's roadmap.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-30 Thread Ilya Dryomov
On Wed, Mar 30, 2016 at 3:03 AM, Jason Dillaman  wrote:
> Understood -- format 2 was promoted to the default image format starting with 
> Infernalis (which not all users would have played with since it isn't LTS).  
> The defaults can be overridden via the command-line when creating new images 
> or via the Ceph configuration file.
>
> I'll let Ilya provide input on which kernels support image format 2, but from 
> a quick peek on GitHub it looks like support was added around the v3.8 
> timeframe.

Layering (i.e. format 2 with default striping parameters) is supported
starting with 3.10.  We don't really support older kernels - backports
are pretty much all 3.10+, etc.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Stuck active+undersized+degraded+inconsistent

2016-03-30 Thread Christian Balzer

Hello,

On Tue, 29 Mar 2016 18:10:33 + Calvin Morrow wrote:

> Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due to a
> hardware error, however after normal recovery it seems stuck with
> one active+undersized+degraded+inconsistent pg.
>
Any reason (other than inertia, which I understand very well) you're
running a non LTS version that last saw bug fixes a year ago?
You may very well be facing a bug that has long been fixed even in Firefly,
let alone Hammer.

If so, hopefully one of the devs remembering it can pipe up.
 
> I haven't been able to get repair to happen using "ceph pg repair
> 12.28a"; I can see the activity logged in the mon logs, however the
> repair doesn't actually seem to happen in any of the actual osd logs.
> 
> I tried folowing Sebiastien's instructions for manually locating the
> inconsistent object (
> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/),
> however the md5sum from the objects both match, so I'm not quite sure how
> to proceed.
> 
Rolling a dice? ^o^
Do they have similar (identical really) timestamps as well?

> Any ideas on how to return to a healthy cluster?
> 
> [root@soi-ceph2 ceph]# ceph status
> cluster 6cc00165-4956-4947-8605-53ba51acd42b
>  health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 pgs
> stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck undersized; 1023
> pgs undersized; recovery 132091/23742762 objects degraded (0.556%);
> 7745/23742762 objects misplaced (0.033%); 1 scrub errors
>  monmap e5: 3 mons at {soi-ceph1=
> 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},
> election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
>  osdmap e41120: 60 osds: 59 up, 59 in
>   pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 kobjects
> 91295 GB used, 73500 GB / 160 TB avail
> 132091/23742762 objects degraded (0.556%); 7745/23742762
> objects misplaced (0.033%)
>60341 active+clean
>   76 active+remapped
> 1022 active+undersized+degraded
>1 active+undersized+degraded+inconsistent
>   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s
> 
What's confusing to me in this picture are the stuck and unclean PGs as
well as degraded objects, it seems that recovery has stopped?

Something else that suggests a bug, or at least a stuck OSD.

> [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
> pg 12.28a is stuck unclean for 126274.215835, current state
> active+undersized+degraded+inconsistent, last acting [36,52]
> pg 12.28a is stuck undersized for 3499.099747, current state
> active+undersized+degraded+inconsistent, last acting [36,52]
> pg 12.28a is stuck degraded for 3499.107051, current state
> active+undersized+degraded+inconsistent, last acting [36,52]
> pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]
> 
> [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
> ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 -1
> log_channel(default) log [ERR] : 12.28a shard 20: soid
> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12
> candidate had a read error, digest 2029411064 != known digest 2692480864
  ^^   
That's the culprit, google for it. Of course the most promising looking
answer is behind the RH pay wall.

Looks like that disk has an issue, guess you're not seeing this on osd.52,
right?
Check osd.36's SMART status.

My guess is that you may have to set min_size to 1 and recover osd.35 as
well, but don't take my word for it.

Christian

> ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700 -1
> log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1
> inconsistent objects
> ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700 -1
> log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors
> 
> [root@soi-ceph2 ceph]# md5sum
> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> \fb57b1f17421377bf2c35809f395e9b9
>  
> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> 
> [root@soi-ceph3 ceph]# md5sum
> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> \fb57b1f17421377bf2c35809f395e9b9
>  
> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.c

Re: [ceph-users] Redirect snapshot COW to alternative pool

2016-03-30 Thread Nick Fisk
> 
> > > > I think this is where I see slow performance. If you are doing
> > > > large IO, then copying 4MB objects (assuming defaults) is maybe
> > > > only 2x times the original IO to the disk. However if you are
> > > > doing smaller IO from what I can see a single 4kb write would lead
> > > > to a 4MB object being copied to the snapshot, with 3x replication,
> > > > this could be amplification in the thousands. Is my understanding
> > > > correct here, it's
> > > certainly what I see?
> > >
> > > The first write (4K or otherwise) to a recently snapshotted object
> > > will result in CoW to a new clone of the snapshotted object.
> > > Subsequent writes to the same object will not have the same penalty.
> > > In the parent/child image case, the first write to the child would
> > > also result in a full object CoW from the parent to the child.
> >
> > The IO can sometimes can be fairly random depending on the changed
> > blocks in the backup, but yes sequential writes are much less
> > effected. At the end of each backup it merges the oldest incremental
> > into the full, which is also very random. My tests were more worst
> > case, but I like to at least know what that limit is, so it doesn’t 
> > surprise you
> late night on a Friday evening.
> > :-)
> >
> 
> Makes sense that sequential would be less affected as compared to random.
> Are you snapshotting all images in parallel or are you doing the backup in
> batches?  Note that snap removal does have some cost as the snap trimmer
> process of the OSD needs to eventually clean up the objects associated with
> the deleted snapshot.

The snapshots are done one at a time and are fairly spaced out. 

> 
> > >
> > > > >With RBD layering, you do whole-object copy-on-write from the client.
> > > > > Doing it from the client does let you put "child" images inside
> > > > >of a faster  pool,  yes. But creating new objects doesn't make
> > > > >the *old* ones slow, so why do  you think there's still the same
> problem?
> > > > >(Other than "the pool is faster"
> > > > > being perhaps too optimistic about the improvement you'd get
> > > > >under this
> > > > > workload.)
> > > >
> > > > From reading the RBD layering docs it looked like you could also
> > > > specify a different object size for the target. If there was some
> > > > way that the snapshot could have a different object size or some
> > > > sort of dirty bitmap, then this would reduce the amount of data
> > > > that would have to copied on each write.
> > >
> > > Have you tried using a different object size for your RBD image?  I
> > > think your proposal is effectively the same as just reducing the
> > > object size (with the added overhead of a OSD<->client round-trip
> > > for CoW instead of handling it within the OSD directly).  The
> > > default 4MB object size was an attempt to strike a balance between
> > > the CoW cost and the number of objects the OSDs would have to
> > > manage.
> >
> > Yeah, this is something we are most likely going to have to do. It’s a
> > lot more performant with 1MB objects when using snapshots, but that
> > causes problems in other areas (backfilling, PG splitting...) and also
> > overall large IO performance seems slightly lower. Using 6TB disks
> > means that there is going to be a ton of objects as well. I was more
> > interested if there was going to be any enhancements done around
> > something like a bitmap where the COW would be more granular, but I
> > understand that this is probably quite a unique usage scenario.
> >
> > The other option might just be to start with a larger cluster, so this
> > snapshot COW stuff is a lower percentage of total performance.
> >
> > >
> > > >
> > > > What I meant about it slowing down the pool, is due to the extra
> > > > 4MB copy writes, the max small IO you can do is dramatically
> > > > reduced, as each small IO is now a 4MB IO. By shifting the COW to
> > > > a different pool you could reduce the load on the primary pool and
> > > > effect on primary workloads. You are effectively shifting this
> > > > snapshot "tax" onto an isolated set of disks/SSD's.
> > >
> > > Except eventually all your IO will be against the new "fast" pool as
> > > enough snapshotted objects have been CoW over to the new pool?
> >
> > That potentially could be a problem, but I was hoping that 8-10 cheap
> > SSD's should easily cover the required write bandwidth (infrequent so
> > can use low DWPD drives). The incoming write bandwidth for 1 customer
> > might only be about 10MB/s, but at points where the writes were
> > random, I was seeing this saturate a 48 disk cluster once a snapshot
> > was taken. Worse case this pool might slow down, but that could be
> > acceptable risk of doing a DR test, what we don't want is to slow down
> > everyone else backups. I guess QOS could also be another option here,
> which I know has been discussed.
> 
> If this is for DR, may I shamelessly plug the forthcoming RBD mirroring
> support in Jewel [1]?  ;-) All modif

[ceph-users] Error mon create-initial

2016-03-30 Thread Mohd Zainal Abidin Rabani

Keep getting this error:

[osd04][INFO  ] monitor: mon.osd04 is currently at the state of electing
[osd04][INFO  ] Running command: ceph --cluster=ceph --admin-daemon 
/var/run/ceph/ceph-mon.osd04.asok mon_status

[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
Other osd01, osd02 and osd03 no error. Kernel and OS same.


-
Regards,
Mohd Zainal Abidin Rabani
Technical Support
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-30 Thread Dan van der Ster
Hi Sean,

Did you check that the process isn't hitting some ulimits? cat
/proc/`pidof radosgw`/limits and compare with the num processes/num
FDs in use.

Cheers, Dan


On Tue, Mar 29, 2016 at 8:35 PM, seapasu...@uchicago.edu
 wrote:
> So an update for anyone else having this issue. It looks like radosgw either
> has a memory leak or it spools the whole object into ram or something.
>
> root@kh11-9:/etc/apt/sources.list.d# free -m
>  total   used   free sharedbuffers cached
> Mem: 64397  63775621  0 3 46
> -/+ buffers/cache:  63725671
> Swap:65499  17630  47869
>
> root@kh11-9:/etc/apt/sources.list.d# ps faux | grep -iE "USE[R]|radosg[w]"
> USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root  269910  134 95.2 90622120 62819128 ?   Ssl  12:31  79:37
> /usr/bin/radosgw --cluster=ceph --id rgw.kh11-9 -f
>
> The odd things are 1.) the disk is fine. 2.) the rest of the server seems
> very responsive. I can ssh into the server without any problems, curl out,
> wget, etc but radosgw is stuck in the mud
>
> This is after 150-300 wget requests to public objects, 2 radosgws freeze
> like this.  The cluster is health okay as well::
>
> root@kh11-9:~# grep -iE "health" ceph_report.json
> "health": {
> "health": {
> "health_services": [
> "health": "HEALTH_OK"
> "health": "HEALTH_OK"
> "health": "HEALTH_OK"
> "health": "HEALTH_OK"
> "health": "HEALTH_OK"
> "health": "HEALTH_OK"
> "overall_status": "HEALTH_OK",
>
> Has anyone seen this behavior before?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local SSD cache for ceph on each compute node.

2016-03-30 Thread Nick Fisk
> >>
> >> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> >>> One thing I picked up on when looking at dm-cache for doing caching
> >>> with RBD's is that it wasn't really designed to be used as a
> >>> writeback cache for new writes, as in how you would expect a
> >>> traditional writeback cache to work. It seems all the policies are
> >>> designed around the idea that writes go to cache only if the block
> >>> is already in the cache (through reads) or its hot enough to
> >>> promote. Although there did seem to be some tunables to alter this
> >>> behaviour, posts on the mailing list seemed to suggest this wasn't
> >>> how it was designed to be used. I'm not sure if this has been addressed
> since I last looked at it though.
> >>>
> >>> Depending on if you are trying to accelerate all writes, or just
> >>> your
> > "hot"
> >>> blocks, this may or may not matter. Even <1GB local caches can make
> >>> a huge difference to sync writes.
> >> Hi Nick,
> >>
> >> Some of the caching policies have changed recently as the team has
> >> looked at different workloads.
> >>
> >> Happy to introduce you to them if you want to discuss offline or post
> >> comments over on their list: device-mapper development  >> de...@redhat.com>
> >>
> >> thanks!
> >>
> >> Ric
> > Hi Ric,
> >
> > Thanks for the heads up, just from a quick flick through I can see
> > there are now separate read and write promotion thresholds, so I can
> > see just from that it would be a lot more suitable for what I
> > intended. I might try and find some time to give it another test.
> >
> > Nick
> 
> Let us know how it works out for you, I know that they are very interested in
> making sure things are useful :)

Hi Ric,

I have given it another test and unfortunately it seems it's still not giving 
the improvements that I was expecting.

Here is a rough description of my test

10GB RBD
1GB ZRAM kernel device for cache (Testing only)

0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 
0 1 writeback 2 migration_threshold 8192 mq 10 ra
ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 
read_promote_adjustment 4 write_promote_adjustment 0 rw -

I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device.

What I expect to happen would be for this sequential stream of 64kb IO's to be 
coalesced into 4MB IO's and written out to the RBD at a high queue depth as 
possible/required. Effectively meaning my 64kb sequential bandwidth should 
match the limit of 4MB sequential bandwidth of my cluster. I'm more interested 
in replicating the behaviour of a write cache on a battery backed raid card, 
than a RW SSD cache, if that makes sense?

An example real life scenario would be for sitting underneath a iSCSI target, 
something like ESXi generates that IO pattern when moving VM's between 
datastores.

What I was seeing is that I get a sudden burst of speed at the start of the fio 
test, but then it quickly drops down to the speed of the underlying RBD device. 
The dirty blocks counter never seems to go too high, so I don't think that it’s 
a cache full problem. The counter is probably no more than about 40% when the 
slowdown starts and then it drops to less than 10% for the remainder of the 
test as it crawls along. It feels like it hits some sort of throttle and it 
never recovers. 

I've done similar tests with flashcache and it gets more stable performance 
over a longer period of time, but the associative hit set behaviour seems to 
cause write misses due to the sequential IO pattern, which limits overall top 
performance. 


Nick

> 
> ric
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-30 Thread German Anders
Ok, but I've kernel 3.19.0-39-generic, so the new version is supposed to
work right?, and I'm still getting issues while trying to map the RBD:

$ *sudo rbd --cluster cephIB create e60host01vX --size 100G --pool rbd -c
/etc/ceph/cephIB.conf*
$ *sudo rbd -p rbd bench-write e60host01vX --io-size 4096 --io-threads 1
--io-total 4096 --io-pattern rand -c /etc/ceph/cephIB.conf*
bench-write  io_size 4096 io_threads 1 bytes 4096 pattern random
  SEC   OPS   OPS/SEC   BYTES/SEC
elapsed: 0  ops:1  ops/sec:29.67  bytes/sec: 121536.32

$ *sudo rbd --cluster cephIB map e60host01vX --pool rbd -c
/etc/ceph/cephIB.conf*
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

$ *sudo rbd -p rbd info e60host01vX -c /etc/ceph/cephIB.conf*
rbd image 'e60host01vX':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5f03238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:

Any other idea of what could be the problem here?


*German*

2016-03-30 5:15 GMT-03:00 Ilya Dryomov :

> On Wed, Mar 30, 2016 at 3:03 AM, Jason Dillaman 
> wrote:
> > Understood -- format 2 was promoted to the default image format starting
> with Infernalis (which not all users would have played with since it isn't
> LTS).  The defaults can be overridden via the command-line when creating
> new images or via the Ceph configuration file.
> >
> > I'll let Ilya provide input on which kernels support image format 2, but
> from a quick peek on GitHub it looks like support was added around the v3.8
> timeframe.
>
> Layering (i.e. format 2 with default striping parameters) is supported
> starting with 3.10.  We don't really support older kernels - backports
> are pretty much all 3.10+, etc.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-30 Thread Jason Dillaman
You will first need to disable all features except for layering since krbd 
doesn't currently have support:

# rbd --cluster cephIB feature disable e60host01vX 
exclusive-lock,object-map,fast-diff,deep-flatten --pool cinder-volumes

You also might want to consider adding "rbd default features = 1" to 
"/etc/ceph/cephIB.conf" if you plan to use krbd.

-- 

Jason Dillaman 


- Original Message - 

> From: "German Anders" 
> To: "Ilya Dryomov" 
> Cc: "Jason Dillaman" , "ceph-users"
> 
> Sent: Wednesday, March 30, 2016 9:37:03 AM
> Subject: Re: [ceph-users] Scrubbing a lot

> Ok, but I've kernel 3.19.0-39-generic, so the new version is supposed to work
> right?, and I'm still getting issues while trying to map the RBD:

> $ sudo rbd --cluster cephIB create e60host01vX --size 100G --pool rbd -c
> /etc/ceph/cephIB.conf
> $ sudo rbd -p rbd bench-write e60host01vX --io-size 4096 --io-threads 1
> --io-total 4096 --io-pattern rand -c /etc/ceph/cephIB.conf
> bench-write io_size 4096 io_threads 1 bytes 4096 pattern random
> SEC OPS OPS/SEC BYTES/SEC
> elapsed: 0 ops: 1 ops/sec: 29.67 bytes/sec: 121536.32

> $ sudo rbd --cluster cephIB map e60host01vX --pool rbd -c
> /etc/ceph/cephIB.conf
> rbd: sysfs write failed
> rbd: map failed: (5) Input/output error

> $ sudo rbd -p rbd info e60host01vX -c /etc/ceph/cephIB.conf
> rbd image 'e60host01vX':
> size 102400 MB in 25600 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.5f03238e1f29
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> flags:

> Any other idea of what could be the problem here?

> German

> 2016-03-30 5:15 GMT-03:00 Ilya Dryomov < idryo...@gmail.com > :

> > On Wed, Mar 30, 2016 at 3:03 AM, Jason Dillaman < dilla...@redhat.com >
> > wrote:
> 
> > > Understood -- format 2 was promoted to the default image format starting
> > > with Infernalis (which not all users would have played with since it
> > > isn't
> > > LTS). The defaults can be overridden via the command-line when creating
> > > new images or via the Ceph configuration file.
> 
> > >
> 
> > > I'll let Ilya provide input on which kernels support image format 2, but
> > > from a quick peek on GitHub it looks like support was added around the
> > > v3.8 timeframe.
> 

> > Layering (i.e. format 2 with default striping parameters) is supported
> 
> > starting with 3.10. We don't really support older kernels - backports
> 
> > are pretty much all 3.10+, etc.
> 

> > Thanks,
> 

> > Ilya
> 
> > ___
> 
> > ceph-users mailing list
> 
> > ceph-users@lists.ceph.com
> 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph upgrade questions

2016-03-30 Thread Daniel Delin

>Note that 0.94.6 has a massive, data destroying cache-tier bug, so you
>will want to wait until .7 at least if you're using cache-tiering, or read
>up on the work-around for that bug alternatively.

This sounds interesting, is there a bug number for this ? Been playing around
with cache tiering in 0.94.6 and have run in to some issues, not seen any data
destruction yet, but I had problems disconnecting the cache tier from the 
backing
pool. Luckily just a test pool so I can just destroy both tier and backing pool,
but I would really like to get cache tiering going on my production pool, it 
gave
a nice performance boost when I tested it.

Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph upgrade questions

2016-03-30 Thread Christian Balzer

Hello,

On Wed, 30 Mar 2016 16:12:30 +0200 (CEST) Daniel Delin wrote:

> 
> >Note that 0.94.6 has a massive, data destroying cache-tier bug, so you
> >will want to wait until .7 at least if you're using cache-tiering, or
> >read up on the work-around for that bug alternatively.
> 
> This sounds interesting, is there a bug number for this ? 
Read the "data corruption with hammer" thread.
That thread also contains the work-around.
http://tracker.ceph.com/issues/12814 

Supposedly this isn't present in Jewel (but then again, I'm betting real
money on other easter eggs being present in there).

> Been playing
> around with cache tiering in 0.94.6 and have run in to some issues, not
> seen any data destruction yet, but I had problems disconnecting the
> cache tier from the backing pool. 
Also discussed here recently/frequently. 
See the "Can not disable rbd cache" thread.
http://tracker.ceph.com/issues/14865

>Luckily just a test pool so I can just
> destroy both tier and backing pool, but I would really like to get cache
> tiering going on my production pool, it gave a nice performance boost
> when I tested it.
> 
It can work quite well, depending on your work load (cache size vs. really
hot objects) and the cache mode chosen.

I certainly solved my overload cluster problems described in the thread
"Reducing the impact of OSD restarts (noout ain't uptosnuff)"
with a cache tier.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Incorrect path in /etc/init/ceph-osd.conf?

2016-03-30 Thread Nick Fisk
Hi All,

 

I can see the path in the upstart script

 

https://github.com/ceph/ceph/blob/master/src/upstart/ceph-osd.conf

 

Checks for the file /etc/default/ceph and then runs it

 

But in all the instances of ceph that I have installed that location is a
directory and the actual location of the default conf file is
/etc/default/ceph/ceph

 

Have I spotted something incorrect or have I misunderstood the workings of
the script?

 

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Stuck active+undersized+degraded+inconsistent

2016-03-30 Thread Calvin Morrow
On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer  wrote:

>
> Hello,
>
> On Tue, 29 Mar 2016 18:10:33 + Calvin Morrow wrote:
>
> > Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due to a
> > hardware error, however after normal recovery it seems stuck with
> > one active+undersized+degraded+inconsistent pg.
> >
> Any reason (other than inertia, which I understand very well) you're
> running a non LTS version that last saw bug fixes a year ago?
> You may very well be facing a bug that has long been fixed even in Firefly,
> let alone Hammer.
>
I know we discussed Hammer several times, and I don't remember the exact
reason we held off.  Other than that, Inertia is probably the best answer I
have.

>
> If so, hopefully one of the devs remembering it can pipe up.
>
> > I haven't been able to get repair to happen using "ceph pg repair
> > 12.28a"; I can see the activity logged in the mon logs, however the
> > repair doesn't actually seem to happen in any of the actual osd logs.
> >
> > I tried folowing Sebiastien's instructions for manually locating the
> > inconsistent object (
> > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> ),
> > however the md5sum from the objects both match, so I'm not quite sure how
> > to proceed.
> >
> Rolling a dice? ^o^
> Do they have similar (identical really) timestamps as well?
>
Yes, timestamps are identical.

>
> > Any ideas on how to return to a healthy cluster?
> >
> > [root@soi-ceph2 ceph]# ceph status
> > cluster 6cc00165-4956-4947-8605-53ba51acd42b
> >  health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 pgs
> > stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck undersized; 1023
> > pgs undersized; recovery 132091/23742762 objects degraded (0.556%);
> > 7745/23742762 objects misplaced (0.033%); 1 scrub errors
> >  monmap e5: 3 mons at {soi-ceph1=
> > 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},
> > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
> >  osdmap e41120: 60 osds: 59 up, 59 in
> >   pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 kobjects
> > 91295 GB used, 73500 GB / 160 TB avail
> > 132091/23742762 objects degraded (0.556%); 7745/23742762
> > objects misplaced (0.033%)
> >60341 active+clean
> >   76 active+remapped
> > 1022 active+undersized+degraded
> >1 active+undersized+degraded+inconsistent
> >   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s
> >
> What's confusing to me in this picture are the stuck and unclean PGs as
> well as degraded objects, it seems that recovery has stopped?
>
Yeah ... recovery essentially halted.  I'm sure its no accident that there
are exactly 1023 (1024-1) unhealthy pgs.

>
> Something else that suggests a bug, or at least a stuck OSD.
>
> > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
> > pg 12.28a is stuck unclean for 126274.215835, current state
> > active+undersized+degraded+inconsistent, last acting [36,52]
> > pg 12.28a is stuck undersized for 3499.099747, current state
> > active+undersized+degraded+inconsistent, last acting [36,52]
> > pg 12.28a is stuck degraded for 3499.107051, current state
> > active+undersized+degraded+inconsistent, last acting [36,52]
> > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]
> >
> > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
> > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 -1
> > log_channel(default) log [ERR] : 12.28a shard 20: soid
> >
> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12
> > candidate had a read error, digest 2029411064 != known digest 2692480864
>   ^^
> That's the culprit, google for it. Of course the most promising looking
> answer is behind the RH pay wall.
>
This part is the most confusing for me.  To me, this should indicate that
there was some kind of bitrot on the disk (I'd love for ZFS to be better
supported here).  What I don't understand is that the actual object has
identical md5sums, timestamps, etc.  I don't know if this means there was
just a transient error that Ceph can't get over, or whether I'm mistakenly
looking at the wrong object.  Maybe something stored in an xattr somewhere?

>
> Looks like that disk has an issue, guess you're not seeing this on osd.52,
> right?
>
Correct.

> Check osd.36's SMART status.
>
SMART is normal, no errors, all counters seem fine.

>
> My guess is that you may have to set min_size to 1 and recover osd.35 as
> well, but don't take my word for it.
>
Thanks for the suggestion.  I'm holding out for the moment in case someone
else reads this and has an "aha" moment.  At the moment, I'm not sure if it
would be more dangerous to try and blow away the object on osd.36 and hope
for recovery (with min_size 1) or try a software upgrade on an unhealthy
cluster (yuck).

>
> Christian
>
> > ceph-os

Re: [ceph-users] Ceph stopped self repair.

2016-03-30 Thread Gregory Farnum
On Tuesday, March 29, 2016, Dan Moses  wrote:

> Any suggestions how to clean up ceph errors that don't autocorrect?  All
> these counters haven't moved in 2 hours now.
>
> HEALTH_WARN 93 pgs degraded; 93 pgs stuck degraded; 113 pgs stuck unclean;
> 93 pgs stuck undersized; 93 pgs undersized; too many PGs per OSD (472 > max
> 300); mon.0 low disk space
>
>
This looks like it could be caused by a faulty crush map. What's the full
output of ceph -s and "ceph osd tree"?


>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v10.1.0 Jewel release candidate available

2016-03-30 Thread Alfredo Deza
On Wed, Mar 30, 2016 at 11:49 AM, Xiaoxi Chen  wrote:
> I am seeing package for precise in
> http://download.ceph.com/debian-jewel/dists/precise/,   is that
> by-accident or we plan to support precise for one more LTS?

The configuration is there but there aren't any packages built for it:

http://download.ceph.com/debian-jewel/pool/main/c/ceph/

This is a side-effect of the readiness on our build system that will,
by default, allow other distros to be configured to get included
in the repo even if there aren't any actual binaries.


>
> Or, in other word, is it safe for user in precise to upgrade to jewel?
>
> 2016-03-29 0:24 GMT+08:00 Simon Leinen :
>> Sage Weil writes:
>>> The first release candidate for Jewel is now available!
>>
>> Cool!
>>
>> [...]
>>> Packages for aarch64 will also be posted shortly!
>>
>> According to the announcement, Ubuntu Xenial should now be supported
>> instead of Precise; but I don't see Xenial packages on
>> download.ceph.com.  Will those arrive, or should we get them from
>> Canonical's Xenial repo?
>> --
>> Simon.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph pg query hangs for ever

2016-03-30 Thread Wido den Hollander
Hi,

I have an issue with a Ceph cluster which I can't resolve.

Due to OSD failure a PG is incomplete, but I can't query the PG to see what I
can do to fix it.

 health HEALTH_WARN
1 pgs incomplete
1 pgs stuck inactive
1 pgs stuck unclean
98 requests are blocked > 32 sec

$ ceph pg 3.117 query

That will hang for ever.

$ ceph pg dump_stuck

pg_stat state   up  up_primary  acting  acting_primary
3.117   incomplete  [68,55,74]  68  [68,55,74]  68

The primary PG in this case is osd.68 . If I stop the OSD the PG query works,
but it says that bringing osd 68 back online will probably help.

The 98 requests which are blocked are also on osd.68 and they all say:

They all say:
- initiated
- reached_pg

The cluster is running Hammer 0.94.5 in this case.

>From what I know a OSD had a failing disk and was restarted a couple of times
while the disk gave errors. This caused the PG to become incomplete.

I've set debug osd to 20, but I can't really tell what is going wrong on osd.68
which causes it to stall this long.

Any idea what to do here to get this PG up and running again?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph pg query hangs for ever

2016-03-30 Thread Mart van Santen

Hi there,

With the help of a lot of people we were able to repair the PG and
restored service. We will get back on this later with a full report for
future reference.

Regards,

Mart


On 03/30/2016 08:30 PM, Wido den Hollander wrote:
> Hi,
>
> I have an issue with a Ceph cluster which I can't resolve.
>
> Due to OSD failure a PG is incomplete, but I can't query the PG to see what I
> can do to fix it.
>
>  health HEALTH_WARN
> 1 pgs incomplete
> 1 pgs stuck inactive
> 1 pgs stuck unclean
> 98 requests are blocked > 32 sec
>
> $ ceph pg 3.117 query
>
> That will hang for ever.
>
> $ ceph pg dump_stuck
>
> pg_stat   state   up  up_primary  acting  acting_primary
> 3.117 incomplete  [68,55,74]  68  [68,55,74]  68
>
> The primary PG in this case is osd.68 . If I stop the OSD the PG query works,
> but it says that bringing osd 68 back online will probably help.
>
> The 98 requests which are blocked are also on osd.68 and they all say:
>
> They all say:
> - initiated
> - reached_pg
>
> The cluster is running Hammer 0.94.5 in this case.
>
> From what I know a OSD had a failing disk and was restarted a couple of times
> while the disk gave errors. This caused the PG to become incomplete.
>
> I've set debug osd to 20, but I can't really tell what is going wrong on 
> osd.68
> which causes it to stall this long.
>
> Any idea what to do here to get this PG up and running again?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] chunk-based cache in ceph with erasure coded back-end storage

2016-03-30 Thread Yu Xiang
Dear List,
I am exploring in ceph caching tier recently, considering a cache-tier 
(replicated) and a back storage-tier (erasure-coded), so chunks are stored in 
the OSDs in the erasure-coded storage tier, when a file has been requested to 
read,  usually, all chunks in the storage tier would be copied to the cache 
tier, replicated, and stored in the OSDs in caching pool, but i was wondering 
would it be possible that if only partial chunks of the requested file be 
copied to cache? or it has to be a complete file? for example, a file using 
(7,4) erasure code (4 original chunks, 3 encoded chunks), when read it might be 
4 required chunks are copied to cache, and i was wondering if it's possible to 
copy only 2 out of 4 required chunks to cache, and the users getting the other 
2 chunks elsewhere (or assuming the client already has 2 chunks, they only need 
another 2 from ceph)? can the cache store partial chunks of a file?


Thanks in advance for any help!


Best,
Yu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Stuck active+undersized+degraded+inconsistent

2016-03-30 Thread Christian Balzer
On Wed, 30 Mar 2016 15:50:07 + Calvin Morrow wrote:

> On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Tue, 29 Mar 2016 18:10:33 + Calvin Morrow wrote:
> >
> > > Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due
> > > to a hardware error, however after normal recovery it seems stuck
> > > with one active+undersized+degraded+inconsistent pg.
> > >
> > Any reason (other than inertia, which I understand very well) you're
> > running a non LTS version that last saw bug fixes a year ago?
> > You may very well be facing a bug that has long been fixed even in
> > Firefly, let alone Hammer.
> >
> I know we discussed Hammer several times, and I don't remember the exact
> reason we held off.  Other than that, Inertia is probably the best
> answer I have.
> 
Fair enough. 

I just seem to remember similar scenarios where recovery got stuck/hung
and thus would assume it was fixed in newer versions.

If you google for "ceph recovery stuck" you find another potential
solution behind the RH paywall and this:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html

That would have been my next suggestion anyway, Ceph OSDs seem to take
well to the 'IT crowd' mantra of "Have you tried turning it off and on
again?". ^o^

> >
> > If so, hopefully one of the devs remembering it can pipe up.
> >
> > > I haven't been able to get repair to happen using "ceph pg repair
> > > 12.28a"; I can see the activity logged in the mon logs, however the
> > > repair doesn't actually seem to happen in any of the actual osd logs.
> > >
> > > I tried folowing Sebiastien's instructions for manually locating the
> > > inconsistent object (
> > > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> > ),
> > > however the md5sum from the objects both match, so I'm not quite
> > > sure how to proceed.
> > >
> > Rolling a dice? ^o^
> > Do they have similar (identical really) timestamps as well?
> >
> Yes, timestamps are identical.
> 
Unsurprisingly.

> >
> > > Any ideas on how to return to a healthy cluster?
> > >
> > > [root@soi-ceph2 ceph]# ceph status
> > > cluster 6cc00165-4956-4947-8605-53ba51acd42b
> > >  health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023
> > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck
> > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects
> > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub
> > > errors monmap e5: 3 mons at {soi-ceph1=
> > > 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},
> > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
> > >  osdmap e41120: 60 osds: 59 up, 59 in
> > >   pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728
> > > kobjects 91295 GB used, 73500 GB / 160 TB avail
> > > 132091/23742762 objects degraded (0.556%); 7745/23742762
> > > objects misplaced (0.033%)
> > >60341 active+clean
> > >   76 active+remapped
> > > 1022 active+undersized+degraded
> > >1 active+undersized+degraded+inconsistent
> > >   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s
> > >
> > What's confusing to me in this picture are the stuck and unclean PGs as
> > well as degraded objects, it seems that recovery has stopped?
> >
> Yeah ... recovery essentially halted.  I'm sure its no accident that
> there are exactly 1023 (1024-1) unhealthy pgs.
> 
> >
> > Something else that suggests a bug, or at least a stuck OSD.
> >
> > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
> > > pg 12.28a is stuck unclean for 126274.215835, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is stuck undersized for 3499.099747, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is stuck degraded for 3499.107051, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]
> > >
> > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700
> > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid
> > >
> > c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12
> > > candidate had a read error, digest 2029411064 != known digest
> > > 2692480864
> >   ^^
> > That's the culprit, google for it. Of course the most promising looking
> > answer is behind the RH pay wall.
> >
> This part is the most confusing for me.  To me, this should indicate that
> there was some kind of bitrot on the disk (I'd love for ZFS to be better
> supported here).  What I don't understand is that the actual object has
> identical md5sums, timestamps, etc.  I don't know if this means there was
> just a transient error that Ceph can't get over, or whet

Re: [ceph-users] chunk-based cache in ceph with erasure coded back-end storage

2016-03-30 Thread huang jun
if your cache-mode is write-back, which will cache the read object in
cache tier.
you can try the read-proxy mode, which will not cache the object.
the read request send to primary OSD, and the primary osd collect the
shards from base tier(in you case, is erasure code pool),
you need to read at least k chunks  to decode the object.
In current code, cache tier only store the whole object, not the shards.


2016-03-31 6:10 GMT+08:00 Yu Xiang :
> Dear List,
> I am exploring in ceph caching tier recently, considering a cache-tier
> (replicated) and a back storage-tier (erasure-coded), so chunks are stored
> in the OSDs in the erasure-coded storage tier, when a file has been
> requested to read,  usually, all chunks in the storage tier would be copied
> to the cache tier, replicated, and stored in the OSDs in caching pool, but i
> was wondering would it be possible that if only partial chunks of the
> requested file be copied to cache? or it has to be a complete file? for
> example, a file using (7,4) erasure code (4 original chunks, 3 encoded
> chunks), when read it might be 4 required chunks are copied to cache, and i
> was wondering if it's possible to copy only 2 out of 4 required chunks to
> cache, and the users getting the other 2 chunks elsewhere (or assuming the
> client already has 2 chunks, they only need another 2 from ceph)? can the
> cache store partial chunks of a file?
>
> Thanks in advance for any help!
>
> Best,
> Yu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
thanks
huangjun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph pg query hangs for ever

2016-03-30 Thread Mart van Santen


Hello,

Well unfortunately the problem is not really solved. Yes, we managed to
get to a good health state at some point, when a client hits some
specific data, the osd process crashes with below errors. The 3 OSD
which handle 3.117, the PG with problems, are currently down and
reweighted them to 0, so non-affected PGs are currently rebuild on other
OSDs
If I put them crashed osd up, the do crash again within a few minutes.

As I'm a bit afraid for the data in this PG, I think we want to recreate
the PG with empty data and discard the old disks. I understand I will
get datacorruption on serveral RBDs in this case, but we will try to
solve that and rebuild the affected VMs. Does this makes sense and what
are the best next steps?

Regards,

Mart






   -34> 2016-03-31 03:07:56.932800 7f8e43829700  3 osd.55 122203
handle_osd_map epochs [122203,122203], i have 122203, src has
[120245,122203]
   -33> 2016-03-31 03:07:56.932837 7f8e43829700  1 --
[2a00:c6c0:0:122::105]:6822/11703 <== osd.45
[2a00:c6c0:0:122::103]:6800/1852 7  pg_info(1 pgs e122202:3.117) v4
 919+0+0 (3389909573 0 0) 0x528bc00 con 0x1200a840
   -32> 2016-03-31 03:07:56.932855 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932770, event: header_read, op: pg_info(1
pgs e122202:3.117)
   -31> 2016-03-31 03:07:56.932869 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932771, event: throttled, op: pg_info(1
pgs e122202:3.117)
   -30> 2016-03-31 03:07:56.932878 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932822, event: all_read, op: pg_info(1 pgs
e122202:3.117)
   -29> 2016-03-31 03:07:56.932886 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932851, event: dispatched, op: pg_info(1
pgs e122202:3.117)
   -28> 2016-03-31 03:07:56.932895 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932895, event: waiting_for_osdmap, op:
pg_info(1 pgs e122202:3.117)
   -27> 2016-03-31 03:07:56.932912 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932912, event: started, op: pg_info(1 pgs
e122202:3.117)
   -26> 2016-03-31 03:07:56.932947 7f8e43829700  5 -- op tracker -- seq:
22, time: 2016-03-31 03:07:56.932947, event: done, op: pg_info(1 pgs
e122202:3.117)
   -25> 2016-03-31 03:07:56.933022 7f8e3c01a700  1 --
[2a00:c6c0:0:122::105]:6822/11703 --> [2a00:c6c0:0:122::103]:6800/1852
-- osd_map(122203..122203 src has 121489..122203) v3 -- ?+0 0x11c7fd40
con 0x1200a840
   -24> 2016-03-31 03:07:56.933041 7f8e3c01a700  1 --
[2a00:c6c0:0:122::105]:6822/11703 --> [2a00:c6c0:0:122::103]:6800/1852
-- pg_info(1 pgs e122203:3.117) v4 -- ?+0 0x528bde0 con 0x1200a840
   -23> 2016-03-31 03:07:56.933111 7f8e3c01a700  1 --
[2a00:c6c0:0:122::105]:6822/11703 --> [2a00:c6c0:0:122::105]:6810/3568
-- osd_map(122203..122203 src has 121489..122203) v3 -- ?+0 0x12200d00
con 0x1209d4a0
   -22> 2016-03-31 03:07:56.933125 7f8e3c01a700  1 --
[2a00:c6c0:0:122::105]:6822/11703 --> [2a00:c6c0:0:122::105]:6810/3568
-- pg_info(1 pgs e122203:3.117) v4 -- ?+0 0x5288960 con 0x1209d4a0
   -21> 2016-03-31 03:07:56.933154 7f8e3c01a700  1 --
[2a00:c6c0:0:122::105]:6822/11703 -->
[2a00:c6c0:0:122::108]:6816/1002847 -- pg_info(1 pgs e122203:3.117) v4
-- ?+0 0x5288d20 con 0x101a19c0
   -20> 2016-03-31 03:07:56.933212 7f8e3c01a700  5 osd.55 pg_epoch:
122203 pg[3.117( v 122193'1898519 (108032'1895437,122193'1898519]
local-les=122202 n=2789 ec=23736 les/c 122202/122047
122062/122201/122201) [72,54,45]/[55] r=0 lpr=122201 pi=122046-122200/51
bft=45,54,72 crt=122133'1898514 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped] on activate: bft=45,54,72 from 0//0//-1
   -19> 2016-03-31 03:07:56.933232 7f8e3c01a700  5 osd.55 pg_epoch:
122203 pg[3.117( v 122193'1898519 (108032'1895437,122193'1898519]
local-les=122202 n=2789 ec=23736 les/c 122202/122047
122062/122201/122201) [72,54,45]/[55] r=0 lpr=122201 pi=122046-122200/51
bft=45,54,72 crt=122133'1898514 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped] target shard 45 from 0//0//-1
   -18> 2016-03-31 03:07:56.933244 7f8e3c01a700  5 osd.55 pg_epoch:
122203 pg[3.117( v 122193'1898519 (108032'1895437,122193'1898519]
local-les=122202 n=2789 ec=23736 les/c 122202/122047
122062/122201/122201) [72,54,45]/[55] r=0 lpr=122201 pi=122046-122200/51
bft=45,54,72 crt=122133'1898514 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped] target shard 54 from 0//0//-1
   -17> 2016-03-31 03:07:56.933255 7f8e3c01a700  5 osd.55 pg_epoch:
122203 pg[3.117( v 122193'1898519 (108032'1895437,122193'1898519]
local-les=122202 n=2789 ec=23736 les/c 122202/122047
122062/122201/122201) [72,54,45]/[55] r=0 lpr=122201 pi=122046-122200/51
bft=45,54,72 crt=122133'1898514 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped] target shard 72 from 0//0//-1
   -16> 2016-03-31 03:07:56.933283 7f8e3680f700  5 -- op tracker -- seq:
20, time: 2016-03-31 03:07:56.933283, event: reached_pg, op:
osd_op(client.776466.1:190178605 rbd_data.900a62ae8944a.0829

[ceph-users] Ceph.conf

2016-03-30 Thread zainal
Hi,

 

What does mean by mon initial members in ceph.conf? Is it monitor node that
monitor all osd node? Or node osd that been monitor? Care to exlain?

 

Regards,

 

Mohd Zainal Abidin Rabani

Technical Support

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash after conversion to bluestore

2016-03-30 Thread Adrian Saul

I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and see 
what latency difference it makes.

I was able to one by one zap and recreate my OSDs to bluestore and rebalance 
the cluster (the change to having new OSDs start with low weight threw me at 
first, but once  I worked that out it was fine).

I was all good until I completed the last OSD, and then one of the earlier ones 
fell over and refuses to restart.  Every attempt to start fails with this 
assertion failure:

-2> 2016-03-31 15:15:08.868588 7f931e5f0800  0  
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
-1> 2016-03-31 15:15:08.868800 7f931e5f0800  1  
cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
 0> 2016-03-31 15:15:08.870948 7f931e5f0800 -1 osd/OSD.h: In function 
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f931e5f0800 time 2016-03-31 
15:15:08.869638
osd/OSD.h: 886: FAILED assert(ret)

 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0x558cee37da55]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x558cedd6a6fd]
 3: (OSD::init()+0xf22) [0x558cedd1d172]
 4: (main()+0x2aab) [0x558cedc83a2b]
 5: (__libc_start_main()+0xf5) [0x7f931b506b15]
 6: (()+0x349689) [0x558cedccd689]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


I could just zap and recreate it again, but I would be curious to know how to 
fix it, or unless someone can suggest if this is a bug that needs looking at.

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph.conf

2016-03-30 Thread Adrian Saul

It is the monitors that ceph clients/daemons can connect to initially to 
connect with the cluster.

Once they connect to one of the initial mons they will get a full list of all 
monitors and be able to connect to any of them to pull updated maps.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
zai...@nocser.net
Sent: Thursday, 31 March 2016 3:21 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph.conf

Hi,

What does mean by mon initial members in ceph.conf? Is it monitor node that 
monitor all osd node? Or node osd that been monitor? Care to exlain?

Regards,

Mohd Zainal Abidin Rabani
Technical Support

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xenserver or xen ceph

2016-03-30 Thread Jiri Kanicky

Hi.

There is a solution for Ceph in XenServer. With the help of my engineer 
Mark, we developed a simple patch which allows you to search and attach 
RBD image on XenServer. We create LVHD over the RBD (not RBD per VDI 
mapping yet), so it is far from ideal, but its a good start. The process 
of creating the SR over RBD works even from XenCenter.


https://github.com/mstarikov/rbdsr

Install notes are included and its very simple. Takes you few minutes 
per XenServer.


We have been running this in our Sydney Citrix  lab for sometime and I 
have been running this at home also. Works great. For the future, the 
patch should work in the upcoming version of XenServer (Dundee) as well. 
Also we are trying to push native Ceph packages in the new version and 
build experimental (not official or approved yet) version of smapi which 
would allow us to map RBD per VDI. But there are no details on this. 
Anyway, everyone is welcome to participate in improving the patch on github.


Let me know if you have any questions.

Cheers,
Jiri

On 16/02/2016 15:30, Christian Balzer wrote:

On Tue, 16 Feb 2016 11:52:17 +0800 (CST) maoqi1982 wrote:


Hi lists
Is there any solution or documents that ceph as xenserver or xen backend
storage?



Not really.

There was a project to natively support Ceph (RBD) in Xenserver but that
seems to have gone nowhere.

There was also a thread last year here "RBD hard crash on kernel
3.10" (google for it) wher Shawn Edwards was working on something similar,
but that seems to have died off silently as well.

While you could of course do a NFS (some pains) or iSCSI (major pains)
head for Ceph the pains and reduced performance make it not an attractive
proposition.

Christian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections

2016-03-30 Thread seapasu...@uchicago.edu

Thanks Dan!

Thanks for this.  I didn't know /proc/procid/limits was here! Super useful!!

Here are my limits::
root@kh11-9:~# cat /proc/419990/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack size8388608  unlimitedbytes
Max core file size0unlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 515007   515007 
processes

Max open files1048576  1048576  files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   515007   515007   signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus

root@kh11-9:~# lsof -p 419990 | wc -l
600

root@kh11-9:~# ps -o nlwp 419990
NLWP
1251

root@kh11-9:~# ps -eo nlwp | tail -n +2 | awk '{ sum += $1 } END { print 
sum }'

1585

root@kh11-9:~# ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 515007
max locked memory   (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files  (-n) 1048576
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) 515007
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited


root@kh11-9:~# ls /proc/419990/fd/ | wc -l
536

I think this is a systemwide config issue as even after I restart 
radosgw this issue doens't go away entirely and seems to linger, I just 
have no idea what it could be.


Prior to this behavior happening I can almost fully saturate my network 
link to near 10Gbps. After the behavior starts happening I can not even 
wget a 100mb bin file. It ends up taking hours. Small wgets complete 
though and I can curl a plain test webpage 
without any issue. Speed is greatly reduced though.


The rest of the server seems to behave fine (sans the newly discovered 
download issue)






On March 30, 2016 5:34:25 AM Dan van der Ster  wrote:


Hi Sean,

Did you check that the process isn't hitting some ulimits? cat
/proc/`pidof radosgw`/limits and compare with the num processes/num
FDs in use.

Cheers, Dan


On Tue, Mar 29, 2016 at 8:35 PM, seapasu...@uchicago.edu
 wrote:

So an update for anyone else having this issue. It looks like radosgw either
has a memory leak or it spools the whole object into ram or something.

root@kh11-9:/etc/apt/sources.list.d# free -m
 total   used   free sharedbuffers cached
Mem: 64397  63775621  0 3 46
-/+ buffers/cache:  63725671
Swap:65499  17630  47869

root@kh11-9:/etc/apt/sources.list.d# ps faux | grep -iE "USE[R]|radosg[w]"
USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root  269910  134 95.2 90622120 62819128 ?   Ssl  12:31  79:37
/usr/bin/radosgw --cluster=ceph --id rgw.kh11-9 -f

The odd things are 1.) the disk is fine. 2.) the rest of the server seems
very responsive. I can ssh into the server without any problems, curl out,
wget, etc but radosgw is stuck in the mud

This is after 150-300 wget requests to public objects, 2 radosgws freeze
like this.  The cluster is health okay as well::

root@kh11-9:~# grep -iE "health" ceph_report.json
"health": {
"health": {
"health_services": [
"health": "HEALTH_OK"
"health": "HEALTH_OK"
"health": "HEALTH_OK"
"health": "HEALTH_OK"
"health": "HEALTH_OK"
"health": "HEALTH_OK"
"overall_status": "HEALTH_OK",

Has anyone seen this behavior before?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] understand "client rmw"

2016-03-30 Thread Zhongyan Gu
Hi ceph experts,
I know rmw means read modify write. Just don't understand what does client
rmw stand for. can any body tell me what is it and in what Scenario this
kind of requests will be generated.

zhongyan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com