Re: [ceph-users] [RGW] SignatureDoesNotMatch using curl

2017-09-25 Thread Дмитрий Глушенок
You must use triple "\n" with GET in stringToSign. See 
http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html

> 18 сент. 2017 г., в 12:23, junho_k...@tmax.co.kr  
> написал(а):
> 
> I’m trying to use Ceph Object Storage in CLI.
> I used curl to make a request to the RGW with S3 way.
>  
> When I use a python library, which is boto, all things work fine, but when I 
> tried to make same request using curl, I always got error 
> “SignatureDoesNotMatch”
> I don’t know what goes wrong.
>  
> Here is my script when I tried to make a request using curl
> ---
> #!/bin/bash
>  
> resource="/my-new-bucket/"
> dateValue=`date -Ru`
> S3KEY="MY_KEY"
> S3SECRET="MY_SECRET_KEY"
> stringToSign="GET\n\n${dateValue}\n${resource}"
> signature=`echo -en ${stringToSign} | openssl sha1 -hmac ${S3SECRET} -binary 
> | base64`
>  
> curl -X GET \
> -H "authorization: AWS ${S3KEY}:${signature}"\
> -H "date: ${dateValue}"\
> -H "host: 10.0.2.15:7480"\
> http://10.0.2.15:7480/my-new-bucket  
> --verbose
>  
> 
>  
> The result 
> 
>  encoding="UTF-8"?>SignatureDoesNotMatchtx00019-0059bf7de0-5e25-default5e25-default-default
> 
>  
> Ceph log in /var/log/ceph/ceph-client.rgw.node0.log is
> 
> 2017-09-18 16:51:50.922935 7fc996fa5700  1 == starting new request 
> req=0x7fc996f9f7e0 =
> 2017-09-18 16:51:50.923135 7fc996fa5700  1 == req done req=0x7fc996f9f7e0 
> op status=0 http_status=403 ==
> 2017-09-18 16:51:50.923156 7fc996fa5700  1 civetweb: 0x7fc9cc00d0c0: 
> 10.0.2.15 - - [18/Sep/2017:16:51:50 +0900] "GET /my-new-bucket HTTP/1.1" 403 
> 0 - curl/7.47.0
> 
>  
> Many Thanks
> -Juno
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI production ready?

2017-08-02 Thread Дмитрий Глушенок
Will it be a separate project? There is a third RC for Luminous without a word 
about iSCSI Gateway.

> 17 июля 2017 г., в 14:54, Jason Dillaman  написал(а):
> 
> On Sat, Jul 15, 2017 at 11:01 PM, Alvaro Soto  > wrote:
>> Hi guys,
>> does anyone know any news about in what release iSCSI interface is going to
>> be production ready, if not yet?
> 
> There are several flavors of RBD iSCSI implementations that are in-use
> by the community. We are working to solidify the integration with LIO
> TCMU (via tcmu-runner) right now for Luminous [1].
> 
>> I mean without the use of a gateway, like a different endpoint connector to
>> a CEPH cluster.
> 
> I'm not sure what you mean here.
> 
>> Thanks in advance.
>> Best.
>> 
>> --
>> 
>> ATTE. Alvaro Soto Escobar
>> 
>> --
>> Great people talk about ideas,
>> average people talk about things,
>> small people talk ... about other people.
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> [1] https://github.com/ceph/ceph/pull/16182 
> 
> 
> -- 
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRC mismatch detection on read (XFS OSD)

2017-07-31 Thread Дмитрий Глушенок
You are right - missing xattrs are leading to ENOENT. Corrupting the file 
without removing xattrs leading to i/o error without marking PG as 
inconsistent. Created an issue: http://tracker.ceph.com/issues/20863

> 28 июля 2017 г., в 23:04, Gregory Farnum  написал(а):
> 
> 
> 
> On Fri, Jul 28, 2017 at 8:16 AM Дмитрий Глушенок  <mailto:gl...@jet.msk.su>> wrote:
> Hi!
> 
> Just found strange thing while testing deep-scrub on 10.2.7.
> 1. Stop OSD
> 2. Change primary copy's contents (using vi)
> 3. Start OSD
> 
> Then 'rados get' returns "No such file or directory". No error messages seen 
> in OSD log, cluster status "HEALTH_OK".
> 
> 4. ceph pg repair 
> 
> Then 'rados get' works as expected, "currupted" data repaired.
> 
> One time (I was unable to reproduce this) the error was detected on-fly 
> (without OSD restart):
> 
> 2017-07-28 17:34:22.362968 7ff8bfa27700 -1 log_channel(cluster) log [ERR] : 
> 16.d full-object read crc 0x78fcc738 != expected 0x5fd86d3e on 
> 16:b36845b2:::testobject1:head
> 
> Am I missed that CRC storing/verifying started to work on XFS? If so, where 
> the are stored? xattr? I thought it was only implemented in Bluestore.
> 
> FileStore maintains CRC checksums opportunistically, such as when you do a 
> full-object write. So in some circumstances it can detect objects with the 
> wrong data and do repairs on its own. (And the checksum is stored in the 
> object_info, which is written down in an xattr, yes.)
> 
> I'm not certain why overwriting the file with vi made it return ENOENT, but 
> probably because it lost the xattrs storing metadata. (...though I'd expect 
> that to return an error on the primary that either prompts it to repair, or 
> else incorrectly returns that raw error to the client. Can you create a 
> ticket with exactly what steps you followed and what outcome you saw?)
> -Greg

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRC mismatch detection on read (XFS OSD)

2017-07-28 Thread Дмитрий Глушенок
Hi!

Just found strange thing while testing deep-scrub on 10.2.7.
1. Stop OSD
2. Change primary copy's contents (using vi)
3. Start OSD

Then 'rados get' returns "No such file or directory". No error messages seen in 
OSD log, cluster status "HEALTH_OK".

4. ceph pg repair 

Then 'rados get' works as expected, "currupted" data repaired.

One time (I was unable to reproduce this) the error was detected on-fly 
(without OSD restart):

2017-07-28 17:34:22.362968 7ff8bfa27700 -1 log_channel(cluster) log [ERR] : 
16.d full-object read crc 0x78fcc738 != expected 0x5fd86d3e on 
16:b36845b2:::testobject1:head

Am I missed that CRC storing/verifying started to work on XFS? If so, where the 
are stored? xattr? I thought it was only implemented in Bluestore.

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] oVirt/RHEV and Ceph

2017-07-25 Thread Дмитрий Глушенок
Cinder is used as a management gateway only while hypervisors (QEMU) are 
directly communicating with Ceph cluster passing RBD volumes to VMs (without 
mapping/mounting RBD on hypervisor level).

> 25 июля 2017 г., в 6:18, Brady Deetz  написал(а):
> 
> Thanks for pointing to some documentation. I'd seen that and it is certainly 
> an option. From my understanding, with a Cinder deployment, you'd have the 
> same failure domains and similar performance characteristics to an oVirt + 
> NFS + RBD deployment. This is acceptable. But, the dream I have in my head is 
> where the RBD images are mounted and controlled on each hypervisor instead of 
> a central storage authority like Cinder. Does that exist for anything or is 
> this a fundamentally flawed idea?
> 
> On Mon, Jul 24, 2017 at 9:41 PM, Jason Dillaman  > wrote:
> oVirt 3.6 added Cinder/RBD integration [1] and it looks like they are
> currently working on integrating Cinder within a container to simplify
> the integration [2].
> 
> [1] 
> http://www.ovirt.org/develop/release-management/features/storage/cinder-integration/
>  
> 
> [2] 
> http://www.ovirt.org/develop/release-management/features/cinderglance-docker-integration/
>  
> 
> 
> On Mon, Jul 24, 2017 at 10:27 PM, Brady Deetz  > wrote:
> > Funny enough, I just had a call with Redhat where the OpenStack engineer was
> > voicing his frustration that there wasn't any movement on RBD for oVirt.
> > This is important to me because I'm building out a user-facing private cloud
> > that just isn't going to be big enough to justify OpenStack and its
> > administrative overhead. But, I already have 1.75PB (soon to be 2PB) of
> > CephFS in production. So, it puts me in a really difficult design position.
> >
> > On Mon, Jul 24, 2017 at 9:09 PM, Dino Yancey  > > wrote:
> >>
> >> I was as much as told by Redhat in a sales call that they push Gluster
> >> for oVirt/RHEV and Ceph for OpenStack, and don't have any plans to
> >> change that in the short term. (note this was about a year ago, i
> >> think - so this isn't super current information).
> >>
> >> I seem to recall the hangup was that oVirt had no orchestration
> >> capability for RBD comparable to OpenStack, and that CephFS wasn't
> >> (yet?) viable for use as a "POSIX filesystem" oVirt storage domain.
> >> Personally, I feel like Redhat is worried about competing with
> >> themselves with GlusterFS versus CephFS and is choosing to focus on
> >> Gluster as a filesystem, and Ceph as everything minus the filesystem.
> >>
> >> Which is a shame, as I'm a fan of both Ceph and oVirt and would love
> >> to use my existing RHEV infrastructure to bring Ceph into my
> >> environment.
> >>
> >>
> >> On Mon, Jul 24, 2017 at 8:39 PM, Brady Deetz  >> > wrote:
> >> > I haven't seen much talk about direct integration with oVirt. Obviously
> >> > it
> >> > kind of comes down to oVirt being interested in participating. But, is
> >> > the
> >> > only hold-up getting development time toward an integration or is there
> >> > some
> >> > kind of friction between the dev teams?
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >> > 
> >> >
> >>
> >>
> >>
> >> --
> >> __
> >> Dino Yancey
> >> 2GNT.com Admin
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> >
> 
> 
> 
> --
> Jason
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mount CephFS with dedicated user fails: mount error 13 = Permission denied

2017-07-24 Thread Дмитрий Глушенок
Check your kernel version, prior to 4.9 it was needed to allow read on root 
path: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014804.html

> 24 июля 2017 г., в 12:36, c.mo...@web.de написал(а):
> 
> Hello!
> 
> I want to mount CephFS with a dedicated user in order to avoid putting the 
> admin key on every client host.
> Therefore I created a user account
> ceph auth get-or-create client.mtyadm mon 'allow r' mds 'allow rw path=/MTY' 
> osd 'allow rw pool=hdb-backup,allow rw pool=hdb-backup_metadata' -o 
> /etc/ceph/ceph.client.mtyadm.keyring
> and wrote out the keyring
> ceph-authtool -p -n client.mtyadm ceph.client.mtyadm.keyring > 
> ceph.client.mtyadm.key
> 
> This user is now displayed in auth list:
> client.mtyadm
>key: AQBYu3VZLg66LBAAGM1jW+cvNE6BoJWfsORZKA==
>caps: [mds] allow rw path=/MTY
>caps: [mon] allow r
>caps: [osd] allow rw pool=hdb-backup,allow rw pool=hdb-backup_metadata
> 
> When I try to mount directory /MTY on the client host I get this error:
> ld2398:/etc/ceph # mount -t ceph ldcephmon1,ldcephmon2,ldcephmon2:/MTY 
> /mnt/cephfs -o name=mtyadm,secretfile=/etc/ceph/ceph.client.mtyadm.key
> mount error 13 = Permission denied
> 
> The mount works using admin though:
> ld2398:/etc/ceph # mount -t ceph ldcephmon1,ldcephmon2,ldcephmon2:/MTY 
> /mnt/cephfs -o name=admin,secretfile=/etc/ceph/ceph.client.admin.key
> ld2398:/etc/ceph # mount | grep cephfs
> 10.96.5.37,10.96.5.38,10.96.5.38:/MTY on /mnt/cephfs type ceph 
> (rw,relatime,name=admin,secret=,acl)
> 
> What is causing this mount error?
> 
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-21 Thread Дмитрий Глушенок
All three mons has value "simple".

> 21 июля 2017 г., в 15:47, Ilya Dryomov  написал(а):
> 
> On Thu, Jul 20, 2017 at 6:35 PM, Дмитрий Глушенок  <mailto:gl...@jet.msk.su>> wrote:
>> Hi Ilya,
>> 
>> While trying to reproduce the issue I've found that:
>> - it is relatively easy to reproduce 5-6 minutes hangs just by killing
>> active mds process (triggering failover) while writing a lot of data.
>> Unacceptable timeout, but not the case of
>> http://tracker.ceph.com/issues/15255
>> - it is hard to reproduce the endless hang (I've spent an hour without
>> success)
>> 
>> One thing I've noticed analysing logs is that "endless hang" always was
>> accompanied with following messages:
>> 
>> Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789
>> session lost, hunting for new mon
>> Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789
>> session established
>> Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789
>> session lost, hunting for new mon
>> Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789
>> session established
>> Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789
>> session lost, hunting for new mon
>> Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789
>> session established
>> Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789
>> session lost, hunting for new mon
>> Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789
>> session established
>> Jul 20 15:33:58 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789
>> session lost, hunting for new mon
>> Jul 20 15:34:29 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789
>> session established
>> 
>> 
>> Bug http://tracker.ceph.com/issues/17664 describes such behaviour and it was
>> fixed in releases starting with v11.1.0 (I'm using 10.2.7). So, the lost
>> session somehow triggers client disconnection and fencing (as described at
>> http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs).
>> 
>> Do you still think it should be posted to
>> http://tracker.ceph.com/issues/15255 ?
> 
> Are you using async messenger?  You can check with
> 
> $ ceph daemon mon.X config get ms_type
> 
> for all MONs.
> 
> Thanks,
> 
>Ilya

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-20 Thread Дмитрий Глушенок
Hi Ilya,

While trying to reproduce the issue I've found that:
- it is relatively easy to reproduce 5-6 minutes hangs just by killing active 
mds process (triggering failover) while writing a lot of data. Unacceptable 
timeout, but not the case of http://tracker.ceph.com/issues/15255
- it is hard to reproduce the endless hang (I've spent an hour without success)

One thing I've noticed analysing logs is that "endless hang" always was 
accompanied with following messages:
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session lost, hunting for new mon
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 
session established
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 
session lost, hunting for new mon
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session lost, hunting for new mon
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session established
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session lost, hunting for new mon
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established
Jul 20 15:33:58 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session lost, hunting for new mon
Jul 20 15:34:29 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established

Bug http://tracker.ceph.com/issues/17664 describes such behaviour and it was 
fixed in releases starting with v11.1.0 (I'm using 10.2.7). So, the lost 
session somehow triggers client disconnection and fencing (as described at 
http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs).

Do you still think it should be posted to http://tracker.ceph.com/issues/15255 ?

> 20 июля 2017 г., в 17:02, Ilya Dryomov  написал(а):
> 
> On Thu, Jul 20, 2017 at 3:23 PM, Дмитрий Глушенок  wrote:
>> Looks like I have similar issue as described in this bug:
>> http://tracker.ceph.com/issues/15255
>> Writer (dd in my case) can be restarted and then writing continues, but
>> until restart dd looks like hanged on write.
>> 
>> 20 июля 2017 г., в 16:12, Дмитрий Глушенок  написал(а):
>> 
>> Hi,
>> 
>> Repeated the test using kernel 4.12.0. OSD node crash seems to be handled
>> fine now, but MDS crash still leads to hanged writes to CephFS. Now it was
>> enough just to crash the first MDS - failover didn't happened. At the same
>> time FUSE client was running on another client - no problems with it.
> 
> Could you please post the exact steps for reproducing with 4.12 to that
> ticket?  It sounds like something that should be prioritized.
> 
> Thanks,
> 
>Ilya

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-20 Thread Дмитрий Глушенок
Looks like I have similar issue as described in this bug: 
http://tracker.ceph.com/issues/15255
Writer (dd in my case) can be restarted and then writing continues, but until 
restart dd looks like hanged on write.

> 20 июля 2017 г., в 16:12, Дмитрий Глушенок  написал(а):
> 
> Hi,
> 
> Repeated the test using kernel 4.12.0. OSD node crash seems to be handled 
> fine now, but MDS crash still leads to hanged writes to CephFS. Now it was 
> enough just to crash the first MDS - failover didn't happened. At the same 
> time FUSE client was running on another client - no problems with it.
> 
>> 19 июля 2017 г., в 13:20, Дмитрий Глушенок > <mailto:gl...@jet.msk.su>> написал(а):
>> 
>> You right. Forgot to mention that the client was using kernel 4.9.9.
>> 
>>> 19 июля 2017 г., в 12:36, 许雪寒 mailto:xuxue...@360.cn>> 
>>> написал(а):
>>> 
>>> Hi, thanks for your sharing:-)
>>> 
>>> So I guess you have not put cephfs into real production environment, and 
>>> it's still in test phase, right?
>>> 
>>> Thanks again:-)
>>> 
>>> 发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su <mailto:gl...@jet.msk.su>] 
>>> 发送时间: 2017年7月19日 17:33
>>> 收件人: 许雪寒
>>> 抄送: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> 主题: Re: [ceph-users] How's cephfs going?
>>> 
>>> Hi,
>>> 
>>> I can share negative test results (on Jewel 10.2.6). All tests were 
>>> performed while actively writing to CephFS from single client (about 1300 
>>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and 
>>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at 
>>> all, active/standby.
>>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the 
>>> test resulted in CephFS hangs forever.
>>> - Restarting active MDS resulted in successful failover to standby. Then, 
>>> after standby became active and the restarted MDS became standby the new 
>>> active was restarted. CephFS hanged for 12 minutes.
>>> 
>>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>>> 
>>> 19 июля 2017 г., в 6:47, 许雪寒 mailto:xuxue...@360.cn>> 
>>> написал(а):
>>> 
>>> Is there anyone else willing to share some usage information of cephfs?
>>> Could developers tell whether cephfs is a major effort in the whole ceph 
>>> development?
>>> 
>>> 发件人: 许雪寒 
>>> 发送时间: 2017年7月17日 11:00
>>> 收件人: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> 主题: How's cephfs going?
>>> 
>>> Hi, everyone.
>>> 
>>> We intend to use cephfs of Jewel version, however, we don’t know its 
>>> status. Is it production ready in Jewel? Does it still have lots of bugs? 
>>> Is it a major effort of the current ceph development? And who are using 
>>> cephfs now?
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>> 
>>> --
>>> Dmitry Glushenok
>>> Jet Infosystems
>>> 
>> 
>> --
>> Дмитрий Глушенок
>> Инфосистемы Джет
>> +7-910-453-2568
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Дмитрий Глушенок
> Инфосистемы Джет
> +7-910-453-2568
> 

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-20 Thread Дмитрий Глушенок
Hi,

Repeated the test using kernel 4.12.0. OSD node crash seems to be handled fine 
now, but MDS crash still leads to hanged writes to CephFS. Now it was enough 
just to crash the first MDS - failover didn't happened. At the same time FUSE 
client was running on another client - no problems with it.

> 19 июля 2017 г., в 13:20, Дмитрий Глушенок  написал(а):
> 
> You right. Forgot to mention that the client was using kernel 4.9.9.
> 
>> 19 июля 2017 г., в 12:36, 许雪寒 mailto:xuxue...@360.cn>> 
>> написал(а):
>> 
>> Hi, thanks for your sharing:-)
>> 
>> So I guess you have not put cephfs into real production environment, and 
>> it's still in test phase, right?
>> 
>> Thanks again:-)
>> 
>> 发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su <mailto:gl...@jet.msk.su>] 
>> 发送时间: 2017年7月19日 17:33
>> 收件人: 许雪寒
>> 抄送: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> 主题: Re: [ceph-users] How's cephfs going?
>> 
>> Hi,
>> 
>> I can share negative test results (on Jewel 10.2.6). All tests were 
>> performed while actively writing to CephFS from single client (about 1300 
>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and 
>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at 
>> all, active/standby.
>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the 
>> test resulted in CephFS hangs forever.
>> - Restarting active MDS resulted in successful failover to standby. Then, 
>> after standby became active and the restarted MDS became standby the new 
>> active was restarted. CephFS hanged for 12 minutes.
>> 
>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>> 
>> 19 июля 2017 г., в 6:47, 许雪寒 mailto:xuxue...@360.cn>> 
>> написал(а):
>> 
>> Is there anyone else willing to share some usage information of cephfs?
>> Could developers tell whether cephfs is a major effort in the whole ceph 
>> development?
>> 
>> 发件人: 许雪寒 
>> 发送时间: 2017年7月17日 11:00
>> 收件人: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> 主题: How's cephfs going?
>> 
>> Hi, everyone.
>> 
>> We intend to use cephfs of Jewel version, however, we don’t know its status. 
>> Is it production ready in Jewel? Does it still have lots of bugs? Is it a 
>> major effort of the current ceph development? And who are using cephfs now?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> --
> Дмитрий Глушенок
> Инфосистемы Джет
> +7-910-453-2568
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
Unfortunately no. Using FUSE was discarded due to poor performance.

> 19 июля 2017 г., в 13:45, Blair Bethwaite  
> написал(а):
> 
> Interesting. Any FUSE client data-points?
> 
> On 19 July 2017 at 20:21, Дмитрий Глушенок  wrote:
>> RBD (via krbd) was in action at the same time - no problems.
>> 
>> 19 июля 2017 г., в 12:54, Blair Bethwaite 
>> написал(а):
>> 
>> It would be worthwhile repeating the first test (crashing/killing an
>> OSD host) again with just plain rados clients (e.g. rados bench)
>> and/or rbd. It's not clear whether your issue is specifically related
>> to CephFS or actually something else.
>> 
>> Cheers,
>> 
>> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>> 
>> Hi,
>> 
>> I can share negative test results (on Jewel 10.2.6). All tests were
>> performed while actively writing to CephFS from single client (about 1300
>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
>> all, active/standby.
>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
>> test resulted in CephFS hangs forever.
>> - Restarting active MDS resulted in successful failover to standby. Then,
>> after standby became active and the restarted MDS became standby the new
>> active was restarted. CephFS hanged for 12 minutes.
>> 
>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>> 
>> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>> 
>> Is there anyone else willing to share some usage information of cephfs?
>> Could developers tell whether cephfs is a major effort in the whole ceph
>> development?
>> 
>> 发件人: 许雪寒
>> 发送时间: 2017年7月17日 11:00
>> 收件人: ceph-users@lists.ceph.com
>> 主题: How's cephfs going?
>> 
>> Hi, everyone.
>> 
>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
>> major effort of the current ceph development? And who are using cephfs now?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
>> --
>> Cheers,
>> ~Blairo
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> 
> 
> -- 
> Cheers,
> ~Blairo

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
RBD (via krbd) was in action at the same time - no problems.

> 19 июля 2017 г., в 12:54, Blair Bethwaite  
> написал(а):
> 
> It would be worthwhile repeating the first test (crashing/killing an
> OSD host) again with just plain rados clients (e.g. rados bench)
> and/or rbd. It's not clear whether your issue is specifically related
> to CephFS or actually something else.
> 
> Cheers,
> 
> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>> Hi,
>> 
>> I can share negative test results (on Jewel 10.2.6). All tests were
>> performed while actively writing to CephFS from single client (about 1300
>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
>> all, active/standby.
>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
>> test resulted in CephFS hangs forever.
>> - Restarting active MDS resulted in successful failover to standby. Then,
>> after standby became active and the restarted MDS became standby the new
>> active was restarted. CephFS hanged for 12 minutes.
>> 
>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>> 
>> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>> 
>> Is there anyone else willing to share some usage information of cephfs?
>> Could developers tell whether cephfs is a major effort in the whole ceph
>> development?
>> 
>> 发件人: 许雪寒
>> 发送时间: 2017年7月17日 11:00
>> 收件人: ceph-users@lists.ceph.com
>> 主题: How's cephfs going?
>> 
>> Hi, everyone.
>> 
>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
>> major effort of the current ceph development? And who are using cephfs now?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Cheers,
> ~Blairo

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
You right. Forgot to mention that the client was using kernel 4.9.9.

> 19 июля 2017 г., в 12:36, 许雪寒  написал(а):
> 
> Hi, thanks for your sharing:-)
> 
> So I guess you have not put cephfs into real production environment, and it's 
> still in test phase, right?
> 
> Thanks again:-)
> 
> 发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su] 
> 发送时间: 2017年7月19日 17:33
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] How's cephfs going?
> 
> Hi,
> 
> I can share negative test results (on Jewel 10.2.6). All tests were performed 
> while actively writing to CephFS from single client (about 1300 MB/sec). 
> Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 
> HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, 
> active/standby.
> - Crashing one node resulted in write hangs for 17 minutes. Repeating the 
> test resulted in CephFS hangs forever.
> - Restarting active MDS resulted in successful failover to standby. Then, 
> after standby became active and the restarted MDS became standby the new 
> active was restarted. CephFS hanged for 12 minutes.
> 
> P.S. Planning to repeat the tests again on 10.2.7 or higher
> 
> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
> 
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph 
> development?
> 
> 发件人: 许雪寒 
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
> 
> Hi, everyone.
> 
> We intend to use cephfs of Jewel version, however, we don’t know its status. 
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a 
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Dmitry Glushenok
> Jet Infosystems
> 

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
Hi,

I can share negative test results (on Jewel 10.2.6). All tests were performed 
while actively writing to CephFS from single client (about 1300 MB/sec). 
Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 HDD 
RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, active/standby.
- Crashing one node resulted in write hangs for 17 minutes. Repeating the test 
resulted in CephFS hangs forever.
- Restarting active MDS resulted in successful failover to standby. Then, after 
standby became active and the restarted MDS became standby the new active was 
restarted. CephFS hanged for 12 minutes.

P.S. Planning to repeat the tests again on 10.2.7 or higher

> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
> 
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph 
> development?
> 
> 发件人: 许雪寒 
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
> 
> Hi, everyone.
> 
> We intend to use cephfs of Jewel version, however, we don’t know its status. 
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a 
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD returns back and recovery process

2017-06-21 Thread Дмитрий Глушенок
Hello!

It is clear what happens after OSD goes OUT - PGs are backfilled to other OSDs 
and PGs whose primary copies were on lost OSD gets new primary OSDs. But when 
OSD returns back it looks like all that data, for which the OSD was holding 
primary copies, are read from that OSD and re-written to other OSDs (to 
secondary copies). Am I right? If so, what is the reason to re-read copies from 
returned OSD? Wouldn't it be cheaper to just track modified ones?

Thank you.

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librbd + rbd-nbd

2017-04-07 Thread Дмитрий Глушенок
Hi!

No experience, but be ready to limit your RBD devices by 2 TB in size: 
http://tracker.ceph.com/issues/17219

> 5 апр. 2017 г., в 22:15, Prashant Murthy  написал(а):
> 
> Hi all, 
> 
> 
> I wanted to ask if anybody is using librbd (user mode lib) with rbd-nbd 
> (kernel module) on their Ceph clients. We're currently using krbd, but that 
> doesn't support some of the features (such as rbd mirroring). So, I wanted to 
> check if anybody has experience running with nbd + librbd on their clusters 
> and can provide more details.
> 
> Prashant
> 
> -- 
> Prashant Murthy
> Sr Director, Software Engineering | Salesforce
> Mobile: 919-961-3041
> 
> 
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.3 libvirt (2.0.0-10.el7_3.2) and openstack volume attachment w/ cephx broken

2016-12-20 Thread Дмитрий Глушенок
Hi,

I was playing with oVirt/Cinder integration and faced the issue. At the same 
time virsh on CentOS 7.3 was working fine with RBD images. So, as a workaround 
following procedure can be used to permanently set the secret on libvirt host:

# vi /tmp/secret.xml

   db11828c-f9e8-48cb-81dd-29fc00ec1c14
   
 admin
   

# virsh secret-define /tmp/secret.xml
# virsh secret-set-value db11828c-f9e8-48cb-81dd-29fc00ec1c14 


Googleme: the VDSM error was:

Thread-289::ERROR::2016-12-20 
12:44:39,306::vm::765::virt.vm::(_startUnderlyingVm) 
vmId=`16889d09-14b1-455b-ab2c-fcbd697b8f59`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 706, in _startUnderlyingVm
self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1996, in _run
self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, 
in wrapper
ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper
return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in createXML
if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Secret not found: no secret with matching uuid 
'db11828c-f9e8-48cb-81dd-29fc00ec1c14'
Thread-289::INFO::2016-12-20 12:44:39,309::vm::1308::virt.vm::(setDownStatus) 
vmId=`16889d09-14b1-455b-ab2c-fcbd697b8f59`::Changed state to Down: Secret not 
found: no secret with matching uuid 'db11828c-f9e8-48cb-81dd-29fc00ec1c14' 
(code=1)

> 20 дек. 2016 г., в 3:06, Mike Lowe  написал(а):
> 
> Not that I’ve found, it’s a little hard to search for.  I believe it’s 
> related to this libvirt mailing list thread 
> https://www.redhat.com/archives/libvir-list/2016-October/msg00396.html 
> 
> You’ll find this in the libvirt qemu log for the instance 'No secret with id 
> 'scsi0-0-0-1-secret0’’ and this in the nova-compute log 'libvirtError: 
> internal error: unable to execute QEMU command '__com.redhat_drive_add': 
> Device 'drive-scsi0-0-0-1' could not be initialized’.  I was able to yum 
> downgrade twice to get to something from the 1.2 series.
> 
> 
>> On Dec 19, 2016, at 6:40 PM, Jason Dillaman > > wrote:
>> 
>> Do you happen to know if there is an existing bugzilla ticket against
>> this issue?
>> 
>> On Mon, Dec 19, 2016 at 3:46 PM, Mike Lowe > > wrote:
>>> It looks like the libvirt (2.0.0-10.el7_3.2) that ships with centos 7.3 is 
>>> broken out of the box when it comes to hot plugging new virtio-scsi devices 
>>> backed by rbd and cephx auth.  If you use openstack, cephx auth, and 
>>> centos, I’d caution against the upgrade to centos 7.3 right now.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> -- 
>> Jason
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
Hi,

The assumptions are:
- OSD nearly full
- HDD vendor not hides real LSE (latent sector error) rate like 1 in 10^18 
under "not more than 1 unrecoverable error in 10^15 bits read"

In case of disk (OSD) failure Ceph have to read copy of the disk from other 
nodes (to restore redundancy). More you read - more chances that LSE will 
happen on one of the nodes Ceph is reading from (all of them have the same LSE 
rate). In case of LSE Ceph cannot recover data because the error is 
unrecoverable and there is no other places to read the data from (in contrast 
to size=3 where third copy can be used to recover from the error).

Here is a good paper about LSE influence on RAID5: 
http://www.snia.org/sites/default/orig/sdc_archives/2010_presentations/tuesday/JasonResch_%20Solving-Data-Loss.pdf

> 7 дек. 2016 г., в 15:07, Wolfgang Link  написал(а):
> 
> Hi
> 
> I'm very interested in this calculation.
> What assumption do you have done?
> Network speed, osd degree of fulfilment, etc?
> 
> Thanks
> 
> Wolfgang
> 
> On 12/07/2016 11:16 AM, Дмитрий Глушенок wrote:
>> Hi,
>> 
>> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on
>> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8
>> TB disk. So, every 18th recovery will probably fail. Similarly to RAID6
>> (two parity disks) size=3 mitigates the problem.
>> By the way - why it is a common opinion that using RAID (RAID6) with
>> Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk
>> errors are handled by RAID (instead of OS/Ceph), decreases OSD count,
>> adds some battery-backed cache and increases performance of single OSD.
>> 
>>> 7 дек. 2016 г., в 11:08, Wido den Hollander >> <mailto:w...@42on.com>> написал(а):
>>> 
>>> Hi,
>>> 
>>> As a Ceph consultant I get numerous calls throughout the year to help
>>> people with getting their broken Ceph clusters back online.
>>> 
>>> The causes of downtime vary vastly, but one of the biggest causes is
>>> that people use replication 2x. size = 2, min_size = 1.
>>> 
>>> In 2016 the amount of cases I have where data was lost due to these
>>> settings grew exponentially.
>>> 
>>> Usually a disk failed, recovery kicks in and while recovery is
>>> happening a second disk fails. Causing PGs to become incomplete.
>>> 
>>> There have been to many times where I had to use xfs_repair on broken
>>> disks and use ceph-objectstore-tool to export/import PGs.
>>> 
>>> I really don't like these cases, mainly because they can be prevented
>>> easily by using size = 3 and min_size = 2 for all pools.
>>> 
>>> With size = 2 you go into the danger zone as soon as a single
>>> disk/daemon fails. With size = 3 you always have two additional copies
>>> left thus keeping your data safe(r).
>>> 
>>> If you are running CephFS, at least consider running the 'metadata'
>>> pool with size = 3 to keep the MDS happy.
>>> 
>>> Please, let this be a big warning to everybody who is running with
>>> size = 2. The downtime and problems caused by missing objects/replicas
>>> are usually big and it takes days to recover from those. But very
>>> often data is lost and/or corrupted which causes even more problems.
>>> 
>>> I can't stress this enough. Running with size = 2 in production is a
>>> SERIOUS hazard and should not be done imho.
>>> 
>>> To anyone out there running with size = 2, please reconsider this!
>>> 
>>> Thanks,
>>> 
>>> Wido
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
RAID10 also will suffer from LSE on big disks, isn't it?

> 7 дек. 2016 г., в 13:35, Christian Balzer  написал(а):
> 
> 
> 
> Hello,
> 
> On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:
> 
>> Hi,
>> 
>> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on 
>> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB 
>> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two 
>> parity disks) size=3 mitigates the problem.
> 
> Indeed.
> That math changes significantly of course if you have very reliable,
> endurable, well monitored and fast SSDs of not too big a size.
> Something that will recover in less than hour.
> 
> So people with SSD pools might have an acceptable risk.
> 
> That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
> and the increased latency stopped me for this time.
> Next round I'll upgrade my HW requirements and budget.
> 
>> By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
>> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors 
>> are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
>> battery-backed cache and increases performance of single OSD.
>> 
> 
> I did run something like that and if your IOPS needs are low enough it
> works well (the larger HW cache the better).
> But once you exceed the combined speed of HW cache coalescing, it degrades
> badly, something that's usually triggered by very mixed R/W ops and/or
> deep scrubs.
> It also depends on your cluster size, if you have dozens of OSDs based on
> such a design, it will work a lot better than with a few.
> 
> I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
> and didn't require all the space.
> 
> Christian
> 
>>> 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
>>> 
>>> Hi,
>>> 
>>> As a Ceph consultant I get numerous calls throughout the year to help 
>>> people with getting their broken Ceph clusters back online.
>>> 
>>> The causes of downtime vary vastly, but one of the biggest causes is that 
>>> people use replication 2x. size = 2, min_size = 1.
>>> 
>>> In 2016 the amount of cases I have where data was lost due to these 
>>> settings grew exponentially.
>>> 
>>> Usually a disk failed, recovery kicks in and while recovery is happening a 
>>> second disk fails. Causing PGs to become incomplete.
>>> 
>>> There have been to many times where I had to use xfs_repair on broken disks 
>>> and use ceph-objectstore-tool to export/import PGs.
>>> 
>>> I really don't like these cases, mainly because they can be prevented 
>>> easily by using size = 3 and min_size = 2 for all pools.
>>> 
>>> With size = 2 you go into the danger zone as soon as a single disk/daemon 
>>> fails. With size = 3 you always have two additional copies left thus 
>>> keeping your data safe(r).
>>> 
>>> If you are running CephFS, at least consider running the 'metadata' pool 
>>> with size = 3 to keep the MDS happy.
>>> 
>>> Please, let this be a big warning to everybody who is running with size = 
>>> 2. The downtime and problems caused by missing objects/replicas are usually 
>>> big and it takes days to recover from those. But very often data is lost 
>>> and/or corrupted which causes even more problems.
>>> 
>>> I can't stress this enough. Running with size = 2 in production is a 
>>> SERIOUS hazard and should not be done imho.
>>> 
>>> To anyone out there running with size = 2, please reconsider this!
>>> 
>>> Thanks,
>>> 
>>> Wido
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com <mailto:ch...@gol.com>  Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-07 Thread Дмитрий Глушенок
Hi,

Let me add a little math to your warning: with LSE rate of 1 in 10^15 on modern 
8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB disk. So, 
every 18th recovery will probably fail. Similarly to RAID6 (two parity disks) 
size=3 mitigates the problem.
By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
(size=2) is bad idea? It is cheaper than size=3, all hardware disk errors are 
handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
battery-backed cache and increases performance of single OSD.

> 7 дек. 2016 г., в 11:08, Wido den Hollander  написал(а):
> 
> Hi,
> 
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
> 
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
> 
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
> 
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
> 
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
> 
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
> 
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
> 
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
> 
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
> 
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
> 
> To anyone out there running with size = 2, please reconsider this!
> 
> Thanks,
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Math behind : : OSD count vs OSD process vs OSD ports

2015-11-18 Thread Дмитрий Глушенок
Hi Vickey,

> 18 нояб. 2015 г., в 11:36, Vickey Singh  
> написал(а):
> 
> Can anyone please help me understand this.
> 
> Thank You
> 
> 
> On Mon, Nov 16, 2015 at 5:55 PM, Vickey Singh  > wrote:
> Hello Community
> 
> Need your help in understanding this.
> 
> I have the below node, which is hosting 60 physical disks, running 1 OSD per 
> disk so total 60 Ceph OSD daemons
> 
> [root@node01 ~]# service ceph status | grep -i osd | grep -i running | wc -l
> 60
> [root@node01 ~]#
> 
> However if i check OSD processes it shows that there are 120 OSD process are 
> running.
> 
> [root@node01 ~]# ps -ef | grep -i osd | grep -v grep | wc -l
> 120
> [root@node01 ~]#
> 
> Question 1 : why its 120 processes ? , i it should be 60 (because of 60 OSD 
> on the system)
> My guess : is this because of multithreading ?

No, it is the way OSD processes were launched. Try following (quotes are 
necessary):

$ bash -c "sleep 1; sleep 999" &
$ ps -ef | grep sleep | grep -v grep

You will see that bash process will wait for "sleep 999" process to finish. OSD 
processes are launched similarly.

> 
> Now if i check the number of ports used by OSD its comming out to be 240
> 
> [root@node01 ~]# netstat -plunt | grep -i osd | wc -l
> 240
> [root@node01 ~]#
> 
> Question 2 : Now why its 240 ports ? It should be 60 ( because of 60 OSD on 
> the system)

It is because each OSD uses four ports: 
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/#osd-ip-tables

> 
> If i grep a specific OSD port , its shows 2 ports are occupied by OSD process 
> 260519
> 
> [root@node01 ~]# netstat -plunt | grep -i osd | grep -i 6819
> tcp0  0 10.101.50.1:6819 
> 0.0.0.0:*   LISTEN  260519/ceph-osd
> tcp0  0 10.102.50.1:6819 
> 0.0.0.0:*   LISTEN  260519/ceph-osd
> [root@node01 ~]#
> 
> Question 3 : Now based on the scenario 2 it should be 4 ( so 60 OSD x 4 ports 
> = 240 ports in total)
> 
> I have two public and cluster network configured in ceph.conf , is all these 
> because of two different networks ?

If you will grep 260519 (PID) instead of 6819 (port) you should see four ports 
listened (two for private network and two for public).

> 
> I would really appreciate if some knowledgeable person share his 
> understanding with me.
> 
> Thank you in advance.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados bench leaves objects in tiered pool

2015-11-03 Thread Дмитрий Глушенок
Hi,

Thanks Gregory and Robert, now it is a bit clearer.

After cache-flush-evict-all almost all objects were deleted, but 101 remained 
in cache pool. Also 1 pg changed its state to inconsistent with HEALTH_ERR. 
"ceph pg repair" changed objects count to 100, but at least ceph become healthy.

Now it looks like:
POOLS:
NAME  ID USED  %USED MAX AVAIL OBJECTS 
rbd-cache 36 23185 0  157G 100 
rbd   37 0 0  279G   0 
# rados -p rbd-cache ls -all
# rados -p rbd ls -all
# 

Is there any way to find what the objects are?

"ceph pg ls-by-pool rbd-cache" gives me pgs of the objects. Looking into these 
pgs gives me nothing I can understand :)

# ceph pg ls-by-pool rbd-cache | head -4
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup   up_primary   acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
36.01   0   0   0   0   83  926 926 
active+clean2015-11-03 22:06:39.193371  798'926   798:640 [4,0,3] 4 
  [4,0,3] 4   798'926 2015-11-03 22:06:39.193321  798'926 
2015-11-03 22:06:39.193321
36.11   0   0   0   0   193 854 854 
active+clean2015-11-03 18:28:51.190819  798'854   798:515 [1,4,3] 1 
  [1,4,3] 1   796'628 2015-11-03 18:28:51.190749  0'0 
2015-11-02 18:28:42.546224
36.21   0   0   0   0   198 869 869 
active+clean2015-11-03 18:28:44.556048  798'869   798:554 [2,0,1] 2 
  [2,0,1] 2   796'650 2015-11-03 18:28:44.555980  0'0 
2015-11-02 18:28:42.546226
#

# find /var/lib/ceph/osd/ceph-0/current/36.0_head/
/var/lib/ceph/osd/ceph-0/current/36.0_head/
/var/lib/ceph/osd/ceph-0/current/36.0_head/__head___24
/var/lib/ceph/osd/ceph-0/current/36.0_head/hit\uset\u36.0\uarchive\u2015-11-03 
11:12:37.962360\u2015-11-03 21:28:58.149662__head__.ceph-internal_24
# find /var/lib/ceph/osd/ceph-0/current/36.2_head/
/var/lib/ceph/osd/ceph-0/current/36.2_head/
/var/lib/ceph/osd/ceph-0/current/36.2_head/__head_0002__24
/var/lib/ceph/osd/ceph-0/current/36.2_head/hit\uset\u36.2\uarchive\u2015-11-02 
19:50:00.788736\u2015-11-03 21:29:02.460568__head_0002_.ceph-internal_24
#

# ls -l 
/var/lib/ceph/osd/ceph-0/current/36.0_head/hit\\uset\\u36.0\\uarchive\\u2015-11-03\
 11\:12\:37.962360\\u2015-11-03\ 
21\:28\:58.149662__head__.ceph-internal_24 
-rw-r--r--. 1 root root 83 Nov  3 21:28 
/var/lib/ceph/osd/ceph-0/current/36.0_head/hit\uset\u36.0\uarchive\u2015-11-03 
11:12:37.962360\u2015-11-03 21:28:58.149662__head__.ceph-internal_24
# 
# ls -l 
/var/lib/ceph/osd/ceph-0/current/36.2_head/hit\\uset\\u36.2\\uarchive\\u2015-11-02\
 19\:50\:00.788736\\u2015-11-03\ 
21\:29\:02.460568__head_0002_.ceph-internal_24 
-rw-r--r--. 1 root root 198 Nov  3 21:29 
/var/lib/ceph/osd/ceph-0/current/36.2_head/hit\uset\u36.2\uarchive\u2015-11-02 
19:50:00.788736\u2015-11-03 21:29:02.460568__head_0002_.ceph-internal_24
#

--
Dmitry Glushenok
Jet Infosystems


> 3 нояб. 2015 г., в 20:11, Robert LeBlanc  написал(а):
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Try:
> 
> rados -p {cachepool} cache-flush-evict-all
> 
> and see if the objects clean up.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Nov 3, 2015 at 8:02 AM, Gregory Farnum  wrote:
>> When you have a caching pool in writeback mode, updates to objects
>> (including deletes) are handled by writeback rather than writethrough.
>> Since there's no other activity against these pools, there is nothing
>> prompting the cache pool to flush updates out to the backing pool, so
>> the backing pool hasn't deleted its objects because nothing's told it
>> to. You'll find that the cache pool has deleted the data for its
>> objects, but it's keeping around a small "whiteout" and the object
>> info metadata.
>> The "rados ls" you're using has never played nicely with cache tiering
>> and probably never will. :( Listings are expensive operations and
>> modifying them to do more than the simple info scan would be fairly
>> expensive in terms of computation and IO.
>> 
>> I think there are some caching commands you can send to flush updates
>> which would cause the objects to be entirely deleted, but I don't have
>> them off-hand. You can probably search the mailing list archives or
>> the docs for tiering commands. :)
>> -Greg
>> 
>> On Tue, Nov 3, 2015 at 12:4

[ceph-users] rados bench leaves objects in tiered pool

2015-11-03 Thread Дмитрий Глушенок
Hi,

While benchmarking tiered pool using rados bench it was noticed that objects 
are not being removed after test.

Test was performed using "rados -p rbd bench 3600 write". The pool is not used 
by anything else.

Just before end of test:
POOLS:
NAME  ID USED   %USED MAX AVAIL OBJECTS
rbd-cache 36 33110M  3.41  114G8366
rbd   37 43472M  4.47  237G   10858

Some time later (few hundreds of writes are flushed, rados automatic cleanup 
finished):
POOLS:
NAME  ID USED   %USED MAX AVAIL OBJECTS
rbd-cache 36  22998 0  157G   16342
rbd   37 46050M  4.74  234G   11503

# rados -p rbd-cache ls | wc -l
16242
# rados -p rbd ls | wc -l
11503
#

# rados -p rbd cleanup
error during cleanup: -2
error 2: (2) No such file or directory
#

# rados -p rbd cleanup --run-name "" --prefix prefix ""
 Warning: using slow linear search
 Removed 0 objects
#

# rados -p rbd ls | head -5
benchmark_data_dropbox01.tzk_7641_object10901
benchmark_data_dropbox01.tzk_7641_object9645
benchmark_data_dropbox01.tzk_7641_object10389
benchmark_data_dropbox01.tzk_7641_object10090
benchmark_data_dropbox01.tzk_7641_object11204
#

#  rados -p rbd-cache ls | head -5
benchmark_data_dropbox01.tzk_7641_object10901
benchmark_data_dropbox01.tzk_7641_object9645
benchmark_data_dropbox01.tzk_7641_object10389
benchmark_data_dropbox01.tzk_7641_object5391
benchmark_data_dropbox01.tzk_7641_object10090
#

So, it looks like the objects are still in place (in both pools?). But it is 
not possible to remove them:

# rados -p rbd rm benchmark_data_dropbox01.tzk_7641_object10901
error removing rbd>benchmark_data_dropbox01.tzk_7641_object10901: (2) No such 
file or directory
#

# ceph health
HEALTH_OK
#


Can somebody explain the behavior? And is it possible to cleanup the benchmark 
data without recreating the pools?


ceph version 0.94.5

# ceph osd dump | grep rbd
pool 36 'rbd-cache' replicated size 3 min_size 1 crush_ruleset 1 object_hash 
rjenkins pg_num 100 pgp_num 100 last_change 755 flags 
hashpspool,incomplete_clones tier_of 37 cache_mode writeback target_bytes 
107374182400 hit_set bloom{false_positive_probability: 0.05, target_size: 0, 
seed: 0} 3600s x1 stripe_width 0
pool 37 'rbd' erasure size 5 min_size 3 crush_ruleset 2 object_hash rjenkins 
pg_num 100 pgp_num 100 last_change 745 lfor 745 flags hashpspool tiers 36 
read_tier 36 write_tier 36 stripe_width 4128
#

# ceph osd pool get rbd-cache hit_set_type
hit_set_type: bloom
# ceph osd pool get rbd-cache hit_set_period
hit_set_period: 3600
# ceph osd pool get rbd-cache hit_set_count
hit_set_count: 1
# ceph osd pool get rbd-cache target_max_objects
target_max_objects: 0
# ceph osd pool get rbd-cache target_max_bytes
target_max_bytes: 107374182400
# ceph osd pool get rbd-cache cache_target_dirty_ratio
cache_target_dirty_ratio: 0.1
# ceph osd pool get rbd-cache cache_target_full_ratio
cache_target_full_ratio: 0.2
#

Crush map:
root cache_tier {   
id -7   # do not change unnecessarily
# weight 0.450
alg straw   
hash 0  # rjenkins1
item osd.0 weight 0.090
item osd.1 weight 0.090
item osd.2 weight 0.090
item osd.3 weight 0.090
item osd.4 weight 0.090
}
root store_tier {   
id -8   # do not change unnecessarily
# weight 0.450
alg straw   
hash 0  # rjenkins1
item osd.5 weight 0.090
item osd.6 weight 0.090
item osd.7 weight 0.090
item osd.8 weight 0.090
item osd.9 weight 0.090
}
rule cache {
ruleset 1
type replicated
min_size 0
max_size 5
step take cache_tier
step chooseleaf firstn 0 type osd
step emit
}
rule store {
ruleset 2
type erasure
min_size 0
max_size 5
step take store_tier
step chooseleaf firstn 0 type osd
step emit
}

Thanks

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com