Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Christian Balzer

Hello,

On Thu, 3 Mar 2016 07:41:09 + Adrian Saul wrote:

> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.
> Any ideas?
> 
> 
>   I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform
> for developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems). 3x 2RU OSD SSD servers
> (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo SSDs each 3x 4RU
> OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>
Samsung EVO... 
Which exact model, I presume this is not a DC one?
 
If you had put your journals on those, you would already be pulling your
hairs out due to abysmal performance.

Also with Evo ones, I'd be worried about endurance.

>  As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual
> E5-2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using
> the 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> servers into one root, and the SATA servers into another and created
> pools using hosts as fault boundaries, with the pools set for 2
> copies.   
Risky. If you have very reliable and well monitored SSDs you can get away
with 2 (I do so), but with HDDs and the combination of their reliability
and recovery time it's asking for trouble.
I realize that this is testbed, but if your production has a replication
of 3 you will be disappointed by the additional latency.

> I created the pools with the pg_num and pgp_num set to 32x the
> number of OSDs in the pool.   On the frontends we create RBD devices and
> present them as iSCSI LUNs using SCST to clients - in this test case a
> Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want
> it, and they are not very stable - with 5 LUNs doing around 200 32K IOPS
> consistently the service times sit at around 3-4ms, but regularly (every
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5
> minutes.  

This smells like garbage collection on your SSDs, especially since it
matches time wise what you saw on them below.

>I fully expected we would have some latencies due to the
> distributed and networked nature of Ceph, but in this instance I just
> cannot find where these latencies are coming from, especially with the
> SSD based pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue
> times.  These are in line with what Solaris sees so I don't think
> SCST/iSCSI is adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine
> with any bursts
> - The SSDs do show similar latency variations with writes - bursting up
> to 12ms or more whenever there is high write workloads.
This.

Have you tried the HDD based pool and did you see similar, consistent
interval, spikes?

Or alternatively, configured 2 of your NVMEs as OSDs?

As for monitoring, I like atop for instant feedback.
For more in-depth analysis (and for when you're not watching), collectd
with graphite serve me well.
 
> - I have tried applying what tuning I can to the SSD block devices (noop
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no
> major impact
> - I have tried tuning up filesystore  queue and wbthrottle values but
> could not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait
> and I can do benchmarks up over 1GB/s in some tests.  Write throughput
> can also be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI
> client block sizes (i.e 32K, 128K instead of 4M) but it seemed to make
> things worse.  I would have thought better alignment would reduce
> latency but is that offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or
> diagnostics do I need to work this out?  We would really like to use
> ceph across a mixed workload that includes some DB systems that are
> fairly latency sensitive, but as it stands its hard to be confident in
> the performance when a fairly quiet unloaded system seems to struggle,
> even with all t

[ceph-users] ceph mon failed to restart

2016-03-03 Thread M Ranga Swami Reddy
I have tried to restart the one of the ceph mon , got the below error:
==

2016-03-03 08:16:00.120355 7f43b067d7c0 -1 obtain_monmap unable to find a monmap

2016-03-03 08:16:00.120374 7f43b067d7c0 -1 unable to obtain a monmap:
(2) No such file or directory

2016-03-03 08:16:00.124437 7f43b067d7c0 -1 mon.node-11@-1(probing) e0
not in monmap and have been in a quorum before; must have been removed

2016-03-03 08:16:00.124715 7f43b067d7c0 -1 mon.node-11@-1(probing) e0
commit suicide!

2016-03-03 08:16:00.124727 7f43b067d7c0 -1 failed to initialize

failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i node-11 --pid-file
/var/run/ceph/mon.node-11.pid -c /etc/ceph/ceph.conf --cluster ceph '

Starting ceph-create-keys on node-11...
===

Any hint appriciated...

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restore properties to default?

2016-03-03 Thread Max A. Krasilnikov
Здравствуйте! 

On Thu, Mar 03, 2016 at 09:53:22AM +1000, lindsay.mathieson wrote:

> Ok, reduced my recovery I/O with

> ceph tell osd.* injectargs '--osd-max-backfills 1'
> ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'


> Now I can put it back to the default values explicity (10, 15), but is 
> there a way to tell ceph to just restore the default args?

As an option:
ceph --show-config -c /dev/null |grep osd_max_backfills
...
ceph tell osd.* injectargs '--osd_max_backfills='

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Help: pool not responding

2016-03-03 Thread Mario Giammarco
I have tried "force create". It says "creating" but at the end problem
persists.
I have restarted ceph as usual.
I am evaluating ceph and I am shocked because it semeed a very robust
filesystem and now for a glitch I have an entire pool blocked and there is
no simple procedure to force a recovery.

2016-03-02 18:31 GMT+01:00 Oliver Dzombic :

> Hi,
>
> i could also not find any delete, but a create.
>
> I found this here, its basically your situation:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032412.html
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 02.03.2016 um 18:28 schrieb Mario Giammarco:
> > Thans for info even if it is a bad info.
> > Anyway I am reading docs again and I do not see a way to delete PGs.
> > How can I remove them?
> > Thanks,
> > Mario
> >
> > 2016-03-02 17:59 GMT+01:00 Oliver Dzombic  > >:
> >
> > Hi,
> >
> > as i see your situation, somehow this 4 pg's got lost.
> >
> > They will not recover, because they are incomplete. So there is no
> data
> > from which it could be recovered.
> >
> > So all what is left is to delete this pg's.
> >
> > Since all 3 osd's are in and up, it does not seem like you can
> somehow
> > access this lost pg's.
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:i...@ip-interactive.de 
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt )
> > Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1 
> > UST ID: DE274086107
> >
> >
> > Am 02.03.2016  um 17:45 schrieb Mario Giammarco:
> > >
> > >
> > > Here it is:
> > >
> > >  cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> > >  health HEALTH_WARN
> > > 4 pgs incomplete
> > > 4 pgs stuck inactive
> > > 4 pgs stuck unclean
> > > 1 requests are blocked > 32 sec
> > >  monmap e8: 3 mons at
> > > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> > 
> > > }
> > > election epoch 840, quorum 0,1,2 0,1,2
> > >  osdmap e2405: 3 osds: 3 up, 3 in
> > >   pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
> > > 1090 GB used, 4481 GB / 5571 GB avail
> > >  284 active+clean
> > >4 incomplete
> > >   client io 4008 B/s rd, 446 kB/s wr, 23 op/s
> > >
> > >
> > > 2016-03-02 9:31 GMT+01:00 Shinobu Kinjo  > 
> > > >>:
> > >
> > > Is "ceph -s" still showing you same output?
> > >
> > > > cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> > > >  health HEALTH_WARN
> > > > 4 pgs incomplete
> > > > 4 pgs stuck inactive
> > > > 4 pgs stuck unclean
> > > >  monmap e8: 3 mons at
> > > > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> > 
> > >  >}
> > > > election epoch 832, quorum 0,1,2 0,1,2
> > > >  osdmap e2400: 3 osds: 3 up, 3 in
> > > >   pgmap v5883297: 288 pgs, 4 pools, 391 GB data, 100
> > kobjects
> > > > 1090 GB used, 4481 GB / 5571 GB avail
> > > >  284 active+clean
> > > >4 incomplete
> > >
> > > Cheers,
> > > S
> > >
> > > - Original Message -
> > > From: "Mario Giammarco"  > 
> > > >>
> > > To: "Lionel Bouton"  > 
> > >  > >>
> > > Cc: "Shinobu Kinjo"  >   > >>,
> > > ceph-users@lists.ceph.com 
> >  >>
> > > Sent: Wednesday, March 2, 2016 4:27:15 PM
> > > 

Re: [ceph-users] XFS and nobarriers on Intel SSD

2016-03-03 Thread Maxime Guyot
Hello,

It looks like this thread is one of the main google hit on this issue, so let 
me bring some update. I experienced the same symptoms with Intel S3610 and 
LSI2208.

The logs reported “task abort!” messages on a daily basis since November:
Write(10): 2a 00 0e 92 88 90 00 00 10 00
scsi target6:0:1: handle(0x000a), sas_address(0x443322110100), phy(1)
scsi target6:0:1: enclosure_logical_id(0x500304801c84e000), slot(2)
sd 6:0:1:0: task abort: SUCCESS scmd(8805b30fa200)
sd 6:0:1:0: attempting task abort! scmd(8807ef9e9800)
sd 6:0:1:0: [sdf] CDB:

OSD would go down from time to time with:
XFS (sdf3): xfs_log_force: error 5 returned.
lost page write due to I/O error on sdf3


I was able to repeat “task abort!” messages with "rados -p data bench 30 write 
-b 1048576”. The OSD down and XFS errors on the other hand were harder to 
reproduce systemically.
To solve the problem I followed Christian’s recommendation to update the S3610 
SSDs’ firmware from G2010110 to G2010140 using the 
isdct
 utility. It was easy to convert the RPM package released by Intel into a .deb 
package using “alien”. Then just a matter of “isdct show –intelssd” and “isdct 
load –intelssd 0"

It has been a week since the cluster runs with the latest firmware, I can’t 
reproduce the problem so it looks like the issue is solved.

Thank you Christian for the info!

Regards

Maxime Guyot
System Engineer



> Hello,
>
> On Tue, 8 Sep 2015 13:40:36 +1200 Richard Bade wrote:
>
> > Hi Christian,
> > Thanks for the info. I'm just wondering, have you updated your S3610's
> > with the new firmware that was released on 21/08 as referred to in the
> > thread?
> I did so earlier today, see below.
>
> >We thought we weren't seeing the issue on the intel controller
> > also to start with, but after further investigation it turned out we
> > were, but it was reported as a different log item such as this:
> > ata5.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
> > ata5.00: failed command: READ FPDMA QUEUED
> > ata5.00: cmd 60/10:a0:18:ca:ca/00:00:32:00:00/40 tag 20 ncq 8192 in
> >   res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > ata5.00: status: { DRDY }
> > ata5.00: failed command: READ FPDMA QUEUED
> > ata5.00: cmd 60/40:a8:48:ca:ca/00:00:32:00:00/40 tag 21 ncq 32768 in
> >  res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > ata5.00: status: { DRDY }
> > ata5: hard resetting link
> > ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> > ata5.00: configured for UDMA/133
> > ata5.00: device reported invalid CHS sector 0
> > ata5.00: device reported invalid CHS sector 0
> > ata5: EH complete
> > ata5.00: Enabling discard_zeroes_data
> >
> Didn't see any of these, but admittedly I tested this with fewer SSDs on
> the onboard controller and with fio/bonnie++, which do not trigger that
> behavior as easily.
>
> > I believe this to be the same thing as the LSI3008 which gives these log
> > messages:
> > sd 0:0:6:0: attempting task abort! scmd(8804cac00600)
> > sd 0:0:6:0: [sdg] CDB:
> > Read(10): 28 00 1c e7 76 a0 00 01 30 00
> > scsi target0:0:6: handle(0x000f), sas_address(0x443322110600), phy(6)
> > scsi target0:0:6: enclosure_logical_id(0x50030480), slot(6)
> > sd 0:0:6:0: task abort: SUCCESS scmd(8804cac00600)
> > sd 0:0:6:0: attempting task abort! scmd(8804cac03780)
> >
> Yup, I know that message all too well.
>
> > I appreciate your info with regards to nobarries. I assume by "alleviate
> > it, but didn't fix" you mean the number of occurrences is reduced?
> >
> Indeed. But first a word about the setup where I'm seeing this.
> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
> with the Samsung DC SSDs, one with the Intel S3610.
> 2 of these chassis to be precise:
> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm
>
> Of course latest firmware and I tried this with any kernel from Debian
> 3.16 to stock 4.1.6.
>
> With nobarrier I managed to trigger the error only once yesterday on the
> DRBD replication target, not the machine that actual has the FS mounted.
> Usually I'd be able to trigger quite a bit more often during those tests.
>
> So this morning I updated the firmware of all S3610s on one node and
> removed the nobarrier flag. It took a lot of punishment, but eventually
> this happened:
> ---
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task 
> abort! scmd(880fdc85b680)
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB: Write(10) 
> 2a 00 0e 9a fb b8 00 00 08 00
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1: 
> handle(0x000a), sas_address(0x443322110100), phy(1)
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1: 
> enclosure

[ceph-users] Details of project

2016-03-03 Thread Nishant karn
I wanted to know more about two of the projects that are in your idea page
they are listed below:

1. RADOS PROXY
2. RBD DIFF CHECKSUMS

Please guide me through this.
What should I do to get selected for these projects.
I have a working knowledge of C/C++ and really wanted to join open source
community through gsoc. Please help me. Tell me more about these two
projects.

Thanks Nishant.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-03 Thread Ritter Sławomir
Hi,

I think this is really serious problem - again:  

- we silently lost S3/RGW objects in clusters 

Moreover, it our situation looks very similiar to described in uncorrected bug 
#13764 (Hammer) and in corrected #8269 (Dumpling).

Regards,

SR



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Dominik Mostowiec
Sent: Friday, February 26, 2016 3:33 PM
To: ceph-us...@ceph.com; ceph-devel
Subject: Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by 
slow requests

Hi,
Maybe this is the reason of another bug?
http://tracker.ceph.com/issues/13764
The situation is very similiar...

--
Regards
Dominik

2016-02-25 16:17 GMT+01:00 Ritter Sławomir :
> Hi,
>
>
>
> We have two CEPH clusters running on Dumpling 0.67.11 and some of our
> "multipart objects" are incompleted. It seems that some slow requests could
> cause corruption of related S3 objects. Moveover GETs for that objects are
> working without any error messages. There are only HTTP 200 in logs as well
> as no information about problems from popular client tools/libs.
>
>
>
> The situation looks very similiar to described in bug #8269, but we are
> using fixed 0.67.11 version:  http://tracker.ceph.com/issues/8269
>
>
>
> Regards,
>
>
>
> Sławomir Ritter
>
>
>
>
>
>
>
> EXAMPLE#1
>
>
>
> slow_request
>
> 
>
> 2016-02-23 13:49:58.818640 osd.260 10.176.67.27:6800/688083 2119 : [WRN] 4
> slow requests, 4 included below; oldest blocked for > 30.727096 secs
>
> 2016-02-23 13:49:58.818673 osd.260 10.176.67.27:6800/688083 2120 : [WRN]
> slow request 30.727096 seconds old, received at 2016-02-23 13:49:28.091460:
> osd_op(c
>
> lient.47792965.0:185007087
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
> [writef
>
> ull 0~524288] 10.ce729ebe e107594) v4 currently waiting for subops from
> [469,9]
>
>
>
>
>
> HTTP_500 in apache.log
>
> ==
>
> 127.0.0.1 - - [23/Feb/2016:13:49:27 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=56
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:28 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> HTTP/1.0" 500 751 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:58 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:59 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=58
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
>
>
>
>
> Empty RADOS object (real size = 0 bytes), list generated basis on MANIFEST
>
> ==
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.56_2
> 2097152   ok  2097152   10.7acc9476 (10.1476) [278,142,436]
> [278,142,436]
>
> found
> default.14654.445__multipart_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57
> 0 diff4194304   10.4f5be025 (10.25)   [57,310,428]
> [57,310,428]
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_1
> 4194304   ok  4194304   10.81191602 (10.1602) [441,109,420]
> [441,109,420]
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
> 2097152   ok  2097152   10.ce729ebe (10.1ebe) [260,469,9]
> [260,469,9]
>
>
>
>
>
> "Silent" GETs
>
> =
>
> # object size from headers
>
> $ s3 -u head
> video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
> Content-Type: binary/octet-stream
>
> Content-Length: 641775701
>
> Server: nginx
>
>
>
> # but GETs only 637581397 (641775701 - missing 4194304 = 637581397)
>
> $ s3 -u get
> video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv >
> /tmp/test
>
> $  ls -al /tmp/test
>
> -rw-r--r-- 1 root root 637581397 Feb 23 17:05 /tmp/test
>
>
>
> # no error in logs
>
> 127.0.0.1 - - [23/Feb/2016:17:05:00 +0100] "GET
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
> HTTP/1.0" 200 637581711 "-" "Mozilla/4.0 (Compatible; s3; libs3 2.0; Linux
> x86_64)"
>
>
>
> # wget - retry for missing part, but there is no missing part, so it GETs
> head/tail of the file again..

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread RDS
A couple of suggestions:
1)   # of pgs per OSD should be 100-200
2)  When dealing with SSD or Flash, performance of these devices hinge on how 
you partition them and how you tune linux:
a)   if using partitions, did you align the partitions on a 4k 
boundary? I start at sector 2048 using either fdisk or sfdisk
b)   There are quite a few Linux settings that benefit SSD/Flash and 
they are: Deadline io scheduler only when using the deadline associated 
settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
setting read ahead if doing majority of reads, and other
3)   mount options:  noatime, delaylog,inode64,noquota, etc…

I have written some papers/blogs on this subject if you are interested in 
seeing them.
Rick
> On Mar 3, 2016, at 2:41 AM, Adrian Saul  wrote:
> 
> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
> journals has higher than desired write latencies for RBD devices.  Any ideas?
> 
> 
>  I am developing a storage system based on Ceph and an SCST+pacemaker 
> cluster.   Our initial testing showed promising results even with mixed 
> available hardware and we proceeded to order a more designed platform for 
> developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
> RBD - they present iSCSI to other systems).
> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
> SSDs each
> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
> 
> As part of the research and planning we opted to put a pair of Intel PC3700DC 
> 400G NVME cards in each OSD server.  These are configured mirrored and setup 
> as the journals for the OSD disks, the aim being to improve write latencies.  
> All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 
> 10G NICs back to a common pair of switches.   All machines are running Centos 
> 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD 
> kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual OSD, 
> with a 12G journal created on the flash mirror.   I setup the SSD servers 
> into one root, and the SATA servers into another and created pools using 
> hosts as fault boundaries, with the pools set for 2 copies.   I created the 
> pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool.  
>  On the frontends we create RBD devices and present them as iSCSI LUNs using 
> SCST to clients - in this test case a Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service 
> times for the LUNs for writes is just not getting down to where we want it, 
> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
> consistently the service times sit at around 3-4ms, but regularly (every 
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
> minutes.  I fully expected we would have some latencies due to the 
> distributed and networked nature of Ceph, but in this instance I just cannot 
> find where these latencies are coming from, especially with the SSD based 
> pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue times.  
> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
> adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine with 
> any bursts
> - The SSDs do show similar latency variations with writes - bursting up to 
> 12ms or more whenever there is high write workloads.
> - I have tried applying what tuning I can to the SSD block devices (noop 
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no major 
> impact
> - I have tried tuning up filesystore  queue and wbthrottle values but could 
> not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait and 
> I can do benchmarks up over 1GB/s in some tests.  Write throughput can also 
> be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI client 
> block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. 
>  I would have thought better alignment would reduce latency but is that 
> offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or diagnostics do 
> I need to work this out?  We would really like to use ceph across a mixed 
> workload that includes some DB systems that are fairly latency sensitive, but 
> as it stands its hard to be confident in the performance when a fairly quiet 
> unloaded system seems to struggle, even with all this hardware behind it.   I 
> get the impression that the SSD write latencies might be coming into play as 
> they are similar to the numbers I see, but really for writes I would expect 
> them t

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Jan Schermer
I think the latency comes from journal flushing

Try tuning

filestore min sync interval = .1
filestore max sync interval = 5

and also
/proc/sys/vm/dirty_bytes (I suggest 512MB)
/proc/sys/vm/dirty_background_bytes (I suggest 256MB)

See if that helps

It would be useful to see the job you are running to know what exactly it does, 
I'm afraid your latency is not really that bad, it will scale horizontally 
(with number of clients) rather than vertically (higher IOPS for single 
blocking writes) and there's not much that can be done about that.


> On 03 Mar 2016, at 14:33, RDS  wrote:
> 
> A couple of suggestions:
> 1)   # of pgs per OSD should be 100-200
> 2)  When dealing with SSD or Flash, performance of these devices hinge on how 
> you partition them and how you tune linux:
>   a)   if using partitions, did you align the partitions on a 4k 
> boundary? I start at sector 2048 using either fdisk or sfdisk

On SSD you should align at 8MB boundary (usually the erase block is quite 
large, though it doesn't matter that much), and the write block size is 
actually something like 128k
2048 aligns at 1MB which is completely fine

>   b)   There are quite a few Linux settings that benefit SSD/Flash and 
> they are: Deadline io scheduler only when using the deadline associated 
> settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
> setting read ahead if doing majority of reads, and other

those don't matter that much, higher queue depths mean larger throughput but at 
the expense of latency, the default are usually fine

> 3)   mount options:  noatime, delaylog,inode64,noquota, etc…

defaults work fine (noatime is a relic, relatime is what filesystems use by 
default nowadays)

> 
> I have written some papers/blogs on this subject if you are interested in 
> seeing them.
> Rick
>> On Mar 3, 2016, at 2:41 AM, Adrian Saul  
>> wrote:
>> 
>> Hi Ceph-users,
>> 
>> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
>> journals has higher than desired write latencies for RBD devices.  Any ideas?
>> 
>> 
>> I am developing a storage system based on Ceph and an SCST+pacemaker 
>> cluster.   Our initial testing showed promising results even with mixed 
>> available hardware and we proceeded to order a more designed platform for 
>> developing into production.   The hardware is:
>> 
>> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
>> RBD - they present iSCSI to other systems).
>> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
>> SSDs each
>> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>> 
>> As part of the research and planning we opted to put a pair of Intel 
>> PC3700DC 400G NVME cards in each OSD server.  These are configured mirrored 
>> and setup as the journals for the OSD disks, the aim being to improve write 
>> latencies.  All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 
>> 4 aggregated 10G NICs back to a common pair of switches.   All machines are 
>> running Centos 7, with the frontends using the 4.4.1 elrepo-ml kernel to get 
>> a later RBD kernel module.
>> 
>> On the ceph side each disk in the OSD servers are setup as an individual 
>> OSD, with a 12G journal created on the flash mirror.   I setup the SSD 
>> servers into one root, and the SATA servers into another and created pools 
>> using hosts as fault boundaries, with the pools set for 2 copies.   I 
>> created the pools with the pg_num and pgp_num set to 32x the number of OSDs 
>> in the pool.   On the frontends we create RBD devices and present them as 
>> iSCSI LUNs using SCST to clients - in this test case a Solaris host.
>> 
>> The problem I have is that even with a lightly loaded system the service 
>> times for the LUNs for writes is just not getting down to where we want it, 
>> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
>> consistently the service times sit at around 3-4ms, but regularly (every 
>> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
>> minutes.  I fully expected we would have some latencies due to the 
>> distributed and networked nature of Ceph, but in this instance I just cannot 
>> find where these latencies are coming from, especially with the SSD based 
>> pool and having flash based journaling.
>> 
>> - The RBD devices show relatively low service times, but high queue times.  
>> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
>> adding much latency.
>> - The journals are reporting 0.02ms service times, and seem to cope fine 
>> with any bursts
>> - The SSDs do show similar latency variations with writes - bursting up to 
>> 12ms or more whenever there is high write workloads.
>> - I have tried applying what tuning I can to the SSD block devices (noop 
>> scheduler etc) - no difference
>> - I have removed any sort of smarts around IO grouping in SCST - no major 
>> impact
>> - I have trie

[ceph-users] ceph upgrade and the impact to rbd clients

2016-03-03 Thread Xu (Simon) Chen
Hi all,

I am running ceph for cinder backend of my OpenStack deployment. I am
curious if I upgrade ceph (say from an older version of firefly to a
newer version of firefly, or from firefly to hammer), what do I need
to do with my VMs, which continue to run with librbd of the previous
version?

I have done one major ceph upgrade before, and ended up having to
reboot all VMs, because they seemingly all lost disk access. I wonder
if I did something wrong there, or this is the way it is.

Any insight on this matter?

Thanks!
-Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs go down with infernalis

2016-03-03 Thread Yoann Moulin
Hello,

I'm (almost) a new user of ceph (couple of month). In my university, we start to
do some test with ceph a couple of months ago.

We have 2 clusters. Each cluster have 100 OSDs on 10 servers :

Each server as this setup :

CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Memory : 128GB of Memory
OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
OSD Disk : 10 x HGST ultrastar-7k6000 6TB
Network : 1 x 10Gb/s
OS : Ubuntu 14.04
Ceph version : infernalis 9.2.0

One cluster give access to some user through a S3 gateway (service is still in
beta). We call this cluster "ceph-beta".

One cluster is for our internal need to learn more about ceph. We call this
cluster "ceph-test". (those servers will be integrated into the ceph-beta
cluster when we will need more space)

We have deploy both clusters with the ceph-ansible playbook[1]

Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no raid. 5
journals partitions on each SSDs.

OSDs disk are format in XFS.

1. https://github.com/ceph/ceph-ansible

We have an issue. Some OSDs go down and don't start. It seem to be related to
the fsid of the journal partition :

> -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal FileJournal::open: 
> ondisk fsid ---- doesn't match expected 
> eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) journal

in attachment, the full logs of one of the dead OSDs

We had this issue with 2 OSDs on ceph-beta cluster fixed by removing, zapping
and readding it.

Now, we have the same issue on ceph-test cluster but on 18 OSDs.

Now the stats of this cluster

> root@icadmin004:~# ceph -s
> cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8
>  health HEALTH_WARN
> 1024 pgs incomplete
> 1024 pgs stuck inactive
> 1024 pgs stuck unclean
>  monmap e1: 3 mons at 
> {iccluster003=10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0}
> election epoch 62, quorum 0,1,2 
> iccluster003,iccluster014,iccluster022
>  osdmap e242: 100 osds: 82 up, 82 in
> flags sortbitwise
>   pgmap v469212: 2304 pgs, 10 pools, 2206 bytes data, 181 objects
> 4812 MB used, 447 TB / 447 TB avail
> 1280 active+clean
> 1024 creating+incomplete

We have install this cluster at the begin of February. We did not use that
cluster at all even at the begin to troubleshoot an issue with ceph-ansible. We
did not push any data neither create pool. What could explain this behaviour ?

Thanks for your help

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
2016-03-03 14:09:00.433074 7efd1a9d5940  0 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 4446
2016-03-03 14:09:01.315583 7efd1a9d5940  0 filestore(/var/lib/ceph/osd/ceph-2) backend xfs (magic 0x58465342)
2016-03-03 14:09:01.338328 7efd1a9d5940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-03-03 14:09:01.338335 7efd1a9d5940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2016-03-03 14:09:01.338362 7efd1a9d5940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: splice is supported
2016-03-03 14:09:01.341468 7efd1a9d5940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2016-03-03 14:09:01.341517 7efd1a9d5940  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: extsize is supported and your kernel >= 3.5
2016-03-03 14:09:01.411145 7efd1a9d5940  0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-03-03 14:09:01.692400 7efd1a9d5940 -1 journal FileJournal::open: ondisk fsid ---- doesn't match expected eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) journal
2016-03-03 14:09:01.694251 7efd1a9d5940 -1 os/FileJournal.h: In function 'virtual FileJournal::~FileJournal()' thread 7efd1a9d5940 time 2016-03-03 14:09:01.692413
os/FileJournal.h: 406: FAILED assert(fd == -1)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7efd1a4cbf2b]
 2: (()+0x2c2f80) [0x7efd19ed1f80]
 3: (FileJournal::~FileJournal()+0x67e) [0x7efd1a1b476e]
 4: (JournalingObjectStore::journal_replay(unsigned long)+0xbfa) [0x7efd1a1c353a]
 5: (FileStore::mount()+0x3b42) [0x7efd1a198a62]
 6: (OSD::init()+0x26d) [0x7efd19f51a5d]
 7: (main()+0x2954) [0x7efd19ed7474]
 8: (__libc_start_main()+0xf5) [0x7efd16d59ec5]
 9: (()+0x2f82b7) [0x7efd19f072b7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.

--- begin dump of recent events ---
   -24> 2016-03-03 14:09:00.430918 7ef

Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults - Hire a consultant

2016-03-03 Thread Philip S. Hempel

On 03/02/2016 01:40 PM, Philip S. Hempel wrote:

Hello everyone,
I am trying to repair a cluster that has 74 pgs that are down, I have
seen that the pgs in question are presently with 0 data on the OSD.
I have exported data from OSD's that were pulled when the client had
thought the disk were bad.

I am using the recovery method describe in "Incomplete PGs - OH MY!"

Following up with what also was stated in ceph users that you should put
the OSD out as well and set the weight correctly for the disk instead of 0.

I have done this running ceph 0.94.5 (the original ceph cluster was at
.80) and this is running on a Proxmox server at version 3.4.

I have imported some of the data into a temp OSD, the PG's will import,
but some of them cause the OSD to segfault like this below.

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

  1: /usr/bin/ceph-osd() [0xbf03dc]

  2: (()+0xf0a0) [0x7fe288b640a0]

  3: (gsignal()+0x35) [0x7fe2874cc125]

  4: (abort()+0x180) [0x7fe2874cf3a0]

  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe287d2489d]

  6: (()+0x63996) [0x7fe287d22996]

  7: (()+0x639c3) [0x7fe287d229c3]

  8: (()+0x63bee) [0x7fe287d22bee]

  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x220) [0xcddda0]
  10: /usr/bin/ceph-osd() [0x7f578f]

  11: (pg_interval_t::check_new_interval(int, int, std::vector > const&, std::vector >
const&, int, int, std::vector
 > const&, std::vector
 > const&, unsigned int, unsigned int, std::tr1::shared_ptr, std::tr1::shared_p
tr, pg_t, IsPGRecoverablePredicate*, std::map,
std::allocator > >*, std::ostream*)+0x2ba) [0x8c399a]

  12: (OSD::build_past_intervals_parallel()+0xbe1) [0x7d2261]

  13: (OSD::load_pgs()+0x2d8a) [0x7e977a]

  14: (OSD::init()+0xdac) [0x7ebb2c]

  15: (main()+0x253e) [0x78dd6e]

  16: (__libc_start_main()+0xfd) [0x7fe2874b8ead]

  17: /usr/bin/ceph-osd() [0x793de9]

  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Now none of the data imported is considered by Ceph as possible data for
the PG's that are incomplete, so this is one of the main problems I am
trying to rectify.

I have set after the upgrade of Ceph the tunables to optimal
I have tried repairs on all incomplete PG's.
I have tried scrub and deep-scrub on incomplete PG's.

What I would hope to accomplish is the PG data I do have, I could get
complete PG's from.

Thanks

Please let me know what other data I could give to help determine a fix
for this.


I am looking to hire a consultant to support us with this.

Please respond so we can get an agreement made and I can pass more 
details on to you.


Thanks again.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults - Hire a consultant

2016-03-03 Thread Richard Arends

On 03/03/2016 06:12 PM, Philip Hempel wrote:

Philip,

I forgot to CC the list, now i did...

To export the data I used ceph-objectstore-tool to do this using the 
export command.


I am trying to repair a cluster that has 74 pgs that are down, I have 
seen that the pgs in question are presently with 0 data on the OSD.



Can you show the output of query of one pg? "ceph pg  query"

I have exported data from OSD's that were pulled when the
client had thought the disk were bad.


How did you export the data ?



Okay, from the pg query output i get the following:

$ egrep 'peer_info|\"peer\"|\"num_bytes\"' ceph-query-3.50f
"num_bytes": 0,
"peer_info": [
"peer": "5",
"num_bytes": 0,
"peer": "9",
"num_bytes": 0,
"peer": "10",
"num_bytes": 0,
"peer": "17",
"num_bytes": 0,
"peer": "19",
"num_bytes": 0,
"peer": "23",
"num_bytes": 897560576,
"peer": "24",
"num_bytes": 0,
"peer": "39",
"num_bytes": 0,
"peer": "43",
"num_bytes": 0,
"peer": "44",
"num_bytes": 0,
"peer": "48",
"num_bytes": 0,


So osd.23 looks like the only that has data and the acting OSDs does not 
have any.  On which OSD did you import the data? Since osd.34 and osd.43 
are acting for this PG, maybe you can try to import the data (also) on 
those OSDs. Before importing, set the cluster to noout and stop the OSD 
daemon where you will import the data. I think you als must remove the 
present PG info on the disk, for example "rm 
/var/lib/ceph/osd/ceph-34/current/3.50f_head", but check (and double 
check) that you only remove that directory when it is empty





--
Regards,

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Richard Arends

On 03/03/2016 06:40 PM, Philip Hempel wrote:


osd 45. But that import causes a segfault on the osd


Did that OSD allready had info (files) for that PG?

---
Regards,

Richard.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Philip S. Hempel

On 03/03/2016 12:44 PM, Richard Arends wrote:

On 03/03/2016 06:40 PM, Philip Hempel wrote:


osd 45. But that import causes a segfault on the osd


Did that OSD allready had info (files) for that PG?

---
Regards,

Richard.


No, this was a new import.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Richard Arends

On 03/03/2016 06:56 PM, Philip Hempel wrote:
I did the import after using the objectool to remove the pg and that 
osd (34) segfaults now.


Segfault output is not my cup of tea, but is that exact the same 
segfault as you posted earlier?



--
Regards,

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Nick Fisk
You can also dump the historic ops from the OSD admin socket. It will give a 
brief overview of each step and how long each one is taking.

But generally what you are seeing is not unusual. Currently best case for a RBD 
on a replicated pool will be somewhere between 200-500 iops. The Ceph code is a 
lot more complex than a 30cm SAS cable.

CPU speed (ie Ghz not Cores) is a large factor in write latency. You may find 
that you can improve performance by setting the max c-state to 1 and enabling 
idle=poll, which stops the cores entering power saving states. I found on 
systems with a large number of cores unless you drive the whole box really 
hard, a lot of the cores clock themselves down which hurts latency.

Also disable all logging in your ceph.conf, this can have quite a big effect as 
well.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 03 March 2016 14:38
> To: RDS 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph RBD latencies
> 
> I think the latency comes from journal flushing
> 
> Try tuning
> 
> filestore min sync interval = .1
> filestore max sync interval = 5
> 
> and also
> /proc/sys/vm/dirty_bytes (I suggest 512MB)
> /proc/sys/vm/dirty_background_bytes (I suggest 256MB)
> 
> See if that helps
> 
> It would be useful to see the job you are running to know what exactly it
> does, I'm afraid your latency is not really that bad, it will scale 
> horizontally
> (with number of clients) rather than vertically (higher IOPS for single 
> blocking
> writes) and there's not much that can be done about that.
> 
> 
> > On 03 Mar 2016, at 14:33, RDS  wrote:
> >
> > A couple of suggestions:
> > 1)   # of pgs per OSD should be 100-200
> > 2)  When dealing with SSD or Flash, performance of these devices hinge on
> how you partition them and how you tune linux:
> > a)   if using partitions, did you align the partitions on a 4k 
> > boundary? I
> start at sector 2048 using either fdisk or sfdisk
> 
> On SSD you should align at 8MB boundary (usually the erase block is quite
> large, though it doesn't matter that much), and the write block size is 
> actually
> something like 128k
> 2048 aligns at 1MB which is completely fine
> 
> > b)   There are quite a few Linux settings that benefit SSD/Flash and
> they are: Deadline io scheduler only when using the deadline associated
> settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
> setting
> read ahead if doing majority of reads, and other
> 
> those don't matter that much, higher queue depths mean larger throughput
> but at the expense of latency, the default are usually fine
> 
> > 3)   mount options:  noatime, delaylog,inode64,noquota, etc…
> 
> defaults work fine (noatime is a relic, relatime is what filesystems use by
> default nowadays)
> 
> >
> > I have written some papers/blogs on this subject if you are interested in
> seeing them.
> > Rick
> >> On Mar 3, 2016, at 2:41 AM, Adrian Saul
>  wrote:
> >>
> >> Hi Ceph-users,
> >>
> >> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.  Any
> ideas?
> >>
> >>
> >> I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform for
> developing into production.   The hardware is:
> >>
> >> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems).
> >> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB
> >> Samsung Evo SSDs each 3x 4RU OSD SATA servers (36 bay) - currently
> >> with 6 8TB Seagate each
> >>
> >> As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual E5-
> 2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using the
> 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> >>
> >> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD servers
> into one root, and the SATA servers into another and created pools using
> hosts as fault boundaries, with the pools set for 2 copies.   I created the 
> pools
> with the pg_num and pgp_num set to 32x the number of OSDs in the pool.
> On the frontends we create RBD devices and present them as iSCSI LUNs
> using SCST to clients - in this test case a Solaris host.
> >>
> >> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want it,

Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Richard Arends

On 03/03/2016 07:21 PM, Philip S. Hempel wrote:

Philip,

Sorry, can't help you with the segfault. What i would do, is set debug 
options in ceph.conf and start the OSD, maybe that extra debug info will 
give something you can work with.





On 03/03/2016 01:15 PM, Richard Arends wrote:

On 03/03/2016 06:56 PM, Philip Hempel wrote:
I did the import after using the objectool to remove the pg and that 
osd (34) segfaults now.


Segfault output is not my cup of tea, but is that exact the same 
segfault as you posted earlier?




This is the full segfault

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xbf03dc]
 2: (()+0xf0a0) [0x7f5d35e5f0a0]
 3: (gsignal()+0x35) [0x7f5d347c9165]
 4: (abort()+0x180) [0x7f5d347cc3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f5d3501f89d]
 6: (()+0x63996) [0x7f5d3501d996]
 7: (()+0x639c3) [0x7f5d3501d9c3]
 8: (()+0x63bee) [0x7f5d3501dbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x220) [0xcddda0]

 10: /usr/bin/ceph-osd() [0x7f578f]
 11: (pg_interval_t::check_new_interval(int, int, std::vectorstd::allocator > const&, std::vector > 
const&, int, int, std::vector
 > const&, std::vectorstd::allocator > const&, unsigned int, unsigned int, 
std::tr1::shared_ptr, std::tr1::shared_p
tr, pg_t, IsPGRecoverablePredicate*, std::mapint, pg_interval_t, std::less, 
std::allocator > >*, 
std::ostream*)+0x2ba) [0x8c399a]

 12: (OSD::build_past_intervals_parallel()+0xbe1) [0x7d2261]
 13: (OSD::load_pgs()+0x2d8a) [0x7e977a]
 14: (OSD::init()+0xdac) [0x7ebb2c]
 15: (main()+0x253e) [0x78dd6e]
 16: (__libc_start_main()+0xfd) [0x7f5d347b5ead]
 17: /usr/bin/ceph-osd() [0x793de9]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.34.log
--- end dump of recent events ---






--
Regards,

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Hammer LTS to Infernalis or wait for Jewel LTS?

2016-03-03 Thread Oliver Dzombic
Hi,

i was unable to find any time table of EOL's of the different versions.

Can you please tell me where your informations come from ( EOL/Release
dates, LTS's )?

Wiki » Planning » @tracker.ceph.com did not really help
Roadmap did not help too.


Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 02.03.2016 um 10:32 schrieb Mihai Gheorghe:
> Hi,
> 
> I've got two questions!
> 
> First. We are currently running Hammer in production. You are thinking
> of upgrading to Infernalis. Should we upgrade now or wait for the next
> LTS, Jewel? On ceph releases i can see Hammers EOL is estimated in
> november 2016 while Infernalis is June 2016.
> If i follow the upgrade procedure there should not be any problems, right?
> 
> Second. When Jewel LTS will be released, does anybody know if we can
> upgrade straight from Hammer or first we need to upgrade to Infernalis
> and then Jewel. If the latter is the case i see no reason not to upgrade
> now to Infernalis and wait for Jewel release to upgrade again. This way
> we can take advantage of the new features in Infernalis.
> 
> Also what is the correct order of upgrading? Mons first then OSDs?
> 
> Any input on the matter would be greatly apreciated.
> 
> Thank you.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH FS - all_squash option equivalent

2016-03-03 Thread Gregory Farnum
On Wed, Mar 2, 2016 at 11:22 PM, Fred Rolland  wrote:
> Thanks for your reply.
>
> Server :
> [root@ceph-1 ~]# rpm -qa | grep ceph
> ceph-mon-0.94.1-13.el7cp.x86_64

That would be a Hammer release. Nothing there for doing anything with
permission checks at all.
-Greg

> ceph-radosgw-0.94.1-13.el7cp.x86_64
> ceph-0.94.1-13.el7cp.x86_64
> ceph-osd-0.94.1-13.el7cp.x86_64
> ceph-deploy-1.5.25-1.el7cp.noarch
> ceph-common-0.94.1-13.el7cp.x86_64
> [root@ceph-1 ~]# uname -a
> Linux ceph-1.qa.lab.tlv.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> Client:
> [root@RHEL7 ~]# rpm -qa | grep ceph
> ceph-fuse-0.94.6-0.el7.x86_64
> python-cephfs-0.94.6-0.el7.x86_64
> libcephfs1-0.94.6-0.el7.x86_64
> ceph-common-0.94.6-0.el7.x86_64
> ceph-0.94.6-0.el7.x86_64
>
> [root@RHEL7 ~]# uname -a
> Linux RHEL7.1Server 3.10.0-229.26.1.el7.x86_64 #1 SMP Fri Dec 11 16:53:27
> EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> [root@RHEL7 ~]# su - sanlock -s /bin/bash
> Last login: Wed Mar  2 14:06:34 IST 2016 on pts/0
> -bash-4.2$ whoami
> sanlock
> -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> touch: cannot touch ‘/rhev/data-center/mnt/ceph-1.qa.lab:6789:_/test’:
> Permission denied
>
>
> [root@RHEL7 ~]# su - vdsm -s /bin/bash
> Last login: Wed Mar  2 12:19:11 IST 2016 on pts/1
> -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> -bash-4.2$ rm /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> -bash-4.2$
>
> Permissions of directory :
> ll
> total 0
> drwxr-xr-x 1 vdsm kvm 0 Mar  2 14:08 
>
>
>
> On Wed, Mar 2, 2016 at 6:25 PM, Gregory Farnum  wrote:
>>
>> On Wed, Mar 2, 2016 at 4:21 AM, Fred Rolland  wrote:
>> > Hi,
>> >
>> > I am trying to use CEPH FS in oVirt (RHEV).
>> > The mount is created OK, however, the hypervisor need access to the
>> > mount
>> > from different users (eg: vdsm, sanlock)
>> > It seems that Sanlock user is having permissions issues.
>> >
>> > When using NFS, configuring the export as all_squash and defining
>> > anonuid/anongid will solve this problem [1].
>> >
>> > Is there a possibility to configure in Ceph FS an equivalent to NFS
>> > all_squash/anonuid/anongid ?
>>
>> What version of Ceph are you running? Newer versions have added a
>> security model and include *some* UID squashing features, but prior to
>> Infernalis, CephFS didn't do any security checking at all (it was all
>> client-side in the standard VFS).
>> -Greg
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH FS - all_squash option equivalent

2016-03-03 Thread Fred Rolland
Can you share a link describing the UID squashing feature?
On Mar 3, 2016 9:02 PM, "Gregory Farnum"  wrote:

> On Wed, Mar 2, 2016 at 11:22 PM, Fred Rolland  wrote:
> > Thanks for your reply.
> >
> > Server :
> > [root@ceph-1 ~]# rpm -qa | grep ceph
> > ceph-mon-0.94.1-13.el7cp.x86_64
>
> That would be a Hammer release. Nothing there for doing anything with
> permission checks at all.
> -Greg
>
> > ceph-radosgw-0.94.1-13.el7cp.x86_64
> > ceph-0.94.1-13.el7cp.x86_64
> > ceph-osd-0.94.1-13.el7cp.x86_64
> > ceph-deploy-1.5.25-1.el7cp.noarch
> > ceph-common-0.94.1-13.el7cp.x86_64
> > [root@ceph-1 ~]# uname -a
> > Linux ceph-1.qa.lab.tlv.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct
> 29
> > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Client:
> > [root@RHEL7 ~]# rpm -qa | grep ceph
> > ceph-fuse-0.94.6-0.el7.x86_64
> > python-cephfs-0.94.6-0.el7.x86_64
> > libcephfs1-0.94.6-0.el7.x86_64
> > ceph-common-0.94.6-0.el7.x86_64
> > ceph-0.94.6-0.el7.x86_64
> >
> > [root@RHEL7 ~]# uname -a
> > Linux RHEL7.1Server 3.10.0-229.26.1.el7.x86_64 #1 SMP Fri Dec 11 16:53:27
> > EST 2015 x86_64 x86_64 x86_64 GNU/Linux
> >
> >
> > [root@RHEL7 ~]# su - sanlock -s /bin/bash
> > Last login: Wed Mar  2 14:06:34 IST 2016 on pts/0
> > -bash-4.2$ whoami
> > sanlock
> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > touch: cannot touch
> ‘/rhev/data-center/mnt/ceph-1.qa.lab:6789:_/test’:
> > Permission denied
> >
> >
> > [root@RHEL7 ~]# su - vdsm -s /bin/bash
> > Last login: Wed Mar  2 12:19:11 IST 2016 on pts/1
> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > -bash-4.2$ rm /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > -bash-4.2$
> >
> > Permissions of directory :
> > ll
> > total 0
> > drwxr-xr-x 1 vdsm kvm 0 Mar  2 14:08 
> >
> >
> >
> > On Wed, Mar 2, 2016 at 6:25 PM, Gregory Farnum 
> wrote:
> >>
> >> On Wed, Mar 2, 2016 at 4:21 AM, Fred Rolland 
> wrote:
> >> > Hi,
> >> >
> >> > I am trying to use CEPH FS in oVirt (RHEV).
> >> > The mount is created OK, however, the hypervisor need access to the
> >> > mount
> >> > from different users (eg: vdsm, sanlock)
> >> > It seems that Sanlock user is having permissions issues.
> >> >
> >> > When using NFS, configuring the export as all_squash and defining
> >> > anonuid/anongid will solve this problem [1].
> >> >
> >> > Is there a possibility to configure in Ceph FS an equivalent to NFS
> >> > all_squash/anonuid/anongid ?
> >>
> >> What version of Ceph are you running? Newer versions have added a
> >> security model and include *some* UID squashing features, but prior to
> >> Infernalis, CephFS didn't do any security checking at all (it was all
> >> client-side in the standard VFS).
> >> -Greg
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH FS - all_squash option equivalent

2016-03-03 Thread Lincoln Bryant
Also very interested in this if there are any docs available!

--Lincoln

> On Mar 3, 2016, at 1:04 PM, Fred Rolland  wrote:
> 
> Can you share a link describing the UID squashing feature?
> 
> On Mar 3, 2016 9:02 PM, "Gregory Farnum"  wrote:
> On Wed, Mar 2, 2016 at 11:22 PM, Fred Rolland  wrote:
> > Thanks for your reply.
> >
> > Server :
> > [root@ceph-1 ~]# rpm -qa | grep ceph
> > ceph-mon-0.94.1-13.el7cp.x86_64
> 
> That would be a Hammer release. Nothing there for doing anything with
> permission checks at all.
> -Greg
> 
> > ceph-radosgw-0.94.1-13.el7cp.x86_64
> > ceph-0.94.1-13.el7cp.x86_64
> > ceph-osd-0.94.1-13.el7cp.x86_64
> > ceph-deploy-1.5.25-1.el7cp.noarch
> > ceph-common-0.94.1-13.el7cp.x86_64
> > [root@ceph-1 ~]# uname -a
> > Linux ceph-1.qa.lab.tlv.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29
> > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Client:
> > [root@RHEL7 ~]# rpm -qa | grep ceph
> > ceph-fuse-0.94.6-0.el7.x86_64
> > python-cephfs-0.94.6-0.el7.x86_64
> > libcephfs1-0.94.6-0.el7.x86_64
> > ceph-common-0.94.6-0.el7.x86_64
> > ceph-0.94.6-0.el7.x86_64
> >
> > [root@RHEL7 ~]# uname -a
> > Linux RHEL7.1Server 3.10.0-229.26.1.el7.x86_64 #1 SMP Fri Dec 11 16:53:27
> > EST 2015 x86_64 x86_64 x86_64 GNU/Linux
> >
> >
> > [root@RHEL7 ~]# su - sanlock -s /bin/bash
> > Last login: Wed Mar  2 14:06:34 IST 2016 on pts/0
> > -bash-4.2$ whoami
> > sanlock
> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > touch: cannot touch ‘/rhev/data-center/mnt/ceph-1.qa.lab:6789:_/test’:
> > Permission denied
> >
> >
> > [root@RHEL7 ~]# su - vdsm -s /bin/bash
> > Last login: Wed Mar  2 12:19:11 IST 2016 on pts/1
> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > -bash-4.2$ rm /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
> > -bash-4.2$
> >
> > Permissions of directory :
> > ll
> > total 0
> > drwxr-xr-x 1 vdsm kvm 0 Mar  2 14:08 
> >
> >
> >
> > On Wed, Mar 2, 2016 at 6:25 PM, Gregory Farnum  wrote:
> >>
> >> On Wed, Mar 2, 2016 at 4:21 AM, Fred Rolland  wrote:
> >> > Hi,
> >> >
> >> > I am trying to use CEPH FS in oVirt (RHEV).
> >> > The mount is created OK, however, the hypervisor need access to the
> >> > mount
> >> > from different users (eg: vdsm, sanlock)
> >> > It seems that Sanlock user is having permissions issues.
> >> >
> >> > When using NFS, configuring the export as all_squash and defining
> >> > anonuid/anongid will solve this problem [1].
> >> >
> >> > Is there a possibility to configure in Ceph FS an equivalent to NFS
> >> > all_squash/anonuid/anongid ?
> >>
> >> What version of Ceph are you running? Newer versions have added a
> >> security model and include *some* UID squashing features, but prior to
> >> Infernalis, CephFS didn't do any security checking at all (it was all
> >> client-side in the standard VFS).
> >> -Greg
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH FS - all_squash option equivalent

2016-03-03 Thread Gregory Farnum
On Thu, Mar 3, 2016 at 11:05 AM, Lincoln Bryant  wrote:
> Also very interested in this if there are any docs available!
>
> --Lincoln
>
>> On Mar 3, 2016, at 1:04 PM, Fred Rolland  wrote:
>>
>> Can you share a link describing the UID squashing feature?

You know what, we'd discussed adding this but I think in the end we
didn't. Sorry to get your hopes up, guys!
-Greg

>>
>> On Mar 3, 2016 9:02 PM, "Gregory Farnum"  wrote:
>> On Wed, Mar 2, 2016 at 11:22 PM, Fred Rolland  wrote:
>> > Thanks for your reply.
>> >
>> > Server :
>> > [root@ceph-1 ~]# rpm -qa | grep ceph
>> > ceph-mon-0.94.1-13.el7cp.x86_64
>>
>> That would be a Hammer release. Nothing there for doing anything with
>> permission checks at all.
>> -Greg
>>
>> > ceph-radosgw-0.94.1-13.el7cp.x86_64
>> > ceph-0.94.1-13.el7cp.x86_64
>> > ceph-osd-0.94.1-13.el7cp.x86_64
>> > ceph-deploy-1.5.25-1.el7cp.noarch
>> > ceph-common-0.94.1-13.el7cp.x86_64
>> > [root@ceph-1 ~]# uname -a
>> > Linux ceph-1.qa.lab.tlv.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29
>> > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >
>> > Client:
>> > [root@RHEL7 ~]# rpm -qa | grep ceph
>> > ceph-fuse-0.94.6-0.el7.x86_64
>> > python-cephfs-0.94.6-0.el7.x86_64
>> > libcephfs1-0.94.6-0.el7.x86_64
>> > ceph-common-0.94.6-0.el7.x86_64
>> > ceph-0.94.6-0.el7.x86_64
>> >
>> > [root@RHEL7 ~]# uname -a
>> > Linux RHEL7.1Server 3.10.0-229.26.1.el7.x86_64 #1 SMP Fri Dec 11 16:53:27
>> > EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >
>> >
>> > [root@RHEL7 ~]# su - sanlock -s /bin/bash
>> > Last login: Wed Mar  2 14:06:34 IST 2016 on pts/0
>> > -bash-4.2$ whoami
>> > sanlock
>> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
>> > touch: cannot touch ‘/rhev/data-center/mnt/ceph-1.qa.lab:6789:_/test’:
>> > Permission denied
>> >
>> >
>> > [root@RHEL7 ~]# su - vdsm -s /bin/bash
>> > Last login: Wed Mar  2 12:19:11 IST 2016 on pts/1
>> > -bash-4.2$ touch /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
>> > -bash-4.2$ rm /rhev/data-center/mnt/ceph-1.qa.lab\:6789\:_/test
>> > -bash-4.2$
>> >
>> > Permissions of directory :
>> > ll
>> > total 0
>> > drwxr-xr-x 1 vdsm kvm 0 Mar  2 14:08 
>> >
>> >
>> >
>> > On Wed, Mar 2, 2016 at 6:25 PM, Gregory Farnum  wrote:
>> >>
>> >> On Wed, Mar 2, 2016 at 4:21 AM, Fred Rolland  wrote:
>> >> > Hi,
>> >> >
>> >> > I am trying to use CEPH FS in oVirt (RHEV).
>> >> > The mount is created OK, however, the hypervisor need access to the
>> >> > mount
>> >> > from different users (eg: vdsm, sanlock)
>> >> > It seems that Sanlock user is having permissions issues.
>> >> >
>> >> > When using NFS, configuring the export as all_squash and defining
>> >> > anonuid/anongid will solve this problem [1].
>> >> >
>> >> > Is there a possibility to configure in Ceph FS an equivalent to NFS
>> >> > all_squash/anonuid/anongid ?
>> >>
>> >> What version of Ceph are you running? Newer versions have added a
>> >> security model and include *some* UID squashing features, but prior to
>> >> Infernalis, CephFS didn't do any security checking at all (it was all
>> >> client-side in the standard VFS).
>> >> -Greg
>> >
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Richard Arends

On 03/03/2016 08:32 PM, Philip S. Hempel wrote:



On 03/03/2016 01:49 PM, Richard Arends wrote:

On 03/03/2016 07:21 PM, Philip S. Hempel wrote:

Philip,

Sorry, can't help you with the segfault. What i would do, is set 
debug options in ceph.conf and start the OSD, maybe that extra debug 
info will give something you can work with.

/osd/OSDMap.h: 502: FAILED assert(exists(osd)

I wonder if this means what I think, that is sees this pg and states 
it exists in the map already, but does not handle this data correctly 
and dumps?


Do you have more info before and after this message?


--
Regards,

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Richard Arends

On 03/03/2016 09:04 PM, Philip S. Hempel wrote:



On 03/03/2016 03:00 PM, Richard Arends wrote:


Do you have more info before and after this message?


There is about 40 lines above like this this is the last few lines

-8> 2016-03-03 14:47:54.244421 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7ef(unlocked)] enter Initial
-7> 2016-03-03 14:47:54.267101 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7ef( v 87924'82792 (1972'78912,87924'82792] local-les=89825 
n=880 ec=140 les/c 89825/8
9825 89824/89824/71089) [15,34] r=1 lpr=0 pi=69217-89823/29 
crt=87924'82780 lcod 0'0 inactive NOTIFY] exit Initial 0.022679 0 
0.00
-6> 2016-03-03 14:47:54.267121 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7ef( v 87924'82792 (1972'78912,87924'82792] local-les=89825 
n=880 ec=140 les/c 89825/8
9825 89824/89824/71089) [15,34] r=1 lpr=0 pi=69217-89823/29 
crt=87924'82780 lcod 0'0 inactive NOTIFY] enter Reset
-5> 2016-03-03 14:47:54.267330 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7fd(unlocked)] enter Initial
-4> 2016-03-03 14:47:54.289062 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7fd( v 87924'6735 (507'2785,87924'6735] local-les=89825 
n=950 ec=140 les/c 89825/89826
 89824/89824/89824) [34,15] r=0 lpr=0 crt=87924'6729 lcod 0'0 mlcod 
0'0 inactive] exit Initial 0.021732 0 0.00
-3> 2016-03-03 14:47:54.289081 7f5b57c01840  5 osd.34 pg_epoch: 
89826 pg[4.7fd( v 87924'6735 (507'2785,87924'6735] local-les=89825 
n=950 ec=140 les/c 89825/89826 89824/89824/89824) [34,15] r=0 lpr=0 
crt=87924'6729 lcod 0'0 mlcod 0'0 inactive] enter Reset
-2> 2016-03-03 14:47:54.289092 7f5b57c01840  0 osd.34 89832 
load_pgs opened 1499 pgs
-1> 2016-03-03 14:47:54.289458 7f5b57c01840  1 osd.34 89832 
build_past_intervals_parallel over 35445-89832
 0> 2016-03-03 14:47:57.368806 7f5b57c01840 -1 ./osd/OSDMap.h: In 
function 'const epoch_t& OSDMap::get_up_from(int) const' thread 
7f5b57c01840 time 2016-03-03 14:47:57.366226

./osd/OSDMap.h: 502: FAILED assert(exists(osd))


Don't know exactly what goes wrong, but to me it looks like this has 
nothing to do with the PG, but it crashes on information in/from the OSD 
map. This is beyond my knowledge of Ceph, hopefully a Ceph developer can 
help you with this.



--
Regards,

Richard.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help: pool not responding

2016-03-03 Thread Dimitar Boichev
I see a lot of people (including myself) ending with PGs that are stuck in 
“creating” state when you force create them.

How did you restart ceph ?
Mine were created fine after I restarted the monitor nodes after a minor 
version upgrade.
Did you do it monitors firs, osds second, etc etc …..

Regards.


On Mar 3, 2016, at 13:13, Mario Giammarco 
mailto:mgiamma...@gmail.com>> wrote:

I have tried "force create". It says "creating" but at the end problem persists.
I have restarted ceph as usual.
I am evaluating ceph and I am shocked because it semeed a very robust 
filesystem and now for a glitch I have an entire pool blocked and there is no 
simple procedure to force a recovery.

2016-03-02 18:31 GMT+01:00 Oliver Dzombic 
mailto:i...@ip-interactive.de>>:
Hi,

i could also not find any delete, but a create.

I found this here, its basically your situation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032412.html

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 02.03.2016 um 18:28 schrieb Mario Giammarco:
> Thans for info even if it is a bad info.
> Anyway I am reading docs again and I do not see a way to delete PGs.
> How can I remove them?
> Thanks,
> Mario
>
> 2016-03-02 17:59 GMT+01:00 Oliver Dzombic 
> mailto:i...@ip-interactive.de>
> >>:
>
> Hi,
>
> as i see your situation, somehow this 4 pg's got lost.
>
> They will not recover, because they are incomplete. So there is no data
> from which it could be recovered.
>
> So all what is left is to delete this pg's.
>
> Since all 3 osd's are in and up, it does not seem like you can somehow
> access this lost pg's.
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
> >
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1 
> 
> UST ID: DE274086107
>
>
> Am 02.03.2016 > um 17:45 
> schrieb Mario Giammarco:
> >
> >
> > Here it is:
> >
> >  cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> >  health HEALTH_WARN
> > 4 pgs incomplete
> > 4 pgs stuck inactive
> > 4 pgs stuck unclean
> > 1 requests are blocked > 32 sec
> >  monmap e8: 3 mons at
> > 
> {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> 
> > }
> > election epoch 840, quorum 0,1,2 0,1,2
> >  osdmap e2405: 3 osds: 3 up, 3 in
> >   pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
> > 1090 GB used, 4481 GB / 5571 GB avail
> >  284 active+clean
> >4 incomplete
> >   client io 4008 B/s rd, 446 kB/s wr, 23 op/s
> >
> >
> > 2016-03-02 9:31 GMT+01:00 Shinobu Kinjo 
> mailto:ski...@redhat.com>
> >
> >  
>  >
> > Is "ceph -s" still showing you same output?
> >
> > > cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> > >  health HEALTH_WARN
> > > 4 pgs incomplete
> > > 4 pgs stuck inactive
> > > 4 pgs stuck unclean
> > >  monmap e8: 3 mons at
> > > 
> {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> 
> > }
> > > election epoch 832, quorum 0,1,2 0,1,2
> > >  osdmap e2400: 3 osds: 3 up, 3 in
> > >   pgmap v5883297: 288 pgs, 4 pools, 391 GB data, 100
> kobjects
> > > 1090 GB used, 4481 GB / 5571 GB avail
> > >  284 active+clean
> > >4 incomplete
> >
> > Cheers,
> > S
> >
> > - Original Message -
> > From: "Mario Giammarco" 
> mailto:mgiamma...@gmail.com>
> 

[ceph-users] R: Help: pool not responding

2016-03-03 Thread Mario Giammarco
  Uses init script to restartDa: Dimitar BoichevInviato: giovedì 3 marzo 2016 21:44A: Mario GiammarcoCc: Oliver Dzombic; ceph-users@lists.ceph.comOggetto: Re: [ceph-users] Help: pool not responding




I see a lot of people (including myself) ending with PGs that are stuck in “creating” state when you force create them.


How did you restart ceph ?
Mine were created fine after I restarted the monitor nodes after a minor version upgrade.
Did you do it monitors firs, osds second, etc etc …..




Regards.








On Mar 3, 2016, at 13:13, Mario Giammarco  wrote:




I have tried "force create". It says "creating" but at the end problem persists.

I have restarted ceph as usual.

I am evaluating ceph and I am shocked because it semeed a very robust filesystem and now for a glitch I have an entire pool blocked and there is no simple procedure to force a recovery.


2016-03-02 18:31 GMT+01:00 Oliver Dzombic 
:

Hi,

i could also not find any delete, but a create.

I found this here, its basically your situation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032412.html

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 02.03.2016 um 18:28 schrieb Mario Giammarco:
> Thans for info even if it is a bad info.
> Anyway I am reading docs again and I do not see a way to delete PGs.
> How can I remove them?
> Thanks,
> Mario
>
> 2016-03-02 17:59 GMT+01:00 Oliver Dzombic  i...@ip-interactive.de>>:
>
>     Hi,
>
>     as i see your situation, somehow this 4 pg's got lost.
>
>     They will not recover, because they are incomplete. So there is no data
>     from which it could be recovered.
>
>     So all what is left is to delete this pg's.
>
>     Since all 3 osd's are in and up, it does not seem like you can somehow
>     access this lost pg's.
>
>     --
>     Mit freundlichen Gruessen / Best regards
>
>     Oliver Dzombic
>     IP-Interactive
>
>     mailto:i...@ip-interactive.de i...@ip-interactive.de>
>
>     Anschrift:
>
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
>
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
>
>     Steuer Nr.: 
35 236 3622 1 
>     UST ID: DE274086107
>
>
>     Am 02.03.2016 02.03.2016> um 17:45 schrieb Mario Giammarco:
>     >
>     >
>     > Here it is:
>     >
>     >  cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
>     >      health HEALTH_WARN
>     >             4 pgs incomplete
>     >             4 pgs stuck inactive
>     >             4 pgs stuck unclean
>     >             1 requests are blocked > 32 sec
>     >      monmap e8: 3 mons at
>     > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
>     
>     > }
>     >             election epoch 840, quorum 0,1,2 0,1,2
>     >      osdmap e2405: 3 osds: 3 up, 3 in
>     >       pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
>     >             1090 GB used, 4481 GB / 5571 GB avail
>     >                  284 active+clean
>     >                    4 incomplete
>     >   client io 4008 B/s rd, 446 kB/s wr, 23 op/s
>     >
>     >
>     > 2016-03-02 9:31 GMT+01:00 Shinobu Kinjo      ski...@redhat.com>
>     > ski...@redhat.com ski...@redhat.com>>>:
>     >
>     >     Is "ceph -s" still showing you same output?
>     >
>     >     >     cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
>     >     >      health HEALTH_WARN
>     >     >             4 pgs incomplete
>     >     >             4 pgs stuck inactive
>     >     >             4 pgs stuck unclean
>     >     >      monmap e8: 3 mons at
>     >     > {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
>     
>     >     }
>     >     >             election epoch 832, quorum 0,1,2 0,1,2
>     >     >      osdmap e2400: 3 osds: 3 up, 3 in
>     >     > 

Re: [ceph-users] Help: pool not responding

2016-03-03 Thread Dimitar Boichev
But the whole cluster or what ?

Regards.

Dimitar Boichev
SysAdmin Team Lead
AXSMarine Sofia
Phone: +359 889 22 55 42
Skype: dimitar.boichev.axsmarine
E-mail: dimitar.boic...@axsmarine.com

On Mar 3, 2016, at 22:47, Mario Giammarco 
mailto:mgiamma...@gmail.com>> wrote:

Uses init script to restart

Da: Dimitar Boichev
Inviato: giovedì 3 marzo 2016 21:44
A: Mario Giammarco
Cc: Oliver Dzombic; ceph-users@lists.ceph.com
Oggetto: Re: [ceph-users] Help: pool not responding


I see a lot of people (including myself) ending with PGs that are stuck in 
“creating” state when you force create them.

How did you restart ceph ?
Mine were created fine after I restarted the monitor nodes after a minor 
version upgrade.
Did you do it monitors firs, osds second, etc etc …..

Regards.


On Mar 3, 2016, at 13:13, Mario Giammarco 
mailto:mgiamma...@gmail.com>> wrote:

I have tried "force create". It says "creating" but at the end problem persists.
I have restarted ceph as usual.
I am evaluating ceph and I am shocked because it semeed a very robust 
filesystem and now for a glitch I have an entire pool blocked and there is no 
simple procedure to force a recovery.

2016-03-02 18:31 GMT+01:00 Oliver Dzombic 
mailto:i...@ip-interactive.de>>:
Hi,

i could also not find any delete, but a create.

I found this here, its basically your situation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032412.html

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 02.03.2016 um 18:28 schrieb Mario Giammarco:
> Thans for info even if it is a bad info.
> Anyway I am reading docs again and I do not see a way to delete PGs.
> How can I remove them?
> Thanks,
> Mario
>
> 2016-03-02 17:59 GMT+01:00 Oliver Dzombic 
> mailto:i...@ip-interactive.de>
> >>:
>
> Hi,
>
> as i see your situation, somehow this 4 pg's got lost.
>
> They will not recover, because they are incomplete. So there is no data
> from which it could be recovered.
>
> So all what is left is to delete this pg's.
>
> Since all 3 osd's are in and up, it does not seem like you can somehow
> access this lost pg's.
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
> >
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1 
> 
> UST ID: DE274086107
>
>
> Am 02.03.2016 > um 17:45 
> schrieb Mario Giammarco:
> >
> >
> > Here it is:
> >
> >  cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> >  health HEALTH_WARN
> > 4 pgs incomplete
> > 4 pgs stuck inactive
> > 4 pgs stuck unclean
> > 1 requests are blocked > 32 sec
> >  monmap e8: 3 mons at
> > 
> {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> 
> > }
> > election epoch 840, quorum 0,1,2 0,1,2
> >  osdmap e2405: 3 osds: 3 up, 3 in
> >   pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
> > 1090 GB used, 4481 GB / 5571 GB avail
> >  284 active+clean
> >4 incomplete
> >   client io 4008 B/s rd, 446 kB/s wr, 23 op/s
> >
> >
> > 2016-03-02 9:31 GMT+01:00 Shinobu Kinjo 
> mailto:ski...@redhat.com>
> >
> >  
>  >
> > Is "ceph -s" still showing you same output?
> >
> > > cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
> > >  health HEALTH_WARN
> > > 4 pgs incomplete
> > > 4 pgs stuck inactive
> > > 4 pgs stuck unclean
> > >  monmap e8: 3 mons at
> > > 
> {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0
> 
> > }
> > 

Re: [ceph-users] PG's stuck inactive, stuck unclean, incomplete, imports cause osd segfaults

2016-03-03 Thread Philip S. Hempel

On 03/03/2016 03:00 PM, Richard Arends wrote:

On 03/03/2016 08:32 PM, Philip S. Hempel wrote:



On 03/03/2016 01:49 PM, Richard Arends wrote:

On 03/03/2016 07:21 PM, Philip S. Hempel wrote:

Philip,

Sorry, can't help you with the segfault. What i would do, is set 
debug options in ceph.conf and start the OSD, maybe that extra debug 
info will give something you can work with.

/osd/OSDMap.h: 502: FAILED assert(exists(osd)

I wonder if this means what I think, that is sees this pg and states 
it exists in the map already, but does not handle this data correctly 
and dumps?


Do you have more info before and after this message?




Thanks, appreciate the help.
That is where I have gotten as well, so if we have a developer out there 
that can help please let me know.

There is budget to pay someone for the help.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Hammer upgrade]: procedure for upgrade

2016-03-03 Thread ceph
Hi,

As the docs said, mon, then osd, then rgw
Restart each daemon after upgrade the code

Works fine

On 03/03/2016 22:11, Andrea Annoè wrote:
> Hi to all,
> An architecture of Ceph have:
> 1 RGW
> 3 MON
> 4 OSD
> 
> Someone have tested procedure for upgrade Ceph architecture with RGW,MON,OSD ?
> What component I will upgrade for first?
> Except RGW all service will be up when upgrade apply?
> 
> Thanks in advance for share your experience.
> 
> Best regards
> Andrea
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Hammer upgrade]: procedure for upgrade

2016-03-03 Thread Andrea Annoè
Hi to all,
An architecture of Ceph have:
1 RGW
3 MON
4 OSD

Someone have tested procedure for upgrade Ceph architecture with RGW,MON,OSD ?
What component I will upgrade for first?
Except RGW all service will be up when upgrade apply?

Thanks in advance for share your experience.

Best regards
Andrea

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Adrian Saul

> Samsung EVO...
> Which exact model, I presume this is not a DC one?
>
> If you had put your journals on those, you would already be pulling your hairs
> out due to abysmal performance.
>
> Also with Evo ones, I'd be worried about endurance.

No,  I am using the P3700DCs for journals.  The Samsungs are the 850 2TB 
(MZ-75E2T0BW).  Chosen primarily on price.  We already built a system using the 
1TB models with Solaris+ZFS and I have little faith in them.  Certainly their 
write performance is erratic and not ideal.  We have other vendor options which 
are what they call "Enterprise Value" SSDs, but still 4x the price.   I would 
prefer a higher grade drive but unfortunately cost is being driven from above 
me.

> > On the ceph side each disk in the OSD servers are setup as an individual
> > OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> > servers into one root, and the SATA servers into another and created
> > pools using hosts as fault boundaries, with the pools set for 2
> > copies.
> Risky. If you have very reliable and well monitored SSDs you can get away
> with 2 (I do so), but with HDDs and the combination of their reliability and
> recovery time it's asking for trouble.
> I realize that this is testbed, but if your production has a replication of 3 
> you
> will be disappointed by the additional latency.

Again, cost - the end goal will be we build metro based dual site pools which 
will be 2+2 replication.  I am aware of the risks but already presenting 
numbers based on buying 4x the disk we are able to use gets questioned hard.

> This smells like garbage collection on your SSDs, especially since it matches
> time wise what you saw on them below.

I concur.   I am just not sure why that impacts back to the client when from 
the client perspective the journal should hide this.   If the journal is 
struggling to keep up and has to flush constantly then perhaps, but  on the 
current steady state IO rate I am testing with I don't think the journal should 
be that saturated.

> Have you tried the HDD based pool and did you see similar, consistent
> interval, spikes?

To be honest I have been focusing on the SSD numbers but that would be a good 
comparison.

> Or alternatively, configured 2 of your NVMEs as OSDs?

That was what I was thinking of doing - move the NVMEs to the frontends, make 
them OSDs and configure them as a read-forward cache tier for the other pools, 
and just have the SSDs and SATA journal by default on a first partition.

> No, not really. The journal can only buffer so much.
> There are several threads about this in the archives.
>
> You could tune it but that will only go so far if your backing storage can't 
> keep
> up.
>
> Regards,
>
> Christian


Agreed - Thanks for your help.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-03 Thread Yehuda Sadeh-Weinraub
On Thu, Feb 25, 2016 at 7:17 AM, Ritter Sławomir
 wrote:
> Hi,
>
>
>
> We have two CEPH clusters running on Dumpling 0.67.11 and some of our
> "multipart objects" are incompleted. It seems that some slow requests could
> cause corruption of related S3 objects. Moveover GETs for that objects are
> working without any error messages. There are only HTTP 200 in logs as well
> as no information about problems from popular client tools/libs.
>
>
>
> The situation looks very similiar to described in bug #8269, but we are
> using fixed 0.67.11 version:  http://tracker.ceph.com/issues/8269
>
>
>
> Regards,
>
>
>
> Sławomir Ritter
>
>
>
>
>
>
>
> EXAMPLE#1
>
>
>
> slow_request
>
> 
>
> 2016-02-23 13:49:58.818640 osd.260 10.176.67.27:6800/688083 2119 : [WRN] 4
> slow requests, 4 included below; oldest blocked for > 30.727096 secs
>
> 2016-02-23 13:49:58.818673 osd.260 10.176.67.27:6800/688083 2120 : [WRN]
> slow request 30.727096 seconds old, received at 2016-02-23 13:49:28.091460:
> osd_op(c
>
> lient.47792965.0:185007087
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
> [writef
>
> ull 0~524288] 10.ce729ebe e107594) v4 currently waiting for subops from
> [469,9]
>

Did these requests ever finish?

>
>
>
>
> HTTP_500 in apache.log
>
> ==
>
> 127.0.0.1 - - [23/Feb/2016:13:49:27 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=56
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:28 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> HTTP/1.0" 500 751 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:58 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=57
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
> 127.0.0.1 - - [23/Feb/2016:13:49:59 +0100] "PUT
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z&partNumber=58
> HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3
> Linux/3.13.0-39-generic(syncworker)"
>
>
>
>
>
> Empty RADOS object (real size = 0 bytes), list generated basis on MANIFEST
>
> ==
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.56_2
> 2097152   ok  2097152   10.7acc9476 (10.1476) [278,142,436]
> [278,142,436]
>
> found
> default.14654.445__multipart_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57
> 0 diff4194304   10.4f5be025 (10.25)   [57,310,428]
> [57,310,428]
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_1
> 4194304   ok  4194304   10.81191602 (10.1602) [441,109,420]
> [441,109,420]
>
> found
> default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
> 2097152   ok  2097152   10.ce729ebe (10.1ebe) [260,469,9]
> [260,469,9]
>
>
>
>
>
> "Silent" GETs
>
> =
>
> # object size from headers
>
> $ s3 -u head
> video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
> Content-Type: binary/octet-stream
>
> Content-Length: 641775701
>
> Server: nginx
>
>
>
> # but GETs only 637581397 (641775701 - missing 4194304 = 637581397)
>
> $ s3 -u get
> video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv >
> /tmp/test
>
> $  ls -al /tmp/test
>
> -rw-r--r-- 1 root root 637581397 Feb 23 17:05 /tmp/test
>
>
>
> # no error in logs
>
> 127.0.0.1 - - [23/Feb/2016:17:05:00 +0100] "GET
> /video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
> HTTP/1.0" 200 637581711 "-" "Mozilla/4.0 (Compatible; s3; libs3 2.0; Linux
> x86_64)"
>
>
>
> # wget - retry for missing part, but there is no missing part, so it GETs
> head/tail of the file again
>
> $ wget
> http://127.0.0.1:88/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
>
> --2016-02-23 17:10:11--
> http://127.0.0.1:88/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
>
> Connecting to 127.0.0.1:88... connected.
>
> HTTP request sent, awaiting response... 200 OK
>
> Length: 641775701 (612M) [binary/octet-stream]
>
> Saving to: `c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv'
>
>
>
> 99%
> [==>
> ] 637,581,397 63.9M/s   in 9.5s
>
>
>
> 2016-02-2

Re: [ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-03-03 Thread Robin H. Johnson
On Thu, Mar 03, 2016 at 01:55:13PM +0100, Ritter Sławomir wrote:
> Hi,
> 
> I think this is really serious problem - again:  
> 
> - we silently lost S3/RGW objects in clusters 
> 
> Moreover, it our situation looks very similiar to described in
> uncorrected bug #13764 (Hammer) and in corrected #8269 (Dumpling).
FYI fix in #8269 _is_ present in Hammer:
commit bd8e026f88b rgw: don't allow multiple writers to same multiobject part

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] abort slow requests ?

2016-03-03 Thread Ben Hines
I have a few bad objects in ceph which are 'stuck on peering'.  The clients
hit them and they build up and eventually stop all traffic to the OSD.   I
can open up traffic by resetting the OSD (aborting those requests)
temporarily.

Is there a way to tell ceph to cancel/abort these 'slow requests' once they
get to certain amount of time? Rather than building up and blocking
everything..

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are crashing during PG replication

2016-03-03 Thread Alexander Gubanov
I decided to refuse use of ssd cache pool and create just 2 pool. 1st pool
only of ssd for fast storage 2nd only of hdd for slow storage.
What about this file, honestly, I don't know why it is created. As I say I
flush the journal for fallen OSD and remove this file and then I start osd
damon:

ceph-osd --flush-journal osd.3
rm -rf /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.
238e1f29.0728__head_813E90A3__3
service ceph start osd.3

But if I turn the cache pool off  the file isn't created:

ceph osd tier cache-mode ${cahec_pool} forward
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are crashing during PG replication

2016-03-03 Thread Shinobu Kinjo
Thank you for your explanation.

> Every time 2 of 18 OSDs are crashing. I think it's happening when run PG 
> replication because crashing only 2 OSDs and every time they're are the same.

1st you said, 2 osds were crashed every time. From the log you pasted,
it makes sense to do something for osd.3.

> rm -rf
> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.0728__head_813E90A3__3

What makes me confused now is this.
Was osd.4 also crashed like osd.3?

>-1> 2016-02-24 04:51:45.904673 7fd995026700  5 -- op tracker -- , seq: 
> 19231, time: 2016-02-24 04:51:45.904673, event: started, request: 
> osd_op(osd.13.12097:806247 rb.0.218d6.238e1f29.00010db3 [copy-get max 
> 8388608] 3.94c2bed2 ack+read+ignore_cache+ignore_overlay+map_snap_clone 
> e13252) v4

And crash seems to happen during this process, what I really want to
know is what this message inferred.
Did you check osd.13?

Anyhow your cluster is now fine...no?
That's good news.

Cheers,
Shinobu

On Fri, Mar 4, 2016 at 11:05 AM, Alexander Gubanov  wrote:
> I decided to refuse use of ssd cache pool and create just 2 pool. 1st pool
> only of ssd for fast storage 2nd only of hdd for slow storage.
> What about this file, honestly, I don't know why it is created. As I say I
> flush the journal for fallen OSD and remove this file and then I start osd
> damon:
>
> ceph-osd --flush-journal osd.3
> rm -rf
> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.0728__head_813E90A3__3
> service ceph start osd.3
>
> But if I turn the cache pool off  the file isn't created:
>
> ceph osd tier cache-mode ${cahec_pool} forward
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: List of SSDs

2016-03-03 Thread Christian Balzer

Hello,

On Mon, 29 Feb 2016 15:00:08 -0800 Heath Albritton wrote:

> > Did you just do these tests or did you also do the "suitable for Ceph"
> > song and dance, as in sync write speed?
> 
> These were done with libaio, so async.  I can do a sync test if that
> helps.  My goal for testing wasn't specifically suitability with ceph,
> but overall suitability in my environment, much of which uses async
> IO.
> 
Fair enough. 
Sync tests would be nice, if nothing else to confirm that the Samsung DC
level SSDs are suitable and how they compare in that respect to the Intels.

> 
> >> SM863 Pro (default over-provisioning) ~7k IOPS per thread (4 threads,
> >> QD32) Intel S3710 ~10k IOPS per thread
> >> 845DC Pro ~12k IOPS per thread
> >> SM863 (28% over-provisioning) ~18k IOPS per thread
> >>
> > Very interesting.
> > To qualify your values up there, could you provide us with the exact
> > models, well size of the SSD will do.
> 
> SM863 was 960GB, I've many of these and the 1.92TB models deployed
> 845DC Pro, 800GB
> S3710, 800GB
> 
Thanks, pretty much an oranges with oranges comparison then. ^o^

> > Also did you test with a S3700 (I find the 3710s to be a slight
> > regression in some ways)?
> > And for kicks, did you try over-provisioning with an Intel SSD to see
> > the effects there?
> 
> These tests were performed mid-2015.  I requested an S3700, but at
> that point, I could only get the S3710.  I didn't test the Intel with
> increased over-provisioning.  I suspect it wouldn't have performed
> much better as it was already over-provisioned by 28% or thereabouts.
> 
Yeah, my curiosity was mostly if there is similar ratio at work here
(might have made more sense for testing purposes to REDUCE the
overprovisioning of the Intel) and where the point of diminishing returns
is.

> It's easy to guess at these sort of things.  The total capacity of
> flash is in some power of two and the advertised capacity is some
> power of ten.  Manufacturer's use the difference to buy themselves
> some space for garbage collection.  So, a terabyte worth of flash is
> 1099511627776 bytes.  800GB is 8e+11 bytes with the difference of
> about 299GB, which is the space they've set aside for GC.
> 
Ayup, that I was quite aware of.

> Again, if there's some tests you'd like to see done, let me know.
> It's relatively easy for me to get samples and the tests are a benefit
> to me as much as any other.
>
Well, see above, diminishing returns and all.
 
> 
> >> I'm seeing the S3710s at ~$1.20/GB and the SM863 around $.63/GB.  As
> >> such, I'm buying quite a lot of the latter.
> >
> > I assume those numbers are before over-provisioning the SM863, still
> > quite a difference indeed.
> 
> Yes, that's correct.  Here's some current pricing:  Newegg has the
> SM863 960GB at $565 or ~$.59/GB raw.  With 28% OP, that yields around
> 800GB and around $.71/GB
> 
If I'm reading the (well hidden and only in the PDF) full specs of the 
960GB 863 correctly it has an endurance of about 3 DWPD, so the comparable
Intel model would be the 3610s.
At least when it comes to endurance.
Would be interesting to see those two in comparison. ^.^


> >> I've not had them deployed
> >> for very long, so I can't attest to anything beyond my synthetic
> >> benchmarks.  I'm using the LSI 3008 based HBA as well and I've had to
> >> use updated firmware and kernel module for it.  I haven't checked the
> >> kernel that comes with EL7.2, but 7.1 still had problems with the
> >> included driver.
> >>
> > Now THIS is really interesting.
> > As you may know several people on this ML including me have issues with
> > LSI 3008s and SSDs, including Samsung ones.
> >
> > Can you provide all the details here, as in:
> > IT or IR mode (IT I presume)
> > Firmware version
> > Kernel driver version
> 
> When initially deployed about a year ago, I had problems with SSDs and
> spinning disks.  Not sure about any problems specific to Samsung SSDs,
> but I've been on the upgrade train.
> 
> I think the stock kernel module is 4.x something or other and LSA, now
> Avago has released P9 through P12 in the past year.  When I first
> started using them, I was on the P9 firmware and kernel module, which
> I built from the sources they supply.  At this point most of my infra
> is on the P10 version.  I've not tested the later versions.
> 
> Everything is IT mode where possible.
> 
Yes, at least until kernel 4.1 the module was the 4.0 version.
And I had no luck at all getting the newer versions into a generic kernel
or Debian.
And when I deployed the machines in question P8 was the latest FW from
Supermicro.

Kernel 4.4 does have the 9.x module, so I guess that's a way forward at
least on the kernel side of things (which I think is the more likely
culprit).

Thanks,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing

Re: [ceph-users] Fwd: List of SSDs

2016-03-03 Thread Shinobu Kinjo
Comparing with these SSDs,

 S3710s
 S3610s
 SM863
 845DC Pro

which one is more reasonable in terms of performance, cost or whatever?
S3710s does not sound reasonable to me.

> And I had no luck at all getting the newer versions into a generic kernel
> or Debian.

So it's not always better to use newer version. Is my understanding right?
If I don't understand that properly, point it out to me. I'm pretty
serious about that.

Cheers,
Shinobu


On Fri, Mar 4, 2016 at 3:17 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Mon, 29 Feb 2016 15:00:08 -0800 Heath Albritton wrote:
>
>> > Did you just do these tests or did you also do the "suitable for Ceph"
>> > song and dance, as in sync write speed?
>>
>> These were done with libaio, so async.  I can do a sync test if that
>> helps.  My goal for testing wasn't specifically suitability with ceph,
>> but overall suitability in my environment, much of which uses async
>> IO.
>>
> Fair enough.
> Sync tests would be nice, if nothing else to confirm that the Samsung DC
> level SSDs are suitable and how they compare in that respect to the Intels.
>
>>
>> >> SM863 Pro (default over-provisioning) ~7k IOPS per thread (4 threads,
>> >> QD32) Intel S3710 ~10k IOPS per thread
>> >> 845DC Pro ~12k IOPS per thread
>> >> SM863 (28% over-provisioning) ~18k IOPS per thread
>> >>
>> > Very interesting.
>> > To qualify your values up there, could you provide us with the exact
>> > models, well size of the SSD will do.
>>
>> SM863 was 960GB, I've many of these and the 1.92TB models deployed
>> 845DC Pro, 800GB
>> S3710, 800GB
>>
> Thanks, pretty much an oranges with oranges comparison then. ^o^
>
>> > Also did you test with a S3700 (I find the 3710s to be a slight
>> > regression in some ways)?
>> > And for kicks, did you try over-provisioning with an Intel SSD to see
>> > the effects there?
>>
>> These tests were performed mid-2015.  I requested an S3700, but at
>> that point, I could only get the S3710.  I didn't test the Intel with
>> increased over-provisioning.  I suspect it wouldn't have performed
>> much better as it was already over-provisioned by 28% or thereabouts.
>>
> Yeah, my curiosity was mostly if there is similar ratio at work here
> (might have made more sense for testing purposes to REDUCE the
> overprovisioning of the Intel) and where the point of diminishing returns
> is.
>
>> It's easy to guess at these sort of things.  The total capacity of
>> flash is in some power of two and the advertised capacity is some
>> power of ten.  Manufacturer's use the difference to buy themselves
>> some space for garbage collection.  So, a terabyte worth of flash is
>> 1099511627776 bytes.  800GB is 8e+11 bytes with the difference of
>> about 299GB, which is the space they've set aside for GC.
>>
> Ayup, that I was quite aware of.
>
>> Again, if there's some tests you'd like to see done, let me know.
>> It's relatively easy for me to get samples and the tests are a benefit
>> to me as much as any other.
>>
> Well, see above, diminishing returns and all.
>
>>
>> >> I'm seeing the S3710s at ~$1.20/GB and the SM863 around $.63/GB.  As
>> >> such, I'm buying quite a lot of the latter.
>> >
>> > I assume those numbers are before over-provisioning the SM863, still
>> > quite a difference indeed.
>>
>> Yes, that's correct.  Here's some current pricing:  Newegg has the
>> SM863 960GB at $565 or ~$.59/GB raw.  With 28% OP, that yields around
>> 800GB and around $.71/GB
>>
> If I'm reading the (well hidden and only in the PDF) full specs of the
> 960GB 863 correctly it has an endurance of about 3 DWPD, so the comparable
> Intel model would be the 3610s.
> At least when it comes to endurance.
> Would be interesting to see those two in comparison. ^.^
>
>
>> >> I've not had them deployed
>> >> for very long, so I can't attest to anything beyond my synthetic
>> >> benchmarks.  I'm using the LSI 3008 based HBA as well and I've had to
>> >> use updated firmware and kernel module for it.  I haven't checked the
>> >> kernel that comes with EL7.2, but 7.1 still had problems with the
>> >> included driver.
>> >>
>> > Now THIS is really interesting.
>> > As you may know several people on this ML including me have issues with
>> > LSI 3008s and SSDs, including Samsung ones.
>> >
>> > Can you provide all the details here, as in:
>> > IT or IR mode (IT I presume)
>> > Firmware version
>> > Kernel driver version
>>
>> When initially deployed about a year ago, I had problems with SSDs and
>> spinning disks.  Not sure about any problems specific to Samsung SSDs,
>> but I've been on the upgrade train.
>>
>> I think the stock kernel module is 4.x something or other and LSA, now
>> Avago has released P9 through P12 in the past year.  When I first
>> started using them, I was on the P9 firmware and kernel module, which
>> I built from the sources they supply.  At this point most of my infra
>> is on the P10 version.  I've not tested the later versions.
>>
>> Everything is IT mode where possible.
>>
> Yes, at le