[ceph-users] ceph usage for very small objects

2019-12-26 Thread Adrian Nicolae

Hi all,

I have a ceph cluster with 4+2 EC used as a secondary storage system for 
offloading big files from another storage system. Even if most of the 
files are big (at least 50MB), we have also some small objects - less 
than 4MB each. The current storage usage is 358TB of raw data and 237TB 
of 'usable' data which means an overhead of 66%.


I was wondering if I can get more storage efficiency if I can get rid of 
all the small files by moving them on other storage systems.  My 
understanding is that every file is splitted into stripe_unit chunks and 
then mapped into ceph objects which have a size of 4MB per object. So if 
I have a file with a size of 1MB, the file will be splitted into 4 x 
256KB chunks, then added another 2 x 256KB chunks as overhead and every 
chunk will be mapped into a ceph object of 4MB size. This means a 1MB 
file will be stored as 6 ceph objects i.e the storage usage will be 
24MB. Not sure if my understanding is correct though...


 Do you have any suggestions on this topic ? Does it really worth it to 
move the small files from ceph ? If yes, what is the minimum file size 
which I can safely store in ceph without loosing too much storage ?


Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Second radosgw install

2019-02-16 Thread Adrian Nicolae

Hi all,

I know that it seems like a stupid question, but I have some concerns 
about this, maybe someone can clear the things for me.


I read in the offical docs that , when I create a rgw server with 
'ceph-deploy rgw create', the rgw scripts will automatically create the 
rgw system pools. I'm not sure what happens with the existing system 
pools if I already have a working rgw server...


Thanks.


On 2/15/2019 6:35 PM, Adrian Nicolae wrote:

Hi,

I want to install a second radosgw to my existing ceph cluster (mimic) 
on another server. Should I create it like the first one, with 
'ceph-deploy rgw create' ?


I don't want to mess with the existing rgw system pools.

Thanks.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Second radosgw install

2019-02-15 Thread Adrian Nicolae

Hi,

I want to install a second radosgw to my existing ceph cluster (mimic) 
on another server. Should I create it like the first one, with 
'ceph-deploy rgw create' ?


I don't want to mess with the existing rgw system pools.

Thanks.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.8 Luminous released

2018-09-05 Thread Adrian Saul
Can I confirm if this bluestore compression assert issue is resolved in 12.2.8?

https://tracker.ceph.com/issues/23540

I notice that it has a backport that is listed against 12.2.8 but there is no 
mention of that issue or backport listed in the release notes.


> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Abhishek Lekshmanan
> Sent: Wednesday, 5 September 2018 2:30 AM
> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-
> maintain...@ceph.com; ceph-annou...@ceph.com
> Subject: v12.2.8 Luminous released
>
>
> We're glad to announce the next point release in the Luminous v12.2.X stable
> release series. This release contains a range of bugfixes and stability
> improvements across all the components of ceph. For detailed release notes
> with links to tracker issues and pull requests, refer to the blog post at
> http://ceph.com/releases/v12-2-8-released/
>
> Upgrade Notes from previous luminous releases
> -
>
> When upgrading from v12.2.5 or v12.2.6 please note that upgrade caveats
> from
> 12.2.5 will apply to any _newer_ luminous version including 12.2.8. Please
> read the notes at https://ceph.com/releases/12-2-7-luminous-
> released/#upgrading-from-v12-2-6
>
> For the cluster that installed the broken 12.2.6 release, 12.2.7 fixed the
> regression and introduced a workaround option `osd distrust data digest =
> true`, but 12.2.7 clusters still generated health warnings like ::
>
>   [ERR] 11.288 shard 207: soid
>   11:1155c332:::rbd_data.207dce238e1f29.0527:head
> data_digest
>   0xc8997a5b != data_digest 0x2ca15853
>
>
> 12.2.8 improves the deep scrub code to automatically repair these
> inconsistencies. Once the entire cluster has been upgraded and then fully
> deep scrubbed, and all such inconsistencies are resolved; it will be safe to
> disable the `osd distrust data digest = true` workaround option.
>
> Changelog
> -
> * bluestore: set correctly shard for existed Collection (issue#24761, 
> pr#22860,
> Jianpeng Ma)
> * build/ops: Boost system library is no longer required to compile and link
> example librados program (issue#25054, pr#23202, Nathan Cutler)
> * build/ops: Bring back diff -y for non-FreeBSD (issue#24396, issue#21664,
> pr#22848, Sage Weil, David Zafman)
> * build/ops: install-deps.sh fails on newest openSUSE Leap (issue#25064,
> pr#23179, Kyr Shatskyy)
> * build/ops: Mimic build fails with -DWITH_RADOSGW=0 (issue#24437,
> pr#22864, Dan Mick)
> * build/ops: order rbdmap.service before remote-fs-pre.target
> (issue#24713, pr#22844, Ilya Dryomov)
> * build/ops: rpm: silence osd block chown (issue#25152, pr#23313, Dan van
> der Ster)
> * cephfs-journal-tool: Fix purging when importing an zero-length journal
> (issue#24239, pr#22980, yupeng chen, zhongyan gu)
> * cephfs: MDSMonitor: uncommitted state exposed to clients/mdss
> (issue#23768, pr#23013, Patrick Donnelly)
> * ceph-fuse mount failed because no mds (issue#22205, pr#22895, liyan)
> * ceph-volume add a __release__ string, to help version-conditional calls
> (issue#25170, pr#23331, Alfredo Deza)
> * ceph-volume: adds test for `ceph-volume lvm list /dev/sda` (issue#24784,
> issue#24957, pr#23350, Andrew Schoen)
> * ceph-volume: do not use stdin in luminous (issue#25173, issue#23260,
> pr#23367, Alfredo Deza)
> * ceph-volume enable the ceph-osd during lvm activation (issue#24152,
> pr#23394, Dan van der Ster, Alfredo Deza)
> * ceph-volume expand on the LVM API to create multiple LVs at different
> sizes (issue#24020, pr#23395, Alfredo Deza)
> * ceph-volume lvm.activate conditional mon-config on prime-osd-dir
> (issue#25216, pr#23397, Alfredo Deza)
> * ceph-volume lvm.batch remove non-existent sys_api property
> (issue#34310, pr#23811, Alfredo Deza)
> * ceph-volume lvm.listing only include devices if they exist (issue#24952,
> pr#23150, Alfredo Deza)
> * ceph-volume: process.call with stdin in Python 3 fix (issue#24993, pr#23238,
> Alfredo Deza)
> * ceph-volume: PVolumes.get() should return one PV when using name or
> uuid (issue#24784, pr#23329, Andrew Schoen)
> * ceph-volume: refuse to zap mapper devices (issue#24504, pr#23374,
> Andrew Schoen)
> * ceph-volume: tests.functional inherit SSH_ARGS from ansible (issue#34311,
> pr#23813, Alfredo Deza)
> * ceph-volume tests/functional run lvm list after OSD provisioning
> (issue#24961, pr#23147, Alfredo Deza)
> * ceph-volume: unmount lvs correctly before zapping (issue#24796,
> pr#23128, Andrew Schoen)
> * ceph-volume: update batch documentation to explain filestore strategies
> (issue#34309, pr#23825, Alfredo Deza)
> * change default filestore_merge_threshold to -10 (issue#24686, pr#22814,
> Douglas Fuller)
> * client: add inst to asok status output (issue#24724, pr#23107, Patrick
> Donnelly)
> * client: fixup parallel calls to ceph_ll_lookup_inode() in NFS FASL
> (issue#22683, pr#23012, huanwen ren)
> * client: increase verbosity level 

Re: [ceph-users] SSDs for data drives

2018-07-12 Thread Adrian Saul

We started our cluster with consumer (Samsung EVO) disks and the write 
performance was pitiful, they had periodic spikes in latency (average of 8ms, 
but much higher spikes) and just did not perform anywhere near where we were 
expecting.

When replaced with SM863 based devices the difference was night and day.  The 
DC grade disks held a nearly constant low latency (contantly sub-ms), no 
spiking and performance was massively better.   For a period I ran both disks 
in the cluster and was able to graph them side by side with the same workload.  
This was not even a moderately loaded cluster so I am glad we discovered this 
before we went full scale.

So while you certainly can do cheap and cheerful and let the data availability 
be handled by Ceph, don’t expect the performance to keep up.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Satish 
Patel
Sent: Wednesday, 11 July 2018 10:50 PM
To: Paul Emmerich 
Cc: ceph-users 
Subject: Re: [ceph-users] SSDs for data drives

Prices going way up if I am picking Samsung SM863a for all data drives.

We have many servers running on consumer grade sad drives and we never noticed 
any performance or any fault so far (but we never used ceph before)

I thought that is the whole point of ceph to provide high availability if drive 
go down also parellel read from multiple osd node

Sent from my iPhone

On Jul 11, 2018, at 6:57 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
Hi,

we‘ve no long-term data for the SM variant.
Performance is fine as far as we can tell, but the main difference between 
these two models should be endurance.


Also, I forgot to mention that my experiences are only for the 1, 2, and 4 TB 
variants. Smaller SSDs are often proportionally slower (especially below 500GB).

Paul

Robert Stanford mailto:rstanford8...@gmail.com>>:
Paul -

 That's extremely helpful, thanks.  I do have another cluster that uses Samsung 
SM863a just for journal (spinning disks for data).  Do you happen to have an 
opinion on those as well?

On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
PM/SM863a are usually great disks and should be the default go-to option, they 
outperform
even the more expensive PM1633 in our experience.
(But that really doesn't matter if it's for the full OSD and not as dedicated 
WAL/journal)

We got a cluster with a few hundred SanDisk Ultra II (discontinued, i believe) 
that was built on a budget.
Not the best disk but great value. They have been running since ~3 years now 
with very few failures and
okayish overall performance.

We also got a few clusters with a few hundred SanDisk Extreme Pro, but we are 
not yet sure about their
long-time durability as they are only ~9 months old (average of ~1000 write 
IOPS on each disk over that time).
Some of them report only 50-60% lifetime left.

For NVMe, the Intel NVMe 750 is still a great disk

Be carefuly to get these exact models. Seemingly similar disks might be just 
completely bad, for
example, the Samsung PM961 is just unusable for Ceph in our experience.

Paul

2018-07-11 10:14 GMT+02:00 Wido den Hollander 
mailto:w...@42on.com>>:


On 07/11/2018 10:10 AM, Robert Stanford wrote:
>
>  In a recent thread the Samsung SM863a was recommended as a journal
> SSD.  Are there any recommendations for data SSDs, for people who want
> to use just SSDs in a new Ceph cluster?
>

Depends on what you are looking for, SATA, SAS3 or NVMe?

I have very good experiences with these drives running with BlueStore in
them in SuperMicro machines:

- SATA: Samsung PM863a
- SATA: Intel S4500
- SAS: Samsung PM1633
- NVMe: Samsung PM963

Running WAL+DB+DATA with BlueStore on the same drives.

Wido

>  Thank you
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 
31h
81247 
München
www.croit.io
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this 

[ceph-users] Different write pools for RGW objects

2018-07-09 Thread Adrian Nicolae

Hi,

I was wondering if I can have  different destination pools for the S3 
objects uploaded to Ceph via RGW based on the object's size.


For example :

- smaller S3 objects (let's say smaller than 1MB) should go to a 
replicated pool


- medium and big objects should go to a EC pool

Is there any way to do that ? I couldn't find such configuration option 
in the Crush Rules docs and the RGW docs.


Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-06 Thread Adrian
Update to this.

The affected pg didn't seem inconsistent:

[root@admin-ceph1-qh2 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.20 is active+clean+inconsistent, acting [114,26,44]
[root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
--format=json-pretty
{
   "epoch": 210034,
   "inconsistents": []
}

Although pg query showed the primary info.stats.stat_sum.num_bytes differed
from the peers

A pg repair on 6.20 seems to have resolved the issue for now but the
info.stats.stat_sum.num_bytes still differs so presumably will become
inconsistent again next time it scrubs.

Adrian.

On Tue, Jun 5, 2018 at 12:09 PM, Adrian  wrote:

> Hi Cephers,
>
> We recently upgraded one of our clusters from hammer to jewel and then to
> luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
> deep-scubs we have an inconsistent pg with a log message we've not seen
> before:
>
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 6.20 is active+clean+inconsistent, acting [114,26,44]
>
>
> Ceph log shows
>
> 2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : 
> cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 
> 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
> 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
> 2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : 
> cluster [ERR] 6.20 scrub 1 errors
> 2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 
> : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
> 2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 
> : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent 
> (PG_DAMAGED)
> 2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 
> : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg 
> inconsistent
>
> There are no EC pools - looks like it may be the same as
> https://tracker.ceph.com/issues/22656 although as in #7 this is not a
> cache pool.
>
> Wondering if this is ok to issue a pg repair on 6.20 or if there's
> something else we should be looking at first ?
>
> Thanks in advance,
> Adrian.
>
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
>



-- 
---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-04 Thread Adrian
Hi Cephers,

We recently upgraded one of our clusters from hammer to jewel and then to
luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
deep-scubs we have an inconsistent pg with a log message we've not seen
before:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]


Ceph log shows

2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395
: cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87
clones, 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive,
0/0 whiteouts, 25952454144/25952462336 bytes, 0/0 hit_set_archive
bytes.
2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396
: cluster [ERR] 6.20 scrub 1 errors
2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41298 : cluster [ERR] Health check failed: 1 scrub errors
(OSD_SCRUB_ERRORS)
2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41299 : cluster [ERR] Health check failed: Possible data damage: 1 pg
inconsistent (PG_DAMAGED)
2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41345 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data
damage: 1 pg inconsistent

There are no EC pools - looks like it may be the same as
https://tracker.ceph.com/issues/22656 although as in #7 this is not a cache
pool.

Wondering if this is ok to issue a pg repair on 6.20 or if there's
something else we should be looking at first ?

Thanks in advance,
Adrian.

---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Adrian Saul

You have the same performance problem then regardless of what platform you 
choose to present it on.  If you want cross site consistency with a single 
consistent view, you need to replicate writes synchronously between sites, 
which will induce a performance hit for writes.   Any other snapshot/async 
setup while improving write performance leaves you with that time window gap 
should you lose a site.

If you are not particularly latency sensitive on writes (i.e these are just 
small documents being written and left behind) then the write latency penalty 
is probably not that big an issue for the easier access a stretched CephFS 
filesystem would give you.  If your clients can access cephfs natively that 
might be cleaner than using NFS over the top, although it means having clients 
get full access to the ceph public network – otherwise my previously mentioned 
NFS export with automount would probably work for you.


From: Up Safe [mailto:upands...@gmail.com]
Sent: Tuesday, 22 May 2018 12:33 AM
To: David Turner <drakonst...@gmail.com>
Cc: Adrian Saul <adrian.s...@tpgtelecom.com.au>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] multi site with cephfs

I'll explain.
Right now we have 2 sites (racks) with several dozens of servers at each
accessing a NAS (let's call it a NAS, although it's an IBM v7000 Unified that 
serves the files via NFS).

The biggest problem is that it works active-passive, i.e. we always access one 
of the storages for read/write
and the other one is replicated once every few hours, so it's more for backup 
needs.
In this setup once the power goes down in our main site - we're stuck with a 
bit (several hours) outdated files
and we need to remount all of the servers and what not.
The multi site ceph was supposed to solve this problem for us. This way we 
would have only local mounts, i.e.
each server would only access the filesystem that is in the same site. And if 
one of the sited go down - no pain.
The files are rather small, pdfs and xml of 50-300KB mostly.
The total size is about 25 TB right now.

We're a low budget company, so your advise about developing is not going to 
happen as we have no such skills or resources for this.
Plus, I want to make this transparent for the devs and everyone - just an 
infrastructure replacement that will buy me all of the ceph benefits and
allow the company to survive the power outages or storage crashes.


On Mon, May 21, 2018 at 5:12 PM, David Turner 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote:
Not a lot of people use object storage multi-site.  I doubt anyone is using 
this like you are.  In theory it would work, but even if somebody has this 
setup running, it's almost impossible to tell if it would work for your needs 
and use case.  You really should try it out for yourself to see if it works to 
your needs.  And if you feel so inclined, report back here with how it worked.

If you're asking for advice, why do you need a networked posix filesystem?  
Unless you are using proprietary software with this requirement, it's generally 
lazy coding that requires a mounted filesystem like this and you should aim 
towards using object storage instead without any sort of NFS layer.  It's a 
little more work for the developers, but is drastically simpler to support and 
manage.

On Mon, May 21, 2018 at 10:06 AM Up Safe 
<upands...@gmail.com<mailto:upands...@gmail.com>> wrote:
guys,
please tell me if I'm in the right direction.
If ceph object storage can be set up in multi site configuration,
and I add ganesha (which to my understanding is an "adapter"
that serves s3 objects via nfs to clients) -
won't this work as active-active?


Thanks

On Mon, May 21, 2018 at 11:48 AM, Up Safe 
<upands...@gmail.com<mailto:upands...@gmail.com>> wrote:
ok, thanks.
but it seems to me that having pool replicas spread over sites is a bit too 
risky performance wise.
how about ganesha? will it work with cephfs and multi site setup?
I was previously reading about rgw with ganesha and it was full of limitations.
with cephfs - there is only one and one I can live with.
Will it work?

On Mon, May 21, 2018 at 10:57 AM, Adrian Saul 
<adrian.s...@tpgtelecom.com.au<mailto:adrian.s...@tpgtelecom.com.au>> wrote:

We run CephFS in a limited fashion in a stretched cluster of about 40km with 
redundant 10G fibre between sites – link latency is in the order of 1-2ms.  
Performance is reasonable for our usage but is noticeably slower than 
comparable local ceph based RBD shares.

Essentially we just setup the ceph pools behind cephFS to have replicas on each 
site.  To export it we are simply using Linux kernel NFS and it gets exported 
from 4 hosts that act as CephFS clients.  Those 4 hosts are then setup in an 
DNS record that resolves to all 4 IPs, and we then use automount to do 
automatic mounting and host failover on the NFS clients.  Automount takes care 
of find

Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Adrian Saul

We run CephFS in a limited fashion in a stretched cluster of about 40km with 
redundant 10G fibre between sites – link latency is in the order of 1-2ms.  
Performance is reasonable for our usage but is noticeably slower than 
comparable local ceph based RBD shares.

Essentially we just setup the ceph pools behind cephFS to have replicas on each 
site.  To export it we are simply using Linux kernel NFS and it gets exported 
from 4 hosts that act as CephFS clients.  Those 4 hosts are then setup in an 
DNS record that resolves to all 4 IPs, and we then use automount to do 
automatic mounting and host failover on the NFS clients.  Automount takes care 
of finding the quickest and available NFS server.

I stress this is a limited setup that we use for some fairly light duty, but we 
are looking to move things like user home directories onto this.  YMMV.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Up Safe
Sent: Monday, 21 May 2018 5:36 PM
To: David Turner 
Cc: ceph-users 
Subject: Re: [ceph-users] multi site with cephfs

Hi,
can you be a bit more specific?
I need to understand whether this is doable at all.
Other options would be using ganesha, but I understand it's very limited on NFS;
or start looking at gluster.

Basically, I need the multi site option, i.e. active-active read-write.

Thanks

On Wed, May 16, 2018 at 5:50 PM, David Turner 
> wrote:
Object storage multi-site is very specific to using object storage.  It uses 
the RGW API's to sync s3 uploads between each site.  For CephFS you might be 
able to do a sync of the rados pools, but I don't think that's actually a thing 
yet.  RBD mirror is also a layer on top of things to sync between sites.  
Basically I think you need to do something on top of the Filesystem as opposed 
to within Ceph  to sync it between sites.

On Wed, May 16, 2018 at 9:51 AM Up Safe 
> wrote:
But this is not the question here.
The question is whether I can configure multi site for CephFS.
Will I be able to do so by following the guide to set up the multi site for 
object storage?

Thanks

On Wed, May 16, 2018, 16:45 John Hearns 
> wrote:
The answer given at the seminar yesterday was that a practical limit was around 
60km.
I don't think 100km is that much longer.  I defer to the experts here.






On 16 May 2018 at 15:24, Up Safe 
> wrote:
Hi,

About a 100 km.
I have a 2-4ms latency between them.

Leon

On Wed, May 16, 2018, 16:13 John Hearns 
> wrote:
Leon,
I was at a Lenovo/SuSE seminar yesterday and asked a similar question regarding 
separated sites.
How far apart are these two geographical locations?   It does matter.

On 16 May 2018 at 15:07, Up Safe 
> wrote:
Hi,
I'm trying to build a multi site setup.
But the only guides I've found on the net were about building it with object 
storage or rbd.
What I need is cephfs.
I.e. I need to have 2 synced file storages at 2 geographical locations.
Is this possible?
Also, if I understand correctly - cephfs is just a component on top of the 
object storage.
Following this logic - it should be possible, right?
Or am I totally off here?
Thanks,
Leon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel to luminous upgrade, chooseleaf_vary_r and chooseleaf_stable

2018-05-15 Thread Adrian
Thanks Dan,

After talking it through we've decided to adopt your approach too and leave
the tunables till after the upgrade.

Regards,
Adrian.

On Mon, May 14, 2018 at 5:14 PM, Dan van der Ster <d...@vanderster.com>
wrote:

> Hi Adrian,
>
> Is there a strict reason why you *must* upgrade the tunables?
>
> It is normally OK to run with old (e.g. hammer) tunables on a luminous
> cluster. The crush placement won't be state of the art, but that's not
> a huge problem.
>
> We have a lot of data in a jewel cluster with hammer tunables. We'll
> upgrade that to luminous soon, but don't plan to set chooseleaf_stable
> until there's less disruptive procedure, e.g.  [1].
>
> Cheers, Dan
>
> [1] One idea I had to make this much less disruptive would be to
> script something that uses upmap's to lock all PGs into their current
> placement, then set chooseleaf_stable, then gradually remove the
> upmap's. There are some details to work out, and it requires all
> clients to be running luminous, but I think something like this could
> help...
>
>
>
>
> On Mon, May 14, 2018 at 9:01 AM, Adrian <aussie...@gmail.com> wrote:
> > Hi all,
> >
> > We recently upgraded our old ceph cluster to jewel (5xmon, 21xstorage
> hosts
> > with 9x6tb filestore osds and 3xssd's with 3 journals on each) - mostly
> used
> > for openstack compute/cinder.
> >
> > In order to get there we had to go with chooseleaf_vary_r = 4 in order to
> > minimize client impact and save time. We now need to get to luminous (on
> a
> > deadline and time is limited).
> >
> > Current tunables are:
> >   {
> >   "choose_local_tries": 0,
> >   "choose_local_fallback_tries": 0,
> >   "choose_total_tries": 50,
> >   "chooseleaf_descend_once": 1,
> >   "chooseleaf_vary_r": 4,
> >   "chooseleaf_stable": 0,
> >   "straw_calc_version": 1,
> >   "allowed_bucket_algs": 22,
> >   "profile": "unknown",
> >   "optimal_tunables": 0,
> >   "legacy_tunables": 0,
> >   "minimum_required_version": "firefly",
> >   "require_feature_tunables": 1,
> >   "require_feature_tunables2": 1,
> >   "has_v2_rules": 0,
> >   "require_feature_tunables3": 1,
> >   "has_v3_rules": 0,
> >   "has_v4_buckets": 0,
> >   "require_feature_tunables5": 0,
> >   "has_v5_rules": 0
> >   }
> >
> > Setting chooseleaf_stable to 1, the crush compare tool says:
> >Replacing the crushmap specified with --origin with the crushmap
> >   specified with --destination will move 8774 PGs (59.08417508417509% of
> the
> > total)
> >   from one item to another.
> >
> > Current tunings we have in ceph.conf are:
> >   #THROTTLING CEPH
> >   osd_max_backfills = 1
> >   osd_recovery_max_active = 1
> >   osd_recovery_op_priority = 1
> >   osd_client_op_priority = 63
> >
> >   #PERFORMANCE TUNING
> >   osd_op_threads = 6
> >   filestore_op_threads = 10
> >   filestore_max_sync_interval = 30
> >
> > I was wondering if anyone has any advice as to anything else we can do
> > balancing client impact and speed of recovery or war stories of other
> things
> > to consider.
> >
> > I'm also wondering about the interplay between chooseleaf_vary_r and
> > chooseleaf_stable.
> > Are we better with
> > 1) sticking with choosleaf_vary_r = 4, setting chooseleaf_stable =1,
> > upgrading and then setting chooseleaf_vary_r incrementally to 1 when more
> > time is available
> > or
> > 2) setting chooseleaf_vary_r incrementally first, then chooseleaf_stable
> and
> > finally upgrade
> >
> > All this bearing in mind we'd like to keep the time it takes us to get to
> > luminous as short as possible ;-) (guestimating a 59% rebalance to take
> many
> > days)
> >
> > Any advice/thoughts gratefully received.
> >
> > Regards,
> > Adrian.
> >
> > --
> > ---
> > Adrian : aussie...@gmail.com
> > If violence doesn't solve your problem, you're not using enough of it.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] no rebalance when changing chooseleaf_vary_r tunable

2018-04-04 Thread Adrian
Hi Gregory

We were planning on going to chooseleaf_vary_r=4 so we could upgrade to
jewel now and schedule the change to 1 at a more suitable time since we
were expecting a large rebalancing of objects (should have mentioned that).

Good to know that there's a valid reason we didn't see any rebalance
though, had me worried so thanks for the info.

Regards,
Adrian.

On Thu, Apr 5, 2018 at 9:16 AM, Gregory Farnum <gfar...@redhat.com> wrote:

> http://docs.ceph.com/docs/master/rados/operations/crush-
> map/#firefly-crush-tunables3
>
> "The optimal value (in terms of computational cost and correctness) is 1."
>
> I think you're just finding that the production cluster, with a much
> larger number of buckets, didn't ever run in to the situation
> chooseleaf_vary_r is meant to resolve, so it didn't change any
> mappings by turning it on.
> -Greg
>
> On Wed, Apr 4, 2018 at 3:49 PM, Adrian <aussie...@gmail.com> wrote:
> > Hi all,
> >
> > Was wondering if someone could enlighten me...
> >
> > I've recently been upgrading a small test clusters tunables from bobtail
> to
> > firefly prior to doing the same with an old production cluster.
> >
> > OS is rhel 7.4, kernel in test is all 3.10.0-693.el7.x86_64, in prod
> admin
> > box
> > is 3.10.0-693.el7.x86_64 all mons and osds are 4.4.76-1.el7.elrepo.x86_64
> >
> > Ceph version is 0.94.10-0.el7, both were installed with ceph-deploy
> 5.37-0
> >
> > Production system was originally redhat ceph but was then changed to
> > ceph-community edition (all prior to me being here) and has 189 osds
> > on 21 hosts with 5 mons
> >
> > I changed chooseleaf_vary_r from 0 in test incrementally from 5 to 1,
> > each change saw a larger rebalance than the last
> >
> >|---+--+---|
> >| chooseleaf_vary_r | degraded | misplaced |
> >|---+--+---|
> >| 5 |   0% |0.187% |
> >| 4 |   1.913% |2.918% |
> >| 3 |   6.965% |   18.904% |
> >| 2 |  14.303% |   32.380% |
> >| 1 |  20.657% |   48.310% |
> >|---+--+---|
> >
> > As the change to 5 was so minimal we decided to jump from 0 to 4 in prod
> >
> > I performed the exact same steps on the production cluster and changed
> > chooseleaf_vary_r to 4 however nothing happened, no rebalancing at all.
> >
> > Update was done with
> >
> >ceph osd getcrushmap -o crushmap-bobtail
> >crushtool -i crushmap-bobtail --set-chooseleaf-vary-r 4 -o
> > crushmap-firefly
> >ceph osd setcrushmap -i crushmap-firefly
> >
> > I also decomplied and diff'ed the maps on occasion to confirm changes,
> I'm
> > relatively new to ceph, better safe than sorry :-)
> >
> >
> > tunables in prod prior to any change were
> > {
> > "choose_local_tries": 0,
> > "choose_local_fallback_tries": 0,
> > "choose_total_tries": 50,
> > "chooseleaf_descend_once": 1,
> > "chooseleaf_vary_r": 0,
> > "straw_calc_version": 0,
> > "allowed_bucket_algs": 22,
> > "profile": "bobtail",
> > "optimal_tunables": 0,
> > "legacy_tunables": 0,
> > "require_feature_tunables": 1,
> > "require_feature_tunables2": 1,
> > "require_feature_tunables3": 0,
> > "has_v2_rules": 0,
> > "has_v3_rules": 0,
> > "has_v4_buckets": 0
> > }
> >
> > tunables in prod now show
> > {
> > "choose_local_tries": 0,
> > "choose_local_fallback_tries": 0,
> > "choose_total_tries": 50,
> > "chooseleaf_descend_once": 1,
> > "chooseleaf_vary_r": 4,
> > "straw_calc_version": 0,
> > "allowed_bucket_algs": 22,
> > "profile": "unknown",
> > "optimal_tunables": 0,
> > "legacy_tunables": 0,
> > "require_feature_tunables": 1,
> > "require_feature_tunables2": 1,
> > "require_feature_tunables3": 1,
> > "has_v2_rules": 0,
> > "has_v3_rules": 0,
> > "has_v4_buckets": 0
> > }
> >
> > for ref in test they are now
> > {
> > "choose_local_tries": 0,
>

[ceph-users] no rebalance when changing chooseleaf_vary_r tunable

2018-04-04 Thread Adrian
Hi all,

Was wondering if someone could enlighten me...

I've recently been upgrading a small test clusters tunables from bobtail to
firefly prior to doing the same with an old production cluster.

OS is rhel 7.4, kernel in test is all 3.10.0-693.el7.x86_64, in prod admin
box
is 3.10.0-693.el7.x86_64 all mons and osds are 4.4.76-1.el7.elrepo.x86_64

Ceph version is 0.94.10-0.el7, both were installed with ceph-deploy 5.37-0

Production system was originally redhat ceph but was then changed to
ceph-community edition (all prior to me being here) and has 189 osds
on 21 hosts with 5 mons

I changed chooseleaf_vary_r from 0 in test incrementally from 5 to 1,
each change saw a larger rebalance than the last

   |---+--+---|
   | chooseleaf_vary_r | degraded | misplaced |
   |---+--+---|
   | 5 |   0% |0.187% |
   | 4 |   1.913% |2.918% |
   | 3 |   6.965% |   18.904% |
   | 2 |  14.303% |   32.380% |
   | 1 |  20.657% |   48.310% |
   |---+--+---|

As the change to 5 was so minimal we decided to jump from 0 to 4 in prod

I performed the exact same steps on the production cluster and changed
chooseleaf_vary_r to 4 however nothing happened, no rebalancing at all.

Update was done with

   ceph osd getcrushmap -o crushmap-bobtail
   crushtool -i crushmap-bobtail --set-chooseleaf-vary-r 4 -o
crushmap-firefly
   ceph osd setcrushmap -i crushmap-firefly

I also decomplied and diff'ed the maps on occasion to confirm changes, I'm
relatively new to ceph, better safe than sorry :-)


tunables in prod prior to any change were
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 0,
"straw_calc_version": 0,
"allowed_bucket_algs": 22,
"profile": "bobtail",
"optimal_tunables": 0,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"require_feature_tunables3": 0,
"has_v2_rules": 0,
"has_v3_rules": 0,
"has_v4_buckets": 0
}

tunables in prod now show
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 4,
"straw_calc_version": 0,
"allowed_bucket_algs": 22,
"profile": "unknown",
"optimal_tunables": 0,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"require_feature_tunables3": 1,
"has_v2_rules": 0,
"has_v3_rules": 0,
"has_v4_buckets": 0
}

for ref in test they are now
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"straw_calc_version": 0,
"allowed_bucket_algs": 22,
"profile": "firefly",
    "optimal_tunables": 1,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"require_feature_tunables3": 1,
"has_v2_rules": 0,
"has_v3_rules": 0,
"has_v4_buckets": 0
}

I'm worried that no rebalancing occurred - anyone any idea why ?

The goal here is to get ready to upgrade to jewel - anyone see any issues
from the above info ?

Thanks in advance,
Adrian.

-- 
---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-04 Thread Adrian Saul

We are using Ceph+RBD+NFS under pacemaker for VMware.  We are doing iSCSI using 
SCST but have not used it against VMware, just Solaris and Hyper-V.

It generally works and performs well enough – the biggest issues are the 
clustering for iSCSI ALUA support and NFS failover, most of which we have 
developed in house – we still have not quite got that right yet.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Daniel 
K
Sent: Saturday, 3 March 2018 1:03 AM
To: Joshua Chen 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph iSCSI is a prank?

There's been quite a few VMWare/Ceph threads on the mailing list in the past.

One setup I've been toying with is a linux guest running on the vmware host on 
local storage, with the guest mounting a ceph RBD with a filesystem on it, then 
exporting that via NFS to the VMWare host as a datastore.

Exporting CephFS via NFS to Vmware is another option.

I'm not sure how well shared storage will work with either of these 
configurations. but they work fairly well for single-host deployments.

There are also quite a few products that do support iscsi on ceph. Suse 
Enterprise Storage is a commercial one, PetaSAN is an open-source option.


On Fri, Mar 2, 2018 at 2:24 AM, Joshua Chen 
> wrote:
Dear all,
  I wonder how we could support VM systems with ceph storage (block device)? my 
colleagues are waiting for my answer for vmware (vSphere 5) and I myself use 
oVirt (RHEV). the default protocol is iSCSI.
  I know that openstack/cinder work well with ceph and proxmox (just heard) 
too. But currently we are using vmware and ovirt.


Your wise suggestion is appreciated

Cheers
Joshua


On Thu, Mar 1, 2018 at 3:16 AM, Mark Schouten 
> wrote:
Does Xen still not support RBD? Ceph has been around for years now!
Met vriendelijke groeten,

--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl


Van: Massimiliano Cuttini >
Aan: "ceph-users@lists.ceph.com" 
>
Verzonden: 28-2-2018 13:53
Onderwerp: [ceph-users] Ceph iSCSI is a prank?

I was building ceph in order to use with iSCSI.
But I just see from the docs that need:

CentOS 7.5
(which is not available yet, it's still at 7.4)
https://wiki.centos.org/Download

Kernel 4.17
(which is not available yet, it is still at 4.15.7)
https://www.kernel.org/

So I guess, there is no ufficial support and this is just a bad prank.

Ceph is ready to be used with S3 since many years.
But need the kernel of the next century to works with such an old technology 
like iSCSI.
So sad.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thick provisioning

2017-10-18 Thread Adrian Saul

I concur - at the moment we need to manually sum the RBD images to look at how 
much we have "provisioned" vs what ceph df shows.  in our case we had a rapid 
run of provisioning new LUNs but it took  a while before usage started to catch 
up with what was provisioned as data was migrated in.  Ceph df would show say 
only 20% of a pool used, but the actual RBD allocation was nearer 80+%

I am not sure if its workable but if there could be a pool level metric to 
track the total allocation of RBD images that would be useful.  I imagine it 
gets tricky with snapshots/clones though.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> si...@turka.nl
> Sent: Thursday, 19 October 2017 6:41 AM
> To: Samuel Soulard 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Thick provisioning
>
> Hi all,
>
> Thanks for the replies.
>
> The main reason why I was looking for the thin/thick provisioning setting is
> that I want to be sure that provisioned space should not exceed the cluster
> capacity.
>
> With thin provisioning there is a risk that more space is provisioned than the
> cluster capacity. When you monitor closely the real usage, this should not be
> a problem; but from experience when there is no hard limit, overprovisioning
> will happen at some point.
>
> Sinan
>
> > I can only speak for some environments, but sometimes, you would want
> > to make sure that a cluster cannot fill up until you can add more capacity.
> >
> > Some organizations are unable to purchase new capacity rapidly and
> > making sure you cannot exceed your current capacity, then you can't
> > run into problems.
> >
> > It may also come from an understanding that thick provisioning will
> > provide more performance initially like virtual machines environment.
> >
> > Having said all of this, isn't there a way to make sure the cluster
> > can accommodate the size of all RBD images that are created. And
> > ensure they have the space available? Some service availability might
> > depend on making sure the storage can provide the necessary capacity.
> >
> > I'm assuming that this is all from an understanding that it is more
> > costly to run such type of environments, however, you can also
> > guarantee that you will never fill up unexpectedly your cluster.
> >
> > Sam
> >
> > On Oct 18, 2017 02:20, "Wido den Hollander"  wrote:
> >
> >
> >> Op 17 oktober 2017 om 19:38 schreef Jason Dillaman
> >> :
> >>
> >>
> >> There is no existing option to thick provision images within RBD.
> >> When an image is created or cloned, the only actions that occur are
> >> some small metadata updates to describe the image. This allows image
> >> creation to be a quick, constant time operation regardless of the
> >> image size. To thick provision the entire image would require writing
> >> data to the entire image and ensuring discard support is disabled to
> >> prevent the OS from releasing space back (and thus re-sparsifying the
> >> image).
> >>
> >
> > Indeed. It makes me wonder why anybody would want it. It will:
> >
> > - Impact recovery performance
> > - Impact scrubbing performance
> > - Utilize more space then needed
> >
> > Why would you want to do this Sinan?
> >
> > Wido
> >
> >> On Mon, Oct 16, 2017 at 10:49 AM,   wrote:
> >> > Hi,
> >> >
> >> > I have deployed a Ceph cluster (Jewel). By default all block
> >> > devices
> > that
> >> > are created are thin provisioned.
> >> >
> >> > Is it possible to change this setting? I would like to have that
> >> > all created block devices are thick provisioned.
> >> >
> >> > In front of the Ceph cluster, I am running Openstack.
> >> >
> >> > Thanks!
> >> >
> >> > Sinan
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or 

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

It’s a fair point – in our case we are based on CentOS so self-support only 
anyway (business does not like paying support costs).  At the time we evaluated 
LIO, SCST and STGT, with a  directive to use ALUA support instead of IP 
failover.   In the end we went with SCST as it had more mature ALUA support at 
the time, and was easier to integrate into pacemaker to support the ALUA 
failover, it also seemed to perform fairly well.

However given the road we have gone down and the issues we are facing as we 
scale up and load up the storage, having a vendor support channel would be a 
relief.


From: Samuel Soulard [mailto:samuel.soul...@gmail.com]
Sent: Thursday, 12 October 2017 11:20 AM
To: Adrian Saul <adrian.s...@tpgtelecom.com.au>
Cc: Zhu Lingshan <ls...@suse.com>; dilla...@redhat.com; ceph-users 
<ceph-us...@ceph.com>
Subject: RE: [ceph-users] Ceph-ISCSI

Yes I looked at this solution, and it seems interesting.  However, one point 
often stick with business requirements is commercial support.

With Redhat or Suse, you have support provided with the solution.   I'm not 
sure about SCST what support channel they offer.

Sam

On Oct 11, 2017 20:05, "Adrian Saul" 
<adrian.s...@tpgtelecom.com.au<mailto:adrian.s...@tpgtelecom.com.au>> wrote:

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users 
> [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
>  On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard <samuel.soul...@gmail.com<mailto:samuel.soul...@gmail.com>>
> Cc: ceph-users <ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>>; Zhu 
> Lingshan <ls...@suse.com<mailto:ls...@suse.com>>
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
> <samuel.soul...@gmail.com<mailto:samuel.soul...@gmail.com>> wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > <jdill...@redhat.com<mailto:jdill...@redhat.com>>
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >> <samuel.soul...@gmail.com<mailto:samuel.soul...@gmail.com>> wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >>

Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Adrian Saul

As an aside, SCST  iSCSI will support ALUA and does PGRs through the use of 
DLM.  We have been using that with Solaris and Hyper-V initiators for RBD 
backed storage but still have some ongoing issues with ALUA (probably our 
current config, we need to lab later recommendations).



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 12 October 2017 5:04 AM
> To: Samuel Soulard 
> Cc: ceph-users ; Zhu Lingshan 
> Subject: Re: [ceph-users] Ceph-ISCSI
>
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote:
> > Hmmm, If you failover the identity of the LIO configuration including
> > PGRs (I believe they are files on disk), this would work no?  Using an
> > 2 ISCSI gateways which have shared storage to store the LIO
> > configuration and PGR data.
>
> Are you referring to the Active Persist Through Power Loss (APTPL) support
> in LIO where it writes the PR metadata to "/var/target/pr/aptpl_"? I
> suppose that would work for a Pacemaker failover if you had a shared file
> system mounted between all your gateways *and* the initiator requests
> APTPL mode(?).
>
> > Also, you said another "fails over to another port", do you mean a
> > port on another ISCSI gateway?  I believe LIO with multiple target
> > portal IP on the same node for path redundancy works with PGRs.
>
> Yes, I was referring to the case with multiple active iSCSI gateways which
> doesn't currently distribute PGRs to all gateways in the group.
>
> > In my scenario, if my assumptions are correct, you would only have 1
> > ISCSI gateway available through 2 target portal IP (for data path
> > redundancy).  If this first ISCSI gateway fails, both target portal IP
> > failover to the standby node with the PGR data that is available on share
> stored.
> >
> >
> > Sam
> >
> > On Wed, Oct 11, 2017 at 12:52 PM, Jason Dillaman 
> > wrote:
> >>
> >> On Wed, Oct 11, 2017 at 12:31 PM, Samuel Soulard
> >>  wrote:
> >> > Hi to all,
> >> >
> >> > What if you're using an ISCSI gateway based on LIO and KRBD (that
> >> > is, RBD block device mounted on the ISCSI gateway and published
> >> > through LIO).
> >> > The
> >> > LIO target portal (virtual IP) would failover to another node.
> >> > This would theoretically provide support for PGRs since LIO does
> >> > support SPC-3.
> >> > Granted it is not distributed and limited to 1 single node
> >> > throughput, but this would achieve high availability required by
> >> > some environment.
> >>
> >> Yes, LIO technically supports PGR but it's not distributed to other
> >> nodes. If you have a pacemaker-initiated target failover to another
> >> node, the PGR state would be lost / missing after migration (unless I
> >> am missing something like a resource agent that attempts to preserve
> >> the PGRs). For initiator-initiated failover (e.g. a target is alive
> >> but the initiator cannot reach it), after it fails over to another
> >> port the PGR data won't be available.
> >>
> >> > Of course, multiple target portal would be awesome since available
> >> > throughput would be able to scale linearly, but since this isn't
> >> > here right now, this would provide at least an alternative.
> >>
> >> It would definitely be great to go active/active but there are
> >> concerns of data-corrupting edge conditions when using MPIO since it
> >> relies on client-side failure timers that are not coordinated with
> >> the target.
> >>
> >> For example, if an initiator writes to sector X down path A and there
> >> is delay to the path A target (i.e. the target and initiator timeout
> >> timers are not in-sync), and MPIO fails over to path B, quickly
> >> performs the write to sector X and performs second write to sector X,
> >> there is a possibility that eventually path A will unblock and
> >> overwrite the new value in sector 1 with the old value. The safe way
> >> to handle that would require setting the initiator-side IO timeouts
> >> to such high values as to cause higher-level subsystems to mark the
> >> MPIO path as failed should a failure actually occur.
> >>
> >> The iSCSI MCS protocol would address these concerns since in theory
> >> path B could discover that the retried IO was actually a retry, but
> >> alas it's not available in the Linux Open-iSCSI nor ESX iSCSI
> >> initiators.
> >>
> >> > On Wed, Oct 11, 2017 at 12:26 PM, David Disseldorp 
> >> > wrote:
> >> >>
> >> >> Hi Jason,
> >> >>
> >> >> Thanks for the detailed write-up...
> >> >>
> >> >> On Wed, 11 Oct 2017 08:57:46 -0400, Jason Dillaman wrote:
> >> >>
> >> >> > On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> > > As far as I am able to understand there are 2 ways of setting
> >> >> > > iscsi for ceph
> >> >> > >
> >> >> > > 1- using kernel (lrbd) only 

Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Adrian Saul

We see the same messages and are similarly on a 4.4 KRBD version that is 
affected by this.

I have seen no impact from it so far that I know about


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 5 October 2017 5:45 AM
> To: Gregory Farnum 
> Cc: ceph-users ; Josy
> 
> Subject: Re: [ceph-users] bad crc/signature errors
>
> Perhaps this is related to a known issue on some 4.4 and later kernels [1]
> where the stable write flag was not preserved by the kernel?
>
> [1] http://tracker.ceph.com/issues/19275
>
> On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum 
> wrote:
> > That message indicates that the checksums of messages between your
> > kernel client and OSD are incorrect. It could be actual physical
> > transmission errors, but if you don't see other issues then this isn't
> > fatal; they can recover from it.
> >
> > On Wed, Oct 4, 2017 at 8:52 AM Josy 
> wrote:
> >>
> >> Hi,
> >>
> >> We have setup a cluster with 8 OSD servers (31 disks)
> >>
> >> Ceph health is Ok.
> >> --
> >> [root@las1-1-44 ~]# ceph -s
> >>cluster:
> >>  id: de296604-d85c-46ab-a3af-add3367f0e6d
> >>  health: HEALTH_OK
> >>
> >>services:
> >>  mon: 3 daemons, quorum
> >> ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
> >>  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
> >>  osd: 31 osds: 31 up, 31 in
> >>
> >>data:
> >>  pools:   4 pools, 510 pgs
> >>  objects: 459k objects, 1800 GB
> >>  usage:   5288 GB used, 24461 GB / 29749 GB avail
> >>  pgs: 510 active+clean
> >> 
> >>
> >> We created a pool and mounted it as RBD in one of the client server.
> >> While adding data to it, we see this below error :
> >>
> >> 
> >> [939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
> >> [939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
> >> [939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
> >> [939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature
> >>
> >> =
> >>
> >> Can anyone explain what is this and if I can fix it ?
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd create returns duplicate ID's

2017-09-29 Thread Adrian Saul

Do you mean that after you delete and remove the crush and auth entries for the 
OSD, when you go to create another OSD later it will re-use the previous OSD ID 
that you have destroyed in the past?

Because I have seen that behaviour as well -  but only for previously allocated 
OSD IDs that have been osd rm/crush rm/auth del.




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Luis Periquito
> Sent: Friday, 29 September 2017 6:01 PM
> To: Ceph Users 
> Subject: [ceph-users] osd create returns duplicate ID's
>
> Hi all,
>
> I use puppet to deploy and manage my clusters.
>
> Recently, as I have been doing a removal of old hardware and adding of new
> I've noticed that sometimes the "ceph osd create" is returning repeated IDs.
> Usually it's on the same server, but yesterday I saw it in different servers.
>
> I was expecting the OSD ID's to be unique, and when they come on the same
> server puppet starts spewing errors - which is desirable - but when it's in
> different servers it broke those OSDs in Ceph. As they hadn't backfill any 
> full
> PGs I just wiped, removed and started anew.
>
> As for the process itself: The OSDs are marked out and removed from crush,
> when empty they are auth del and osd rm. After building the server puppet
> will osd create, and use the generated ID for crush move and mkfs.
>
> Unfortunately I haven't been able to reproduce in isolation, and being a
> production cluster logging is tuned way down.
>
> This has happened in several different clusters, but they are all running
> 10.2.7.
>
> Any ideas?
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-22 Thread Adrian Saul

Thanks for bringing this to attention Wido - its of interest to us as we are 
currently looking to migrate mail platforms onto Ceph using NFS, but this seems 
far more practical.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: Thursday, 21 September 2017 6:40 PM
> To: ceph-us...@ceph.com
> Subject: [ceph-users] librmb: Mail storage on RADOS with Dovecot
>
> Hi,
>
> A tracker issue has been out there for a while:
> http://tracker.ceph.com/issues/12430
>
> Storing e-mail in RADOS with Dovecot, the IMAP/POP3/LDA server with a
> huge marketshare.
>
> It took a while, but last year Deutsche Telekom took on the heavy work and
> started a project to develop librmb: LibRadosMailBox
>
> Together with Deutsche Telekom and Tallence GmbH (DE) this project came
> to life.
>
> First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-
> plugin
>
> I am not going to repeat everything which is on Github, put a short summary:
>
> - CephFS is used for storing Mailbox Indexes
> - E-Mails are stored directly as RADOS objects
> - It's a Dovecot plugin
>
> We would like everybody to test librmb and report back issues on Github so
> that further development can be done.
>
> It's not finalized yet, but all the help is welcome to make librmb the best
> solution for storing your e-mails on Ceph with Dovecot.
>
> Danny Al-Gaaf has written a small blogpost about it and a presentation:
>
> - https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/
> - http://blog.bisect.de/2017/09/ceph-meetup-berlin-followup-librmb.html
>
> To get a idea of the scale: 4,7PB of RAW storage over 1.200 OSDs is the final
> goal (last slide in presentation). That will provide roughly 1,2PB of usable
> storage capacity for storing e-mail, a lot of e-mail.
>
> To see this project finally go into the Open Source world excites me a lot :-)
>
> A very, very big thanks to Deutsche Telekom for funding this awesome
> project!
>
> A big thanks as well to Tallence as they did an awesome job in developing
> librmb in such a short time.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Adrian Saul
> I understand what you mean and it's indeed dangerous, but see:
> https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
>
> Looking at the systemd docs it's difficult though:
> https://www.freedesktop.org/software/systemd/man/systemd.service.ht
> ml
>
> If the OSD crashes due to another bug you do want it to restart.
>
> But for systemd it's not possible to see if the crash was due to a disk I/O-
> error or a bug in the OSD itself or maybe the OOM-killer or something.

Perhaps using something like RestartPreventExitStatus and defining a specific 
exit code for the OSD to exit on when it is exiting due to an IO error.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-07 Thread Adrian Saul
> * Drop the odd releases, and aim for a ~9 month cadence. This splits the
> difference between the current even/odd pattern we've been doing.
>
>   + eliminate the confusing odd releases with dubious value
>   + waiting for the next release isn't quite as bad
>   - required upgrades every 9 months instead of ever 12 months

As a user is probably closest to the ideal, although a production deployment 
might slip out of the LTS view in 18 months given once deployed they tend to 
stay static.

>From a testing perspective it would be good to know you could deploy the 
>"early access" version of a release and test with that rather than having to 
>switch release to productionise when that release is blessed.

Also, and this might be harder to achieve, but could krbd support for new 
releases be more aligned with kernel versions?  Or at the least a definitive 
map of what kernels and backports support which release.


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring a rbd map rbd connection

2017-08-25 Thread Adrian Saul
If you are monitoring to ensure that it is mounted and active, a simple 
check_disk on the mountpoint should work.  If the mount is not present, or the 
filesystem is non-responsive then this should pick it up. A second check to 
perhaps test you can actually write files to the file system would not go 
astray either.

Other than that I don't think there is much point checking anything else like 
rbd mapped output.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Hauke Homburg
> Sent: Friday, 25 August 2017 1:35 PM
> To: ceph-users 
> Subject: [ceph-users] Monitoring a rbd map rbd connection
>
> Hallo,
>
> Ich want to monitor the mapped Connection between a rbd map rbdimage
> an a /dev/rbd device.
>
> This i want to do with icinga.
>
> Has anyone a Idea how i can do this?
>
> My first Idea is to touch and remove a File in the mount point. I am not sure
> that this is the the only thing i have to do
>
>
> Thanks for Help
>
> Hauke
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ruleset vs replica count

2017-08-25 Thread Adrian Saul

Yes - ams5-ssd would have 2 replicas, ams6-ssd would have 1  (@size 3, -2 = 1)

Although for this ruleset the min_size should be set to at least 2, or more 
practically 3 or 4.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sinan 
Polat
Sent: Friday, 25 August 2017 3:02 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Ruleset vs replica count

Hi,

In a Multi Datacenter Cluster I have the following rulesets:
--
rule ams5_ssd {
ruleset 1
type replicated
min_size 1
max_size 10
step take ams5-ssd
step chooseleaf firstn 2 type host
step emit
step take ams6-ssd
step chooseleaf firstn -2 type host
step emit
}
rule ams6_ssd {
ruleset 2
type replicated
min_size 1
max_size 10
step take ams6-ssd
step chooseleaf firstn 2 type host
step emit
step take ams5-ssd
step chooseleaf firstn -2 type host
step emit
}
--

The replication size is set to 3.

When for example ruleset 1 is used, how is the replication being done? Does it 
store 2 replica's in ams5-ssd and store 1 replica in ams6-ssd? Or does it store 
3 replicas in ams5-ssd and 3 replicas in ams6-ssd?

Thanks!

Sinan
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Adrian Saul
> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage - MZ-
> 75E4T0B/AM | Samsung

The performance difference between these and the SM or PM863 range is night and 
day.  I would not use these for anything you care about with performance, 
particularly IOPS or latency.
Their write latency is highly variable and even at best is still 5x higher than 
what the SM863 range does.  When we compared them we could not get them below 
6ms and they frequently spiked to much higher values (25-30ms).  With the 
SM863s they were a constant sub 1ms and didn't fluctuate.  I believe it was the 
garbage collection on the Evos that causes the issue.  Here was the difference 
in average latencies from a pool made of half Evo and half SM863:

Write latency - Evo 7.64ms - SM863 0.55ms
Read Latency - Evo 2.56ms - SM863  0.16ms

Add to that Christian's remarks on the write endurance and they are only good 
for desktops that wont exercise them that much.   You are far better investing 
in DC/Enterprise grade devices.




>
> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>  wrote:
> > Yes, Its in production and used the pg count as per the pg calcuator @
> ceph.com.
> >
> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
> >> Which ssds are used? Are they in production? If so how is your PG Count?
> >>
> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> :
> >>>
> >>> Hello,
> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate
> >>> pool for each.
> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500
> >>> MB/s and SSD's OSD show around 280MB/s.
> >>>
> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40%
> >>> high as compared with HDD's OSD bench.
> >>>
> >>> Did I miss anything here? Any hint is appreciated.
> >>>
> >>> Thanks
> >>> Swami
> >>> 
> >>>
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul
> I'd be interested in details of this small versus large bit.

The smaller shares is just simply to distribute the workload over more RBDs so 
the bottleneck doesn’t become the RBD device. The size itself doesn’t 
particularly matter but just the idea to distribute VMs across many shares 
rather than a few large datastores.

We originally started with 10TB shares, just because we had the space - but we 
found performance was running out before capacity did.  But it's been apparent 
that the limitation appears to be at the RBD level, particularly with writes.  
So under heavy usage with say VMWare snapshot backups VMs gets impacted by 
higher latency to the point that some VMs become unresponsive for small 
periods.  The ceph cluster itself has plenty of performance available and 
handles far higher workload periods, but individual RBD devices just seem to 
hit the wall.

For example, one of our shares will sit there all day happily doing 3-400 IOPS 
read at very low latencies.  During the backup period we get heavier writes as 
snapshots are created and cleaned up.   That increased write activity pushes 
the RBD to 100% busy and read latencies go up from 1-2ms to 20-30ms, even 
though the number of reads doesn’t change that much.   The devices though can 
handle more, I can see periods of up to 1800 IOPS read and 800 write.

There is probably more tuning that can be applied at the XFS/NFS level, but for 
the moment that’s the direction we are taking - creating more shares.

>
> Would you say that the IOPS starvation is more an issue of the large
> filesystem than the underlying Ceph/RBD?

As above - I think its more to do with an IOPS limitation at the RBD device 
level - likely due to sync write latency limiting the number of effective IOs.  
That might be XFS as well but I have not had the chance to dial that in more.

> With a cache-tier in place I'd expect all hot FS objects (inodes, etc) to be
> there and thus be as fast as it gets from a Ceph perspective.

Yeah - the cache teir takes a fair bit of the heat and improves the response 
considerably for the SATA environments - it makes a significant difference.  
The SSD only pool images behave in a similar way but operate to a much higher 
performance level before they start showing issues.

> OTOH lots of competing accesses to same journal, inodes would be a
> limitation inherent to the FS.

Its likely there is tuning there to improve the XFS performance, but from the 
stats of the RBD device they are showing the latencies going up, there might be 
more impact further up the stack, but the underlying device shows the change in 
performance.

>
> Christian
>
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: Wednesday, 16 August 2017 10:34 PM
> > To: n...@fisk.me.uk
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Nick,
> >
> > Thanks for replying! If Ceph is combined with Openstack then, does that
> mean that actually when openstack writes are happening, it is not fully sync'd
> (as in written to disks) before it starts receiving more data, so acting as 
> async
> ? In that scenario there is a chance for data loss if things go bad, i.e power
> outage or something like that ?
> >
> > As for the slow operations, reading is quite fine when I compare it to a SAN
> storage system connected to VMware. It is writing data, small chunks or big
> ones, that suffer when trying to use the sync option with FIO for
> benchmarking.
> >
> > In that case, I wonder, is no one using CEPH with VMware in a production
> environment ?
> >
> > Cheers.
> >
> > Regards,
> > Ossi
> >
> >
> >
> > Hi Osama,
> >
> > This is a known problem with many software defined storage stacks, but
> potentially slightly worse with Ceph due to extra overheads. Sync writes
> have to wait until all copies of the data are written to disk by the OSD and
> acknowledged back to the client. The extra network hops for replication and
> NFS gateways add significant latency which impacts the time it takes to carry
> out small writes. The Ceph code also takes time to process each IO request.
> >
> > What particular operations are you finding slow? Storage vmotions are just
> bad, and I don’t think there is much that can be done about them as they are
> split into lots of 64kb IO’s.
> >
> > One thing you can try is to force the CPU’s on your OSD nodes to run at C1
> cstate and force their minimum frequency to 100%. This can have quite a
> large impact on latency. Also you don’t specify your network, but 10G is a
> must.
> >
> > Nick
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: 14 August 2017 12:27
> > To: ceph-users
> > >
> > Subject: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Everyone,
> >
> > We started testing the idea of 

Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul

We are using Ceph on NFS for VMWare – we are using SSD tiers in front of SATA 
and some direct SSD pools.  The datastores are just XFS file systems on RBD 
managed by a pacemaker cluster for failover.

Lessons so far are that large datastores quickly run out of IOPS and compete 
for performance – you are better off with many smaller RBDs (say 1TB) to spread 
out workloads.  Also tuning up NFS threads seems to help.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: Wednesday, 16 August 2017 10:34 PM
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Nick,

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ?

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking.

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ?

Cheers.

Regards,
Ossi



Hi Osama,

This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request.

What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s.

One thing you can try is to force the CPU’s on your OSD nodes to run at C1 
cstate and force their minimum frequency to 100%. This can have quite a large 
impact on latency. Also you don’t specify your network, but 10G is a must.

Nick


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: 14 August 2017 12:27
To: ceph-users >
Subject: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Everyone,

We started testing the idea of using Ceph storage with VMware, the idea was to 
provide Ceph storage through open stack to VMware, by creating a virtual 
machine coming from Ceph + Openstack , which acts as an NFS gateway, then mount 
that storage on top of VMware cluster.

When mounting the NFS exports using the sync option, we noticed a huge 
degradation in performance which makes it very slow to use it in production, 
the async option makes it much better but then there is the risk of it being 
risky that in case a failure shall happen, some data might be lost in that 
Scenario.

Now I understand that some people in the ceph community are using Ceph with 
VMware using NFS gateways, so if you can kindly shed some light on your 
experience, and if you do use it in production purpose, that would be great and 
how did you mitigate the sync/async options and keep write performance.


Thanks you!!!

Regards,
Ossi


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Iscsi configuration

2017-08-08 Thread Adrian Saul
Hi Sam,
  We use SCST for iSCSI with Ceph, and a pacemaker cluster to orchestrate the 
management of active/passive presentation using ALUA though SCST device groups. 
 In our case we ended up writing our own pacemaker resources to support our 
particular model and preferences, but I believe there are a few resources out 
there for setting this up that you could make use of.

For us it consists of resources for the RBD devices, the iSCSI targets, the 
device groups and hostgroups for presentation.  The resources are cloned across 
all the cluster nodes, except for the device group resources which are 
master/slave, with the master becoming the active ALUA member and the others 
becoming standby or non-optimised.

The iSCSI clients see the ALUA presentation and manage it with their own 
multipathing stacks.

There may be ways to do it with LIO now, but at the time I looked at the ALUA 
support in SCST was a lot better.

HTH.

Cheers,
 Adrian



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Samuel 
Soulard
Sent: Wednesday, 9 August 2017 6:45 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Iscsi configuration

Hi all,

Platform : Centos 7 Luminous 12.1.2

First time here but, are there any guides or guidelines out there on how to 
configure ISCSI gateways in HA so that if one gateway fails, IO can continue on 
the passive node?

What I've done so far
-ISCSI node with Ceph client map rbd on boot
-Rbd has exclusive-lock feature enabled and layering
-Targetd service dependent on rbdmap.service
-rbd exported through LUN ISCSI
-Windows ISCSI imitator can map the lun and format / write to it (awesome)

Now I have no idea where to start to have an active /passive scenario for luns 
exported with LIO.  Any ideas?

Also the web dashboard seem to hint that it can get stats for various clients 
made on ISCSI gateways, I'm not sure where it pulls that information. Is 
Luminous now shipping a ISCSI daemon of some sort?

Thanks all!
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does ceph pg scrub error affect all of I/O in ceph cluster?

2017-08-03 Thread Adrian Saul

Depends on the error case – usually you will see blocked IO messages as well if 
there is a condition causing OSDs to be unresponsive.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ???
Sent: Friday, 4 August 2017 1:34 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Does ceph pg scrub error affect all of I/O in ceph 
cluster?

Hi cephers,

I experienced ceph status into HEALTH_ERR because of pg scrub error.

I thought all I/O is blocked when the status of ceph is Error.

However, ceph could operate normally even thought ceph is in error status.

There are two pools in ceph cluster which are include seperate 
nodes.(volumes-1, volumes-2)

The OSD device which has problem is in volumes-1 pool.

I noticed that volumes-2 pool has no problem with operation.

My question is that all of I/O request are blocked when the ceph status is into 
error or it depends on error case?

Thank you!
John Haan
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs per OSD guidance

2017-07-19 Thread Adrian Saul

Anyone able to offer any advice on this?

Cheers,
 Adrian


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Friday, 14 July 2017 6:05 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] PGs per OSD guidance
>
> Hi All,
>I have been reviewing the sizing of our PGs with a view to some
> intermittent performance issues.  When we have scrubs running, even when
> only a few are, we can sometimes get severe impacts on the performance of
> RBD images, enough to start causing VMs to appear stalled or unresponsive.
> When some of these scrubs are running I can see very high latency on some
> disks which I suspect is what is impacting the performance.  We currently
> have around 70 PGs per SATA OSD, and 140 PGs per SSD OSD.   These
> numbers are probably not really reflective as most of the data is in only 
> really
> half of the pools, so some PGs would be fairly heavy while others are
> practically empty.   From what I have read we should be able to go
> significantly higher though.We are running 10.2.1 if that matters in this
> context.
>
>  My question is if we increase the numbers of PGs, is that likely to help
> reduce the scrub impact or spread it wider?  For example, does the mere act
> of scrubbing one PG mean the underlying disk is going to be hammered and
> so we will impact more PGs with that load, or would having more PGs mean
> the time to scrub the PG should be reduced and so the impact will be more
> disbursed?
>
> I am also curious from a performance stand of view are we better off with
> more PGs to reduce PG lock contention etc?
>
> Cheers,
>  Adrian
>
>
> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs per OSD guidance

2017-07-14 Thread Adrian Saul
Hi All,
   I have been reviewing the sizing of our PGs with a view to some intermittent 
performance issues.  When we have scrubs running, even when only a few are, we 
can sometimes get severe impacts on the performance of RBD images, enough to 
start causing VMs to appear stalled or unresponsive.When some of these 
scrubs are running I can see very high latency on some disks which I suspect is 
what is impacting the performance.  We currently have around 70 PGs per SATA 
OSD, and 140 PGs per SSD OSD.   These numbers are probably not really 
reflective as most of the data is in only really half of the pools, so some PGs 
would be fairly heavy while others are practically empty.   From what I have 
read we should be able to go significantly higher though.We are running 
10.2.1 if that matters in this context.

 My question is if we increase the numbers of PGs, is that likely to help 
reduce the scrub impact or spread it wider?  For example, does the mere act of 
scrubbing one PG mean the underlying disk is going to be hammered and so we 
will impact more PGs with that load, or would having more PGs mean the time to 
scrub the PG should be reduced and so the impact will be more disbursed?

I am also curious from a performance stand of view are we better off with more 
PGs to reduce PG lock contention etc?

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deep scrub distribution

2017-07-05 Thread Adrian Saul

During a recent snafu with a production cluster I disabled scrubbing and deep 
scrubbing in order to reduce load on the cluster while things backfilled and 
settled down.  The PTSD caused by the incident meant I was not keen to 
re-enable it until I was confident we had fixed the root cause of the issues 
(driver issues with a new NIC type introduced with new hardware that did not 
show up until production load hit them).   My cluster is using Jewel 10.2.1, 
and is a mix of SSD and SATA over 20 hosts, 352 OSDs in total.

Fast forward a few weeks and I was ready to re-enable it.  On some reading I 
was concerned the cluster might kick off excessive scrubbing once I unset the 
flags, so I tried increasing the deep scrub interval from 7 days to 60 days - 
with most of the last deep scrubs being from over a month before I was hoping 
it would distribute them over the next 30 days.  Having unset the flag and 
carefully watched the cluster it seems to have just run a steady catch up 
without significant impact.  What I am noticing though is that the scrubbing is 
seeming to just run through the full set of PGs, so it did some 2280 PGs last 
night over 6 hours, and so far today in 12 hours another 4000 odd.  With 13408 
PGs, I am guessing that all this will stop some time early tomorrow.

ceph-glb-fec-01[/var/log]$ sudo ceph pg dump|awk '{print $20}'|grep 
2017|sort|uniq -c
dumped all in format plain
  5 2017-05-23
 18 2017-05-24
 33 2017-05-25
 52 2017-05-26
 89 2017-05-27
114 2017-05-28
144 2017-05-29
172 2017-05-30
256 2017-05-31
191 2017-06-01
230 2017-06-02
369 2017-06-03
606 2017-06-04
680 2017-06-05
919 2017-06-06
   1261 2017-06-07
   1876 2017-06-08
 15 2017-06-09
   2280 2017-07-05
   4098 2017-07-06

My concern is am I now set to have all 13408 PGs do a deep scrub in 60 days in 
a serial fashion again over 3 days.  I would much rather they distribute over 
that period.

Will the OSDs do this distribution themselves now they have caught up, or do I 
need to say create a script that will trigger batches of PGs to deep scrub over 
time to push out the distribution again?





Adrian Saul | Infrastructure Projects Team Lead
IT
T 02 9009 9041 | M +61 402 075 760
30 Ross St, Glebe NSW 2037
adrian.s...@tpgtelecom.com.au<mailto:adrian.s...@tpgtelecom.com.au> | 
www.tpg.com.au<http://www.tpg.com.au/>

TPG Telecom (ASX: TPM)


[Description: http://res.tpgi.com.au/img/signature/tpgtelecomlogo.jpg]


This email and any attachments are confidential and may be subject to 
copyright, legal or some other professional privilege. They are intended solely 
for the attention and use of the named addressee(s). They may only be copied, 
distributed or disclosed with the consent of the copyright owner. If you have 
received this email by mistake or by breach of the confidentiality clause, 
please notify the sender immediately by return email and delete or destroy all 
copies of the email. Any confidentiality, privilege or copyright is not waived 
or lost because this email has been sent to you by mistake.



Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + CEPH Integration

2017-06-18 Thread Adrian Saul
> Hi Alex,
>
> Have you experienced any problems with timeouts in the monitor action in
> pacemaker? Although largely stable, every now and again in our cluster the
> FS and Exportfs resources timeout in pacemaker. There's no mention of any
> slow requests or any peering..etc from the ceph logs so it's a bit of a 
> mystery.

Yes - we have that in our setup which is very similar.  Usually  I find it 
related to RBD device latency  due to scrubbing or similar but even when tuning 
some of that down we still get it randomly.

The most annoying part is that once it comes up, having to use  "resource 
cleanup" to try and remove the failed usually has more impact than the actual 
error.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Adrian Saul
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > and saw much worse performance with the first cluster, so it seems
> > this may be the better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of Ceph,
> though other people here have made significant efforts.

In our tests our best results were with SCST - also because it provided proper 
ALUA support at the time.  I ended up developing my own pacemaker cluster 
resources to manage the SCST orchestration and ALUA failover.  In our model we 
have  a pacemaker cluster in front being an RBD client presenting LUNs/NFS out 
to VMware (NFS), Solaris and Hyper-V (iSCSI).  We are using CephFS over NFS but 
performance has been poor, even using it just for VMware templates.  We are on 
an earlier version of Jewel so its possibly some later versions may improve 
CephFS for that but I have not had time to test it.

We have been running a small production/POC for over 18 months on that setup, 
and gone live into a much larger setup in the last 6 months based on that 
model.  It's not without its issues, but most of that is a lack of test 
resources to be able to shake out some of the client compatibility and failover 
shortfalls we have.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Adrian Saul
In my case I am using SCST, so that is what my experience is based on.  For our 
VMware we are using NFS, but for Hyper-V and Solaris we are using iSCSI.

There is actually some work done to make userland SCST which could be 
interesting for making a scst_librbd integration that bypasses the need for 
krbd.



From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Thursday, 6 April 2017 5:43 PM
To: Adrian Saul; 'Brady Deetz'; 'ceph-users'
Subject: RE: [ceph-users] rbd iscsi gateway question

I assume Brady is referring to the death spiral LIO gets into with some 
initiators, including vmware, if an IO takes longer than about 10s. I haven’t 
heard of anything, and can’t see any changes, so I would assume this issue 
still remains.

I would look at either SCST or NFS for now.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian 
Saul
Sent: 06 April 2017 05:32
To: Brady Deetz <bde...@gmail.com>; ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] rbd iscsi gateway question


I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:

-  Failed OSDs (dead disks) – no issues
-  Cluster rebalancing – ok if throttled back to keep service times down
-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot
-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-05 Thread Adrian Saul

I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:


-  Failed OSDs (dead disks) – no issues

-  Cluster rebalancing – ok if throttled back to keep service times down

-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot

-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

The problem is not so much ceph, but the fact that sync workloads tend to mean 
you have an effective queue depth of 1 because it serialises the IO from the 
application, as it waits for the last write to complete before issuing the next 
one.


From: Matteo Dacrema [mailto:mdacr...@enter.eu]
Sent: Wednesday, 8 March 2017 10:36 AM
To: Adrian Saul
Cc: ceph-users
Subject: Re: [ceph-users] MySQL and ceph volumes

Thank you Adrian!

I’ve forgot this option and I can reproduce the problem.

Now, what could be the problem on ceph side with O_DSYNC writes?

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
<adrian.s...@tpgtelecom.com.au<mailto:adrian.s...@tpgtelecom.com.au>> ha 
scritto:


Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Matteo Dacrema
Sent: Wednesday, 8 March 2017 7:52 AM
To: ceph-users
Subject: [ceph-users] MySQL and ceph volumes

Hi All,

I have a galera cluster running on openstack with data on ceph volumes
capped at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and
MySQL can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam:
http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Matteo Dacrema
> Sent: Wednesday, 8 March 2017 7:52 AM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
>
> Hi All,
>
> I have a galera cluster running on openstack with data on ceph volumes
> capped at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
> can’t reproduce the problem.
>
> Anyone can tell me where I’m wrong?
>
> Thank you
> Regards
> Matteo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Adrian Saul

I would concur having spent a lot of time on ZFS on Solaris.

ZIL will reduce the fragmentation problem a lot (because it is not doing intent 
logging into the filesystem itself which fragments the block allocations) and 
write response will be a lot better.  I would use different devices for L2ARC 
and ZIL - ZIL needs to be small and fast for writes (and mirrored - we have 
used some HGST 16G devices which are designed as ZILs - pricy but highly 
recommend) - L2ARC just needs to be faster for reads than your data disks, most 
SSDs would be fine for this.

A 14 disk RAIDZ2 is also going to be very poor for writes especially with SATA 
- you are effectively only getting one disk worth of IOPS for write as each 
write needs to hit all disks.  Without a ZIL you are also losing out on write 
IOPS for ZIL and metadata operations.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Patrick Donnelly
> Sent: Wednesday, 11 January 2017 5:24 PM
> To: Kevin Olbrich
> Cc: Ceph Users
> Subject: Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph
> for RBD + OpenStack
>
> Hello Kevin,
>
> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
> > 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
> > journal,
>
> Is the "journal" used as a ZIL?
>
> > We experienced a lot of io blocks (X requests blocked > 32 sec) when a
> > lot of data is changed in cloned RBDs (disk imported via OpenStack
> > Glance, cloned during instance creation by Cinder).
> > If the disk was cloned some months ago and large software updates are
> > applied (a lot of small files) combined with a lot of syncs, we often
> > had a node hit suicide timeout.
> > Most likely this is a problem with op thread count, as it is easy to
> > block threads with RAIDZ2 (RAID6) if many small operations are written
> > to disk (again, COW is not optimal here).
> > When recovery took place (0.020% degraded) the cluster performance was
> > very bad - remote service VMs (Windows) were unusable. Recovery itself
> > was using
> > 70 - 200 mb/s which was okay.
>
> I would think having an SSD ZIL here would make a very large difference.
> Probably a ZIL may have a much larger performance impact than an L2ARC
> device. [You may even partition it and have both but I'm not sure if that's
> normally recommended.]
>
> Thanks for your writeup!
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When Zero isn't 0 (Crush weight mysteries)

2016-12-20 Thread Adrian Saul

I found the other day even though I had 0 weighted OSDs, there was still weight 
in the containing buckets which triggered some rebalancing.

Maybe it is something similar, there was weight added to the bucket even though 
the OSD underneath was 0.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: Wednesday, 21 December 2016 12:39 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] When Zero isn't 0 (Crush weight mysteries)
>
>
> Hello,
>
> I just (manually) added 1 OSD each to my 2 cache-tier nodes.
> The plan was/is to actually do the data-migration at the least busiest day in
> Japan, New Years (the actual holiday is January 2nd this year).
>
> So I was going to have everything up and in but at weight 0 initially.
>
> Alas at the "ceph osd crush add osd.x0 0 host=ceph-0x" steps Ceph happily
> started to juggle a few PGs (about 7 total) around, despite of course no
> weight in the cluster changing at all.
> No harm done (this is the fast and not too busy cache-tier after all), but 
> very
> much unexpected.
>
> So which part of the CRUSH algorithm goes around and pulls weights out of
> thin air?
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush rule check

2016-12-12 Thread Adrian Saul
> One thing to check though. The number of DCs is a fixed number right? You
> will always have two DCs with X hosts.

I am keeping it open in case we add other sites for some reason, but likely to 
remain  at 2.

>
> In that case:
>
>   step choose firstn 2 type datacenter
>   step chooseleaf firstn -2 type host
>
> First, take 2 of the type 'datacenter' and then find the remaining hosts. But
> since you will always use size = 4 you might even try:
>
> rule sydney-ssd {
> ruleset 6
> type replicated
> min_size 4
> max_size 4
> step take ssd-sydney
> step choose firstn 2 type datacenter
> step chooseleaf firstn 2 type host
> step emit
> }
>
> This way the ruleset will only work for size = 4.
>
> Wido
>
>
> > thanks,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Wido den Hollander [mailto:w...@42on.com]
> > > Sent: Monday, 12 December 2016 7:07 PM
> > > To: ceph-users@lists.ceph.com; Adrian Saul
> > > Subject: Re: [ceph-users] Crush rule check
> > >
> > >
> > > > Op 10 december 2016 om 12:45 schreef Adrian Saul
> > > <adrian.s...@tpgtelecom.com.au>:
> > > >
> > > >
> > > >
> > > > Hi Ceph-users,
> > > >   I just want to double check a new crush ruleset I am creating -
> > > > the intent
> > > here is that over 2 DCs, it will select one DC, and place two copies
> > > on separate hosts in that DC.  The pools created on this will use size 4  
> > > and
> min-size 2.
> > > >
> > > >  I just want to check I have crafted this correctly.
> > > >
> > >
> > > I suggest that you test your ruleset with crushtool like this:
> > >
> > > $ crushtool -i crushmap.new --test --rule 6 --num-rep 4
> > > --show-utilization $ crushtool -i crushmap.new --test --rule 6
> > > --num-rep 4 --show-mappings
> > >
> > > You can now manually verify if the placement goes as intended.
> > >
> > > Wido
> > >
> > > > rule sydney-ssd {
> > > > ruleset 6
> > > > type replicated
> > > > min_size 2
> > > > max_size 10
> > > > step take ssd-sydney
> > > > step choose firstn -2 type datacenter
> > > > step chooseleaf firstn 2 type host
> > > > step emit
> > > > }
> > > >
> > > > Cheers,
> > > >  Adrian
> > > >
> > > >
> > > >
> > > > Confidentiality: This email and any attachments are confidential
> > > > and may be
> > > subject to copyright, legal or some other professional privilege.
> > > They are intended solely for the attention and use of the named
> > > addressee(s). They may only be copied, distributed or disclosed with
> > > the consent of the copyright owner. If you have received this email
> > > by mistake or by breach of the confidentiality clause, please notify
> > > the sender immediately by return email and delete or destroy all
> > > copies of the email. Any confidentiality, privilege or copyright is
> > > not waived or lost because this email has been sent to you by mistake.
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush rule check

2016-12-12 Thread Adrian Saul

Thanks Wido.

I had found the show-utilization test, but had not seen show-mappings - that 
confirmed it for me.

thanks,
 Adrian


> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com]
> Sent: Monday, 12 December 2016 7:07 PM
> To: ceph-users@lists.ceph.com; Adrian Saul
> Subject: Re: [ceph-users] Crush rule check
>
>
> > Op 10 december 2016 om 12:45 schreef Adrian Saul
> <adrian.s...@tpgtelecom.com.au>:
> >
> >
> >
> > Hi Ceph-users,
> >   I just want to double check a new crush ruleset I am creating - the intent
> here is that over 2 DCs, it will select one DC, and place two copies on 
> separate
> hosts in that DC.  The pools created on this will use size 4  and min-size 2.
> >
> >  I just want to check I have crafted this correctly.
> >
>
> I suggest that you test your ruleset with crushtool like this:
>
> $ crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-utilization $
> crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-mappings
>
> You can now manually verify if the placement goes as intended.
>
> Wido
>
> > rule sydney-ssd {
> > ruleset 6
> > type replicated
> > min_size 2
> > max_size 10
> > step take ssd-sydney
> > step choose firstn -2 type datacenter
> > step chooseleaf firstn 2 type host
> > step emit
> > }
> >
> > Cheers,
> >  Adrian
> >
> >
> >
> > Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush rule check

2016-12-10 Thread Adrian Saul

Hi Ceph-users,
  I just want to double check a new crush ruleset I am creating - the intent 
here is that over 2 DCs, it will select one DC, and place two copies on 
separate hosts in that DC.  The pools created on this will use size 4  and 
min-size 2.

 I just want to check I have crafted this correctly.

rule sydney-ssd {
ruleset 6
type replicated
min_size 2
max_size 10
step take ssd-sydney
step choose firstn -2 type datacenter
step chooseleaf firstn 2 type host
step emit
}

Cheers,
 Adrian



Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: osd set noin ignored for old OSD ids

2016-11-23 Thread Adrian Saul

Thanks - that is more in line with what I was looking for, being able to 
suppress backfills/rebalancing until a host/hosts full set of OSDs are up and 
ready.


> -Original Message-
> From: Will.Boege [mailto:will.bo...@target.com]
> Sent: Thursday, 24 November 2016 2:17 PM
> To: Gregory Farnum
> Cc: Adrian Saul; ceph-users@lists.ceph.com
> Subject: Re: [EXTERNAL] Re: [ceph-users] osd set noin ignored for old OSD
> ids
>
> From my experience noin doesn't stop new OSDs from being marked in. noin
> only works on OSDs already in the crushmap. To accomplish the behavior you
> want I've injected "mon osd auto mark new in = false" into MONs. This also
> seems to set their OSD weight to 0 when they are created.
>
> > On Nov 23, 2016, at 1:47 PM, Gregory Farnum <gfar...@redhat.com>
> wrote:
> >
> > On Tue, Nov 22, 2016 at 7:56 PM, Adrian Saul
> > <adrian.s...@tpgtelecom.com.au> wrote:
> >>
> >> Hi ,
> >> As part of migration between hardware I have been building new OSDs
> and cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).
> To try and prevent rebalancing kicking in until all the new OSDs are created
> on a host I use "ceph osd set noin", however what I have seen is that if the
> new OSD that is created uses a new unique ID, then the flag is honoured and
> the OSD remains out until I bring it in.  However if the OSD re-uses a 
> previous
> OSD id then it will go straight to in and start backfilling.  I have to 
> manually out
> the OSD to stop it (or set nobackfill,norebalance).
> >>
> >> Am I doing something wrong in this process or is there something about
> "noin" that is ignored for previously existing OSDs that have been removed
> from both the OSD map and crush map?
> >
> > There are a lot of different pieces of an OSD ID that need to get
> > deleted for it to be truly gone; my guess is you've missed some of
> > those. The noin flag doesn't prevent unlinked-but-up CRUSH entries
> > from getting placed back into the tree, etc.
> >
> > We may also have a bug though, so if you can demonstrate that the ID
> > doesn't exist in the CRUSH and OSD dumps then please create a ticket
> > at tracker.ceph.com!
> > -Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd set noin ignored for old OSD ids

2016-11-22 Thread Adrian Saul

Hi ,
 As part of migration between hardware I have been building new OSDs and 
cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).   To 
try and prevent rebalancing kicking in until all the new OSDs are created on a 
host I use "ceph osd set noin", however what I have seen is that if the new OSD 
that is created uses a new unique ID, then the flag is honoured and the OSD 
remains out until I bring it in.  However if the OSD re-uses a previous OSD id 
then it will go straight to in and start backfilling.  I have to manually out 
the OSD to stop it (or set nobackfill,norebalance).

Am I doing something wrong in this process or is there something about "noin" 
that is ignored for previously existing OSDs that have been removed from both 
the OSD map and crush map?

Cheers,
 Adrian




Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph outage - monitoring options

2016-11-21 Thread Adrian Saul
  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787811 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181151 seconds old, received at 2016-11-21 16:42:42.606462: 
osd_op(osd.22.51681:2352097 1.3140a3e7 rbd_data.159a26238e1f29.00018502 
[list-snaps] snapc 0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e69295) currently waiting for rw locks
2016-11-21 16:43:42.787813 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787875 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181123 seconds old, received at 2016-11-21 16:42:42.606490: 
osd_op(osd.22.51681:2352098 1.3140a3e7 rbd_data.159a26238e1f29.00018502 
[copy-get max 8388608] snapc 0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e69295) currently waiting for rw locks
2016-11-21 16:43:42.787877 7f138a653700  0 log_channel(cluster) do_log log to 
syslog

I am assuming there is something underlying here that broke network 
connectivity, however at the time I was able to connect to the machine fine, 
able to restart the services and have the monitors see that activity and also 
ping through the replication and public networks.   Granted that its not 
probably cephs fault why connectivity was broken. However

Why when the OSDs were restarted would they go straight back into that failed 
state where they were still blocked even though they came up and in?
Why would other OSDs not have marked them as bad or unresponsive at that point 
given many were waiting on blocked operations?  Is the fact they could connect 
enough or are there other conditions that would not cause them to be marked as 
out by other OSDs?
Is there any recommended configuration that would help improve detect the case 
when an OSD is unresponsive and more quickly move to out it from the osdmap to 
keep the cluster operational  (timeout tuning, flags to be set etc?)

Any help appreciated.  It's a little scary kicking into production and having 
an outage that I cant explain why cephs redundancy didn't kick in.

Cheers,
 Adrian





Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul
I am also seeing if reducing filestore queue ops limit from 500 to 250.  On my 
graphs I can see the file store ops queue goes from 1 or 2 to 500 for the 
period of the load.  I am looking to see if throttling down helps spread out 
the load.  The normal ops load is not enough to worry the current limit.



Sent from my SAMSUNG Galaxy S7 on the Telstra Mobile Network


 Original message 
From: Nick Fisk <n...@fisk.me.uk>
Date: 23/09/2016 7:26 PM (GMT+10:00)
To: Adrian Saul <adrian.s...@tpgtelecom.com.au>, ceph-users@lists.ceph.com
Subject: RE: Snap delete performance impact

Looking back through my graphs when this happened to me I can see that the 
queue on the disks was up as high as 30 during the period when the snapshot was 
removed, this would explain the high latencies, the disk is literally having 
fits trying to jump all over the place.

I need to test with the higher osd_snap_trim_sleep to see if that helps. What 
I'm interested in finding out is why so much disk activity is required for 
deleting an object. It feels to me that the process is async, in that Ceph will 
quite happily flood the Filestore with delete requests without any feedback to 
the higher layers.


> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 23 September 2016 10:04
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
>
> I did some observation today - with the reduced filestore_op_threads it seems 
> to ride out the storm better, not ideal but better.
>
> The main issue is that for the 10 minutes from the moment the rbd snap rm 
> command is issued, the SATA systems in my configuration
> load up massively on disk IO and I think this is what is rolling on to all 
> other issues (OSDs unresponsive, queue backlogs). The disks all
> go 100% busy - the average SATA write latency goes from 14ms to 250ms.  I was 
> observing disks doing 400, 700 and higher service
> times.  After those few minutes it tapers down and goes back to normal.
>
> There are all ST6000VN0001 disks - anyone aware of anything that might 
> explain this sort of behaviour?  It seems odd that even if the
> disks were hit with high write traffic (average of 50 write IOPS going up to 
> 270-300 during this activity) that the service times would
> blow out that much.
>
> Cheers,
>  Adrian
>
>
>
>
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Thursday, 22 September 2016 7:15 PM
> > To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > I tried 2 this afternoon and saw the same results.  Essentially the
> > disks appear to go to 100% busy doing very small but high numbers of IO and 
> > incur massive
> > service times (300-400ms).   During that period I get blocked request errors
> > continually.
> >
> > I suspect part of that might be the SATA servers had
> > filestore_op_threads set too high and hammering the disks with too
> > much concurrent work.  As they have inherited a setting targeted for
> > SSDs, so I have wound that back to defaults on those machines see if it 
> > makes a difference.
> >
> > But I suspect going by the disk activity there is a lot of very small
> > FS metadata updates going on and that is what is killing it.
> >
> > Cheers,
> >  Adrian
> >
> >
> > > -Original Message-
> > > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > > Sent: Thursday, 22 September 2016 7:06 PM
> > > To: Adrian Saul; ceph-users@lists.ceph.com
> > > Subject: RE: Snap delete performance impact
> > >
> > > Hi Adrian,
> > >
> > > I have also hit this recently and have since increased the
> > > osd_snap_trim_sleep to try and stop this from happening again.
> > > However, I haven't had an opportunity to actually try and break it
> > > again yet, but your mail seems to suggest it might not be the silver
> > > bullet I
> > was looking for.
> > >
> > > I'm wondering if the problem is not with the removal of the
> > > snapshot, but actually down to the amount of object deletes that
> > > happen, as I see similar results when doing fstrim's or deleting
> > > RBD's. Either way I agree that a settable throttle to allow it to
> > > process more slowly would be a
> > good addition.
> > > Have you tried that value set to higher than 1, maybe 10?
> > >
> > > Nick
> > >
> > > > -Original Message-
> > > > From

Re: [ceph-users] Snap delete performance impact

2016-09-23 Thread Adrian Saul

I did some observation today - with the reduced filestore_op_threads it seems 
to ride out the storm better, not ideal but better.

The main issue is that for the 10 minutes from the moment the rbd snap rm 
command is issued, the SATA systems in my configuration load up massively on 
disk IO and I think this is what is rolling on to all other issues (OSDs 
unresponsive, queue backlogs). The disks all go 100% busy - the average SATA 
write latency goes from 14ms to 250ms.  I was observing disks doing 400, 700 
and higher service times.  After those few minutes it tapers down and goes back 
to normal.

There are all ST6000VN0001 disks - anyone aware of anything that might explain 
this sort of behaviour?  It seems odd that even if the disks were hit with high 
write traffic (average of 50 write IOPS going up to 270-300 during this 
activity) that the service times would blow out that much.

Cheers,
 Adrian






> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Thursday, 22 September 2016 7:15 PM
> To: n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Snap delete performance impact
>
>
> I tried 2 this afternoon and saw the same results.  Essentially the disks 
> appear
> to go to 100% busy doing very small but high numbers of IO and incur massive
> service times (300-400ms).   During that period I get blocked request errors
> continually.
>
> I suspect part of that might be the SATA servers had filestore_op_threads
> set too high and hammering the disks with too much concurrent work.  As
> they have inherited a setting targeted for SSDs, so I have wound that back to
> defaults on those machines see if it makes a difference.
>
> But I suspect going by the disk activity there is a lot of very small FS 
> metadata
> updates going on and that is what is killing it.
>
> Cheers,
>  Adrian
>
>
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Thursday, 22 September 2016 7:06 PM
> > To: Adrian Saul; ceph-users@lists.ceph.com
> > Subject: RE: Snap delete performance impact
> >
> > Hi Adrian,
> >
> > I have also hit this recently and have since increased the
> > osd_snap_trim_sleep to try and stop this from happening again.
> > However, I haven't had an opportunity to actually try and break it
> > again yet, but your mail seems to suggest it might not be the silver bullet 
> > I
> was looking for.
> >
> > I'm wondering if the problem is not with the removal of the snapshot,
> > but actually down to the amount of object deletes that happen, as I
> > see similar results when doing fstrim's or deleting RBD's. Either way
> > I agree that a settable throttle to allow it to process more slowly would 
> > be a
> good addition.
> > Have you tried that value set to higher than 1, maybe 10?
> >
> > Nick
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: 22 September 2016 05:19
> > > To: 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com>
> > > Subject: Re: [ceph-users] Snap delete performance impact
> > >
> > >
> > > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > > seems to have tempered some of the issues but its still bad
> > enough
> > > that NFS storage off RBD volumes become unavailable for over 3
> minutes.
> > >
> > > It seems that the activity which the snapshot deletes are actioned
> > > triggers massive disk load for around 30 minutes.  The logs
> > show
> > > OSDs marking each other out, OSDs complaining they are wrongly
> > > marked out and blocked requests errors for around 10 minutes at the
> > > start of this
> > activity.
> > >
> > > Is there any way to throttle snapshot deletes to make them much more
> > > of a background activity?  It really should not make the
> > entire
> > > platform unusable for 10 minutes.
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Adrian Saul
> > > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > > To: 'ceph-users@lists.ceph.com'
> > > > Subject: [ceph-users] Snap delete performance impact
> > > >
> > > >
> > > > I recently started a process of using rbd snapshots to setup a
> > > > backup regime for a few file systems contained in RBD ima

Re: [ceph-users] Snap delete performance impact

2016-09-22 Thread Adrian Saul

I tried 2 this afternoon and saw the same results.  Essentially the disks 
appear to go to 100% busy doing very small but high numbers of IO and incur 
massive service times (300-400ms).   During that period I get blocked request 
errors continually.

I suspect part of that might be the SATA servers had filestore_op_threads set 
too high and hammering the disks with too much concurrent work.  As they have 
inherited a setting targeted for SSDs, so I have wound that back to defaults on 
those machines see if it makes a difference.

But I suspect going by the disk activity there is a lot of very small FS 
metadata updates going on and that is what is killing it.

Cheers,
 Adrian


> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Thursday, 22 September 2016 7:06 PM
> To: Adrian Saul; ceph-users@lists.ceph.com
> Subject: RE: Snap delete performance impact
>
> Hi Adrian,
>
> I have also hit this recently and have since increased the
> osd_snap_trim_sleep to try and stop this from happening again. However, I
> haven't had an opportunity to actually try and break it again yet, but your
> mail seems to suggest it might not be the silver bullet I was looking for.
>
> I'm wondering if the problem is not with the removal of the snapshot, but
> actually down to the amount of object deletes that happen, as I see similar
> results when doing fstrim's or deleting RBD's. Either way I agree that a
> settable throttle to allow it to process more slowly would be a good addition.
> Have you tried that value set to higher than 1, maybe 10?
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: 22 September 2016 05:19
> > To: 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] Snap delete performance impact
> >
> >
> > Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it
> > seems to have tempered some of the issues but its still bad
> enough
> > that NFS storage off RBD volumes become unavailable for over 3 minutes.
> >
> > It seems that the activity which the snapshot deletes are actioned
> > triggers massive disk load for around 30 minutes.  The logs
> show
> > OSDs marking each other out, OSDs complaining they are wrongly marked
> > out and blocked requests errors for around 10 minutes at the start of this
> activity.
> >
> > Is there any way to throttle snapshot deletes to make them much more
> > of a background activity?  It really should not make the
> entire
> > platform unusable for 10 minutes.
> >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Adrian Saul
> > > Sent: Wednesday, 6 July 2016 3:41 PM
> > > To: 'ceph-users@lists.ceph.com'
> > > Subject: [ceph-users] Snap delete performance impact
> > >
> > >
> > > I recently started a process of using rbd snapshots to setup a
> > > backup regime for a few file systems contained in RBD images.  While
> > > this generally works well at the time of the snapshots there is a
> > > massive increase in latency (10ms to multiple seconds of rbd device
> > > latency) across the entire cluster.  This has flow on effects for
> > > some cluster timeouts as well as general performance hits to applications.
> > >
> > > In research I have found some references to osd_snap_trim_sleep being
> the
> > > way to throttle this activity but no real guidance on values for it.   I 
> > > also
> see
> > > some other osd_snap_trim tunables  (priority and cost).
> > >
> > > Is there any recommendations around setting these for a Jewel cluster?
> > >
> > > cheers,
> > >  Adrian
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional
> > privilege. They are intended solely for the attention and use of the
> > named addressee(s). They may only be copied, distributed or disclosed
> > with the consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the sender
> immediately by return email and delete or destroy all copies of the email.
> Any confidentiality, privilege or copyright is not waived or lost because this
&

Re: [ceph-users] Snap delete performance impact

2016-09-21 Thread Adrian Saul

Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it seems to have 
tempered some of the issues but its still bad enough that NFS storage off RBD 
volumes become unavailable for over 3 minutes.

It seems that the activity which the snapshot deletes are actioned triggers 
massive disk load for around 30 minutes.  The logs show OSDs marking each other 
out, OSDs complaining they are wrongly marked out and blocked requests errors 
for around 10 minutes at the start of this activity.

Is there any way to throttle snapshot deletes to make them much more of a 
background activity?  It really should not make the entire platform unusable 
for 10 minutes.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Wednesday, 6 July 2016 3:41 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] Snap delete performance impact
>
>
> I recently started a process of using rbd snapshots to setup a backup regime
> for a few file systems contained in RBD images.  While this generally works
> well at the time of the snapshots there is a massive increase in latency (10ms
> to multiple seconds of rbd device latency) across the entire cluster.  This 
> has
> flow on effects for some cluster timeouts as well as general performance hits
> to applications.
>
> In research I have found some references to osd_snap_trim_sleep being the
> way to throttle this activity but no real guidance on values for it.   I also 
> see
> some other osd_snap_trim tunables  (priority and cost).
>
> Is there any recommendations around setting these for a Jewel cluster?
>
> cheers,
>  Adrian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-14 Thread Adrian Saul
> But shouldn't freezing the fs and doing a snapshot constitute a "clean
> unmount" hence no need to recover on the next mount (of the snapshot) -
> Ilya?

It's what I thought as well, but XFS seems to want to attempt to replay the log 
regardless on mount and write to the device to do so.  This was the only way I 
found to mount it without converting the snapshot to a clone (which I couldn't 
do with the image options enabled anyway).

I have this script snapshotting, mounting and backing up multiple file systems 
on my cluster with no issue.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-14 Thread Adrian Saul

I found I could ignore the XFS issues and just mount it with the appropriate 
options (below from my backup scripts):

#
# Mount with nouuid (conflicting XFS) and norecovery (ro snapshot)
#
if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; then
echo "FAILED: Unable to mount snapshot $DATESTAMP of $FS - 
cleaning up"
rbd unmap $SNAPDEV
rbd snap rm ${RBDPATH}@${DATESTAMP}
exit 3;
fi
echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"

It's impossible without clones to do it without norecovery.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ilya Dryomov
> Sent: Wednesday, 14 September 2016 1:51 AM
> To: Nikolay Borisov
> Cc: ceph-users; SiteGround Operations
> Subject: Re: [ceph-users] Consistency problems when taking RBD snapshot
>
> On Tue, Sep 13, 2016 at 4:11 PM, Nikolay Borisov  wrote:
> >
> >
> > On 09/13/2016 04:30 PM, Ilya Dryomov wrote:
> > [SNIP]
> >>
> >> Hmm, it could be about whether it is able to do journal replay on
> >> mount.  When you mount a snapshot, you get a read-only block device;
> >> when you mount a clone image, you get a read-write block device.
> >>
> >> Let's try this again, suppose image is foo and snapshot is snap:
> >>
> >> # fsfreeze -f /mnt
> >>
> >> # rbd snap create foo@snap
> >> # rbd map foo@snap
> >> /dev/rbd0
> >> # file -s /dev/rbd0
> >> # fsck.ext4 -n /dev/rbd0
> >> # mount /dev/rbd0 /foo
> >> # umount /foo
> >> 
> >> # file -s /dev/rbd0
> >> # fsck.ext4 -n /dev/rbd0
> >>
> >> # rbd clone foo@snap bar
> >> $ rbd map bar
> >> /dev/rbd1
> >> # file -s /dev/rbd1
> >> # fsck.ext4 -n /dev/rbd1
> >> # mount /dev/rbd1 /bar
> >> # umount /bar
> >> 
> >> # file -s /dev/rbd1
> >> # fsck.ext4 -n /dev/rbd1
> >>
> >> Could you please provide the output for the above?
> >
> > Here you go : http://paste.ubuntu.com/23173721/
>
> OK, so that explains it: the frozen filesystem is "needs journal recovery", so
> mounting it off of read-only block device leads to errors.
>
> root@alxc13:~# fsfreeze -f /var/lxc/c11579 root@alxc13:~# rbd snap create
> rbd/c11579@snap_test root@alxc13:~# rbd map c11579@snap_test
> /dev/rbd151
> root@alxc13:~# fsfreeze -u /var/lxc/c11579 root@alxc13:~# file -s
> /dev/rbd151
> /dev/rbd151: Linux rev 1.0 ext4 filesystem data (needs journal
> recovery) (extents) (large files) (huge files)
>
> Now, to isolate the problem, the easiest would probably be to try to
> reproduce it with loop devices.  Can you try dding one of these images to a
> file, make sure that the filesystem is clean, losetup + mount, freeze, make a
> "snapshot" with cp and losetup -r + mount?
>
> Try sticking file -s before unfreeze and also compare md5sums:
>
> root@alxc13:~# fsfreeze -f /var/lxc/c11579  device> root@alxc13:~# rbd snap create rbd/c11579@snap_test
> root@alxc13:~# rbd map c11579@snap_test  device>  root@alxc13:~# file -s /dev/rbd151
> root@alxc13:~# fsfreeze -u /var/lxc/c11579  device>  root@alxc13:~# file -s /dev/rbd151
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-17 Thread Adrian Saul

I have SELinux disabled and it does the restorecon on /var/lib/ceph regardless 
from the RPM post upgrade scripts.

In my case I chose to kill the restorecon processes to save outage time – it 
didn’t affect the upgrade package completion.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mykola 
Dvornik
Sent: Friday, 15 July 2016 6:54 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

I would also advice people to mind the SELinux if it is enabled on the OSD's 
nodes.
The re-labeling should be done as the part of the upgrade and this is rather 
time consuming process.


-Original Message-
From: Mart van Santen 
>
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel
Date: Fri, 15 Jul 2016 10:48:40 +0200


Hi Wido,

Thank you, we are currently in the same process so this information is very 
usefull. Can you share why you upgraded from hammer directly to jewel, is there 
a reason to skip infernalis? So, I wonder why you didn't do a 
hammer->infernalis->jewel upgrade, as that seems the logical path for me.

(we did indeed saw the same errors "Failed to encode map eXXX with expected 
crc" when upgrading to the latest hammer)


Regards,

Mart






On 07/15/2016 03:08 AM, 席智勇 wrote:
good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander 
>:

To add, the RGWs upgraded just fine as well.

No regions in use here (yet!), so that upgraded as it should.

Wido

> Op 13 juli 2016 om 16:56 schreef Wido den Hollander 
> >:
>
>
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable 
> ceph-mon@srv-zmb04-05.service
>   systemctl start 
> ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Adrian Saul

I would suggest caution with " filestore_odsync_write" - its fine on good SSDs, 
but on poor SSDs or spinning disks it will kill performance.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Friday, 15 July 2016 3:12 AM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Try increasing the following to say 10

osd_op_num_shards = 10
filestore_fd_cache_size = 128

Hope, the following you introduced after I told you , so, it shouldn't be the 
cause it seems (?)

filestore_odsync_write = true

Also, comment out the following.

filestore_wbthrottle_enable = false



From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs 

[ceph-users] Snap delete performance impact

2016-07-05 Thread Adrian Saul

I recently started a process of using rbd snapshots to setup a backup regime 
for a few file systems contained in RBD images.  While this generally works 
well at the time of the snapshots there is a massive increase in latency (10ms 
to multiple seconds of rbd device latency) across the entire cluster.  This has 
flow on effects for some cluster timeouts as well as general performance hits 
to applications.

In research I have found some references to osd_snap_trim_sleep being the way 
to throttle this activity but no real guidance on values for it.   I also see 
some other osd_snap_trim tunables  (priority and cost).

Is there any recommendations around setting these for a Jewel cluster?

cheers,
 Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD out/down detection

2016-06-19 Thread Adrian Saul
Hi All,
 We have a Jewel (10.2.1) cluster on Centos 7 - I am using an  elrepo 4.4.1 
kernel on all machines and we have an issue where some of the machines hang - 
not sure if its hardware or OS but essentially the host including the console 
is unresponsive and can only be recovered with a hardware reset.  Unfortunately 
nothing useful is logged so I am still trying to figure out what is going on to 
cause this.   But the result for ceph is that if an OSD host goes down like 
this we have run into an issue where only some of its OSDs are marked down.
In the instance on the weekend, the host had 8 OSDs and only 5 got marked as 
down - this lead to the kRBD devices jamming up trying to send IO to 
non-responsive OSDs that stayed marked up.

The machine went into a slow death - lots of reports of slow or blocked 
requests:

2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster [WRN] 2 
slow requests, 2 included below; oldest blocked for > 30.297258 secs
2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster [WRN] 112 
slow requests, 5 included below; oldest blocked for > 35.297988 secs
2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster [WRN] 253 
slow requests, 5 included below; oldest blocked for > 35.325155 secs
2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster [WRN] 262 
slow requests, 5 included below; oldest blocked for > 40.325986 secs

And then when the monitors did report them down the OSDs disputed that:

2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster [INF] 
osd.6 10.145.2.15:6801/21836 failed (2 reporters from different host after 
20.000365 >= grace 20.00)
2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster [INF] 
osd.22 10.145.2.15:6806/21826 failed (2 reporters from different host after 
21.613336 >= grace 20.00)
2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster [INF] 
osd.31 10.145.2.15:6812/21838 failed (2 reporters from different host after 
21.613781 >= grace 20.836511)
2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster [INF] 
osd.36 10.145.2.15:6802/31359 failed (2 reporters from different host after 
21.614259 >= grace 20.00)

2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster [WRN] map 
e28729 wrongly marked me down
2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster [WRN] 
map e28730 wrongly marked me down

But shortly after

2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster [INF] 
osd.6 out (down for 303.951251)
2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster [INF] 
osd.22 out (down for 302.908528)
2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster [INF] 
osd.31 out (down for 302.908527)
2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster [INF] 
osd.36 out (down for 302.908527)

2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster [INF] 
osd.23 10.145.2.15:6814/21852 failed (2 reporters from different host after 
20.000378 >= grace 20.00)
2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster [WRN] 
map e28848 wrongly marked me down
2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster [INF] 
osd.23 out (down for 304.288018)

By the time the issue was eventually escalated and I was able to do something 
about it I manual marked the remaining host OSDs down (which seemed to unclog 
RBD):

2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster [INF] 
osd.7 10.145.2.15:6808/21837 failed (2 reporters from different host after 
22.000367 >= grace 20.00)
2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster [INF] 
osd.24 10.145.2.15:6800/21813 failed (2 reporters from different host after 
22.000748 >= grace 20.710981)
2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster [INF] 
osd.37 10.145.2.15:6810/31936 failed (2 reporters from different host after 
22.001167 >= grace 20.00)

The question I have is why might the these 3 OSDs, despite not being responsive 
for over 5 hours, stayed in the cluster?  The CRUSH map for all pools is to 
have the hosts as fault boundaries, so I would have expected other host OSDs to 
be reporting these as unresponsive and reporting them.  On the OSD logs nothing 
was logged in the hour prior to the failure, and on the other OSDs it seems 
like they noticed all the other OSDs timing out but the 3 that stayed up it 
seemed to be actively attempting backfills.

Any ideas on how I can improve detection of this condition?

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyrig

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-06 Thread Adrian Saul
Centos 7 - the ugrade was done simply with "yum update -y ceph" on each node 
one by one, so the package order would have been determined by yum.




From: Jason Dillaman <jdill...@redhat.com>
Sent: Monday, June 6, 2016 10:42 PM
To: Adrian Saul
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

What OS are you using?  It actually sounds like the plugins were
updated, the Infernalis OSD was reset, and then the Jewel OSD was
installed.

On Sun, Jun 5, 2016 at 10:42 PM, Adrian Saul
<adrian.s...@tpgtelecom.com.au> wrote:
>
> Thanks Jason.
>
> I don’t have anything specified explicitly for osd class dir.   I suspect it 
> might be related to the OSDs being restarted during the package upgrade 
> process before all libraries are upgraded.
>
>
>> -Original Message-
>> From: Jason Dillaman [mailto:jdill...@redhat.com]
>> Sent: Monday, 6 June 2016 12:37 PM
>> To: Adrian Saul
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>>
>> Odd -- sounds like you might have Jewel and Infernalis class objects and
>> OSDs intermixed. I would double-check your installation and see if your
>> configuration has any overload for "osd class dir".
>>
>> On Sun, Jun 5, 2016 at 10:28 PM, Adrian Saul
>> <adrian.s...@tpgtelecom.com.au> wrote:
>> >
>> > I have traced it back to an OSD giving this error:
>> >
>> > 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open
>> > got (5) Input/output error
>> > 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open
>> > class /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed):
>> > /usr/lib64/rados-classes/libcls_rbd.so: undefined symbol:
>> > _ZN4ceph6buffer4list8iteratorC1EPS1_j
>> >
>> > Trying to figure out why that is the case.
>> >
>> >
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of Adrian Saul
>> >> Sent: Monday, 6 June 2016 11:11 AM
>> >> To: dilla...@redhat.com
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>> >>
>> >>
>> >> No - it throws a usage error - if I add a file argument after it works:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get
>> >> rbd_id.hypervtst-
>> >> lun04 /tmp/crap
>> >> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>> >>
>> >> stat works:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
>> >> rbd_id.hypervtst-
>> >> lun04
>> >> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
>> >> size 18
>> >>
>> >>
>> >> I can do a rados ls:
>> >>
>> >> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
>> >> rbd_id.cloud2sql-lun01
>> >> rbd_id.glbcluster3-vm17
>> >> rbd_id.holder   <<<  a create that said it failed while I was debugging 
>> >> this
>> >> rbd_id.pvtcloud-nfs01
>> >> rbd_id.hypervtst-lun05
>> >> rbd_id.test02
>> >> rbd_id.cloud2sql-lun02
>> >> rbd_id.fiotest2
>> >> rbd_id.radmast02-lun04
>> >> rbd_id.hypervtst-lun04
>> >> rbd_id.cloud2fs-lun00
>> >> rbd_id.radmast02-lun03
>> >> rbd_id.hypervtst-lun00
>> >> rbd_id.cloud2sql-lun00
>> >> rbd_id.radmast02-lun02
>> >>
>> >>
>> >> > -Original Message-
>> >> > From: Jason Dillaman [mailto:jdill...@redhat.com]
>> >> > Sent: Monday, 6 June 2016 11:00 AM
>> >> > To: Adrian Saul
>> >> > Cc: ceph-users@lists.ceph.com
>> >> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>> >> >
>> >> > Are you able to successfully run the following command successfully?
>> >> >
>> >> > rados -p glebe-sata get rbd_id.hypervtst-lun04
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
>> >> > <adrian.s...@tpgtelecom.com.au> wrote:
>> >> > >
>> >> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
>> >

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

Thanks Jason.

I don’t have anything specified explicitly for osd class dir.   I suspect it 
might be related to the OSDs being restarted during the package upgrade process 
before all libraries are upgraded.


> -Original Message-
> From: Jason Dillaman [mailto:jdill...@redhat.com]
> Sent: Monday, 6 June 2016 12:37 PM
> To: Adrian Saul
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
> Odd -- sounds like you might have Jewel and Infernalis class objects and
> OSDs intermixed. I would double-check your installation and see if your
> configuration has any overload for "osd class dir".
>
> On Sun, Jun 5, 2016 at 10:28 PM, Adrian Saul
> <adrian.s...@tpgtelecom.com.au> wrote:
> >
> > I have traced it back to an OSD giving this error:
> >
> > 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open
> > got (5) Input/output error
> > 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open
> > class /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed):
> > /usr/lib64/rados-classes/libcls_rbd.so: undefined symbol:
> > _ZN4ceph6buffer4list8iteratorC1EPS1_j
> >
> > Trying to figure out why that is the case.
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Adrian Saul
> >> Sent: Monday, 6 June 2016 11:11 AM
> >> To: dilla...@redhat.com
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >>
> >>
> >> No - it throws a usage error - if I add a file argument after it works:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get
> >> rbd_id.hypervtst-
> >> lun04 /tmp/crap
> >> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
> >>
> >> stat works:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
> >> rbd_id.hypervtst-
> >> lun04
> >> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
> >> size 18
> >>
> >>
> >> I can do a rados ls:
> >>
> >> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> >> rbd_id.cloud2sql-lun01
> >> rbd_id.glbcluster3-vm17
> >> rbd_id.holder   <<<  a create that said it failed while I was debugging 
> >> this
> >> rbd_id.pvtcloud-nfs01
> >> rbd_id.hypervtst-lun05
> >> rbd_id.test02
> >> rbd_id.cloud2sql-lun02
> >> rbd_id.fiotest2
> >> rbd_id.radmast02-lun04
> >> rbd_id.hypervtst-lun04
> >> rbd_id.cloud2fs-lun00
> >> rbd_id.radmast02-lun03
> >> rbd_id.hypervtst-lun00
> >> rbd_id.cloud2sql-lun00
> >> rbd_id.radmast02-lun02
> >>
> >>
> >> > -Original Message-
> >> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> >> > Sent: Monday, 6 June 2016 11:00 AM
> >> > To: Adrian Saul
> >> > Cc: ceph-users@lists.ceph.com
> >> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >> >
> >> > Are you able to successfully run the following command successfully?
> >> >
> >> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >> >
> >> >
> >> >
> >> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> >> > <adrian.s...@tpgtelecom.com.au> wrote:
> >> > >
> >> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> >> > > While
> >> > the upgrade went through smoothly (aside from a time wasting
> >> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> >> > services continued running without interruption.  However this
> >> > morning when I went to create some new RBD images I am unable to do
> >> > much at all
> >> with RBD.
> >> > >
> >> > > Just about any rbd command fails with an I/O error.   I can run
> >> > showmapped but that is about it - anything like an ls, info or
> >> > status fails.  This applies to all my pools.
> >> > >
> >> > > I can see no errors in any log files that appear to suggest an
> >> > > issue.  I  have
> >> > also tried the commands on other cluster members that have not done
> >> > anything with RBD before (I was wondering if perhaps the kernel rbd
> >> > was pinning the old l

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I couldn't find anything wrong with the packages and everything seemed 
installed ok.

Once I restarted the OSDs the directory issue went away but the error started 
moving to other rbd output, and the same class open error occurred on other 
OSDs.  I have gone through and bounced all the OSDs and that seems to have 
cleared the issue.

I am guessing that perhaps the restart of the OSDs during the package upgrade 
is occurring before all library packages are upgraded and so they are starting 
with the wrong versions loaded, so when these class libraries are dynamically 
opened later they are failing.



> -Original Message-
> From: Adrian Saul
> Sent: Monday, 6 June 2016 12:29 PM
> To: Adrian Saul; dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> I have traced it back to an OSD giving this error:
>
> 2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open got
> (5) Input/output error
> 2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open class
> /usr/lib64/rados-classes/libcls_rbd.so (dlopen failed): /usr/lib64/rados-
> classes/libcls_rbd.so: undefined symbol:
> _ZN4ceph6buffer4list8iteratorC1EPS1_j
>
> Trying to figure out why that is the case.
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: Monday, 6 June 2016 11:11 AM
> > To: dilla...@redhat.com
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> >
> > No - it throws a usage error - if I add a file argument after it works:
> >
> > [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> > lun04 /tmp/crap
> > [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
> >
> > stat works:
> >
> > [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat
> > rbd_id.hypervtst-
> > lun04
> > glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00,
> > size 18
> >
> >
> > I can do a rados ls:
> >
> > [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> > rbd_id.cloud2sql-lun01
> > rbd_id.glbcluster3-vm17
> > rbd_id.holder   <<<  a create that said it failed while I was debugging this
> > rbd_id.pvtcloud-nfs01
> > rbd_id.hypervtst-lun05
> > rbd_id.test02
> > rbd_id.cloud2sql-lun02
> > rbd_id.fiotest2
> > rbd_id.radmast02-lun04
> > rbd_id.hypervtst-lun04
> > rbd_id.cloud2fs-lun00
> > rbd_id.radmast02-lun03
> > rbd_id.hypervtst-lun00
> > rbd_id.cloud2sql-lun00
> > rbd_id.radmast02-lun02
> >
> >
> > > -Original Message-
> > > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > > Sent: Monday, 6 June 2016 11:00 AM
> > > To: Adrian Saul
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> > >
> > > Are you able to successfully run the following command successfully?
> > >
> > > rados -p glebe-sata get rbd_id.hypervtst-lun04
> > >
> > >
> > >
> > > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> > > <adrian.s...@tpgtelecom.com.au> wrote:
> > > >
> > > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > > While
> > > the upgrade went through smoothly (aside from a time wasting
> > > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > > services continued running without interruption.  However this
> > > morning when I went to create some new RBD images I am unable to do
> > > much at all
> > with RBD.
> > > >
> > > > Just about any rbd command fails with an I/O error.   I can run
> > > showmapped but that is about it - anything like an ls, info or
> > > status fails.  This applies to all my pools.
> > > >
> > > > I can see no errors in any log files that appear to suggest an
> > > > issue.  I  have
> > > also tried the commands on other cluster members that have not done
> > > anything with RBD before (I was wondering if perhaps the kernel rbd
> > > was pinning the old library version open or something) but the same
> > > error
> > occurs.
> > > >
> > > > Where can I start trying to resolve this?
> > > >
> > > > Cheers,
> > > >  Adrian
> > > >
> > > >
> > > > [root@ceph-glb-fec-01 ceph]# rbd ls gle

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I have traced it back to an OSD giving this error:

2016-06-06 12:18:14.315573 7fd714679700 -1 osd.20 23623 class rbd open got (5) 
Input/output error
2016-06-06 12:19:49.835227 7fd714679700  0 _load_class could not open class 
/usr/lib64/rados-classes/libcls_rbd.so (dlopen failed): 
/usr/lib64/rados-classes/libcls_rbd.so: undefined symbol: 
_ZN4ceph6buffer4list8iteratorC1EPS1_j

Trying to figure out why that is the case.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Monday, 6 June 2016 11:11 AM
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> No - it throws a usage error - if I add a file argument after it works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> lun04 /tmp/crap
> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>
> stat works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-
> lun04
> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18
>
>
> I can do a rados ls:
>
> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> rbd_id.cloud2sql-lun01
> rbd_id.glbcluster3-vm17
> rbd_id.holder   <<<  a create that said it failed while I was debugging this
> rbd_id.pvtcloud-nfs01
> rbd_id.hypervtst-lun05
> rbd_id.test02
> rbd_id.cloud2sql-lun02
> rbd_id.fiotest2
> rbd_id.radmast02-lun04
> rbd_id.hypervtst-lun04
> rbd_id.cloud2fs-lun00
> rbd_id.radmast02-lun03
> rbd_id.hypervtst-lun00
> rbd_id.cloud2sql-lun00
> rbd_id.radmast02-lun02
>
>
> > -Original Message-
> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > Sent: Monday, 6 June 2016 11:00 AM
> > To: Adrian Saul
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> > Are you able to successfully run the following command successfully?
> >
> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >
> >
> >
> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> > <adrian.s...@tpgtelecom.com.au> wrote:
> > >
> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > While
> > the upgrade went through smoothly (aside from a time wasting
> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > services continued running without interruption.  However this morning
> > when I went to create some new RBD images I am unable to do much at all
> with RBD.
> > >
> > > Just about any rbd command fails with an I/O error.   I can run
> > showmapped but that is about it - anything like an ls, info or status
> > fails.  This applies to all my pools.
> > >
> > > I can see no errors in any log files that appear to suggest an
> > > issue.  I  have
> > also tried the commands on other cluster members that have not done
> > anything with RBD before (I was wondering if perhaps the kernel rbd
> > was pinning the old library version open or something) but the same error
> occurs.
> > >
> > > Where can I start trying to resolve this?
> > >
> > > Cheers,
> > >  Adrian
> > >
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > > id pool   image snap device
> > > 0  glebe-sata test02-/dev/rbd0
> > > 1  glebe-ssd  zfstest   -/dev/rbd1
> > > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> &

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

Seems like my rbd_directory is empty for some reason:

[root@ceph-glb-fec-02 ceph]# rados get -p glebe-sata rbd_directory /tmp/dir
[root@ceph-glb-fec-02 ceph]# strings /tmp/dir
[root@ceph-glb-fec-02 ceph]# ls -la /tmp/dir
-rw-r--r--. 1 root root 0 Jun  6 11:12 /tmp/dir

[root@ceph-glb-fec-02 ceph]# rados stat -p glebe-sata rbd_directory
glebe-sata/rbd_directory mtime 2016-06-06 10:18:28.00, size 0



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Monday, 6 June 2016 11:11 AM
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
>
> No - it throws a usage error - if I add a file argument after it works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-
> lun04 /tmp/crap
> [root@ceph-glb-fec-02 ceph]# cat /tmp/crap 109eb01f5f89de
>
> stat works:
>
> [root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-
> lun04
> glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18
>
>
> I can do a rados ls:
>
> [root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
> rbd_id.cloud2sql-lun01
> rbd_id.glbcluster3-vm17
> rbd_id.holder   <<<  a create that said it failed while I was debugging this
> rbd_id.pvtcloud-nfs01
> rbd_id.hypervtst-lun05
> rbd_id.test02
> rbd_id.cloud2sql-lun02
> rbd_id.fiotest2
> rbd_id.radmast02-lun04
> rbd_id.hypervtst-lun04
> rbd_id.cloud2fs-lun00
> rbd_id.radmast02-lun03
> rbd_id.hypervtst-lun00
> rbd_id.cloud2sql-lun00
> rbd_id.radmast02-lun02
>
>
> > -Original Message-
> > From: Jason Dillaman [mailto:jdill...@redhat.com]
> > Sent: Monday, 6 June 2016 11:00 AM
> > To: Adrian Saul
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
> >
> > Are you able to successfully run the following command successfully?
> >
> > rados -p glebe-sata get rbd_id.hypervtst-lun04
> >
> >
> >
> > On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> > <adrian.s...@tpgtelecom.com.au> wrote:
> > >
> > > I upgraded my Infernalis semi-production cluster to Jewel on Friday.
> > > While
> > the upgrade went through smoothly (aside from a time wasting
> > restorecon /var/lib/ceph in the selinux package upgrade) and the
> > services continued running without interruption.  However this morning
> > when I went to create some new RBD images I am unable to do much at all
> with RBD.
> > >
> > > Just about any rbd command fails with an I/O error.   I can run
> > showmapped but that is about it - anything like an ls, info or status
> > fails.  This applies to all my pools.
> > >
> > > I can see no errors in any log files that appear to suggest an
> > > issue.  I  have
> > also tried the commands on other cluster members that have not done
> > anything with RBD before (I was wondering if perhaps the kernel rbd
> > was pinning the old library version open or something) but the same error
> occurs.
> > >
> > > Where can I start trying to resolve this?
> > >
> > > Cheers,
> > >  Adrian
> > >
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > > rbd: list: (5) Input/output error
> > > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing
> > > image in directory: (5) Input/output error
> > > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > > images: (5) Input/output error
> > >
> > > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > > id pool   image snap device
> > > 0  glebe-sata test02-/dev/rbd0
> > > 1  glebe-ssd  zfstest   -/dev/rbd1
> > > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> > > 16 glebe-sat

Re: [ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

No - it throws a usage error - if I add a file argument after it works:

[root@ceph-glb-fec-02 ceph]# rados -p glebe-sata get rbd_id.hypervtst-lun04 
/tmp/crap
[root@ceph-glb-fec-02 ceph]# cat /tmp/crap
109eb01f5f89de

stat works:

[root@ceph-glb-fec-02 ceph]# rados -p glebe-sata stat rbd_id.hypervtst-lun04
glebe-sata/rbd_id.hypervtst-lun04 mtime 2016-06-06 10:55:08.00, size 18


I can do a rados ls:

[root@ceph-glb-fec-02 ceph]# rados ls -p glebe-sata|grep rbd_id
rbd_id.cloud2sql-lun01
rbd_id.glbcluster3-vm17
rbd_id.holder   <<<  a create that said it failed while I was debugging this
rbd_id.pvtcloud-nfs01
rbd_id.hypervtst-lun05
rbd_id.test02
rbd_id.cloud2sql-lun02
rbd_id.fiotest2
rbd_id.radmast02-lun04
rbd_id.hypervtst-lun04
rbd_id.cloud2fs-lun00
rbd_id.radmast02-lun03
rbd_id.hypervtst-lun00
rbd_id.cloud2sql-lun00
rbd_id.radmast02-lun02


> -Original Message-
> From: Jason Dillaman [mailto:jdill...@redhat.com]
> Sent: Monday, 6 June 2016 11:00 AM
> To: Adrian Saul
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Jewel upgrade - rbd errors after upgrade
>
> Are you able to successfully run the following command successfully?
>
> rados -p glebe-sata get rbd_id.hypervtst-lun04
>
>
>
> On Sun, Jun 5, 2016 at 8:49 PM, Adrian Saul
> <adrian.s...@tpgtelecom.com.au> wrote:
> >
> > I upgraded my Infernalis semi-production cluster to Jewel on Friday.  While
> the upgrade went through smoothly (aside from a time wasting restorecon
> /var/lib/ceph in the selinux package upgrade) and the services continued
> running without interruption.  However this morning when I went to create
> some new RBD images I am unable to do much at all with RBD.
> >
> > Just about any rbd command fails with an I/O error.   I can run
> showmapped but that is about it - anything like an ls, info or status fails.  
> This
> applies to all my pools.
> >
> > I can see no errors in any log files that appear to suggest an issue.  I  
> > have
> also tried the commands on other cluster members that have not done
> anything with RBD before (I was wondering if perhaps the kernel rbd was
> pinning the old library version open or something) but the same error occurs.
> >
> > Where can I start trying to resolve this?
> >
> > Cheers,
> >  Adrian
> >
> >
> > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
> > rbd: list: (5) Input/output error
> > 2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing image
> > in directory: (5) Input/output error
> > 2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2
> > images: (5) Input/output error
> >
> > [root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
> > rbd: list: (5) Input/output error
> > 2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing image
> > in directory: (5) Input/output error
> > 2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2
> > images: (5) Input/output error
> >
> > [root@ceph-glb-fec-02 ~]# rbd showmapped
> > id pool   image snap device
> > 0  glebe-sata test02-/dev/rbd0
> > 1  glebe-ssd  zfstest   -/dev/rbd1
> > 10 glebe-sata hypervtst-lun00   -/dev/rbd10
> > 11 glebe-sata hypervtst-lun02   -/dev/rbd11
> > 12 glebe-sata hypervtst-lun03   -/dev/rbd12
> > 13 glebe-ssd  nspprd01_lun00-/dev/rbd13
> > 14 glebe-sata cirrux-nfs01  -/dev/rbd14
> > 15 glebe-sata hypervtst-lun04   -/dev/rbd15
> > 16 glebe-sata hypervtst-lun05   -/dev/rbd16
> > 17 glebe-sata pvtcloud-nfs01-/dev/rbd17
> > 18 glebe-sata cloud2sql-lun00   -/dev/rbd18
> > 19 glebe-sata cloud2sql-lun01   -/dev/rbd19
> > 2  glebe-sata radmast02-lun00   -/dev/rbd2
> > 20 glebe-sata cloud2sql-lun02   -/dev/rbd20
> > 21 glebe-sata cloud2fs-lun00-/dev/rbd21
> > 22 glebe-sata cloud2fs-lun01-/dev/rbd22
> > 3  glebe-sata radmast02-lun01   -/dev/rbd3
> > 4  glebe-sata radmast02-lun02   -/dev/rbd4
> > 5  glebe-sata radmast02-lun03   -/dev/rbd5
> > 6  glebe-sata radmast02-lun04   -/dev/rbd6
> > 7  glebe-ssd  sybase_iquser02_lun00 -/dev/rbd7
> > 8  glebe-ssd  sybase_iquser03_lun00 -/dev/rbd8
> > 9  glebe-ssd  sybase_iquser04_lun00 -/dev/rbd9
> >
> > [root@ceph-glb-fec-02 ~]# rbd status glebe-sata/hypervtst-lun04
> > 2016-06-06 10:47:30.221453 7fc0030dc700 -1 librbd::image::OpenRequest:
> > failed to retrieve image id: (5) Input/output error
> > 2016-06-06 10:47:30.221556 7fc0028db700 -1 librbd::ImageState: failed
&g

[ceph-users] Jewel upgrade - rbd errors after upgrade

2016-06-05 Thread Adrian Saul

I upgraded my Infernalis semi-production cluster to Jewel on Friday.  While the 
upgrade went through smoothly (aside from a time wasting restorecon 
/var/lib/ceph in the selinux package upgrade) and the services continued 
running without interruption.  However this morning when I went to create some 
new RBD images I am unable to do much at all with RBD.

Just about any rbd command fails with an I/O error.   I can run showmapped but 
that is about it - anything like an ls, info or status fails.  This applies to 
all my pools.

I can see no errors in any log files that appear to suggest an issue.  I  have 
also tried the commands on other cluster members that have not done anything 
with RBD before (I was wondering if perhaps the kernel rbd was pinning the old 
library version open or something) but the same error occurs.

Where can I start trying to resolve this?

Cheers,
 Adrian


[root@ceph-glb-fec-01 ceph]# rbd ls glebe-sata
rbd: list: (5) Input/output error
2016-06-06 10:41:31.792720 7f53c06a2d80 -1 librbd: error listing image in 
directory: (5) Input/output error
2016-06-06 10:41:31.792749 7f53c06a2d80 -1 librbd: error listing v2 images: (5) 
Input/output error

[root@ceph-glb-fec-01 ceph]# rbd ls glebe-ssd
rbd: list: (5) Input/output error
2016-06-06 10:41:33.956648 7f90de663d80 -1 librbd: error listing image in 
directory: (5) Input/output error
2016-06-06 10:41:33.956672 7f90de663d80 -1 librbd: error listing v2 images: (5) 
Input/output error

[root@ceph-glb-fec-02 ~]# rbd showmapped
id pool   image snap device
0  glebe-sata test02-/dev/rbd0
1  glebe-ssd  zfstest   -/dev/rbd1
10 glebe-sata hypervtst-lun00   -/dev/rbd10
11 glebe-sata hypervtst-lun02   -/dev/rbd11
12 glebe-sata hypervtst-lun03   -/dev/rbd12
13 glebe-ssd  nspprd01_lun00-/dev/rbd13
14 glebe-sata cirrux-nfs01  -/dev/rbd14
15 glebe-sata hypervtst-lun04   -/dev/rbd15
16 glebe-sata hypervtst-lun05   -/dev/rbd16
17 glebe-sata pvtcloud-nfs01-/dev/rbd17
18 glebe-sata cloud2sql-lun00   -/dev/rbd18
19 glebe-sata cloud2sql-lun01   -/dev/rbd19
2  glebe-sata radmast02-lun00   -/dev/rbd2
20 glebe-sata cloud2sql-lun02   -/dev/rbd20
21 glebe-sata cloud2fs-lun00-/dev/rbd21
22 glebe-sata cloud2fs-lun01-/dev/rbd22
3  glebe-sata radmast02-lun01   -/dev/rbd3
4  glebe-sata radmast02-lun02   -/dev/rbd4
5  glebe-sata radmast02-lun03   -/dev/rbd5
6  glebe-sata radmast02-lun04   -/dev/rbd6
7  glebe-ssd  sybase_iquser02_lun00 -/dev/rbd7
8  glebe-ssd  sybase_iquser03_lun00 -/dev/rbd8
9  glebe-ssd  sybase_iquser04_lun00 -/dev/rbd9

[root@ceph-glb-fec-02 ~]# rbd status glebe-sata/hypervtst-lun04
2016-06-06 10:47:30.221453 7fc0030dc700 -1 librbd::image::OpenRequest: failed 
to retrieve image id: (5) Input/output error
2016-06-06 10:47:30.221556 7fc0028db700 -1 librbd::ImageState: failed to open 
image: (5) Input/output error
rbd: error opening image hypervtst-lun04: (5) Input/output error
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2 networks vs 2 NICs

2016-06-04 Thread Adrian Sevcenco

On 06/04/2016 06:12 PM, Nick Fisk wrote:

Yes, this is fine. I currently use 2 bonded 10G nics which have the
untagged vlan as the public network and a tagged vlan as the cluster
network.

Thanks for info!


However, when I build my next cluster I will probably forgo the
separate cluster network and just run them over the same IP, as after
running the cluster, I don't see any benefit from separate networks
when taking into account the extra complexity. Something to
consider.

Well, i was thinking to reduce the internal chatter just to specified vlan..
OTOH all nodes will stay on the same switch so i am not sure if it's
worth it..

Thank you!
Adrian




-Original Message- From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian
Sevcenco Sent: 04 June 2016 16:11 To: ceph-users@lists.ceph.com
Subject: [ceph-users] 2 networks vs 2 NICs

Hi! I seen in discussion and in documentation that "networks" is
used interchangeable with "NIC" (which also is a different thing
than interface) .. So, my question is :for an OSD server with 24
OSDs with a single 40 GB NIC would be ok to have a public network
on the main interface and a vlan (virtual) interface for the
cluster network?

Thank you! Adrian








smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 2 networks vs 2 NICs

2016-06-04 Thread Adrian Sevcenco

Hi! I seen in discussion and in documentation that "networks" is used
interchangeable with "NIC" (which also is a different thing than interface) ..
So, my question is :for an OSD server with 24 OSDs with a single 40 GB NIC
would be ok to have a public network on the main interface and a vlan
(virtual) interface for the cluster network?

Thank you!
Adrian



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Network Switches for Redundancy

2016-06-01 Thread Adrian Saul

> > For two links it should be quite good - it seemed to balance across
> > that quite well, but with 4 links it seemed to really prefer 2 in my case.
> >
> Just for the record, did you also change the LACP policies on the switches?
>
> From what I gather, having fancy pants L3+4 hashing on the Linux side will not
> fix imbalances by itself, the switches need to be configured likewise.

Yes - I was changing policies on both sides in similar manners but it seemed to 
be that the way the OSDs selected their service ports just happened to hash 
consistently to the same links.   There just wasn't enough variation in the 
combinations of L3+L4 or even L2 hash output to utilise more of the links (the 
even numbered ports and consistent IP pairs just kept returning the same link 
output for the hash algorithm).   Some of the more simplistic round robin 
methods might have got better results but I didn't want to stick with for 
future scalability.

In a larger scale deployment with more clients or a wider pool of OSDs that 
would probably not be the case as there would be greater distribution of hash 
inputs.  Just something to be aware of when you look to do LACP with more than 
2 links.



>
> Christian
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of David Riedl
> > > Sent: Thursday, 2 June 2016 2:12 AM
> > > To: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Best Network Switches for Redundancy
> > >
> > >
> > > > 4. As Ceph has lots of connections on lots of IP's and port's,
> > > > LACP or the Linux ALB mode should work really well to balance
> connections.
> > > Linux ALB Mode looks promising. Does that work with two switches?
> > > Each server has 4 ports which are 'splitted' and connected to each switch.
> > >  _
> > >/ _[switch]
> > >   / /  ||
> > > [server] ||
> > >  \ \_ ||
> > >   \__[switch]
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional privilege.
> > They are intended solely for the attention and use of the named
> > addressee(s). They may only be copied, distributed or disclosed with
> > the consent of the copyright owner. If you have received this email by
> > mistake or by breach of the confidentiality clause, please notify the
> > sender immediately by return email and delete or destroy all copies of
> > the email. Any confidentiality, privilege or copyright is not waived
> > or lost because this email has been sent to you by mistake.
> > ___ ceph-users
> mailing
> > list ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Network Switches for Redundancy

2016-06-01 Thread Adrian Saul

I am currently running our Ceph POC environment using dual Nexus 9372TX 10G-T 
switches, each OSD host has two connections to each switch and they are formed 
into a single 4 link VPC (MC-LAG), which is bonded under LACP on the host side.

What I have noticed is that the various hashing policies for LACP do not 
guarantee you will make full use of all the links.  I tried various policies 
and from what I could see the normal L3+L4 IP and port hashing generally worked 
as good as anything else, but if you have lots of similar connections it 
doesn't seem to hash across all the links and say 2 will be heavily used while 
not much is hashed onto the other links.  This might have just been because it 
was a fairly small pool of IPs and fairly similar port numbers that just 
happened to keep hashing to the same links (I ended up going to the point of 
tcpdumping traffic and scripting a calculation of what link it should use, it 
just happened to be so consistent).

For two links it should be quite good - it seemed to balance across that quite 
well, but with 4 links it seemed to really prefer 2 in my case.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Riedl
> Sent: Thursday, 2 June 2016 2:12 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Best Network Switches for Redundancy
>
>
> > 4. As Ceph has lots of connections on lots of IP's and port's, LACP or
> > the Linux ALB mode should work really well to balance connections.
> Linux ALB Mode looks promising. Does that work with two switches? Each
> server has 4 ports which are 'splitted' and connected to each switch.
>  _
>/ _[switch]
>   / /  ||
> [server] ||
>  \ \_ ||
>   \__[switch]
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] Wasting the Storage capacity when using Ceph based On high-end storage systems

2016-06-01 Thread Adrian Saul

Also if for political reasons you need a “vendor” solution – ask Dell about 
their DSS 7000 servers – 90 8TB  disks and two compute nodes in 4RU would go a 
long way to making up a multi-PB Ceph solution.

Supermicro also do a similar solution with some 36, 60 and 90 disk in 4RU 
models.

Cisco has C3260s which are about 60 disks depending on config.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jack 
Makenz
Sent: Monday, 30 May 2016 3:56 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Fwd: [Ceph-community] Wasting the Storage capacity when 
using Ceph based On high-end storage systems


Forwarded conversation
Subject: Wasting the Storage capacity when using Ceph based On high-end storage 
systems


From: Jack Makenz >
Date: Sun, May 29, 2016 at 6:52 PM
To: ceph-commun...@lists.ceph.com

Hello All,
There are some serious problem about ceph that may waste storage capacity when 
using high-end storage system(Hitachi, IBM, EMC, HP ,...) as back-end for OSD 
hosts.

Imagine in the real cloud we need  n Petabytes of storage capacity that 
commodity hardware's hard disks or OSD server's hard disks can't provide this 
amount of storage capacity. thus we have to use storage systems as back-end for 
OSD hosts(to implement OSD daemons ).

But because almost all of these storage systems ( Regardless of their brand) 
use Raid technology and also ceph replicate at least two copy of each Object, 
lot's amount of storage capacity waste.

So is there any solution to solve this problem/misunderstand ?

Regards
Jack Makenz

--
From: Nate Curry >
Date: Mon, May 30, 2016 at 5:50 AM
To: Jack Makenz >
Cc: Unknown 
>


I think that purpose of ceph is to get away from having to rely on high end 
storage systems and to be provide the capacity to utilize multiple less 
expensive servers as the storage system.

That being said you should still be able to use the high end storage systems 
with or without RAID enabled.  You could do away with RAID altogether and let 
Ceph handle the redundancy or you can have LUNs assigned to hosts be put into 
use as OSDs.  You could make it work however but to get the most out of your 
storage with Ceph I think a non-RAID configuration would be best.

Nate Curry
___
Ceph-community mailing list
ceph-commun...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

--
From: Doug Dressler >
Date: Mon, May 30, 2016 at 6:02 AM
To: Nate Curry >
Cc: Jack Makenz >, Unknown 
>

For non-technical reasons I had to run ceph initially using SAN disks.

Lesson learned:

Make sure deduplication is disabled on the SAN :-)



--
From: Jack Makenz >
Date: Mon, May 30, 2016 at 9:05 AM
To: Nate Curry >, 
ceph-commun...@lists.ceph.com

Thanks Nate,
But as i mentioned before , providing petabytes of storage capacity on 
commodity hardware or enterprise servers is almost impossible, of course that 
it's possible by installing hundreds of servers with 3 terabytes hard disks, 
but this solution waste data center raise floor, power consumption and also 
money :)


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] seqwrite gets good performance but random rw gets worse

2016-05-25 Thread Adrian Saul

Sync will always be lower – it will cause it to wait for previous writes to 
complete before issuing more so it will effectively throttle writes to a queue 
depth of 1.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ken 
Peng
Sent: Wednesday, 25 May 2016 6:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] seqwrite gets good performance but random rw gets 
worse

Hi again,
when setup file-fsync-freq=1 (fsync for each time writing) and 
file-fsync-freq=0 (never fsync by sysbench), the result gets huge difference.
(one is 382.94Kb/sec, another is 25.921Mb/sec).
How do you think of it? thanks.

file-fsync-freq=1,
# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 --file-fsync-freq=1 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  4309 Read, 2873 Write, 367707 Other = 374889 Total
Read 67.328Mb  Written 44.891Mb  Total transferred 112.22Mb  (382.94Kb/sec)
   23.93 Requests/sec executed

Test execution summary:
total time:  300.0782s
total number of events:  7182
total time taken by event execution: 2.3207
per-request statistics:
 min:  0.01ms
 avg:  0.32ms
 max: 80.17ms
 approx.  95 percentile:   1.48ms

Threads fairness:
events (avg/stddev):   7182./0.00
execution time (avg/stddev):   2.3207/0.00


file-fsync-freq=0,

# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 --file-fsync-freq=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  298613 Read, 199075 Write, 0 Other = 497688 Total
Read 4.5565Gb  Written 3.0376Gb  Total transferred 7.5941Gb  (25.921Mb/sec)
 1658.93 Requests/sec executed

Test execution summary:
total time:  300.0049s
total number of events:  497688
total time taken by event execution: 299.7026
per-request statistics:
 min:  0.00ms
 avg:  0.60ms
 max:   2211.13ms
 approx.  95 percentile:   1.21ms

Threads fairness:
events (avg/stddev):   497688./0.00
execution time (avg/stddev):   299.7026/0.00

2016-05-25 15:01 GMT+08:00 Ken Peng >:
Hello,
We have a cluster with 20+ hosts and 200+ OSDs, each 4T SATA disk for an OSD, 
no SSD cache.
OS is Ubuntu 16.04 LTS, ceph version 10.2.0
Both data network and cluster network are 10Gbps.
We run ceph as block storage service only (rbd client within VM).
For testing within a VM with sysbench tool, we see that the seqwrite has a 
relatively good performance, it can reach 170.37Mb/sec, but random read/write 
always gets bad result, it can be only 474.63Kb/sec (shown as below).

Can you help give the idea why the random IO is so worse? Thanks.
This is what sysbench outputs,

# sysbench --test=fileio --file-total-size=5G prepare
sysbench 0.4.12:  multi-threaded system evaluation benchmark

128 files, 40960Kb each, 5120Mb total
Creating files for the test...


# sysbench --test=fileio --file-total-size=5G --file-test-mode=seqwr 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.

Operations performed:  0 Read, 327680 Write, 128 Other = 327808 Total
Read 0b  Written 5Gb  Total transferred 5Gb  

Re: [ceph-users] seqwrite gets good performance but random rw gets worse

2016-05-25 Thread Adrian Saul

Are you using image-format 2 RBD images?

We found a major performance hit using format 2 images under 10.2.0 today in 
some testing.  When we switched to using format 1 images we literally got 10x 
random write IOPS performance (1600 IOPs up to 3 IOPS for the same test).



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ken 
Peng
Sent: Wednesday, 25 May 2016 5:02 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] seqwrite gets good performance but random rw gets worse

Hello,
We have a cluster with 20+ hosts and 200+ OSDs, each 4T SATA disk for an OSD, 
no SSD cache.
OS is Ubuntu 16.04 LTS, ceph version 10.2.0
Both data network and cluster network are 10Gbps.
We run ceph as block storage service only (rbd client within VM).
For testing within a VM with sysbench tool, we see that the seqwrite has a 
relatively good performance, it can reach 170.37Mb/sec, but random read/write 
always gets bad result, it can be only 474.63Kb/sec (shown as below).

Can you help give the idea why the random IO is so worse? Thanks.
This is what sysbench outputs,

# sysbench --test=fileio --file-total-size=5G prepare
sysbench 0.4.12:  multi-threaded system evaluation benchmark

128 files, 40960Kb each, 5120Mb total
Creating files for the test...


# sysbench --test=fileio --file-total-size=5G --file-test-mode=seqwr 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.

Operations performed:  0 Read, 327680 Write, 128 Other = 327808 Total
Read 0b  Written 5Gb  Total transferred 5Gb  (170.37Mb/sec)
10903.42 Requests/sec executed

Test execution summary:
total time:  30.0530s
total number of events:  327680
total time taken by event execution: 28.5936
per-request statistics:
 min:  0.01ms
 avg:  0.09ms
 max:192.84ms
 approx.  95 percentile:   0.03ms

Threads fairness:
events (avg/stddev):   327680./0.00
execution time (avg/stddev):   28.5936/0.00



# sysbench --test=fileio --file-total-size=5G --file-test-mode=rndrw 
--init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 40Mb each
5Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!

Time limit exceeded, exiting...
Done.

Operations performed:  5340 Read, 3560 Write, 11269 Other = 20169 Total
Read 83.438Mb  Written 55.625Mb  Total transferred 139.06Mb  (474.63Kb/sec)
   29.66 Requests/sec executed

Test execution summary:
total time:  300.0216s
total number of events:  8900
total time taken by event execution: 6.4774
per-request statistics:
 min:  0.01ms
 avg:  0.73ms
 max: 90.18ms
 approx.  95 percentile:   1.60ms

Threads fairness:
events (avg/stddev):   8900./0.00
execution time (avg/stddev):   6.4774/0.00
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD removal issue

2016-05-23 Thread Adrian Saul

Thanks - all sorted.


> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Monday, 23 May 2016 6:58 PM
> To: Adrian Saul; ceph-users@lists.ceph.com
> Subject: RE: RBD removal issue
>
> See here:
>
> http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Adrian Saul
> > Sent: 23 May 2016 09:37
> > To: 'ceph-users@lists.ceph.com' <ceph-users@lists.ceph.com>
> > Subject: [ceph-users] RBD removal issue
> >
> >
> > A while back I attempted to create an RBD volume manually - intending
> > it
> to
> > be an exact size of another LUN around 100G.  The command line instead
> > took this to be the default MB argument for size and so I ended up
> > with a
> > 102400 TB volume.  Deletion was painfully slow (I never used the
> > volume,
> it
> > just seemed to spin on CPU for ages going through all the objects it
> thought it
> > had) and the rbd rm command was interrupted a few times, but even
> > after running for two months it still wont complete.
> >
> > I still have the volume listed even though it appears to be otherwise
> > gone from the RADOS view.  From what I can see there is only the
> > rbd_header object remaining - can I just remove that directly or am I
> > risking
> corrupting
> > something else by not removing it using rbd rm?
> >
> > Cheers,
> >  Adrian
> >
> >
> > [root@ceph-glb-fec-01 ~]# rbd info glebe-sata/oemprd01db_lun00 rbd
> > image 'oemprd01db_lun00':
> > size 102400 TB in 26843545600 objects
> > order 22 (4096 kB objects)
> > block_name_prefix: rbd_data.8d4ca65a5db37
> > format: 2
> > features: layering
> > flags:
> > [root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep
> > rbd_data.8d4ca65a5db37
> > [root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep 8d4ca65a5db37
> > rbd_header.8d4ca65a5db37
> >
> > Confidentiality: This email and any attachments are confidential and
> > may
> be
> > subject to copyright, legal or some other professional privilege. They
> > are intended solely for the attention and use of the named
> > addressee(s). They may only be copied, distributed or disclosed with
> > the consent of the copyright owner. If you have received this email by
> > mistake or by breach
> of
> > the confidentiality clause, please notify the sender immediately by
> > return email and delete or destroy all copies of the email. Any
> > confidentiality, privilege or copyright is not waived or lost because
> > this email has been
> sent
> > to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD removal issue

2016-05-23 Thread Adrian Saul

A while back I attempted to create an RBD volume manually - intending it to be 
an exact size of another LUN around 100G.  The command line instead took this 
to be the default MB argument for size and so I ended up with a 102400 TB 
volume.  Deletion was painfully slow (I never used the volume, it just seemed 
to spin on CPU for ages going through all the objects it thought it had) and 
the rbd rm command was interrupted a few times, but even after running for two 
months it still wont complete.

I still have the volume listed even though it appears to be otherwise gone from 
the RADOS view.  From what I can see there is only the rbd_header object 
remaining - can I just remove that directly or am I risking corrupting 
something else by not removing it using rbd rm?

Cheers,
 Adrian


[root@ceph-glb-fec-01 ~]# rbd info glebe-sata/oemprd01db_lun00
rbd image 'oemprd01db_lun00':
size 102400 TB in 26843545600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d4ca65a5db37
format: 2
features: layering
flags:
[root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep rbd_data.8d4ca65a5db37
[root@ceph-glb-fec-01 ~]# rados ls -p glebe-sata|grep 8d4ca65a5db37
rbd_header.8d4ca65a5db37

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVRAM cards as OSD journals

2016-05-22 Thread Adrian Saul

I am using Intel P3700DC 400G cards in a similar configuration (two per host) - 
perhaps you could look at cards of that capacity to meet your needs.

I would suggest having such small journals would mean you will be constantly 
blocking on journal flushes which will impact write performance and latency, 
you would be better off with larger journals to accommodate the expected 
throughput you are after.

Also for redundancy I would suggest more than a single journal - if you lose 
the journal you will need to rebuild all the OSDs on the host which will be a 
significant performance impact and depending on your replication level opens up 
the risk of data loss should another OSD fail for whatever reason.




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Saturday, 21 May 2016 1:53 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] NVRAM cards as OSD journals

Hi,

I am contemplating using a NVRAM card for OSD journals in place of SSD drives 
in our ceph cluster.

Configuration:

* 4 Ceph servers

* Each server has 24 OSDs (each OSD is a 1TB SAS drive)

* 1 PCIe NVRAM card of 16GB capacity per ceph server

* Both Client & cluster network is 10Gbps

As per ceph documents:
The expected throughput number should include the expected disk throughput 
(i.e., sustained data transfer rate), and network throughput. For example, a 
7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the 
disk and network throughput should provide a reasonable expected throughput. 
Some users just start off with a 10GB journal size. For example:
osd journal size = 1
Given that I have a single 16GB card per server that has to be carved among all 
24OSDs, I will have to configure each OSD journal to be much smaller around 
600MB, i.e., 16GB/24 drives.  This value is much smaller than 10GB/OSD journal 
that is generally used.  So, I am wondering if this configuration and journal 
size is valid.  Is there a performance benefit of having a journal that is this 
small?  Also, do I have to reduce the default "filestore maxsync interval" from 
5 seconds to a smaller value say 2 seconds to match the smaller journal size?

Have people used NVRAM cards in the Ceph clusters as journals?  What is their 
experience?

Any thoughts?



Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fibre channel as ceph storage interconnect

2016-04-22 Thread Adrian Saul
> from the responses I've gotten, it looks like there's no viable option to use
> fibre channel as an interconnect between the nodes of the cluster.
> Would it be worth while development effort to establish a block protocol
> between the nodes so that something like fibre channel could be used to
> communicate internally?  Unless I'm waaay wrong (And I'm seldom *that*
> wrong), it would not be worth the effort.  I won't even feature request it.
> Looks like I'll have to look into infiniband or CE, and possibly migrate away
> from Fibre Channel, even though it kinda just works, and therefore I really
> like it :(

I would think even conceptually it would be a mess -  FC as a peer to peer 
network fabric might be useful (in many ways I like it a lot better than 
Ethernet), but you would have to develop an entire transport protocol over it 
(the normal SCSI model would be useless) for Ceph and then write that in to 
replace any of the network code in the existing Ceph code base.

A lot of work for something that is probably easier done swapping your FC HBAs 
for 10G NICs or IB HBAs.

>
> On Thu, Apr 21, 2016 at 11:06 PM, Schlacta, Christ 
> wrote:
> > My primary motivations are:
> > Most of my systems that I want to use with ceph already have fibre
> > Chantel cards and infrastructure, and more infrastructure is
> > incredibly cheap compared to infiniband or {1,4}0gbe cards and
> > infrastructure Most of my systems are expansion slot constrained, and
> > I'd be forced to pick one or the other anyway.
> >
> > On Thu, Apr 21, 2016 at 9:28 PM, Paul Evans  wrote:
> >> In today’s world, OSDs communicate via IP and only IP*. Some
> >> FiberChannel switches and HBAs  support IP-over-FC, but it’s about
> >> 0.02% of the FC deployments.
> >> Therefore, one could technically use FC, but it does’t appear to
> >> offer enough benefit to OSD operations to justify the unique architecture.
> >>
> >> What is your motivation to leverage FC behind OSDs?
> >>
> >> -Paul
> >>
> >> *Ceph on native Infiniband may be available some day, but it seems
> >> impractical with the current releases. IP-over-IB is also known to work.
> >>
> >>
> >> On Apr 21, 2016, at 8:12 PM, Schlacta, Christ 
> wrote:
> >>
> >> Is it possible?  Can I use fibre channel to interconnect my ceph OSDs?
> >> Intuition tells me it should be possible, yet experience (Mostly with
> >> fibre channel) tells me no.  I don't know enough about how ceph works
> >> to know for sure.  All my googling returns results about using ceph
> >> as a BACKEND for exporting fibre channel LUNs, which is, sadly, not
> >> what I'm looking for at the moment.
> >>
> >>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fibre channel as ceph storage interconnect

2016-04-21 Thread Adrian Saul

I could only see it being done using FCIP as the OSD processes use IP to 
communicate.

I guess it would depend on why you are looking to use something like FC instead 
of Ethernet or IB.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Schlacta, Christ
> Sent: Friday, 22 April 2016 1:12 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] fibre channel as ceph storage interconnect
>
> Is it possible?  Can I use fibre channel to interconnect my ceph OSDs?
>  Intuition tells me it should be possible, yet experience (Mostly with fibre
> channel) tells me no.  I don't know enough about how ceph works to know
> for sure.  All my googling returns results about using ceph as a BACKEND for
> exporting fibre channel LUNs, which is, sadly, not what I'm looking for at the
> moment.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Adrian Saul

At this stage the RGW component is down the line - pretty much just concept 
while we build out the RBD side first.

What I wanted to get out of EC was distributing the data across multiple DCs 
such that we were not simply replicating data - which would give us much better 
storage efficiency and redundancy.Some of what I had read in the past was 
around using EC to spread data over multiple DCs to be able to sustain loss of 
multiple sites.  Most of this was implied fairly clearly in the documentation 
under "CHEAP MULTIDATACENTER STORAGE":

http://docs.ceph.com/docs/hammer/dev/erasure-coded-pool/

Although I note that section appears to have disappeared in the later 
documentation versions

It seems a little disheartening that much of this promise and capability for 
Ceph appears to be just not there in practice.






> -Original Message-
> From: Maxime Guyot [mailto:maxime.gu...@elits.com]
> Sent: Tuesday, 12 April 2016 5:49 PM
> To: Adrian Saul; Christian Balzer; 'ceph-users@lists.ceph.com'
> Subject: Re: [ceph-users] Mon placement over wide area
>
> Hi Adrian,
>
> Looking at the documentation RadosGW has multi region support with the
> “federated gateways”
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical
> locales, configuring Ceph Object Gateway regions and metadata
> synchronization agents enables the service to maintain a global namespace,
> even though Ceph Object Gateway instances run in different geographic
> locales and potentially on different Ceph Storage Clusters.”
>
> Maybe that could do the trick for your multi metro EC pools?
>
> Disclaimer: I haven't tested the federated gateways RadosGW.
>
> Best Regards
>
> Maxime Guyot
> System Engineer
>
>
>
>
>
>
>
>
>
> On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul"  boun...@lists.ceph.com on behalf of adrian.s...@tpgtelecom.com.au>
> wrote:
>
> >Hello again Christian :)
> >
> >
> >> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> >> > will be distributed over every major capital in Australia.The config
> >> > will be dual sites in each city that will be coupled as HA pairs - 12
> >> > sites in total.   The vast majority of CRUSH rules will place data
> >> > either locally to the individual site, or replicated to the other HA
> >> > site in that city.   However there are future use cases where I think we
> >> > could use EC to distribute data wider or have some replication that
> >> > puts small data sets across multiple cities.
> >> This will very, very, VERY much depend on the data (use case) in question.
> >
> >The EC use case would be using RGW and to act as an archival backup
> >store
> >
> >> > The concern I have is around the placement of mons.  In the current
> >> > design there would be two monitors in each site, running separate to
> the
> >> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> >> > will also be a "tiebreaker" mon placed on a separate host which
> >> > will house some management infrastructure for the whole platform.
> >> >
> >> Yes, that's the preferable way, might want to up this to 5 mons so
> >> you can loose one while doing maintenance on another one.
> >> But if that would be a coupled, national cluster you're looking both
> >> at significant MON traffic, interesting "split-brain" scenarios and
> >> latencies as well (MONs get chosen randomly by clients AFAIK).
> >
> >In the case I am setting up it would be 2 per site plus the extra so 25 - 
> >but I
> am fearing that would make the mon syncing become to heavy.  Once we
> build up to multiple sites though we can maybe reduce to one per site to
> reduce the workload on keeping the mons in sync.
> >
> >> > Obviously a concern is latency - the east coast to west coast
> >> > latency is around 50ms, and on the east coast it is 12ms between
> >> > Sydney and the other two sites, and 24ms Melbourne to Brisbane.
> >> In any situation other than "write speed doesn't matter at all"
> >> combined with "large writes, not small ones" and "read-mostly" you're
> >> going to be in severe pain.
> >
> >For data yes, but the main case for that would be backup data where it
> would be large writes, read rarely and as long as streaming performance
> keeps up latency wont matter.   My concern with the latency would be how
> that impacts the monitors having

Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul
Hello again Christian :)


> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> > will be distributed over every major capital in Australia.The config
> > will be dual sites in each city that will be coupled as HA pairs - 12
> > sites in total.   The vast majority of CRUSH rules will place data
> > either locally to the individual site, or replicated to the other HA
> > site in that city.   However there are future use cases where I think we
> > could use EC to distribute data wider or have some replication that puts
> > small data sets across multiple cities.
> This will very, very, VERY much depend on the data (use case) in question.

The EC use case would be using RGW and to act as an archival backup store

> > The concern I have is around the placement of mons.  In the current
> > design there would be two monitors in each site, running separate to the
> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> > will also be a "tiebreaker" mon placed on a separate host which will
> > house some management infrastructure for the whole platform.
> >
> Yes, that's the preferable way, might want to up this to 5 mons so you can
> loose one while doing maintenance on another one.
> But if that would be a coupled, national cluster you're looking both at
> significant MON traffic, interesting "split-brain" scenarios and latencies as
> well (MONs get chosen randomly by clients AFAIK).

In the case I am setting up it would be 2 per site plus the extra so 25 - but I 
am fearing that would make the mon syncing become to heavy.  Once we build up 
to multiple sites though we can maybe reduce to one per site to reduce the 
workload on keeping the mons in sync.

> > Obviously a concern is latency - the east coast to west coast latency
> > is around 50ms, and on the east coast it is 12ms between Sydney and
> > the other two sites, and 24ms Melbourne to Brisbane.
> In any situation other than "write speed doesn't matter at all" combined with
> "large writes, not small ones" and "read-mostly" you're going to be in severe
> pain.

For data yes, but the main case for that would be backup data where it would be 
large writes, read rarely and as long as streaming performance keeps up latency 
wont matter.   My concern with the latency would be how that impacts the 
monitors having to keep in sync and how that would impact client opertions, 
especially with the rate of change that would occur with the predominant RBD 
use in most sites.

> > Most of the data
> > traffic will remain local but if we create a single national cluster
> > then how much of an impact will it be having all the mons needing to
> > keep in sync, as well as monitor and communicate with all OSDs (in the
> > end goal design there will be some 2300+ OSDs).
> >
> Significant.
> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
> run/setup and sharing the experience with us. ^.^

Someone has to be the canary right :)

> > The other options I  am considering:
> > - split into east and west coast clusters, most of the cross city need
> > is in the east coast, any data moves between clusters can be done with
> > snap replication
> > - city based clusters (tightest latency) but loose the multi-DC EC
> > option, do cross city replication using snapshots
> >
> The later, I seem to remember that there was work in progress to do this
> (snapshot replication) in an automated fashion.
>
> > Just want to get a feel for what I need to consider when we start
> > building at this scale.
> >
> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
> the only well known/supported way to do geo-replication with Ceph is via
> RGW.

iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
sensitive workloads so while we are still working to keep that low, we wont be 
putting the heavier IOP or latency sensitive workloads onto it until we get a 
better feel for how it behaves at scale and can be sure of the performance.

As above - for the most part we are going to be for the most part having local 
site pools (replicate at application level), a few metro replicated pools and a 
couple of very small multi-metro replicated pools, with the geo-redundant EC 
stuff a future plan.  It would just be a shame to lock the design into a setup 
that won't let us do some of these wider options down the track.

Thanks.

Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied

[ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul

We are close to being given approval to deploy a 3.5PB Ceph cluster that will 
be distributed over every major capital in Australia.The config will be 
dual sites in each city that will be coupled as HA pairs - 12 sites in total.   
The vast majority of CRUSH rules will place data either locally to the 
individual site, or replicated to the other HA site in that city.   However 
there are future use cases where I think we could use EC to distribute data 
wider or have some replication that puts small data sets across multiple 
cities.   All of this will be tied together with a dedicated private IP network.

The concern I have is around the placement of mons.  In the current design 
there would be two monitors in each site, running separate to the OSDs as part 
of some hosts acting as RBD to iSCSI/NFS gateways.   There will also be a 
"tiebreaker" mon placed on a separate host which will house some management 
infrastructure for the whole platform.

Obviously a concern is latency - the east coast to west coast latency is around 
50ms, and on the east coast it is 12ms between Sydney and the other two sites, 
and 24ms Melbourne to Brisbane.  Most of the data traffic will remain local but 
if we create a single national cluster then how much of an impact will it be 
having all the mons needing to keep in sync, as well as monitor and communicate 
with all OSDs (in the end goal design there will be some 2300+ OSDs).

The other options I  am considering:
- split into east and west coast clusters, most of the cross city need is in 
the east coast, any data moves between clusters can be done with snap 
replication
- city based clusters (tightest latency) but loose the multi-DC EC option, do 
cross city replication using snapshots

Just want to get a feel for what I need to consider when we start building at 
this scale.

Cheers,
 Adrian






Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph.conf

2016-03-30 Thread Adrian Saul

It is the monitors that ceph clients/daemons can connect to initially to 
connect with the cluster.

Once they connect to one of the initial mons they will get a full list of all 
monitors and be able to connect to any of them to pull updated maps.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
zai...@nocser.net
Sent: Thursday, 31 March 2016 3:21 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph.conf

Hi,

What does mean by mon initial members in ceph.conf? Is it monitor node that 
monitor all osd node? Or node osd that been monitor? Care to exlain?

Regards,

Mohd Zainal Abidin Rabani
Technical Support

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash after conversion to bluestore

2016-03-30 Thread Adrian Saul

I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and see 
what latency difference it makes.

I was able to one by one zap and recreate my OSDs to bluestore and rebalance 
the cluster (the change to having new OSDs start with low weight threw me at 
first, but once  I worked that out it was fine).

I was all good until I completed the last OSD, and then one of the earlier ones 
fell over and refuses to restart.  Every attempt to start fails with this 
assertion failure:

-2> 2016-03-31 15:15:08.868588 7f931e5f0800  0  
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
-1> 2016-03-31 15:15:08.868800 7f931e5f0800  1  
cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
 0> 2016-03-31 15:15:08.870948 7f931e5f0800 -1 osd/OSD.h: In function 
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f931e5f0800 time 2016-03-31 
15:15:08.869638
osd/OSD.h: 886: FAILED assert(ret)

 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0x558cee37da55]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x558cedd6a6fd]
 3: (OSD::init()+0xf22) [0x558cedd1d172]
 4: (main()+0x2aab) [0x558cedc83a2b]
 5: (__libc_start_main()+0xf5) [0x7f931b506b15]
 6: (()+0x349689) [0x558cedccd689]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


I could just zap and recreate it again, but I would be curious to know how to 
fix it, or unless someone can suggest if this is a bug that needs looking at.

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Adrian Saul
 is flushing things
> from RAM to the backing storage on configurable time limits and once these
> are exceeded and/or you run out RAM (pagecache), you are limited to what
> your backing storage can sustain.
>
> Now in real life, you would want a cluster and especially OSDs that are 
> lightly
> to medium loaded on average and in that case a spike won't result in a
> significant rise of latency.
>
> > > Have you tried the HDD based pool and did you see similar,
> > > consistent interval, spikes?
> >
> > To be honest I have been focusing on the SSD numbers but that would be
> > a good comparison.
> >
> > > Or alternatively, configured 2 of your NVMEs as OSDs?
> >
> > That was what I was thinking of doing - move the NVMEs to the
> > frontends, make them OSDs and configure them as a read-forward cache
> > tier for the other pools, and just have the SSDs and SATA journal by
> > default on a first partition.
> >
> Madness lies down that path, also not what I meant.
> For quick testing, leave the NVMEs right where they are, destroy your SSD
> pool and create one with the 2 NVMEs per node as individual OSDs.
> Test against that.

Have that in place now - was also a fun exercise in ceph management to 
dynamically reconfigure and rebuild 12 OSDs and then put the flash OSDs into 
their own crush root.  That is in play now but the numbers are really not what 
I expected.  I am going to work on it some more before I call anything.

>
> A read forward cache tier is exactly the opposite of what you want, you want
> your writes to be fast and hit the fastest game in town (your NVMEs
> preferably) and thus want writeback mode.
> Infernalis, or even better waiting for Jewel will help to keep the cache as 
> hot
> and unpolluted as possible with working recency configurations for
> promotions.
> But if anyhow possible, keep your base pools sufficiently fast as well, so 
> they
> can serve cache misses (promotions) or cache flushes adequately.
> Keep in mind that a promotion or flush will (on average for RDB objects)
> result in 4MB reads and writes.

My understanding was read forward would take the writes, but send the reads on 
to the backend.  Probably not a lot to save those reads but I wanted to ensure 
I was not hiding real read performance with a flash pool that was larger than 
the workload I am dealing with.


>
> In your case the SSDs are totally unsuitable to hold journals and will both
> perform miserably and wear out even faster.
> And HDDs really benefit from SSD journals, especially when it comes to IOPS.

For now I have left the SATAs with their journals on flash anyway - we are 
already pricing based on that config anyway.

> I also recall your NVMEs being in a RAID1, presumably so that a failure won't
> take out all your OSDs.
> While understandable, it is also quite wasteful.
> For starters you need to be able to sustain a node loss, so "half" a node 
> loss if
> a NVME fails must be within the the capability of your cluster.
> This is why most people suggest starting with about 10 storage nodes for
> production clusters, of course budget permitting (none of mine is that size
> yet).
>
> By using the NVMEs individually, you improve performance and lower their
> write usage.
> Specifically, those 400GB P3700 can write about 1000MB/s, which is half your
> network speed and will only saturate about 10 of your 36 HDDs.
> And with Intel P3700s, you really don't have to worry about endurance to
> boot.

Thanks.  The consideration was I didn't want to lose 36 or 18 OSDs due to a 
journal failure, so if we lost a card we could do a controlled replacement 
without totally rebuilding the OSDS (as they are PCI-e its host outage anyway).

We could maybe look to see if we can put 3 cards in and do 12 per journal, and 
just take the hit should we lose a single journal.

I would prefer to do a more distributed host setup however the compute cost 
pushes the overall pricing up significantly even if we use smaller less 
configured hosts, hence why I am going for more of higher density model.

Very much appreciate your insight and advice.

Cheers,
 Adrian



>
>
> Regards,
>
> Christian
> > > No, not really. The journal can only buffer so much.
> > > There are several threads about this in the archives.
> > >
> > > You could tune it but that will only go so far if your backing
> > > storage can't keep up.
> > >
> > > Regards,
> > >
> > > Christian
> >
> >
> > Agreed - Thanks for your help.
> > Confidentiality: This email and any attachments are confidential and
> > may be subject to copyright, legal or some other professional privilege.
> > They are intended

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Adrian Saul

> Samsung EVO...
> Which exact model, I presume this is not a DC one?
>
> If you had put your journals on those, you would already be pulling your hairs
> out due to abysmal performance.
>
> Also with Evo ones, I'd be worried about endurance.

No,  I am using the P3700DCs for journals.  The Samsungs are the 850 2TB 
(MZ-75E2T0BW).  Chosen primarily on price.  We already built a system using the 
1TB models with Solaris+ZFS and I have little faith in them.  Certainly their 
write performance is erratic and not ideal.  We have other vendor options which 
are what they call "Enterprise Value" SSDs, but still 4x the price.   I would 
prefer a higher grade drive but unfortunately cost is being driven from above 
me.

> > On the ceph side each disk in the OSD servers are setup as an individual
> > OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> > servers into one root, and the SATA servers into another and created
> > pools using hosts as fault boundaries, with the pools set for 2
> > copies.
> Risky. If you have very reliable and well monitored SSDs you can get away
> with 2 (I do so), but with HDDs and the combination of their reliability and
> recovery time it's asking for trouble.
> I realize that this is testbed, but if your production has a replication of 3 
> you
> will be disappointed by the additional latency.

Again, cost - the end goal will be we build metro based dual site pools which 
will be 2+2 replication.  I am aware of the risks but already presenting 
numbers based on buying 4x the disk we are able to use gets questioned hard.

> This smells like garbage collection on your SSDs, especially since it matches
> time wise what you saw on them below.

I concur.   I am just not sure why that impacts back to the client when from 
the client perspective the journal should hide this.   If the journal is 
struggling to keep up and has to flush constantly then perhaps, but  on the 
current steady state IO rate I am testing with I don't think the journal should 
be that saturated.

> Have you tried the HDD based pool and did you see similar, consistent
> interval, spikes?

To be honest I have been focusing on the SSD numbers but that would be a good 
comparison.

> Or alternatively, configured 2 of your NVMEs as OSDs?

That was what I was thinking of doing - move the NVMEs to the frontends, make 
them OSDs and configure them as a read-forward cache tier for the other pools, 
and just have the SSDs and SATA journal by default on a first partition.

> No, not really. The journal can only buffer so much.
> There are several threads about this in the archives.
>
> You could tune it but that will only go so far if your backing storage can't 
> keep
> up.
>
> Regards,
>
> Christian


Agreed - Thanks for your help.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] failure of public network kills connectivity

2016-01-05 Thread Adrian Imboden

Hi

I recently set up a small ceph cluster at home for testing and private 
purposes.
It works really great, but I have a problem that may come from my 
small-size configuration.


All nodes are running Ubuntu 14.04 and ceph infernalis 9.2.0.

I have two networks as recommended:
cluster network: 10.10.128.0/24
public network: 10.10.48.0/22

All ip addresses are configured statically.

The behavior that I see here is that when I let the rados benchmark run 
(e.g "rados bench -p data 300 write"),
the public and the cluster network is being used to transmit the data 
(about 50%/50%).


When I disconnect the cluster from the public network, the connection 
between the osd's is lost, while the monitors keep seeing each other:
HEALTH_WARN 129 pgs degraded; 127 pgs stale; 129 pgs undersized; 
recovery 1885/7316 objects degraded (25.765%); 6/8 in osds are down



What I expect is, that only the cluster network is being used, when a 
ceph node itself reads or writes data.
Furthermore, I expected that a failure of the public network does not 
affect the connectivity of the nodes themselves.


What do I not yet understand, or what am I configuring the wrong way?

I plan to run kvm on these same nodes beside the storage-cluster as it 
is only a small setup. Thats the reason why I am a little bit concerned 
about

this behaviour.


This is how it is setup:
|- node1 (cluster: 10.10.128.1, public: 10.10.49.1)
|  |- osd
|  |- osd
|  |- mon
|
|- node2 (cluster: 10.10.128.2, public: 10.10.49.2)
|  |- osd
|  |- osd
|  |- mon
|
|- node3 (cluster: 10.10.128.3, public: 10.10.49.3)
|  |- osd
|  |- osd
|  |- mon
|
|- node4 (cluster: 10.10.128.4, public: 10.10.49.4)
|  |- osd
|  |- osd
|

This is my ceph config:

[global]
auth supported = cephx

fsid = 64599def-5741-4bda-8ce5-31a85af884bb
mon initial members = node1 node3 node2 node4
mon host = 10.10.128.1 10.10.128.3 10.10.128.2 10.10.128.4
public network = 10.10.48.0/22
cluster network = 10.10.128.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[mon.node1]
host = node1
mon data = /var/lib/ceph/mon/ceph-node1/

[mon.node3]
host = node3
mon data = /var/lib/ceph/mon/ceph-node3/

[mon.node2]
host = node2
mon data = /var/lib/ceph/mon/ceph-node2/

[mon.node4]
host = node4
mon data = /var/lib/ceph/mon/ceph-node4/


Thank you very much

Greetings
Adrian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD load simulator

2015-03-10 Thread Adrian Sevcenco
Hi! Is is possible somehow to have a kind of OSD benchmark for CPU?
It would be very useful to measure the actual compatibility of a server with
a number of OSD, PGs and so on .. 
The reason of the request is that the rule of 1 GHz per OSD might not really 
hold water 
(for reasons like AMD vs Intel, Atom, Xeon D and upcoming ARMs 
(like Cavium's ThunderX, X-Gene, AMD's opteron ARM) etc...)

Thank you!
Adrian



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH hardware recommendations and cluster design questions

2015-03-05 Thread Adrian Sevcenco
Thank you all for all good advises and much needed documentation.
I have a lot to digest :)

Adrian

On 03/04/2015 08:17 PM, Stephen Mercier wrote:
 To expand upon this, the very nature and existence of Ceph is to replace
 RAID. The FS itself replicates data and handles the HA functionality
 that you're looking for. If you're going to build a single server with
 all those disks, backed by a ZFS RAID setup, you're going to be much
 better suited with an iSCSI setup. The idea of ceph is that it takes the
 place of all the ZFS bells and whistles. A CEPH cluster that only has
 one OSD backed by that huge ZFS setup becomes just a wire-protocol to
 speak to the server. The magic in ceph comes from the replication and
 distribution of the data across many OSDs, hopefully living in many
 hosts. My own setup for instance uses 96 OSDs that are spread across 4
 hosts (I know I know guys - CPU is a big deal with SSDs so 24 per host
 is a tall order - didn't know that when we built it - been working ok so
 far) that is then distributed between 2 cabinets on 2 separate
 cooling/power/data zones in our datacenter. My CRUSH map is currently
 setup for 3 copies of all data, and laid out so that at least one copy
 is located in each cabinet, and then the cab that gets the 2 copies also
 makes sure that each copy is on a different host. No RAID needed because
 ceph makes sure that I have a safe amount of copies of the data, in a
 distribution layout that allows us to sleep at night. In my opinion,
 ceph is much more pleasant, powerful, and versatile to deal with than
 both hardware RAID and ZFS (Both of which we have instances of deployed
 as well from previous iterations of infrastructure deployments). Now,
 you could always create small little zRAID clusters using ZFS, and then
 give an OSD to each of those, if you wanted even an additional layer of
 safety. Heck, you could even have hardware RAID behind the zRAID, for
 even another layer. Where YOU need to make the decision is the trade-off
 between HA functionality/peace of mind, performance, and
 useability/maintainability.
 
 Would me happy to answer any questions you still have...
 
 Cheers,
 -- 
 Stephen Mercier
 Senior Systems Architect
 Attainia, Inc.
 Phone: 866-288-2464 ext. 727
 Email: stephen.merc...@attainia.com mailto:stephen.merc...@attainia.com
 Web: www.attainia.com http://www.attainia.com
 
 Capital equipment lifecycle planning  budgeting solutions for healthcare
 
 
 
 
 
 
 On Mar 4, 2015, at 10:42 AM, Alexandre DERUMIER wrote:
 
 Hi for hardware, inktank have good guides here:

 http://www.inktank.com/resource/inktank-hardware-selection-guide/
 http://www.inktank.com/resource/inktank-hardware-configuration-guide/

 ceph works well with multiple osd daemon (1 osd by disk),
 so you should not use raid.

 (xfs is the recommended fs for osd daemons).

 you don't need disk spare too, juste enough disk space to handle a
 disk failure.
 (datas are replicated-rebalanced on other disks/osd in case of disk
 failure)


 - Mail original -
 De: Adrian Sevcenco adrian.sevce...@cern.ch
 À: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mercredi 4 Mars 2015 18:30:31
 Objet: [ceph-users] CEPH hardware recommendations and cluster
 designquestions

 Hi! I seen the documentation
 http://ceph.com/docs/master/start/hardware-recommendations/ but those
 minimum requirements without some recommendations don't tell me much ...

 So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd
 would do ... what puzzles me is that per daemon construct ...
 Why would i need/require to have multiple daemons? with separate servers
 (3 mon + 1 mds - i understood that this is the requirement) i imagine
 that each will run a single type of daemon.. did i miss something?
 (beside that maybe is a relation between daemons and block devices and
 for each block device should be a daemon?)

 for mon and mds : would help the clients if these are on 10 GbE?

 for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all
 disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB
 how much ram i would really need? (128 gb would be way to much i think)
 (that RAIDZ3 for 36 disks is just a thought - i have also choices like:
 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare)

 Regarding journal and scrubbing : by using ZFS i would think that i can
 safely not use the CEPH ones ... is this ok?

 Do you have some other advises and recommendations for me? (the
 read:writes ratios will be 10:1)

 Thank you!!
 Adrian




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH hardware recommendations and cluster design questions

2015-03-04 Thread Adrian Sevcenco
Hi! I seen the documentation
http://ceph.com/docs/master/start/hardware-recommendations/ but those
minimum requirements without some recommendations don't tell me much ...

So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd
would do ... what puzzles me is that per daemon construct ...
Why would i need/require to have multiple daemons? with separate servers
(3 mon + 1 mds - i understood that this is the requirement) i imagine
that each will run a single type of daemon.. did i miss something?
(beside that maybe is a relation between daemons and block devices and
for each block device should be a daemon?)

for mon and mds : would help the clients if these are on 10 GbE?

for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all
disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB
how much ram i would really need? (128 gb would be way to much i think)
(that RAIDZ3 for 36 disks is just a thought - i have also choices like:
2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare)

Regarding journal and scrubbing : by using ZFS i would think that i can
safely not use the CEPH ones ... is this ok?

Do you have some other advises and recommendations for me? (the
read:writes ratios will be 10:1)

Thank you!!
Adrian



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-14 Thread Adrian Banasiak
Thank you, that should do the trick.


2014-05-14 6:41 GMT+02:00 Kai Zhang log1...@yeah.net:

 Hi Adrian,

 You may be interested in rados -p poo_name df --format json, although
 it's pool oriented, you could probably add the values together :)

 Regards,
 Kai

 在 2014-05-13 08:33:11,Adrian Banasiak adr...@banasiak.it 写道:

 Thanks for sugestion with admin daemon but it looks like single osd
 oriented. I have used perf dump on mon socket and it output some
 interesting data in case of monitoring whole cluster:
 { cluster: { num_mon: 4,
   num_mon_quorum: 4,
   num_osd: 29,
   num_osd_up: 29,
   num_osd_in: 29,
   osd_epoch: 1872,
   osd_kb: 20218112516,
   osd_kb_used: 5022202696,
   osd_kb_avail: 15195909820,
   num_pool: 4,
   num_pg: 3500,
   num_pg_active_clean: 3500,
   num_pg_active: 3500,
   num_pg_peering: 0,
   num_object: 400746,
   num_object_degraded: 0,
   num_object_unfound: 0,
   num_bytes: 1678788329609,
   num_mds_up: 0,
   num_mds_in: 0,
   num_mds_failed: 0,
   mds_epoch: 1},

 Unfortunately cluster wide IO statistics are still missing.


 2014-05-13 17:17 GMT+02:00 Haomai Wang haomaiw...@gmail.com:

 Not sure your demand.

 I use ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump to
 get the monitor infos. And the result can be parsed by simplejson
 easily via python.

 On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak adr...@banasiak.it
 wrote:
  Hi, i am working with test Ceph cluster and now I want to implement
 Zabbix
  monitoring with items such as:
 
  - whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35
  objects/s)
  - pg statistics
 
  I would like to create single script in python to retrive values using
 rados
  python module, but there are only few informations in documentation
 about
  module usage. I've created single function which calculates all pools
  current read/write statistics but i cant find out how to add recovery IO
  usage and pg statistics:
 
  read = 0
  write = 0
  for pool in conn.list_pools():
  io = conn.open_ioctx(pool)
  stats[pool] = io.get_stats()
  read+=int(stats[pool]['num_rd'])
  write+=int(stats[pool]['num_wr'])
 
  Could someone share his knowledge about rados module for retriving ceph
  statistics?
 
  BTW Ceph is awesome!
 
  --
  Best regards, Adrian Banasiak
  email: adr...@banasiak.it
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 



 --
 Best Regards,

 Wheat




 --
 Pozdrawiam, Adrian Banasiak
 email: adr...@banasiak.it




-- 
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitoring ceph statistics

2014-05-13 Thread Adrian Banasiak
Hi, i am working with test Ceph cluster and now I want to implement Zabbix
monitoring with items such as:

- whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35
objects/s)
- pg statistics

I would like to create single script in python to retrive values using
rados python module, but there are only few informations in documentation
about module usage. I've created single function which calculates all pools
current read/write statistics but i cant find out how to add recovery IO
usage and pg statistics:

read = 0
write = 0
for pool in conn.list_pools():
io = conn.open_ioctx(pool)
stats[pool] = io.get_stats()
read+=int(stats[pool]['num_rd'])
write+=int(stats[pool]['num_wr'])

Could someone share his knowledge about rados module for retriving ceph
statistics?

BTW Ceph is awesome!

-- 
Best regards, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Adrian Banasiak
Hi, i am working with test Ceph cluster and now I want to implement Zabbix
monitoring with items such as:

- whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35
objects/s)
- pg statistics

I would like to create single script in python to retrive values using
rados python module, but there are only few informations in documentation
about module usage. I've created single function which calculates all pools
current read/write statistics but i cant find out how to add recovery IO
usage and pg statistics:

read = 0
write = 0
for pool in conn.list_pools():
io = conn.open_ioctx(pool)
stats[pool] = io.get_stats()
read+=int(stats[pool]['num_rd'])
write+=int(stats[pool]['num_wr'])

Could someone share his knowledge about rados module for retriving ceph
statistics?

BTW Ceph is awesome!

-- 
Best regards, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Adrian Banasiak
Thanks for sugestion with admin daemon but it looks like single osd
oriented. I have used perf dump on mon socket and it output some
interesting data in case of monitoring whole cluster:
{ cluster: { num_mon: 4,
  num_mon_quorum: 4,
  num_osd: 29,
  num_osd_up: 29,
  num_osd_in: 29,
  osd_epoch: 1872,
  osd_kb: 20218112516,
  osd_kb_used: 5022202696,
  osd_kb_avail: 15195909820,
  num_pool: 4,
  num_pg: 3500,
  num_pg_active_clean: 3500,
  num_pg_active: 3500,
  num_pg_peering: 0,
  num_object: 400746,
  num_object_degraded: 0,
  num_object_unfound: 0,
  num_bytes: 1678788329609,
  num_mds_up: 0,
  num_mds_in: 0,
  num_mds_failed: 0,
  mds_epoch: 1},

Unfortunately cluster wide IO statistics are still missing.


2014-05-13 17:17 GMT+02:00 Haomai Wang haomaiw...@gmail.com:

 Not sure your demand.

 I use ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump to
 get the monitor infos. And the result can be parsed by simplejson
 easily via python.

 On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak adr...@banasiak.it
 wrote:
  Hi, i am working with test Ceph cluster and now I want to implement
 Zabbix
  monitoring with items such as:
 
  - whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35
  objects/s)
  - pg statistics
 
  I would like to create single script in python to retrive values using
 rados
  python module, but there are only few informations in documentation about
  module usage. I've created single function which calculates all pools
  current read/write statistics but i cant find out how to add recovery IO
  usage and pg statistics:
 
  read = 0
  write = 0
  for pool in conn.list_pools():
  io = conn.open_ioctx(pool)
  stats[pool] = io.get_stats()
  read+=int(stats[pool]['num_rd'])
  write+=int(stats[pool]['num_wr'])
 
  Could someone share his knowledge about rados module for retriving ceph
  statistics?
 
  BTW Ceph is awesome!
 
  --
  Best regards, Adrian Banasiak
  email: adr...@banasiak.it
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 



 --
 Best Regards,

 Wheat




-- 
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com