Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Andrew Thrift
We have seen similar poor performance with Intel S3700 and S3710 on LSI
SAS3008 with CFQ on 3.13, 3.18 and 3.19 kernels.

Switching to noop fixed the problems for us.



On Fri, Jul 10, 2015 at 4:30 AM, Alexandre DERUMIER 
wrote:

> >>That’s very strange. Is nothing else using the disks?
> no. only the fio benchmark.
>
> >>The difference between noop and cfq should be (and in my experience is)
> marginal for such a benchmark.
> maybe a bug in cfq (kernel 3.16 debian jessie) ? also, deadline scheduler
> give me same perf than noop.
>
>
> - Mail original -
> De: "Jan Schermer" 
> À: "aderumier" 
> Cc: "Somnath Roy" , "ceph-users" <
> ceph-users@lists.ceph.com>
> Envoyé: Jeudi 9 Juillet 2015 18:20:51
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
>
> That’s very strange. Is nothing else using the disks?
> The difference between noop and cfq should be (and in my experience is)
> marginal for such a benchmark.
>
> Jan
>
>
> > On 09 Jul 2015, at 18:11, Alexandre DERUMIER 
> wrote:
> >
> > Hi again,
> >
> > I totally forgot to check the io scheduler from my last tests, this was
> with cfq.
> >
> > with noop scheduler, I have a huge difference
> >
> > cfq:
> >
> > - sequential syncronous 4k write iodepth=1 : 60 iops
> > - sequential syncronous 4k write iodepth=32 : 2000 iops
> >
> >
> > noop:
> >
> > - sequential syncronous 4k write iodepth=1 : 7866 iops
> > - sequential syncronous 4k write iodepth=32 : 34303 iops
> >
> >
> > - Mail original -
> > De: "Somnath Roy" 
> > À: "Jan Schermer" , "aderumier" 
> > Cc: "ceph-users" 
> > Envoyé: Jeudi 9 Juillet 2015 17:46:41
> > Objet: RE: [ceph-users] Investigating my 100 IOPS limit
> >
> > I am not sure how increasing iodepth for sync write is giving you better
> result..sync fio engine supposed to be always using iodepth =1.
> > BTW, I faced similar issues sometimes back,..By running the following
> fio job file, I was getting very dismal performance on my SSD on top of
> XFS..
> >
> > [random-write]
> > directory=/mnt/fio_test
> > rw=randwrite
> > bs=16k
> > direct=1
> > sync=1
> > time_based
> > runtime=1m
> > size=700G
> > group_reporting
> >
> > Result :
> > 
> > IOPS = 420
> >
> > lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01%
> > lat (msec) : 2=20.05%, 4=46.64%, 10=8.68%
> >
> >
> > Turned out that is a SSD FW problem...Some SSDs tend to misbehave in
> this pattern (even directly with block device, without any XFS) because
> they don't handle O_DIRECT|O_SYNC writes well..I am sure you will find some
> reference by digging into ceph mail list. That's why not all SSDs behave
> well with Ceph journal..
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Jan Schermer
> > Sent: Thursday, July 09, 2015 8:24 AM
> > To: Alexandre DERUMIER
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Investigating my 100 IOPS limit
> >
> > Those are very strange numbers. Is the “60” figure right?
> >
> > Can you paste the full fio command and output?
> > Thanks
> >
> > Jan
> >
> >> On 09 Jul 2015, at 15:58, Alexandre DERUMIER 
> wrote:
> >>
> >> I just tried on an intel s3700, on top of xfs
> >>
> >> fio , with
> >> - sequential syncronous 4k write iodepth=1 : 60 iops
> >> - sequential syncronous 4k write iodepth=32 : 2000 iops
> >> - random syncronous 4k write, iodepth=1 : 8000iops
> >> - random syncronous 4k write iodepth=32 : 18000 iops
> >>
> >>
> >>
> >> - Mail original -
> >> De: "aderumier" 
> >> À: "Jan Schermer" 
> >> Cc: "ceph-users" 
> >> Envoyé: Jeudi 9 Juillet 2015 15:50:35
> >> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> >>
>  Any ideas where to look? I was hoping blktrace would show what
>  exactly is going on, but it just shows a synchronous write -> (10ms)
>  -> completed
> >>
> >> which size is the write in this case ? 4K ? or more ?
> >>
> >>
> >> - Mail original -
> >> De: "Jan Schermer" 
> >> À: "aderumier" 
> >> Cc: "ceph-users" 
> >> Envoyé: Jeudi 9 Juillet 2015 15:29:15
> >> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> >>
> >> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never
> >> get the same 10ms latency. Must be something the filesystem journal/log
> does that is special.
> >>
> >> Any ideas where to look? I was hoping blktrace would show what exactly
> >> is going on, but it just shows a synchronous write -> (10ms) ->
> >> completed
> >>
> >> Jan
> >>
> >>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER 
> wrote:
> >>>
> > I have 12K IOPS in this test on the block device itself. But only
> > 100 filesystem transactions (=IOPS) on filesystem on the same
> > device because the “flush” (=FUA?) operation takes 10ms to finish.
> > I just can’t replicate the >>same “flush” operation with fio on the
> > block device, unfortunately, so I have no idea what is causing that
> > :/
> >>>
> >>> AFAIK, with fio on block d

[ceph-users] Nova with Ceph generate error

2015-07-09 Thread Mario Codeniera
Hi,

It is my first time here. I am just having an issue regarding with my
configuration with the OpenStack which works perfectly for the cinder and
the glance based on Kilo release in CentOS 7. I am based my documentation
on this  rbd-opeenstack
manual.


If I enable my rbd in the nova.conf it generates error like the following
in the dashboard as the logs don't have any errors:

Internal Server Error (HTTP 500) (Request-ID:
> req-231347dd-f14c-4f97-8a1d-851a149b037c)
> Code
> 500
> Details
> File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 343,
> in decorated_function return function(self, context, *args, **kwargs) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2737, in
> terminate_instance do_terminate_instance(instance, bdms) File
> "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 445,
> in inner return f(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2735, in
> do_terminate_instance self._set_instance_error_state(context, instance)
> File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in
> __exit__ six.reraise(self.type_, self.value, self.tb) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2725, in
> do_terminate_instance self._delete_instance(context, instance, bdms,
> quotas) File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 149, in
> inner rv = f(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2694, in
> _delete_instance quotas.rollback() File
> "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in
> __exit__ six.reraise(self.type_, self.value, self.tb) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2664, in
> _delete_instance self._shutdown_instance(context, instance, bdms) File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2604, in
> _shutdown_instance self.volume_api.detach(context, bdm.volume_id) File
> "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 214, in
> wrapper res = method(self, ctx, volume_id, *args, **kwargs) File
> "/usr/lib/python2.7/site-packages/nova/volume/cinder.py", line 365, in
> detach cinderclient(context).volumes.detach(volume_id) File
> "/usr/lib/python2.7/site-packages/cinderclient/v2/volumes.py", line 334, in
> detach return self._action('os-detach', volume) File
> "/usr/lib/python2.7/site-packages/cinderclient/v2/volumes.py", line 311, in
> _action return self.api.client.post(url, body=body) File
> "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 91, in post
> return self._cs_request(url, 'POST', **kwargs) File
> "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 85, in
> _cs_request return self.request(url, method, **kwargs) File
> "/usr/lib/python2.7/site-packages/cinderclient/client.py", line 80, in
> request return super(SessionClient, self).request(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/keystoneclient/adapter.py", line 206, in
> request resp = super(LegacyJsonAdapter, self).request(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/keystoneclient/adapter.py", line 95, in
> request return self.session.request(url, method, **kwargs) File
> "/usr/lib/python2.7/site-packages/keystoneclient/utils.py", line 318, in
> inner return func(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/keystoneclient/session.py", line 397, in
> request raise exceptions.from_response(resp, method, url)
> Created
> 10 Jul 2015, 4:40 a.m.
>


Again if disable I able to work but it is generated on the compute node, as
I observe too it doesn't display the hypervisor of the compute nodes, or
maybe it is related.

It was working on Juno before, but there are unexpected rework as the
network infrastructure was change which the I rerun the script and found
lots of conflicts et al as I run before using qemu-img-rhev qemu-kvm-rhev
from OVirt but seems the new hammer (Ceph repository) solve the issue.

Hope someone can enlighten.

Thanks,
Mario
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds0: Client failing to respond to cache pressure

2015-07-09 Thread 谷枫
hi,

I use CephFS in production environnement with 7osd,1mds,3mon now.

So far so good,but i have a problem with it today.

The ceph status report this:

cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228
 health HEALTH_WARN
mds0: Client 34271 failing to respond to cache pressure
mds0: Client 74175 failing to respond to cache pressure
mds0: Client 74181 failing to respond to cache pressure
mds0: Client 34247 failing to respond to cache pressure
mds0: Client 64162 failing to respond to cache pressure
mds0: Client 136744 failing to respond to cache pressure
 monmap e2: 3 mons at
{node01=10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0}
election epoch 186, quorum 0,1,2 node01,node02,node03
 mdsmap e46: 1/1/1 up {0=tree01=up:active}
 osdmap e717: 7 osds: 7 up, 7 in
  pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects
138 GB used, 1364 GB / 1502 GB avail
 264 active+clean
  client io 1018 B/s rd, 1273 B/s wr, 0 op/s


I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday.

So the question is does this matter?

What's the warning mean ,and how can i solve this problem.Thanks!

This is my cluster config message with mds:

"name": "mds.tree01",
"debug_mds": "1\/5",
"debug_mds_balancer": "1\/5",
"debug_mds_locker": "1\/5",
"debug_mds_log": "1\/5",
"debug_mds_log_expire": "1\/5",
"debug_mds_migrator": "1\/5",
"admin_socket": "\/var\/run\/ceph\/ceph-mds.tree01.asok",
"log_file": "\/var\/log\/ceph\/ceph-mds.tree01.log",
"keyring": "\/var\/lib\/ceph\/mds\/ceph-tree01\/keyring",
"mon_max_mdsmap_epochs": "500",
"mon_mds_force_trim_to": "0",
"mon_debug_dump_location": "\/var\/log\/ceph\/ceph-mds.tree01.tdump",
"client_use_random_mds": "false",
"mds_data": "\/var\/lib\/ceph\/mds\/ceph-tree01",
"mds_max_file_size": "1099511627776",
"mds_cache_size": "10",
"mds_cache_mid": "0.7",
"mds_max_file_recover": "32",
"mds_mem_max": "1048576",
"mds_dir_max_commit_size": "10",
"mds_decay_halflife": "5",
"mds_beacon_interval": "4",
"mds_beacon_grace": "15",
"mds_enforce_unique_name": "true",
"mds_blacklist_interval": "1440",
"mds_session_timeout": "120",
"mds_revoke_cap_timeout": "60",
"mds_recall_state_timeout": "60",
"mds_freeze_tree_timeout": "30",
"mds_session_autoclose": "600",
"mds_health_summarize_threshold": "10",
"mds_reconnect_timeout": "45",
"mds_tick_interval": "5",
"mds_dirstat_min_interval": "1",
"mds_scatter_nudge_interval": "5",
"mds_client_prealloc_inos": "1000",
"mds_early_reply": "true",
"mds_default_dir_hash": "2",
"mds_log": "true",
"mds_log_skip_corrupt_events": "false",
"mds_log_max_events": "-1",
"mds_log_events_per_segment": "1024",
"mds_log_segment_size": "0",
"mds_log_max_segments": "30",
"mds_log_max_expiring": "20",
"mds_bal_sample_interval": "3",
"mds_bal_replicate_threshold": "8000",
"mds_bal_unreplicate_threshold": "0",
"mds_bal_frag": "false",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"mds_bal_merge_size": "50",
"mds_bal_merge_rd": "1000",
"mds_bal_merge_wr": "1000",
"mds_bal_interval": "10",
"mds_bal_fragment_interval": "5",
"mds_bal_idle_threshold": "0",
"mds_bal_max": "-1",
"mds_bal_max_until": "-1",
"mds_bal_mode": "0",
"mds_bal_min_rebalance": "0.1",
"mds_bal_min_start": "0.2",
"mds_bal_need_min": "0.8",
"mds_bal_need_max": "1.2",
"mds_bal_midchunk": "0.3",
"mds_bal_minchunk": "0.001",
"mds_bal_target_removal_min": "5",
"mds_bal_target_removal_max": "10",
"mds_replay_interval": "1",
"mds_shutdown_check": "0",
"mds_thrash_exports": "0",
"mds_thrash_fragments": "0",
"mds_dump_cache_on_map": "false",
"mds_dump_cache_after_rejoin": "false",
"mds_verify_scatter": "false",
"mds_debug_scatterstat": "false",
"mds_debug_frag": "false",
"mds_debug_auth_pins": "false",
"mds_debug_subtrees": "false",
"mds_kill_mdstable_at": "0",
"mds_kill_export_at": "0",
"mds_kill_import_at": "0",
"mds_kill_link_at": "0",
"mds_kill_rename_at": "0",
"mds_kill_openc_at": "0",
"mds_kill_journal_at": "0",
"mds_kill_journal_expire_at": "0",
"mds_kill_journal_replay_at": "0",
"mds_journal_format": "1",
"mds_kill_create_at": "0",
"mds_inject_traceless_reply_probability": "0",
"mds_wipe_sessions": "false",
"mds_wipe_ino_prealloc": "false",
"mds_skip_ino": "0",
"max_mds": "1",
"mds_standby_for_name": "",
"mds_standby_for_rank": "-1",
"mds_standby_replay": "false",
"mds_enable_op_tracker": "true",
"mds_op_history_size": "20",
"mds_op_history_duration": "600",
"mds_op_complain

Re: [ceph-users] How to prefer faster disks in same pool

2015-07-09 Thread Robert LeBlanc
You could also create two roots and two rules and have the primary osd be
the 10k drives so that the 7.2k are used primarily for writes. I believe
that recipe is on the CRUSH page in the documentation.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Jul 9, 2015 10:03 PM, "Alexandre DERUMIER"  wrote:

> Hi,
>
> you need to create 2 crushmaps, 1 for 10k && 1 for 7.2k disks.
>
> then create 2 pools, 1pool with crushmap1 and 1 pool with crushmap2.
>
> see :
>
> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>
> - Mail original -
> De: "Christoph Adomeit" 
> À: "ceph-users" 
> Envoyé: Vendredi 10 Juillet 2015 02:13:04
> Objet: [ceph-users] How to prefer faster disks in same pool
>
> Hi Guys,
>
> I have a ceph pool that is mixed with 10k rpm disks and 7.2 k rpm disks.
>
> There are 85 osds and 10 of them are 10k
> Size is not an issue, the pool is filled only 20%
>
> I want to somehow prefer the 10 k rpm disks so that they get more i/o
>
> What is the most intelligent wy to prefer the faster disks ?
> Just give them another weight or are there other methods ?
>
> Thanks
> Christoph
>
>
> --
> Christoph Adomeit
> GATWORKS GmbH
> Reststrauch 191
> 41199 Moenchengladbach
> Sitz: Moenchengladbach
> Amtsgericht Moenchengladbach, HRB 6303
> Geschaeftsfuehrer:
> Christoph Adomeit, Hans Wilhelm Terstappen
>
> christoph.adom...@gatworks.de Internetloesungen vom Feinsten
> Fon. +49 2166 9149-32 Fax. +49 2166 9149-10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to prefer faster disks in same pool

2015-07-09 Thread Alexandre DERUMIER
Hi,

you need to create 2 crushmaps, 1 for 10k && 1 for 7.2k disks.

then create 2 pools, 1pool with crushmap1 and 1 pool with crushmap2.

see :
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

- Mail original -
De: "Christoph Adomeit" 
À: "ceph-users" 
Envoyé: Vendredi 10 Juillet 2015 02:13:04
Objet: [ceph-users] How to prefer faster disks in same pool

Hi Guys, 

I have a ceph pool that is mixed with 10k rpm disks and 7.2 k rpm disks. 

There are 85 osds and 10 of them are 10k 
Size is not an issue, the pool is filled only 20% 

I want to somehow prefer the 10 k rpm disks so that they get more i/o 

What is the most intelligent wy to prefer the faster disks ? 
Just give them another weight or are there other methods ? 

Thanks 
Christoph 


-- 
Christoph Adomeit 
GATWORKS GmbH 
Reststrauch 191 
41199 Moenchengladbach 
Sitz: Moenchengladbach 
Amtsgericht Moenchengladbach, HRB 6303 
Geschaeftsfuehrer: 
Christoph Adomeit, Hans Wilhelm Terstappen 

christoph.adom...@gatworks.de Internetloesungen vom Feinsten 
Fon. +49 2166 9149-32 Fax. +49 2166 9149-10 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to prefer faster disks in same pool

2015-07-09 Thread Christoph Adomeit
Hi Guys,

I have a ceph pool that is mixed with 10k rpm disks and 7.2 k rpm disks.

There are 85 osds and 10 of them are 10k
Size is not an issue, the pool is filled only 20%

I want to somehow prefer the 10 k rpm disks so that they get more i/o

What is the most intelligent wy to prefer the faster disks ?
Just give them another weight or are there other methods ?

Thanks
  Christoph


-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-07-09 Thread David Burley
On Thu, Jul 9, 2015 at 5:42 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Thanks for sharing this info. I've been toying with doing this very
> thing... How did you measure the performance? I'm specifically looking at
> reducing the IO load on my spinners and it seems the xfs journaling process
> is eating a lot of my IO. My queues on my OSD drives frequently get into
> the 500 ballpark which makes for sad VMs.
>
>
ceph tell bench and also via some mixed IO fio runs on the OSD partition
while the OSD it hosted was offline.

-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Quentin Hartman
So, I was running with size=2, until we had a network interface on an
OSD node go faulty, and start corrupting data. Because ceph couldn't tell
which copy was right it caused all sorts of trouble. I might have been able
to recover more gracefully had I caught the problem sooner and been able to
identify the root right away, but as it was, we ended up labeling every VM
in the cluster suspect destroying the whole thing and restoring from
backups. I didn't end up managing to find the root of the problem until I
was rebuilding the cluster and noticed one node "felt weird" when I was
ssh'd into it. It was painful.

We are currently running "important" vms from a ceph pool with size=3, and
more disposable ones from a size=2 pool, and that seems to be a reasonable
tradeoff so far, giving us a bit more IO overhead tha nwe would have
running 3 for everything, but still having safety where we need it.

QH

On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke <
goetz.reini...@filmakademie.de> wrote:

> Hi Warren,
>
> thanks for that feedback. regarding the 2 or 3 copies we had a lot of
> internal discussions and lots of pros and cons on 2 and 3 :) … and finally
> decided to give 2 copies in the first - now called evaluation cluster - a
> chance to prove.
>
> I bet in 2016 we will see, if that was a good decision or bad and data los
> is in that scenario ok. We evaluate. :)
>
> Regarding one P3700 for 12 SATA disks I do get it right, that if that
> P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me
> from my current knowledge. Or are the P3700 so much more reliable than the
> eg. S3500 or S3700?
>
> Or is the suggestion with the P3700 if we go in the direction of 20+ nodes
> and till than stay without SSDs for journaling.
>
> I really appreciate your thoughts and feedback and I’m aware of the fact
> that building a ceph cluster is some sort of knowing the specs,
> configuration option, math, experience, modification and feedback from best
> practices real world clusters. Finally all clusters are unique in some way
> and what works for one will not work for an other.
>
> Thanks for feedback, 100 kowtows . Götz
>
>
>
> > Am 09.07.2015 um 16:58 schrieb Wang, Warren <
> warren_w...@cable.comcast.com>:
> >
> > You'll take a noticeable hit on write latency. Whether or not it's
> tolerable will be up to you and the workload you have to capture. Large
> file operations are throughput efficient without an SSD journal, as long as
> you have enough spindles.
> >
> > About the Intel P3700, you will only need 1 to keep up with 12 SATA
> drives. The 400 GB is probably okay if you keep the journal sizes small,
> but the 800 is probably safer if you plan on leaving these in production
> for a few years. Depends on the turnover of data on the servers.
> >
> > The dual disk failure comment is pointing out that you are more exposed
> for data loss with 2 copies. You do need to understand that there is a
> possibility for 2 drives to fail either simultaneously, or one before the
> cluster is repaired. As usual, this is going to be a decision you need to
> decide if it's acceptable or not. We have many clusters, and some are 2,
> and others are 3. If your data resides nowhere else, then 3 copies is the
> safe thing to do. That's getting harder and harder to justify though, when
> the price of other storage solutions using erasure coding continues to
> plummet.
> >
> > Warren
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Götz Reinicke - IT Koordinator
> > Sent: Thursday, July 09, 2015 4:47 AM
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Real world benefit from SSD Journals for a
> more read than write cluster
> >
> > Hi Christian,
> > Am 09.07.15 um 09:36 schrieb Christian Balzer:
> >>
> >> Hello,
> >>
> >> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
> >>
> >>> Hi again,
> >>>
> >>> time is passing, so is my budget :-/ and I have to recheck the
> >>> options for a "starter" cluster. An expansion next year for may be an
> >>> openstack installation or more performance if the demands rise is
> >>> possible. The "starter" could always be used as test or slow dark
> archive.
> >>>
> >>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per
> >>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less
> >>> performance, less capacity I know. But thats ok!
> >>>
> >> Leave the space to upgrade these nodes with SSDs in the future.
> >> If your cluster grows large enough (more than 20 nodes) even a single
> >> P3700 might do the trick and will need only a PCIe slot.
> >
> > If I get you right, the 12Disk is not a bad idea, if there would be the
> need of SSD Journal I can add the PCIe P3700.
> >
> > In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
> >
> > God or bad idea?
> >
> >>
> >>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
> >>>
> >> Danger, Will Ro

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Götz Reinicke
Hi Warren,

thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal 
discussions and lots of pros and cons on 2 and 3 :) … and finally decided to 
give 2 copies in the first - now called evaluation cluster - a chance to prove.

I bet in 2016 we will see, if that was a good decision or bad and data los is 
in that scenario ok. We evaluate. :)

Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 
fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my 
current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or 
S3700?

Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and 
till than stay without SSDs for journaling.

I really appreciate your thoughts and feedback and I’m aware of the fact that 
building a ceph cluster is some sort of knowing the specs, configuration 
option, math, experience, modification and feedback from best practices real 
world clusters. Finally all clusters are unique in some way and what works for 
one will not work for an other.

Thanks for feedback, 100 kowtows . Götz


 
> Am 09.07.2015 um 16:58 schrieb Wang, Warren :
> 
> You'll take a noticeable hit on write latency. Whether or not it's tolerable 
> will be up to you and the workload you have to capture. Large file operations 
> are throughput efficient without an SSD journal, as long as you have enough 
> spindles.
> 
> About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. 
> The 400 GB is probably okay if you keep the journal sizes small, but the 800 
> is probably safer if you plan on leaving these in production for a few years. 
> Depends on the turnover of data on the servers.
> 
> The dual disk failure comment is pointing out that you are more exposed for 
> data loss with 2 copies. You do need to understand that there is a 
> possibility for 2 drives to fail either simultaneously, or one before the 
> cluster is repaired. As usual, this is going to be a decision you need to 
> decide if it's acceptable or not. We have many clusters, and some are 2, and 
> others are 3. If your data resides nowhere else, then 3 copies is the safe 
> thing to do. That's getting harder and harder to justify though, when the 
> price of other storage solutions using erasure coding continues to plummet.
> 
> Warren
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz 
> Reinicke - IT Koordinator
> Sent: Thursday, July 09, 2015 4:47 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more 
> read than write cluster
> 
> Hi Christian,
> Am 09.07.15 um 09:36 schrieb Christian Balzer:
>> 
>> Hello,
>> 
>> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
>> 
>>> Hi again,
>>> 
>>> time is passing, so is my budget :-/ and I have to recheck the 
>>> options for a "starter" cluster. An expansion next year for may be an 
>>> openstack installation or more performance if the demands rise is 
>>> possible. The "starter" could always be used as test or slow dark archive.
>>> 
>>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per 
>>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less 
>>> performance, less capacity I know. But thats ok!
>>> 
>> Leave the space to upgrade these nodes with SSDs in the future.
>> If your cluster grows large enough (more than 20 nodes) even a single
>> P3700 might do the trick and will need only a PCIe slot.
> 
> If I get you right, the 12Disk is not a bad idea, if there would be the need 
> of SSD Journal I can add the PCIe P3700.
> 
> In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
> 
> God or bad idea?
> 
>> 
>>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
>>> 
>> Danger, Will Robinson.
>> This is essentially a RAID5 and you're plain asking for a double disk 
>> failure to happen.
> 
> May be I do not understand that. size = 2 I think is more sort of raid1 ... ? 
> And why am I asking for for a double disk failure?
> 
> To less nodes, OSDs or because of the size = 2.
> 
>> 
>> See this recent thread:
>> "calculating maximum number of disk and node failure that can be 
>> handled by cluster with out data loss"
>> for some discussion and python script which you will need to modify 
>> for
>> 2 disk replication.
>> 
>> With a RAID5 failure calculator you're at 1 data loss event per 3.5 
>> years...
>> 
> 
> Thanks for that thread, but I dont get the point out of it for me.
> 
> I see that calculating the reliability is some sort of complex math ...
> 
>>> The workload I expect is more writes of may be some GB of Office 
>>> files per day and some TB of larger video Files from a few users per week.
>>> 
>>> At the end of this year we calculate to have +- 60 to 80 TB of lager 
>>> videofiles in that cluster, which are accessed from time to time.
>>> 
>>> Any suggestion on the drop of ssd journa

Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-07-09 Thread Quentin Hartman
Thanks for sharing this info. I've been toying with doing this very
thing... How did you measure the performance? I'm specifically looking at
reducing the IO load on my spinners and it seems the xfs journaling process
is eating a lot of my IO. My queues on my OSD drives frequently get into
the 500 ballpark which makes for sad VMs.

QH

On Thu, Jul 9, 2015 at 12:05 PM, David Burley 
wrote:

> Converted a few of our OSD's (spinners) over to a config where the OSD
> journal and XFS journal both live on an NVMe drive (Intel P3700). The XFS
> journal might have provided some very minimal performance gains (3%,
> maybe). Given the low gains, we're going to reject this as something to dig
> into deeper and stick with the simpler configuration of just using the NVMe
> drives for OSD journaling and leave the XFS journals on the partition.
>
> --David
>
> On Thu, Jun 4, 2015 at 2:23 PM, Lars Marowsky-Bree  wrote:
>
>> On 2015-06-04T12:42:42, David Burley  wrote:
>>
>> > Are there any safety/consistency or other reasons we wouldn't want to
>> try
>> > using an external XFS log device for our OSDs? I realize if that device
>> > fails the filesystem is pretty much lost, but beyond that?
>>
>> I think with the XFS journal on the same SSD as ceph's OSD journal, that
>> could be a quite interesting setup. Please share performance numbers!
>>
>> I've been meaning to benchmark bcache in front of the OSD backend,
>> especially for SMRs, but haven't gotten around to it yet.
>>
>>
>> Regards,
>> Lars
>>
>> --
>> Architect Storage/HA
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu,
>> Graham Norton, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> David Burley
> NOC Manager, Sr. Systems Programmer/Analyst
> Slashdot Media
>
> e: da...@slashdotmedia.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor questions

2015-07-09 Thread Quentin Hartman
I have my mons sharing the ceph network, and while I currently do not run
mds or rgw, I have run those on my mon hosts in the past with no
perceptible ill effects.

On Thu, Jul 9, 2015 at 3:20 PM, Nate Curry  wrote:

> I have a question in regards to monitor nodes and network layout.  Its my
> understanding that there should be two networks; a ceph only network for
> comms between the various ceph nodes, and a separate storage network where
> other systems will interface with the ceph nodes.  Are the monitor nodes
> supposed to straddle both the ceph only network and the storage network or
> just in the ceph network?
>
> Another question is can I run multiple things on the monitor nodes?  Like
> the RADOS GW and the MDS?
>
>
> Thanks,
>
> *Nate Curry*
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor questions

2015-07-09 Thread Nate Curry
I have a question in regards to monitor nodes and network layout.  Its my
understanding that there should be two networks; a ceph only network for
comms between the various ceph nodes, and a separate storage network where
other systems will interface with the ceph nodes.  Are the monitor nodes
supposed to straddle both the ceph only network and the storage network or
just in the ceph network?

Another question is can I run multiple things on the monitor nodes?  Like
the RADOS GW and the MDS?


Thanks,

*Nate Curry*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace OSD disk without removing the osd from crush

2015-07-09 Thread Stefan Priebe


Am 09.07.2015 um 19:35 schrieb Wido den Hollander:

On 07/09/2015 09:15 AM, Stefan Priebe - Profihost AG wrote:


Am 08.07.2015 um 23:33 schrieb Somnath Roy:

Yes, I am able to reproduce that too..Not sure if this is a bug or change.


That's odd. Can someone from inktank comment?




Not from Inktank, but here we go.

When you add a OSD again it has to match the OSDs UUID as in the OSDMap.

So when running mkfs for the OSD run it like this:

$ ceph-osd -i  --mkfs --mkjournal --keyring /path/to/keyring
--osd-uuid 

You can find the UUID in the OSDMap:

$ ceph osd dump|grep osd\.

At the end you'll find the UUID of that OSD.

Without a matching UUID the OSD will refuse to start.

The OSD stores it's UUID in the datadir in the 'fsid' file.


Great!!

Thanks!

Stefan



Wido


Thanks & Regards
Somnath

-Original Message-
From: Stefan Priebe [mailto:s.pri...@profihost.ag]
Sent: Wednesday, July 08, 2015 1:09 PM
To: Somnath Roy; ceph-users
Subject: Re: [ceph-users] replace OSD disk without removing the osd from crush

Hi,
Am 08.07.2015 um 22:03 schrieb Somnath Roy:

Run 'ceph osd set noout' before replacing


sure but that didn't worked since firefly for me.

I did:
# set noout
# ceph stop osd.5
# removed disk
# inserted new disk
# format disk and mount disk
# start mkjournal mkkey mkkfs
# remove old osd auth key add new key

I can start the osd but i never comes up.

It only works for me if i completely remove the osd and create a new one:
ceph osd crush remove osd.5
ceph auth del osd.5
ceph osd rm osd.5

ceph osd create
...

Stefan


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stefan 
Priebe
Sent: Wednesday, July 08, 2015 12:58 PM
To: ceph-users
Subject: [ceph-users] replace OSD disk without removing the osd from crush

Hi,

is there any way to replace an osd disk without removing the osd from crush, 
auth, ...

Just recreate the same OSD?

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-07-09 Thread David Burley
Converted a few of our OSD's (spinners) over to a config where the OSD
journal and XFS journal both live on an NVMe drive (Intel P3700). The XFS
journal might have provided some very minimal performance gains (3%,
maybe). Given the low gains, we're going to reject this as something to dig
into deeper and stick with the simpler configuration of just using the NVMe
drives for OSD journaling and leave the XFS journals on the partition.

--David

On Thu, Jun 4, 2015 at 2:23 PM, Lars Marowsky-Bree  wrote:

> On 2015-06-04T12:42:42, David Burley  wrote:
>
> > Are there any safety/consistency or other reasons we wouldn't want to try
> > using an external XFS log device for our OSDs? I realize if that device
> > fails the filesystem is pretty much lost, but beyond that?
>
> I think with the XFS journal on the same SSD as ceph's OSD journal, that
> could be a quite interesting setup. Please share performance numbers!
>
> I've been meaning to benchmark bcache in front of the OSD backend,
> especially for SMRs, but haven't gotten around to it yet.
>
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu,
> Graham Norton, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace OSD disk without removing the osd from crush

2015-07-09 Thread Wido den Hollander
On 07/09/2015 09:15 AM, Stefan Priebe - Profihost AG wrote:
> 
> Am 08.07.2015 um 23:33 schrieb Somnath Roy:
>> Yes, I am able to reproduce that too..Not sure if this is a bug or change.
> 
> That's odd. Can someone from inktank comment?
> 
> 

Not from Inktank, but here we go.

When you add a OSD again it has to match the OSDs UUID as in the OSDMap.

So when running mkfs for the OSD run it like this:

$ ceph-osd -i  --mkfs --mkjournal --keyring /path/to/keyring
--osd-uuid 

You can find the UUID in the OSDMap:

$ ceph osd dump|grep osd\.

At the end you'll find the UUID of that OSD.

Without a matching UUID the OSD will refuse to start.

The OSD stores it's UUID in the datadir in the 'fsid' file.

Wido

>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: Stefan Priebe [mailto:s.pri...@profihost.ag] 
>> Sent: Wednesday, July 08, 2015 1:09 PM
>> To: Somnath Roy; ceph-users
>> Subject: Re: [ceph-users] replace OSD disk without removing the osd from 
>> crush
>>
>> Hi,
>> Am 08.07.2015 um 22:03 schrieb Somnath Roy:
>>> Run 'ceph osd set noout' before replacing
>>
>> sure but that didn't worked since firefly for me.
>>
>> I did:
>> # set noout
>> # ceph stop osd.5
>> # removed disk
>> # inserted new disk
>> # format disk and mount disk
>> # start mkjournal mkkey mkkfs
>> # remove old osd auth key add new key
>>
>> I can start the osd but i never comes up.
>>
>> It only works for me if i completely remove the osd and create a new one:
>> ceph osd crush remove osd.5
>> ceph auth del osd.5
>> ceph osd rm osd.5
>>
>> ceph osd create
>> ...
>>
>> Stefan
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Stefan Priebe
>>> Sent: Wednesday, July 08, 2015 12:58 PM
>>> To: ceph-users
>>> Subject: [ceph-users] replace OSD disk without removing the osd from crush
>>>
>>> Hi,
>>>
>>> is there any way to replace an osd disk without removing the osd from 
>>> crush, auth, ...
>>>
>>> Just recreate the same OSD?
>>>
>>> Stefan
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> 
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please notify 
>>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>>> any and all copies of this message in your possession (whether hard copies 
>>> or electronically stored copies).
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
>>That’s very strange. Is nothing else using the disks?
no. only the fio benchmark.

>>The difference between noop and cfq should be (and in my experience is) 
>>marginal for such a benchmark.
maybe a bug in cfq (kernel 3.16 debian jessie) ? also, deadline scheduler give 
me same perf than noop.


- Mail original -
De: "Jan Schermer" 
À: "aderumier" 
Cc: "Somnath Roy" , "ceph-users" 

Envoyé: Jeudi 9 Juillet 2015 18:20:51
Objet: Re: [ceph-users] Investigating my 100 IOPS limit

That’s very strange. Is nothing else using the disks? 
The difference between noop and cfq should be (and in my experience is) 
marginal for such a benchmark. 

Jan 


> On 09 Jul 2015, at 18:11, Alexandre DERUMIER  wrote: 
> 
> Hi again, 
> 
> I totally forgot to check the io scheduler from my last tests, this was with 
> cfq. 
> 
> with noop scheduler, I have a huge difference 
> 
> cfq: 
> 
> - sequential syncronous 4k write iodepth=1 : 60 iops 
> - sequential syncronous 4k write iodepth=32 : 2000 iops 
> 
> 
> noop: 
> 
> - sequential syncronous 4k write iodepth=1 : 7866 iops 
> - sequential syncronous 4k write iodepth=32 : 34303 iops 
> 
> 
> - Mail original - 
> De: "Somnath Roy"  
> À: "Jan Schermer" , "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 17:46:41 
> Objet: RE: [ceph-users] Investigating my 100 IOPS limit 
> 
> I am not sure how increasing iodepth for sync write is giving you better 
> result..sync fio engine supposed to be always using iodepth =1. 
> BTW, I faced similar issues sometimes back,..By running the following fio job 
> file, I was getting very dismal performance on my SSD on top of XFS.. 
> 
> [random-write] 
> directory=/mnt/fio_test 
> rw=randwrite 
> bs=16k 
> direct=1 
> sync=1 
> time_based 
> runtime=1m 
> size=700G 
> group_reporting 
> 
> Result : 
>  
> IOPS = 420 
> 
> lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01% 
> lat (msec) : 2=20.05%, 4=46.64%, 10=8.68% 
> 
> 
> Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this 
> pattern (even directly with block device, without any XFS) because they don't 
> handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by 
> digging into ceph mail list. That's why not all SSDs behave well with Ceph 
> journal.. 
> 
> Thanks & Regards 
> Somnath 
> 
> -Original Message- 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer 
> Sent: Thursday, July 09, 2015 8:24 AM 
> To: Alexandre DERUMIER 
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> Those are very strange numbers. Is the “60” figure right? 
> 
> Can you paste the full fio command and output? 
> Thanks 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:58, Alexandre DERUMIER  wrote: 
>> 
>> I just tried on an intel s3700, on top of xfs 
>> 
>> fio , with 
>> - sequential syncronous 4k write iodepth=1 : 60 iops 
>> - sequential syncronous 4k write iodepth=32 : 2000 iops 
>> - random syncronous 4k write, iodepth=1 : 8000iops 
>> - random syncronous 4k write iodepth=32 : 18000 iops 
>> 
>> 
>> 
>> - Mail original - 
>> De: "aderumier"  
>> À: "Jan Schermer"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 15:50:35 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
 Any ideas where to look? I was hoping blktrace would show what 
 exactly is going on, but it just shows a synchronous write -> (10ms) 
 -> completed 
>> 
>> which size is the write in this case ? 4K ? or more ? 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never 
>> get the same 10ms latency. Must be something the filesystem journal/log does 
>> that is special. 
>> 
>> Any ideas where to look? I was hoping blktrace would show what exactly 
>> is going on, but it just shows a synchronous write -> (10ms) -> 
>> completed 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
>>> 
> I have 12K IOPS in this test on the block device itself. But only 
> 100 filesystem transactions (=IOPS) on filesystem on the same 
> device because the “flush” (=FUA?) operation takes 10ms to finish. 
> I just can’t replicate the >>same “flush” operation with fio on the 
> block device, unfortunately, so I have no idea what is causing that 
> :/ 
>>> 
>>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>>> write. 
>>> 
>>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>>> after file write. 
>>> 
>>> 
>>> - Mail original - 
>>> De: "Jan Schermer"  
>>> À: "aderumier"  
>>> Cc: "ceph-users"  
>>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> The old FUA code has b

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
That’s very strange. Is nothing else using the disks?
The difference between noop and cfq should be (and in my experience is) 
marginal for such a benchmark.

Jan


> On 09 Jul 2015, at 18:11, Alexandre DERUMIER  wrote:
> 
> Hi again,
> 
> I totally forgot to check the io scheduler from my last tests, this was with 
> cfq.
> 
> with noop scheduler, I have a huge difference
> 
> cfq:
> 
> - sequential syncronous 4k write iodepth=1 : 60 iops 
> - sequential syncronous 4k write iodepth=32 : 2000 iops 
> 
> 
> noop:
> 
> - sequential syncronous 4k write iodepth=1 : 7866 iops 
> - sequential syncronous 4k write iodepth=32 : 34303 iops 
> 
> 
> - Mail original -
> De: "Somnath Roy" 
> À: "Jan Schermer" , "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 17:46:41
> Objet: RE: [ceph-users] Investigating my 100 IOPS limit
> 
> I am not sure how increasing iodepth for sync write is giving you better 
> result..sync fio engine supposed to be always using iodepth =1. 
> BTW, I faced similar issues sometimes back,..By running the following fio job 
> file, I was getting very dismal performance on my SSD on top of XFS.. 
> 
> [random-write] 
> directory=/mnt/fio_test 
> rw=randwrite 
> bs=16k 
> direct=1 
> sync=1 
> time_based 
> runtime=1m 
> size=700G 
> group_reporting 
> 
> Result : 
>  
> IOPS = 420 
> 
> lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01% 
> lat (msec) : 2=20.05%, 4=46.64%, 10=8.68% 
> 
> 
> Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this 
> pattern (even directly with block device, without any XFS) because they don't 
> handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by 
> digging into ceph mail list. That's why not all SSDs behave well with Ceph 
> journal.. 
> 
> Thanks & Regards 
> Somnath 
> 
> -Original Message- 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer 
> Sent: Thursday, July 09, 2015 8:24 AM 
> To: Alexandre DERUMIER 
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> Those are very strange numbers. Is the “60” figure right? 
> 
> Can you paste the full fio command and output? 
> Thanks 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:58, Alexandre DERUMIER  wrote: 
>> 
>> I just tried on an intel s3700, on top of xfs 
>> 
>> fio , with 
>> - sequential syncronous 4k write iodepth=1 : 60 iops 
>> - sequential syncronous 4k write iodepth=32 : 2000 iops 
>> - random syncronous 4k write, iodepth=1 : 8000iops 
>> - random syncronous 4k write iodepth=32 : 18000 iops 
>> 
>> 
>> 
>> - Mail original - 
>> De: "aderumier"  
>> À: "Jan Schermer"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 15:50:35 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
 Any ideas where to look? I was hoping blktrace would show what 
 exactly is going on, but it just shows a synchronous write -> (10ms) 
 -> completed 
>> 
>> which size is the write in this case ? 4K ? or more ? 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never 
>> get the same 10ms latency. Must be something the filesystem journal/log does 
>> that is special. 
>> 
>> Any ideas where to look? I was hoping blktrace would show what exactly 
>> is going on, but it just shows a synchronous write -> (10ms) -> 
>> completed 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
>>> 
> I have 12K IOPS in this test on the block device itself. But only 
> 100 filesystem transactions (=IOPS) on filesystem on the same 
> device because the “flush” (=FUA?) operation takes 10ms to finish. 
> I just can’t replicate the >>same “flush” operation with fio on the 
> block device, unfortunately, so I have no idea what is causing that 
> :/ 
>>> 
>>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>>> write. 
>>> 
>>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>>> after file write. 
>>> 
>>> 
>>> - Mail original - 
>>> De: "Jan Schermer"  
>>> À: "aderumier"  
>>> Cc: "ceph-users"  
>>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>>> and higher have it for sure. 
>>> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device 
>>> because the “flush” (=FUA?) operation takes 10ms to finish. I just 
>>> can’t replicate the same “flush” operation with fio on the block 
>>> device, unfortunately, so I have no idea what is causing that :/ 
>>> 
>>> Jan 
>>> 
 On 09 Jul 2015, at 14:08, Alexand

[ceph-users] Ceph Read Performance Issues

2015-07-09 Thread Garg, Pankaj
Hi,
I'm experiencing READ performance issues in my Cluster. I have 3 x86 servers 
each with 2 SSDs and 9 OSDs. SSDs are being used for Journaling.
I seem to get erratic READ performance numbers when using Rados Bench read test.

I ran a test with just a single x86 server, with 2 SSDs, and 9 OSDS. Pool had 
replication factor of 3.

Write  Bandwidth : 527 MB/Sec (rados bench write with default options)

Read Bandwidth : Run 1 : 201 MB/Sec (redos bench read with same pool as write).

Read Bandwidth : Run 2 : 381 MB/Sec

Read Bandwidth : Run 2 : 482 MB/Sec

In Run 2 and Run 3 : I start off with 1100 MB/Sec (basically maxing out my 10G 
Link), but by the time the 60 second read test finishes, the bandwidth drops to 
400 MB/Sec range.

Any ideas to what might be going wrong? Overall my writes are faster than 
reads. That doesn't seem correct.


Thanks
Pankaj




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
Hi again,

I totally forgot to check the io scheduler from my last tests, this was with 
cfq.

with noop scheduler, I have a huge difference

cfq:

- sequential syncronous 4k write iodepth=1 : 60 iops 
- sequential syncronous 4k write iodepth=32 : 2000 iops 


noop:

- sequential syncronous 4k write iodepth=1 : 7866 iops 
- sequential syncronous 4k write iodepth=32 : 34303 iops 


- Mail original -
De: "Somnath Roy" 
À: "Jan Schermer" , "aderumier" 
Cc: "ceph-users" 
Envoyé: Jeudi 9 Juillet 2015 17:46:41
Objet: RE: [ceph-users] Investigating my 100 IOPS limit

I am not sure how increasing iodepth for sync write is giving you better 
result..sync fio engine supposed to be always using iodepth =1. 
BTW, I faced similar issues sometimes back,..By running the following fio job 
file, I was getting very dismal performance on my SSD on top of XFS.. 

[random-write] 
directory=/mnt/fio_test 
rw=randwrite 
bs=16k 
direct=1 
sync=1 
time_based 
runtime=1m 
size=700G 
group_reporting 

Result : 
 
IOPS = 420 

lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01% 
lat (msec) : 2=20.05%, 4=46.64%, 10=8.68% 


Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this 
pattern (even directly with block device, without any XFS) because they don't 
handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by 
digging into ceph mail list. That's why not all SSDs behave well with Ceph 
journal.. 

Thanks & Regards 
Somnath 

-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer 
Sent: Thursday, July 09, 2015 8:24 AM 
To: Alexandre DERUMIER 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] Investigating my 100 IOPS limit 

Those are very strange numbers. Is the “60” figure right? 

Can you paste the full fio command and output? 
Thanks 

Jan 

> On 09 Jul 2015, at 15:58, Alexandre DERUMIER  wrote: 
> 
> I just tried on an intel s3700, on top of xfs 
> 
> fio , with 
> - sequential syncronous 4k write iodepth=1 : 60 iops 
> - sequential syncronous 4k write iodepth=32 : 2000 iops 
> - random syncronous 4k write, iodepth=1 : 8000iops 
> - random syncronous 4k write iodepth=32 : 18000 iops 
> 
> 
> 
> - Mail original - 
> De: "aderumier"  
> À: "Jan Schermer"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 15:50:35 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
>>> Any ideas where to look? I was hoping blktrace would show what 
>>> exactly is going on, but it just shows a synchronous write -> (10ms) 
>>> -> completed 
> 
> which size is the write in this case ? 4K ? or more ? 
> 
> 
> - Mail original - 
> De: "Jan Schermer"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never 
> get the same 10ms latency. Must be something the filesystem journal/log does 
> that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly 
> is going on, but it just shows a synchronous write -> (10ms) -> 
> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
>> 
 I have 12K IOPS in this test on the block device itself. But only 
 100 filesystem transactions (=IOPS) on filesystem on the same 
 device because the “flush” (=FUA?) operation takes 10ms to finish. 
 I just can’t replicate the >>same “flush” operation with fio on the 
 block device, unfortunately, so I have no idea what is causing that 
 :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 
>> filesystem transactions (=IOPS) on filesystem on the same device 
>> because the “flush” (=FUA?) operation takes 10ms to finish. I just 
>> can’t replicate the same “flush” operation with fio on the block 
>> device, unfortunately, so I have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>> syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your 
>>> -ssd-is-suitable-as-a-journal-device/ 
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
>>>

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Somnath Roy
I am not sure how increasing iodepth for sync write is giving you better 
result..sync fio engine supposed to be always using iodepth =1.
BTW, I faced similar issues sometimes back,..By running the following fio job 
file, I was getting very dismal performance on my SSD on top of XFS..

[random-write]
directory=/mnt/fio_test
rw=randwrite
bs=16k
direct=1
sync=1
time_based
runtime=1m
size=700G
group_reporting

Result :

IOPS = 420

lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01%
   lat (msec) : 2=20.05%, 4=46.64%, 10=8.68%


Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this 
pattern (even directly with block device, without any XFS) because they don't 
handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by 
digging into ceph mail list. That's why not all SSDs behave well with Ceph 
journal..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Thursday, July 09, 2015 8:24 AM
To: Alexandre DERUMIER
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Investigating my 100 IOPS limit

Those are very strange numbers. Is the “60” figure right?

Can you paste the full fio command and output?
Thanks

Jan

> On 09 Jul 2015, at 15:58, Alexandre DERUMIER  wrote:
>
> I just tried on an intel s3700, on top of xfs
>
> fio , with
> - sequential syncronous 4k write iodepth=1  : 60 iops
> - sequential syncronous 4k write iodepth=32 : 2000 iops
> - random syncronous 4k write, iodepth=1 : 8000iops
> - random syncronous 4k write iodepth=32 : 18000 iops
>
>
>
> - Mail original -
> De: "aderumier" 
> À: "Jan Schermer" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 15:50:35
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
>
>>> Any ideas where to look? I was hoping blktrace would show what
>>> exactly is going on, but it just shows a synchronous write -> (10ms)
>>> -> completed
>
> which size is the write in this case ? 4K ? or more ?
>
>
> - Mail original -
> De: "Jan Schermer" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 15:29:15
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
>
> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never
> get the same 10ms latency. Must be something the filesystem journal/log does 
> that is special.
>
> Any ideas where to look? I was hoping blktrace would show what exactly
> is going on, but it just shows a synchronous write -> (10ms) ->
> completed
>
> Jan
>
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote:
>>
 I have 12K IOPS in this test on the block device itself. But only
 100 filesystem transactions (=IOPS) on filesystem on the same
 device because the “flush” (=FUA?) operation takes 10ms to finish.
 I just can’t replicate the >>same “flush” operation with fio on the
 block device, unfortunately, so I have no idea what is causing that
 :/
>>
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write.
>>
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write.
>>
>>
>> - Mail original -
>> De: "Jan Schermer" 
>> À: "aderumier" 
>> Cc: "ceph-users" 
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
>>
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure.
>>
>> I have 12K IOPS in this test on the block device itself. But only 100
>> filesystem transactions (=IOPS) on filesystem on the same device
>> because the “flush” (=FUA?) operation takes 10ms to finish. I just
>> can’t replicate the same “flush” operation with fio on the block
>> device, unfortunately, so I have no idea what is causing that :/
>>
>> Jan
>>
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote:
>>>
>>> Hi,
>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>> syncronous write.
>>>
>>> Not sure what model of ssd do you have ?
>>>
>>> see this:
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
>>> -ssd-is-suitable-as-a-journal-device/
>>>
>>> what is your result of disk directly with
>>>
>>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k
>>> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
>>> --name=journal-test
>>>
>>> ?
>>>
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>>> mode, and don't have any problem.
>>>
>>>
>>> also about centos 2.6.32, I'm not sure FUA support has been
>>> backported by redhat (since true FUA support is since 2.6.37), so maybe 
>>> it's the old barrier code.
>>>
>>>
>>> - Mail original -
>>> De: "Jan Schermer" 
>>> À: "ceph-users" 
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04
>>> Objet: [ceph-users] Investigating my 100 IOPS limit
>>>
>>> I hope this would be 

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
Those are very strange numbers. Is the “60” figure right?

Can you paste the full fio command and output?
Thanks

Jan

> On 09 Jul 2015, at 15:58, Alexandre DERUMIER  wrote:
> 
> I just tried on an intel s3700, on top of xfs
> 
> fio , with 
> - sequential syncronous 4k write iodepth=1  : 60 iops
> - sequential syncronous 4k write iodepth=32 : 2000 iops
> - random syncronous 4k write, iodepth=1 : 8000iops
> - random syncronous 4k write iodepth=32 : 18000 iops
> 
> 
> 
> - Mail original -
> De: "aderumier" 
> À: "Jan Schermer" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 15:50:35
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is 
>>> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> which size is the write in this case ? 4K ? or more ? 
> 
> 
> - Mail original - 
> De: "Jan Schermer"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem 
> journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is 
> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
>> 
 I have 12K IOPS in this test on the block device itself. But only 100 
 filesystem transactions (=IOPS) on filesystem on the same device because 
 the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
 the >>same “flush” operation with fio on the block device, unfortunately, 
 so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 
>> filesystem transactions (=IOPS) on filesystem on the same device because the 
>> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
>> same “flush” operation with fio on the block device, unfortunately, so I 
>> have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>> syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>  
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>>> 
>>> ? 
>>> 
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>>> mode, and don't have any problem. 
>>> 
>>> 
>>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>>> redhat (since true FUA support is since 2.6.37), 
>>> so maybe it's the old barrier code. 
>>> 
>>> 
>>> - Mail original - 
>>> De: "Jan Schermer"  
>>> À: "ceph-users"  
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>> 
>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>>> with the LSI controllers and some drives. 
>>> It almost drove me crazy as I could replicate the problem with ease but 
>>> when I wanted to show it to someone it was often gone. Sometimes it 
>>> required fio to write for some time for the problem to manifest again, 
>>> required seemingly conflicting settings to come up… 
>>> 
>>> Well, turns out the problem is fio calling fallocate() when creating the 
>>> file to use for this test, which doesn’t really allocate the blocks, it 
>>> just “reserves” them. 
>>> When fio writes to those blocks, the filesystem journal becomes the 
>>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>> 
>>> If, however, I create the file with dd or such, those writes do _not_ end 
>>> in the journal, and the result is 10K synchronous 4K IOPS on the same 
>>> drive. 
>>> If, for example, I run fio with a 1M block size, it would still do 100* 
>>> IOPS and when I then run a 4K bl

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Wang, Warren
You'll take a noticeable hit on write latency. Whether or not it's tolerable 
will be up to you and the workload you have to capture. Large file operations 
are throughput efficient without an SSD journal, as long as you have enough 
spindles.

About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 
400 GB is probably okay if you keep the journal sizes small, but the 800 is 
probably safer if you plan on leaving these in production for a few years. 
Depends on the turnover of data on the servers.

The dual disk failure comment is pointing out that you are more exposed for 
data loss with 2 copies. You do need to understand that there is a possibility 
for 2 drives to fail either simultaneously, or one before the cluster is 
repaired. As usual, this is going to be a decision you need to decide if it's 
acceptable or not. We have many clusters, and some are 2, and others are 3. If 
your data resides nowhere else, then 3 copies is the safe thing to do. That's 
getting harder and harder to justify though, when the price of other storage 
solutions using erasure coding continues to plummet.

Warren

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz 
Reinicke - IT Koordinator
Sent: Thursday, July 09, 2015 4:47 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read 
than write cluster

Hi Christian,
Am 09.07.15 um 09:36 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
> 
>> Hi again,
>>
>> time is passing, so is my budget :-/ and I have to recheck the 
>> options for a "starter" cluster. An expansion next year for may be an 
>> openstack installation or more performance if the demands rise is 
>> possible. The "starter" could always be used as test or slow dark archive.
>>
>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per 
>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less 
>> performance, less capacity I know. But thats ok!
>>
> Leave the space to upgrade these nodes with SSDs in the future.
> If your cluster grows large enough (more than 20 nodes) even a single
> P3700 might do the trick and will need only a PCIe slot.

If I get you right, the 12Disk is not a bad idea, if there would be the need of 
SSD Journal I can add the PCIe P3700.

In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

God or bad idea?

> 
>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
>>
> Danger, Will Robinson.
> This is essentially a RAID5 and you're plain asking for a double disk 
> failure to happen.

May be I do not understand that. size = 2 I think is more sort of raid1 ... ? 
And why am I asking for for a double disk failure?

To less nodes, OSDs or because of the size = 2.

> 
> See this recent thread:
> "calculating maximum number of disk and node failure that can be 
> handled by cluster with out data loss"
> for some discussion and python script which you will need to modify 
> for
> 2 disk replication.
> 
> With a RAID5 failure calculator you're at 1 data loss event per 3.5 
> years...
> 

Thanks for that thread, but I dont get the point out of it for me.

I see that calculating the reliability is some sort of complex math ...

>> The workload I expect is more writes of may be some GB of Office 
>> files per day and some TB of larger video Files from a few users per week.
>>
>> At the end of this year we calculate to have +- 60 to 80 TB of lager 
>> videofiles in that cluster, which are accessed from time to time.
>>
>> Any suggestion on the drop of ssd journals?
>>
> You will miss them when the cluster does write, be it from clients or 
> when re-balancing a lost OSD.

I can imagine, that I might miss the SSD Journal, but if I can add the
P3700 later I feel comfy with it for now. Budget and evaluation related.

Thanks for your helpful input and feedback. /Götz

--
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium 
für Wissenschaft, Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread David Burley
If you can accept the failure domain, we find 12:1 ratio of SATA spinners
to a 400GB P3700 is reasonable. Benchmarks can saturate it, but it is
entirely bored in our real-world workload and only 30-50% utilized during
backfills. I am sure one could go even further than 12:1 if they wanted,
but we haven't tested.

On Thu, Jul 9, 2015 at 4:47 AM, Götz Reinicke - IT Koordinator <
goetz.reini...@filmakademie.de> wrote:

> Hi Christian,
> Am 09.07.15 um 09:36 schrieb Christian Balzer:
> >
> > Hello,
> >
> > On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
> >
> >> Hi again,
> >>
> >> time is passing, so is my budget :-/ and I have to recheck the options
> >> for a "starter" cluster. An expansion next year for may be an openstack
> >> installation or more performance if the demands rise is possible. The
> >> "starter" could always be used as test or slow dark archive.
> >>
> >> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
> >> but now I'm looking for 12 SATA OSDs without SSD journal. Less
> >> performance, less capacity I know. But thats ok!
> >>
> > Leave the space to upgrade these nodes with SSDs in the future.
> > If your cluster grows large enough (more than 20 nodes) even a single
> > P3700 might do the trick and will need only a PCIe slot.
>
> If I get you right, the 12Disk is not a bad idea, if there would be the
> need of SSD Journal I can add the PCIe P3700.
>
> In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
>
> God or bad idea?
>
> >
> >> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
> >>
> > Danger, Will Robinson.
> > This is essentially a RAID5 and you're plain asking for a double disk
> > failure to happen.
>
> May be I do not understand that. size = 2 I think is more sort of raid1
> ... ? And why am I asking for for a double disk failure?
>
> To less nodes, OSDs or because of the size = 2.
>
> >
> > See this recent thread:
> > "calculating maximum number of disk and node failure that can be handled
> > by cluster with out data loss"
> > for some discussion and python script which you will need to modify for
> > 2 disk replication.
> >
> > With a RAID5 failure calculator you're at 1 data loss event per 3.5
> > years...
> >
>
> Thanks for that thread, but I dont get the point out of it for me.
>
> I see that calculating the reliability is some sort of complex math ...
>
> >> The workload I expect is more writes of may be some GB of Office files
> >> per day and some TB of larger video Files from a few users per week.
> >>
> >> At the end of this year we calculate to have +- 60 to 80 TB of lager
> >> videofiles in that cluster, which are accessed from time to time.
> >>
> >> Any suggestion on the drop of ssd journals?
> >>
> > You will miss them when the cluster does write, be it from clients or
> when
> > re-balancing a lost OSD.
>
> I can imagine, that I might miss the SSD Journal, but if I can add the
> P3700 later I feel comfy with it for now. Budget and evaluation related.
>
> Thanks for your helpful input and feedback. /Götz
>
> --
> Götz Reinicke
> IT-Koordinator
>
> Tel. +49 7141 969 82420
> E-Mail goetz.reini...@filmakademie.de
>
> Filmakademie Baden-Württemberg GmbH
> Akademiehof 10
> 71638 Ludwigsburg
> www.filmakademie.de
>
> Eintragung Amtsgericht Stuttgart HRB 205016
>
> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
> Staatssekretär im Ministerium für Wissenschaft,
> Forschung und Kunst Baden-Württemberg
>
> Geschäftsführer: Prof. Thomas Schadt
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Tony Harris
Sounds to me like you've put yourself at too much risk - *if* I'm reading
your message right about your configuration, you have multiple hosts
accessing OSDs that are stored on a single shared box - so if that single
shared box (single point of failure for multiple nodes) goes down it's
possible for multiple replicas to disappear at the same time which could
halt the operation of your cluster if the masters and the replicas are both
on OSDs within that single shared storage system...

On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar <
mallikarjuna.bira...@gmail.com> wrote:

> Hi all,
>
> Setup details:
> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
> Failure domain is Chassis (enclosure) level. Replication count is 2.
> Each host has allotted with 4 drives.
>
> I have active client IO running on cluster. (Random write profile with
> 4M block size & 64 Queue depth).
>
> One of enclosure had power loss. So all OSD's from hosts that are
> connected to this enclosure went down as expected.
>
> But client IO got paused. After some time enclosure & hosts connected
> to it came up.
> And all OSD's on that hosts came up.
>
> Till this time, cluster was not serving IO. Once all hosts & OSD's
> pertaining to that enclosure came up, client IO resumed.
>
>
> Can anybody help me why cluster not serving IO during enclosure
> failure. OR its a bug?
>
> -Thanks & regards,
> Mallikarjun Biradar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
This is blktrace of the problem but this is the first time I’ve used it. It 
begins when “swapper” (probably because it is a dirty page and thus gets 
flushed?) show up.

+8 should mean 8 sectors = 4KiB?
 

 8,160  155925 2.692651182 1436712  Q FWS [fio]
  8,160  155926 2.692652285 1436712  G FWS [fio]
  8,160  155927 2.692652923 1436712  I FWS [fio]
  8,160  155928 2.692842413  6770  C  WS 16760024 + 8 [0]
  8,160  155929 2.692881511  6770  C  WS 0 [0]
  8,160  155930 2.693029408 0  C  WS 4023248 + 8 [0]
  8,160  155931 2.693066725  6770  C  WS 0 [0]
  8,160  155932 2.693266146 0  C  WS 16328824 + 8 [0]
  8,160  155933 2.693297033 1436712  Q FWS [fio]
  8,160  155934 2.693297628 1436712  G FWS [fio]
  8,160  155935 2.693298042 1436712  I FWS [fio]
  8,160  155936 2.693478425 0  C  WS 11772464 + 8 [0]
  8,160  155937 2.693484440 1436712  Q FWS [fio]
  8,160  155938 2.693484919 1436712  G FWS [fio]
  8,160  155939 2.693485180 1436712  I FWS [fio]
  8,160  155940 2.693669076 0  C  WS 19637160 + 8 [0]
  8,160  155941 2.693675679 1436712  Q FWS [fio]
  8,160  155942 2.693676180 1436712  G FWS [fio]
  8,160  155943 2.693676706 1436712  I FWS [fio]
  8,160  155944 2.693898452 0  C  WS 12364112 + 8 [0]
  8,160  155945 2.693963594 0  C  WS 0 [0]
  8,160  155946 2.693968646 1436712  U   N [fio] 0
  8,160  155947 2.693990010 1436712  Q  WS 1864456 + 8 [fio]
  8,160  155948 2.693990846 1436712  G  WS 1864456 + 8 [fio]
  8,160  155949 2.693991543 1436712  I  WS 1864456 + 8 [fio]
  8,160  155950 2.693992127 1436712  D  WS 1864456 + 8 [fio]
  8,160  155951 2.693995244 1436712  U   N [fio] 1
  8,160  155952 2.694233777 1436712  Q FWFS 468960767 + 16 [fio]
  8,160  155953 2.694234504 1436712  G FWFS 468960767 + 16 [fio]
  8,160  155954 2.694235144 1436712  I FWFS 468960767 + 16 [fio]
  8,160  155955 2.706561235  8989  C WFS 468960767 + 16 [0]
  8,160  155956 2.706562115  8989  C WFS 468960767 [0]
  8,160  155957 2.725928434 1436712  Q FWFS 468960804 + 18 [fio]
  8,160  155958 2.725929908 1436712  G FWFS 468960804 + 18 [fio]
  8,160  155959 2.725930980 1436712  I FWFS 468960804 + 18 [fio]
  8,160  7   12 2.706817448 0  C  WS 17954496 + 8 [0]
  8,160  7   13 2.706892420 0  D WFS 468960783 + 10 [swapper]
  8,160  7   14 2.715675674 0  C  WS 3242200 + 8 [0]
  8,160  7   15 2.715771745 0  D WFS 468960793 + 11 [swapper]
  8,160  7   16 2.736072549 0  C  WS 2058840 + 8 [0]
  8,160  7   17 2.736167908 0  D WFS 468960822 + 10 [swapper]
  8,160  7   18 2.746284722 0  C  WS 20142728 + 8 [0]
  8,160  7   19 2.746370687 0  D WFS 468960832 + 11 [swapper]

Jan

> On 09 Jul 2015, at 15:50, Alexandre DERUMIER  wrote:
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is 
>>> going on, but it just shows a synchronous write -> (10ms) -> completed
> 
> which size is the write in this case ? 4K ? or more ?
> 
> 
> - Mail original -
> De: "Jan Schermer" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 15:29:15
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem 
> journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is 
> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
>> 
 I have 12K IOPS in this test on the block device itself. But only 100 
 filesystem transactions (=IOPS) on filesystem on the same device because 
 the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
 the >>same “flush” operation with fio on the block device, unfortunately, 
 so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 
>> filesystem transactions (=IOPS) on filesystem on the same device because the 
>> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
>> same “flush” operation with fio on t

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
I just tried on an intel s3700, on top of xfs

fio , with 
- sequential syncronous 4k write iodepth=1  : 60 iops
- sequential syncronous 4k write iodepth=32 : 2000 iops
- random syncronous 4k write, iodepth=1 : 8000iops
- random syncronous 4k write iodepth=32 : 18000 iops



- Mail original -
De: "aderumier" 
À: "Jan Schermer" 
Cc: "ceph-users" 
Envoyé: Jeudi 9 Juillet 2015 15:50:35
Objet: Re: [ceph-users] Investigating my 100 IOPS limit

>>Any ideas where to look? I was hoping blktrace would show what exactly is 
>>going on, but it just shows a synchronous write -> (10ms) -> completed 

which size is the write in this case ? 4K ? or more ? 


- Mail original - 
De: "Jan Schermer"  
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Jeudi 9 Juillet 2015 15:29:15 
Objet: Re: [ceph-users] Investigating my 100 IOPS limit 

I tried everything: —write-barrier, —sync —fsync, —fdatasync 
I never get the same 10ms latency. Must be something the filesystem journal/log 
does that is special. 

Any ideas where to look? I was hoping blktrace would show what exactly is going 
on, but it just shows a synchronous write -> (10ms) -> completed 

Jan 

> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>> so I have no idea what is causing that :/ 
> 
> AFAIK, with fio on block device with --sync=1, is doing flush after each 
> write. 
> 
> I'm not sure with fio on a filesystem, but filesystem should do a fsync after 
> file write. 
> 
> 
> - Mail original - 
> De: "Jan Schermer"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
> higher have it for sure. 
> 
> I have 12K IOPS in this test on the block device itself. But only 100 
> filesystem transactions (=IOPS) on filesystem on the same device because the 
> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
> same “flush” operation with fio on the block device, unfortunately, so I have 
> no idea what is causing that :/ 
> 
> Jan 
> 
>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
>> 
>> Hi, 
>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>> syncronous write. 
>> 
>> Not sure what model of ssd do you have ? 
>> 
>> see this: 
>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>  
>> 
>> what is your result of disk directly with 
>> 
>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>> 
>> ? 
>> 
>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>> mode, and don't have any problem. 
>> 
>> 
>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>> redhat (since true FUA support is since 2.6.37), 
>> so maybe it's the old barrier code. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I hope this would be interesting for some, it nearly cost me my sanity. 
>> 
>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>> with the LSI controllers and some drives. 
>> It almost drove me crazy as I could replicate the problem with ease but when 
>> I wanted to show it to someone it was often gone. Sometimes it required fio 
>> to write for some time for the problem to manifest again, required seemingly 
>> conflicting settings to come up… 
>> 
>> Well, turns out the problem is fio calling fallocate() when creating the 
>> file to use for this test, which doesn’t really allocate the blocks, it just 
>> “reserves” them. 
>> When fio writes to those blocks, the filesystem journal becomes the 
>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>> 
>> If, however, I create the file with dd or such, those writes do _not_ end in 
>> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
>> and when I then run a 4K block size test without deleting the file, it would 
>> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
>> slows to a crawl again. 
>> 
>> The same issue is present with XFS and ext3/ext4 (with default mount 
>> options), and no matter how I create the filesystem or mount it can I avoid 
>> this problem. The only way to 

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
>>Any ideas where to look? I was hoping blktrace would show what exactly is 
>>going on, but it just shows a synchronous write -> (10ms) -> completed

which size is the write in this case ? 4K ? or more ?


- Mail original -
De: "Jan Schermer" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Jeudi 9 Juillet 2015 15:29:15
Objet: Re: [ceph-users] Investigating my 100 IOPS limit

I tried everything: —write-barrier, —sync —fsync, —fdatasync 
I never get the same 10ms latency. Must be something the filesystem journal/log 
does that is special. 

Any ideas where to look? I was hoping blktrace would show what exactly is going 
on, but it just shows a synchronous write -> (10ms) -> completed 

Jan 

> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote: 
> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>> so I have no idea what is causing that :/ 
> 
> AFAIK, with fio on block device with --sync=1, is doing flush after each 
> write. 
> 
> I'm not sure with fio on a filesystem, but filesystem should do a fsync after 
> file write. 
> 
> 
> - Mail original - 
> De: "Jan Schermer"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
> higher have it for sure. 
> 
> I have 12K IOPS in this test on the block device itself. But only 100 
> filesystem transactions (=IOPS) on filesystem on the same device because the 
> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
> same “flush” operation with fio on the block device, unfortunately, so I have 
> no idea what is causing that :/ 
> 
> Jan 
> 
>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
>> 
>> Hi, 
>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>> syncronous write. 
>> 
>> Not sure what model of ssd do you have ? 
>> 
>> see this: 
>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>  
>> 
>> what is your result of disk directly with 
>> 
>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>> 
>> ? 
>> 
>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>> mode, and don't have any problem. 
>> 
>> 
>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>> redhat (since true FUA support is since 2.6.37), 
>> so maybe it's the old barrier code. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I hope this would be interesting for some, it nearly cost me my sanity. 
>> 
>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>> with the LSI controllers and some drives. 
>> It almost drove me crazy as I could replicate the problem with ease but when 
>> I wanted to show it to someone it was often gone. Sometimes it required fio 
>> to write for some time for the problem to manifest again, required seemingly 
>> conflicting settings to come up… 
>> 
>> Well, turns out the problem is fio calling fallocate() when creating the 
>> file to use for this test, which doesn’t really allocate the blocks, it just 
>> “reserves” them. 
>> When fio writes to those blocks, the filesystem journal becomes the 
>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>> 
>> If, however, I create the file with dd or such, those writes do _not_ end in 
>> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
>> and when I then run a 4K block size test without deleting the file, it would 
>> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
>> slows to a crawl again. 
>> 
>> The same issue is present with XFS and ext3/ext4 (with default mount 
>> options), and no matter how I create the filesystem or mount it can I avoid 
>> this problem. The only way to avoid this problem is to mount ext4 with -o 
>> journal_async_commit, which should be safe, but... 
>> 
>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>> Kingston SSDs in this case (interestingly, this issue does not seem to occur 
>> on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” 
>> support for the drives (AFAIK they don’t support it so the controller must 
>> somehow flush the cache, which is wh

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
I tried everything: —write-barrier, —sync —fsync, —fdatasync
I never get the same 10ms latency. Must be something the filesystem journal/log 
does that is special.

Any ideas where to look? I was hoping blktrace would show what exactly is going 
on, but it just shows a synchronous write -> (10ms) -> completed

Jan

> On 09 Jul 2015, at 15:26, Alexandre DERUMIER  wrote:
> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>> so I have no idea what is causing that :/ 
> 
> AFAIK, with fio on block device with --sync=1, is doing flush after each 
> write.
> 
> I'm not sure with fio on a filesystem, but filesystem should do a fsync after 
> file write.
> 
> 
> - Mail original -
> De: "Jan Schermer" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 14:43:46
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
> higher have it for sure. 
> 
> I have 12K IOPS in this test on the block device itself. But only 100 
> filesystem transactions (=IOPS) on filesystem on the same device because the 
> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
> same “flush” operation with fio on the block device, unfortunately, so I have 
> no idea what is causing that :/ 
> 
> Jan 
> 
>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
>> 
>> Hi, 
>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>> syncronous write. 
>> 
>> Not sure what model of ssd do you have ? 
>> 
>> see this: 
>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>  
>> 
>> what is your result of disk directly with 
>> 
>> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>> 
>> ? 
>> 
>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>> mode, and don't have any problem. 
>> 
>> 
>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>> redhat (since true FUA support is since 2.6.37), 
>> so maybe it's the old barrier code. 
>> 
>> 
>> - Mail original - 
>> De: "Jan Schermer"  
>> À: "ceph-users"  
>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I hope this would be interesting for some, it nearly cost me my sanity. 
>> 
>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>> with the LSI controllers and some drives. 
>> It almost drove me crazy as I could replicate the problem with ease but when 
>> I wanted to show it to someone it was often gone. Sometimes it required fio 
>> to write for some time for the problem to manifest again, required seemingly 
>> conflicting settings to come up… 
>> 
>> Well, turns out the problem is fio calling fallocate() when creating the 
>> file to use for this test, which doesn’t really allocate the blocks, it just 
>> “reserves” them. 
>> When fio writes to those blocks, the filesystem journal becomes the 
>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>> 
>> If, however, I create the file with dd or such, those writes do _not_ end in 
>> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
>> and when I then run a 4K block size test without deleting the file, it would 
>> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
>> slows to a crawl again. 
>> 
>> The same issue is present with XFS and ext3/ext4 (with default mount 
>> options), and no matter how I create the filesystem or mount it can I avoid 
>> this problem. The only way to avoid this problem is to mount ext4 with -o 
>> journal_async_commit, which should be safe, but... 
>> 
>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>> Kingston SSDs in this case (interestingly, this issue does not seem to occur 
>> on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” 
>> support for the drives (AFAIK they don’t support it so the controller must 
>> somehow flush the cache, which is what introduces a huge latency hit). 
>> I can’t replicate this problem on the block device itself, only on a file on 
>> filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
>> showing the difference between the “good” and “bad” writes, but I don’t know 
>> what the driver/controller does - I only see the write on the log device 
>> finishing after a long 10ms. 
>> 
>> Co

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
>>I have 12K IOPS in this test on the block device itself. But only 100 
>>filesystem transactions (=IOPS) on filesystem on the same device because the 
>>“flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
same “flush” operation with fio on the block device, unfortunately, so I 
>>have no idea what is causing that :/ 

AFAIK, with fio on block device with --sync=1, is doing flush after each write.

I'm not sure with fio on a filesystem, but filesystem should do a fsync after 
file write.


- Mail original -
De: "Jan Schermer" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Jeudi 9 Juillet 2015 14:43:46
Objet: Re: [ceph-users] Investigating my 100 IOPS limit

The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
higher have it for sure. 

I have 12K IOPS in this test on the block device itself. But only 100 
filesystem transactions (=IOPS) on filesystem on the same device because the 
“flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the same 
“flush” operation with fio on the block device, unfortunately, so I have no 
idea what is causing that :/ 

Jan 

> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote: 
> 
> Hi, 
> I have already see bad performance with Crucial m550 ssd, 400 iops syncronous 
> write. 
> 
> Not sure what model of ssd do you have ? 
> 
> see this: 
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  
> 
> what is your result of disk directly with 
> 
> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync 
> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
> 
> ? 
> 
> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
> mode, and don't have any problem. 
> 
> 
> also about centos 2.6.32, I'm not sure FUA support has been backported by 
> redhat (since true FUA support is since 2.6.37), 
> so maybe it's the old barrier code. 
> 
> 
> - Mail original - 
> De: "Jan Schermer"  
> À: "ceph-users"  
> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
> Objet: [ceph-users] Investigating my 100 IOPS limit 
> 
> I hope this would be interesting for some, it nearly cost me my sanity. 
> 
> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
> with the LSI controllers and some drives. 
> It almost drove me crazy as I could replicate the problem with ease but when 
> I wanted to show it to someone it was often gone. Sometimes it required fio 
> to write for some time for the problem to manifest again, required seemingly 
> conflicting settings to come up… 
> 
> Well, turns out the problem is fio calling fallocate() when creating the file 
> to use for this test, which doesn’t really allocate the blocks, it just 
> “reserves” them. 
> When fio writes to those blocks, the filesystem journal becomes the 
> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
> 
> If, however, I create the file with dd or such, those writes do _not_ end in 
> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
> and when I then run a 4K block size test without deleting the file, it would 
> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
> slows to a crawl again. 
> 
> The same issue is present with XFS and ext3/ext4 (with default mount 
> options), and no matter how I create the filesystem or mount it can I avoid 
> this problem. The only way to avoid this problem is to mount ext4 with -o 
> journal_async_commit, which should be safe, but... 
> 
> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
> Kingston SSDs in this case (interestingly, this issue does not seem to occur 
> on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” 
> support for the drives (AFAIK they don’t support it so the controller must 
> somehow flush the cache, which is what introduces a huge latency hit). 
> I can’t replicate this problem on the block device itself, only on a file on 
> filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
> showing the difference between the “good” and “bad” writes, but I don’t know 
> what the driver/controller does - I only see the write on the log device 
> finishing after a long 10ms. 
> 
> Could someone tell me how CEPH creates the filesystem objects? I suppose it 
> does fallocate() as well, right? Any way to force it to write them out 
> completely and not use it to get around this issue I have? 
> 
> How to replicate: 
> 
> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test 
> --size=1000M --ioengine=libaio 
> 
> 
> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
> 
> Jan 
> 

Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
higher have it for sure.

I have 12K IOPS in this test on the block device itself. But only 100 
filesystem transactions (=IOPS) on filesystem on the same device because the 
“flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the same 
“flush” operation with fio on the block device, unfortunately, so I have no 
idea what is causing that  :/

Jan

> On 09 Jul 2015, at 14:08, Alexandre DERUMIER  wrote:
> 
> Hi,
> I have already see bad performance with Crucial m550 ssd, 400 iops syncronous 
> write.
> 
> Not sure what model of ssd do you have ?
> 
> see this:
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> what is your result of disk directly with
> 
> #dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync
> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
> 
> ?
> 
> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
> mode, and don't have any problem.
> 
> 
> also about centos 2.6.32, I'm not sure FUA support has been backported by 
> redhat (since true FUA support is since 2.6.37),
> so maybe it's the old barrier code.
> 
> 
> - Mail original -
> De: "Jan Schermer" 
> À: "ceph-users" 
> Envoyé: Jeudi 9 Juillet 2015 12:32:04
> Objet: [ceph-users] Investigating my 100 IOPS limit
> 
> I hope this would be interesting for some, it nearly cost me my sanity. 
> 
> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
> with the LSI controllers and some drives. 
> It almost drove me crazy as I could replicate the problem with ease but when 
> I wanted to show it to someone it was often gone. Sometimes it required fio 
> to write for some time for the problem to manifest again, required seemingly 
> conflicting settings to come up… 
> 
> Well, turns out the problem is fio calling fallocate() when creating the file 
> to use for this test, which doesn’t really allocate the blocks, it just 
> “reserves” them. 
> When fio writes to those blocks, the filesystem journal becomes the 
> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
> 
> If, however, I create the file with dd or such, those writes do _not_ end in 
> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
> and when I then run a 4K block size test without deleting the file, it would 
> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
> slows to a crawl again. 
> 
> The same issue is present with XFS and ext3/ext4 (with default mount 
> options), and no matter how I create the filesystem or mount it can I avoid 
> this problem. The only way to avoid this problem is to mount ext4 with -o 
> journal_async_commit, which should be safe, but... 
> 
> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
> Kingston SSDs in this case (interestingly, this issue does not seem to occur 
> on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” 
> support for the drives (AFAIK they don’t support it so the controller must 
> somehow flush the cache, which is what introduces a huge latency hit). 
> I can’t replicate this problem on the block device itself, only on a file on 
> filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
> showing the difference between the “good” and “bad” writes, but I don’t know 
> what the driver/controller does - I only see the write on the log device 
> finishing after a long 10ms. 
> 
> Could someone tell me how CEPH creates the filesystem objects? I suppose it 
> does fallocate() as well, right? Any way to force it to write them out 
> completely and not use it to get around this issue I have? 
> 
> How to replicate: 
> 
> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test 
> --size=1000M --ioengine=libaio 
> 
> 
> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
> 
> Jan 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Alexandre DERUMIER
Hi,
I have already see bad performance with Crucial m550 ssd, 400 iops syncronous 
write.

Not sure what model of ssd do you have ?

see this:
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

what is your result of disk directly with

#dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync
#fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
--iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

?

I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
mode, and don't have any problem.


also about centos 2.6.32, I'm not sure FUA support has been backported by 
redhat (since true FUA support is since 2.6.37),
so maybe it's the old barrier code.


- Mail original -
De: "Jan Schermer" 
À: "ceph-users" 
Envoyé: Jeudi 9 Juillet 2015 12:32:04
Objet: [ceph-users] Investigating my 100 IOPS limit

I hope this would be interesting for some, it nearly cost me my sanity. 

Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
with the LSI controllers and some drives. 
It almost drove me crazy as I could replicate the problem with ease but when I 
wanted to show it to someone it was often gone. Sometimes it required fio to 
write for some time for the problem to manifest again, required seemingly 
conflicting settings to come up… 

Well, turns out the problem is fio calling fallocate() when creating the file 
to use for this test, which doesn’t really allocate the blocks, it just 
“reserves” them. 
When fio writes to those blocks, the filesystem journal becomes the bottleneck 
(100 IOPS* limit can be seen there with 100% utilization). 

If, however, I create the file with dd or such, those writes do _not_ end in 
the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
and when I then run a 4K block size test without deleting the file, it would 
run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows 
to a crawl again. 

The same issue is present with XFS and ext3/ext4 (with default mount options), 
and no matter how I create the filesystem or mount it can I avoid this problem. 
The only way to avoid this problem is to mount ext4 with -o 
journal_async_commit, which should be safe, but... 

I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
Kingston SSDs in this case (interestingly, this issue does not seem to occur on 
Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support 
for the drives (AFAIK they don’t support it so the controller must somehow 
flush the cache, which is what introduces a huge latency hit). 
I can’t replicate this problem on the block device itself, only on a file on 
filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
showing the difference between the “good” and “bad” writes, but I don’t know 
what the driver/controller does - I only see the write on the log device 
finishing after a long 10ms. 

Could someone tell me how CEPH creates the filesystem objects? I suppose it 
does fallocate() as well, right? Any way to force it to write them out 
completely and not use it to get around this issue I have? 

How to replicate: 

fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
--numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test 
--size=1000M --ioengine=libaio 


* It is in fact 98 IOPS. Exactly. Not more, not less :-) 

Jan 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Gregory Farnum
Your first point of troubleshooting is pretty much always to look at
"ceph -s" and see what it says. In this case it's probably telling you
that some PGs are down, and then you can look at why (but perhaps it's
something else).
-Greg

On Thu, Jul 9, 2015 at 12:22 PM, Mallikarjun Biradar
 wrote:
> Yeah. All OSD's down and monitors still up..
>
> On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer  wrote:
>> And are the OSDs getting marked down during the outage?
>> Are all the MONs still up?
>>
>> Jan
>>
>>> On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
>>>  wrote:
>>>
>>> I have size=2 & min_size=1 and IO is paused till all hosts com back.
>>>
>>> On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer  wrote:
 What is the min_size setting for the pool? If you have size=2 and 
 min_size=2, then all your data is safe when one replica is down, but the 
 IO is paused. If you want to continue IO you need to set min_size=1.
 But be aware that a single failure after that causes you to lose all the 
 data, you’d have to revert to the other replica if it comes up and works - 
 no idea how that works in ceph but will likely be a PITA to do.

 Jan

> On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
>  wrote:
>
> Hi all,
>
> Setup details:
> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
> Failure domain is Chassis (enclosure) level. Replication count is 2.
> Each host has allotted with 4 drives.
>
> I have active client IO running on cluster. (Random write profile with
> 4M block size & 64 Queue depth).
>
> One of enclosure had power loss. So all OSD's from hosts that are
> connected to this enclosure went down as expected.
>
> But client IO got paused. After some time enclosure & hosts connected
> to it came up.
> And all OSD's on that hosts came up.
>
> Till this time, cluster was not serving IO. Once all hosts & OSD's
> pertaining to that enclosure came up, client IO resumed.
>
>
> Can anybody help me why cluster not serving IO during enclosure
> failure. OR its a bug?
>
> -Thanks & regards,
> Mallikarjun Biradar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Mallikarjun Biradar
Yeah. All OSD's down and monitors still up..

On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer  wrote:
> And are the OSDs getting marked down during the outage?
> Are all the MONs still up?
>
> Jan
>
>> On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
>>  wrote:
>>
>> I have size=2 & min_size=1 and IO is paused till all hosts com back.
>>
>> On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer  wrote:
>>> What is the min_size setting for the pool? If you have size=2 and 
>>> min_size=2, then all your data is safe when one replica is down, but the IO 
>>> is paused. If you want to continue IO you need to set min_size=1.
>>> But be aware that a single failure after that causes you to lose all the 
>>> data, you’d have to revert to the other replica if it comes up and works - 
>>> no idea how that works in ceph but will likely be a PITA to do.
>>>
>>> Jan
>>>
 On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
  wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size & 64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure & hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts & OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

 -Thanks & regards,
 Mallikarjun Biradar
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Jan Schermer
And are the OSDs getting marked down during the outage?
Are all the MONs still up?

Jan

> On 09 Jul 2015, at 13:20, Mallikarjun Biradar 
>  wrote:
> 
> I have size=2 & min_size=1 and IO is paused till all hosts com back.
> 
> On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer  wrote:
>> What is the min_size setting for the pool? If you have size=2 and 
>> min_size=2, then all your data is safe when one replica is down, but the IO 
>> is paused. If you want to continue IO you need to set min_size=1.
>> But be aware that a single failure after that causes you to lose all the 
>> data, you’d have to revert to the other replica if it comes up and works - 
>> no idea how that works in ceph but will likely be a PITA to do.
>> 
>> Jan
>> 
>>> On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
>>>  wrote:
>>> 
>>> Hi all,
>>> 
>>> Setup details:
>>> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
>>> Failure domain is Chassis (enclosure) level. Replication count is 2.
>>> Each host has allotted with 4 drives.
>>> 
>>> I have active client IO running on cluster. (Random write profile with
>>> 4M block size & 64 Queue depth).
>>> 
>>> One of enclosure had power loss. So all OSD's from hosts that are
>>> connected to this enclosure went down as expected.
>>> 
>>> But client IO got paused. After some time enclosure & hosts connected
>>> to it came up.
>>> And all OSD's on that hosts came up.
>>> 
>>> Till this time, cluster was not serving IO. Once all hosts & OSD's
>>> pertaining to that enclosure came up, client IO resumed.
>>> 
>>> 
>>> Can anybody help me why cluster not serving IO during enclosure
>>> failure. OR its a bug?
>>> 
>>> -Thanks & regards,
>>> Mallikarjun Biradar
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Mallikarjun Biradar
I have size=2 & min_size=1 and IO is paused till all hosts com back.

On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer  wrote:
> What is the min_size setting for the pool? If you have size=2 and min_size=2, 
> then all your data is safe when one replica is down, but the IO is paused. If 
> you want to continue IO you need to set min_size=1.
> But be aware that a single failure after that causes you to lose all the 
> data, you’d have to revert to the other replica if it comes up and works - no 
> idea how that works in ceph but will likely be a PITA to do.
>
> Jan
>
>> On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
>>  wrote:
>>
>> Hi all,
>>
>> Setup details:
>> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
>> Failure domain is Chassis (enclosure) level. Replication count is 2.
>> Each host has allotted with 4 drives.
>>
>> I have active client IO running on cluster. (Random write profile with
>> 4M block size & 64 Queue depth).
>>
>> One of enclosure had power loss. So all OSD's from hosts that are
>> connected to this enclosure went down as expected.
>>
>> But client IO got paused. After some time enclosure & hosts connected
>> to it came up.
>> And all OSD's on that hosts came up.
>>
>> Till this time, cluster was not serving IO. Once all hosts & OSD's
>> pertaining to that enclosure came up, client IO resumed.
>>
>>
>> Can anybody help me why cluster not serving IO during enclosure
>> failure. OR its a bug?
>>
>> -Thanks & regards,
>> Mallikarjun Biradar
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Jan Schermer
What is the min_size setting for the pool? If you have size=2 and min_size=2, 
then all your data is safe when one replica is down, but the IO is paused. If 
you want to continue IO you need to set min_size=1.
But be aware that a single failure after that causes you to lose all the data, 
you’d have to revert to the other replica if it comes up and works - no idea 
how that works in ceph but will likely be a PITA to do.

Jan

> On 09 Jul 2015, at 12:42, Mallikarjun Biradar 
>  wrote:
> 
> Hi all,
> 
> Setup details:
> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
> Failure domain is Chassis (enclosure) level. Replication count is 2.
> Each host has allotted with 4 drives.
> 
> I have active client IO running on cluster. (Random write profile with
> 4M block size & 64 Queue depth).
> 
> One of enclosure had power loss. So all OSD's from hosts that are
> connected to this enclosure went down as expected.
> 
> But client IO got paused. After some time enclosure & hosts connected
> to it came up.
> And all OSD's on that hosts came up.
> 
> Till this time, cluster was not serving IO. Once all hosts & OSD's
> pertaining to that enclosure came up, client IO resumed.
> 
> 
> Can anybody help me why cluster not serving IO during enclosure
> failure. OR its a bug?
> 
> -Thanks & regards,
> Mallikarjun Biradar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-09 Thread Mallikarjun Biradar
Hi all,

Setup details:
Two storage enclosures each connected to 4 OSD nodes (Shared storage).
Failure domain is Chassis (enclosure) level. Replication count is 2.
Each host has allotted with 4 drives.

I have active client IO running on cluster. (Random write profile with
4M block size & 64 Queue depth).

One of enclosure had power loss. So all OSD's from hosts that are
connected to this enclosure went down as expected.

But client IO got paused. After some time enclosure & hosts connected
to it came up.
And all OSD's on that hosts came up.

Till this time, cluster was not serving IO. Once all hosts & OSD's
pertaining to that enclosure came up, client IO resumed.


Can anybody help me why cluster not serving IO during enclosure
failure. OR its a bug?

-Thanks & regards,
Mallikarjun Biradar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Investigating my 100 IOPS limit

2015-07-09 Thread Jan Schermer
I hope this would be interesting for some, it nearly cost me my sanity.

Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
with the LSI controllers and some drives.
It almost drove me crazy as I could replicate the problem with ease but when I 
wanted to show it to someone it was often gone. Sometimes it required fio to 
write for some time for the problem to manifest again, required seemingly 
conflicting settings to come up…

Well, turns out the problem is fio calling fallocate() when creating the file 
to use for this test, which doesn’t really allocate the blocks, it just 
“reserves” them.
When fio writes to those blocks, the filesystem journal becomes the bottleneck 
(100 IOPS* limit can be seen there with 100% utilization).

If, however, I create the file with dd or such, those writes do _not_ end in 
the journal, and the result is 10K synchronous 4K IOPS on the same drive.
If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
and when I then run a 4K block size test without deleting the file, it would 
run at a 10K IOPS pace until it hits the first unwritten blocks - then it slows 
to a crawl again.

The same issue is present with XFS and ext3/ext4 (with default mount options), 
and no matter how I create the filesystem or mount it can I avoid this problem. 
The only way to avoid this problem is to mount ext4 with -o 
journal_async_commit, which should be safe, but...

I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
Kingston SSDs in this case (interestingly, this issue does not seem to occur on 
Samsung SSDs!). I think it has something to do with LSI faking a “FUA” support 
for the drives (AFAIK they don’t support it so the controller must somehow 
flush the cache, which is what introduces a huge latency hit).
I can’t replicate this problem on the block device itself, only on a file on 
filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
showing the difference between the “good” and “bad” writes, but I don’t know 
what the driver/controller does - I only see the write on the log device 
finishing after a long 10ms. 

Could someone tell me how CEPH creates the filesystem objects? I suppose it 
does fallocate() as well, right? Any way to force it to write them out 
completely and not use it to get around this issue I have?

How to replicate:

fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
--numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test 
--size=1000M --ioengine=libaio


* It is in fact 98 IOPS. Exactly. Not more, not less :-)

Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fuse mount in fstab

2015-07-09 Thread Kenneth Waegeman

Hmm, it looks like a version issue..

I am testing with these versions on centos7:
 ~]# mount -V
mount from util-linux 2.23.2 (libmount 2.23.0: selinux, debug, assert)
 ~]# ceph-fuse -v
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

This do not work..


On my fedora box, with these versions from repo:
# mount -V
mount from util-linux 2.24.2 (libmount 2.24.0: selinux, debug, assert)
# ceph-fuse -v
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

this works..


Which versions are you running?
And does someone knows from which versions , or which version 
combinations do work?


Thanks a lot!
K

On 07/09/2015 11:53 AM, Thomas Lemarchand wrote:

Hello Kenneth,

I have a working ceph fuse in fstab. Only difference I see it that I
don't use "conf", your configuration file is at the default path
anyway.

I tried it with and without conf, but it always complains about id


id=recette-files-rw,client_mountpoint=/recette-files/files
  /mnt/wimi/ceph-files  fuse.ceph noatime,_netdev 0 0



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fuse mount in fstab

2015-07-09 Thread Thomas Lemarchand
Hello Kenneth,

I have a working ceph fuse in fstab. Only difference I see it that I
don't use "conf", your configuration file is at the default path
anyway.

id=recette-files-rw,client_mountpoint=/recette-files/files  
 /mnt/wimi/ceph-files  fuse.ceph noatime,_netdev 0 0


-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information


On jeu., 2015-07-09 at 11:45 +0200, Kenneth Waegeman wrote:
> Hi all,
> 
> we are trying to mount ceph-fuse in fstab, following this: 
> http://ceph.com/docs/master/cephfs/fstab/
> 
> When we add this:
> 
> id=cephfs,conf=/etc/ceph/ceph.conf  /mnt/ceph   fuse.ceph 
> defaults0 0
> 
> to fstab, we get an error message running mount:
> 
> mount: can't find id=cephfs,conf=/etc/ceph/ceph.conf
> 
> same happens when only using id=cephfs
> 
> I've found an old thread also mentioning this, but without solution..
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014
> -January/037049.html)
> 
> Thanks!
> Kenneth
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fuse mount in fstab

2015-07-09 Thread Kenneth Waegeman

Hi all,

we are trying to mount ceph-fuse in fstab, following this: 
http://ceph.com/docs/master/cephfs/fstab/


When we add this:

id=cephfs,conf=/etc/ceph/ceph.conf  /mnt/ceph   fuse.ceph 
defaults0 0


to fstab, we get an error message running mount:

mount: can't find id=cephfs,conf=/etc/ceph/ceph.conf

same happens when only using id=cephfs

I've found an old thread also mentioning this, but without solution..
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/037049.html)

Thanks!
Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Götz Reinicke - IT Koordinator
Hi Christian,
Am 09.07.15 um 09:36 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:
> 
>> Hi again,
>>
>> time is passing, so is my budget :-/ and I have to recheck the options
>> for a "starter" cluster. An expansion next year for may be an openstack
>> installation or more performance if the demands rise is possible. The
>> "starter" could always be used as test or slow dark archive.
>>
>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
>> but now I'm looking for 12 SATA OSDs without SSD journal. Less
>> performance, less capacity I know. But thats ok!
>>
> Leave the space to upgrade these nodes with SSDs in the future.
> If your cluster grows large enough (more than 20 nodes) even a single
> P3700 might do the trick and will need only a PCIe slot.

If I get you right, the 12Disk is not a bad idea, if there would be the
need of SSD Journal I can add the PCIe P3700.

In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

God or bad idea?

> 
>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
>>
> Danger, Will Robinson.
> This is essentially a RAID5 and you're plain asking for a double disk
> failure to happen.

May be I do not understand that. size = 2 I think is more sort of raid1
... ? And why am I asking for for a double disk failure?

To less nodes, OSDs or because of the size = 2.

> 
> See this recent thread:
> "calculating maximum number of disk and node failure that can be handled
> by cluster with out data loss"
> for some discussion and python script which you will need to modify for
> 2 disk replication.
> 
> With a RAID5 failure calculator you're at 1 data loss event per 3.5
> years...
> 

Thanks for that thread, but I dont get the point out of it for me.

I see that calculating the reliability is some sort of complex math ...

>> The workload I expect is more writes of may be some GB of Office files
>> per day and some TB of larger video Files from a few users per week.
>>
>> At the end of this year we calculate to have +- 60 to 80 TB of lager
>> videofiles in that cluster, which are accessed from time to time.
>>
>> Any suggestion on the drop of ssd journals?
>>
> You will miss them when the cluster does write, be it from clients or when
> re-balancing a lost OSD.

I can imagine, that I might miss the SSD Journal, but if I can add the
P3700 later I feel comfy with it for now. Budget and evaluation related.

Thanks for your helpful input and feedback. /Götz

-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "ERROR: rgw_obj_remove(): cls_cxx_remove returned -2" on OSDs since Hammer upgrade

2015-07-09 Thread Sylvain Munaut
Hi,


Since I upgraded to Hammer last weekend, I see errors suchs as

7eff5322d700  0  cls/rgw/cls_rgw.cc:1947: ERROR:
rgw_obj_remove(): cls_cxx_remove returned -2

in the logs.

What's going on ?


Can this be related to the unexplained write activity I see on my OSDs ?


Cheers,


   Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-09 Thread Christian Balzer

Hello,

On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:

> Hi again,
> 
> time is passing, so is my budget :-/ and I have to recheck the options
> for a "starter" cluster. An expansion next year for may be an openstack
> installation or more performance if the demands rise is possible. The
> "starter" could always be used as test or slow dark archive.
> 
> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node,
> but now I'm looking for 12 SATA OSDs without SSD journal. Less
> performance, less capacity I know. But thats ok!
> 
Leave the space to upgrade these nodes with SSDs in the future.
If your cluster grows large enough (more than 20 nodes) even a single
P3700 might do the trick and will need only a PCIe slot.

> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
> 
Danger, Will Robinson.
This is essentially a RAID5 and you're plain asking for a double disk
failure to happen.

See this recent thread:
"calculating maximum number of disk and node failure that can be handled
by cluster with out data loss"
for some discussion and python script which you will need to modify for
2 disk replication.

With a RAID5 failure calculator you're at 1 data loss event per 3.5
years...

> The workload I expect is more writes of may be some GB of Office files
> per day and some TB of larger video Files from a few users per week.
> 
> At the end of this year we calculate to have +- 60 to 80 TB of lager
> videofiles in that cluster, which are accessed from time to time.
> 
> Any suggestion on the drop of ssd journals?
> 
You will miss them when the cluster does write, be it from clients or when
re-balancing a lost OSD.

Christian
>   Thanks as always for your feedback . Götz
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot map rbd image with striping!

2015-07-09 Thread Ilya Dryomov
On Wed, Jul 8, 2015 at 11:02 PM, Hadi Montakhabi  wrote:
> Thank you!
> Is striping supported while using CephFS?

Yes.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace OSD disk without removing the osd from crush

2015-07-09 Thread Stefan Priebe - Profihost AG

Am 08.07.2015 um 23:33 schrieb Somnath Roy:
> Yes, I am able to reproduce that too..Not sure if this is a bug or change.

That's odd. Can someone from inktank comment?


> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Stefan Priebe [mailto:s.pri...@profihost.ag] 
> Sent: Wednesday, July 08, 2015 1:09 PM
> To: Somnath Roy; ceph-users
> Subject: Re: [ceph-users] replace OSD disk without removing the osd from crush
> 
> Hi,
> Am 08.07.2015 um 22:03 schrieb Somnath Roy:
>> Run 'ceph osd set noout' before replacing
> 
> sure but that didn't worked since firefly for me.
> 
> I did:
> # set noout
> # ceph stop osd.5
> # removed disk
> # inserted new disk
> # format disk and mount disk
> # start mkjournal mkkey mkkfs
> # remove old osd auth key add new key
> 
> I can start the osd but i never comes up.
> 
> It only works for me if i completely remove the osd and create a new one:
> ceph osd crush remove osd.5
> ceph auth del osd.5
> ceph osd rm osd.5
> 
> ceph osd create
> ...
> 
> Stefan
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Stefan Priebe
>> Sent: Wednesday, July 08, 2015 12:58 PM
>> To: ceph-users
>> Subject: [ceph-users] replace OSD disk without removing the osd from crush
>>
>> Hi,
>>
>> is there any way to replace an osd disk without removing the osd from crush, 
>> auth, ...
>>
>> Just recreate the same OSD?
>>
>> Stefan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> 
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com