Re: [ceph-users] rebalancing taking very long time

2015-09-03 Thread Bob Ababurko
I found place to paste my output for `ceph daemon osd.xx config show` for
all my OSD's:

https://www.zerobin.net/?743bbbdea41874f4#FNk5EjsfRxvkX1JuTp52fQ4CXW6VOIEB0Lj0Icnyr4Q=

If you want it in a gzip'd txt file, you can download here:

https://mega.nz/#!oY5QAByC!JEWhHRms0WwbYbwG4o4RdTUWtFwFjUDLWhtNtEDhBkA

It honestly looks to me like the disks are maxing out on IOPS and a good
portion of the disks are hitting 100% Utilization according to dstat when
there was rebalancing or client i/o.  I'm running this to look at my disk
stats:

dstat -cd --disk-util -D sda,sdb,sdc,sdd,sde,sdf,sdg,sdh --disk-tps

I dont have any client load on my cluster at this point to show any good
output but with just '11 active+clean+scrubbing+deep' being run, I am
seeing 70-80% disk utilization for each OSD according to dstat.




On Thu, Sep 3, 2015 at 2:34 AM, Jan Schermer  wrote:

> Can you post the output ot
>
> ceph daemon osd.xx config show? (probably as an attachment).
>
> There are several things that I've seen cause it
> 1) too many PGs but too little degraded objects make it seem "slow" (if
> you just have 2 degraded objects but restarted a host with 10K PGs, it will
> have to scan all the PGs probably)
> 2) sometimes the process gets stuck when a toofull condition occurs
> 3) sometimes the process gets stuck for no apparent reason - restarting
> the currently backfilling/recovering OSDs fixes it
> setting osd_recovery_threads sometimes fixes both 2) and 3), but usually
> not
> 4) setting recovery_delay_start to anything > 0 makes recovery slow (even
> 0.001 makes it much slower than simple 0). On the other hand we had to
> set it high as a default because of slow ops when restarting OSDs, which
> was partially fixed by this.
>
> Can you see any bottleneck in the system? CPU spinning, disks reading? I
> don't think this is the issue, just make sure it's not something more
> obvious...
>
> Jan
>
>
> On 02 Sep 2015, at 22:34, Bob Ababurko  wrote:
>
> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a
> very long time to rebalance.  I should note that my cluster is slightly
> unique in that I am using cephfs(shouldn't matter?) and it currently
> contains about 310 million objects.
>
> The last time I replaced a disk/OSD was 2.5 days ago and it is still
> rebalancing.  This is on a cluster with no client load.
>
> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
> SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
> total.  System disk is on its own disk.  I'm also using a backend network
> with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
> when it is close to finishingsay <1% objects misplaced.
>
> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
> with no load on the cluster.  Are my expectations off?
>
> I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance
> time is dependent on the number of objects in the pool.  These are thoughts
> i've had but am not certain are relevant here.
>
> $ sudo ceph -v
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> $ sudo ceph -s
> [sudo] password for bababurko:
> cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
>  health HEALTH_WARN
> 5 pgs backfilling
> 5 pgs stuck unclean
> recovery 3046506/676638611 objects misplaced (0.450%)
>  monmap e1: 3 mons at {cephmon01=
> 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
> }
> election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
>  mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
>  osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
>   pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
> 18319 GB used, 9612 GB / 27931 GB avail
> 3046506/676638611 objects misplaced (0.450%)
> 2095 active+clean
>   12 active+clean+scrubbing+deep
>5 active+remapped+backfilling
> recovery io 2294 kB/s, 147 objects/s
>
> $ sudo rados df
> pool name KB  objects   clones degraded
>  unfound   rdrd KB   wrwr KB
> cephfs_data   676756996233574670200
> 0  21368341676984208   7052266742
> cephfs_metadata42738  105843700
> 0 16130199  30718800215295996938   3811963908
> rbd0000
> 00000
>   total use

[ceph-users] rebalancing taking very long time

2015-09-02 Thread Bob Ababurko
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance.  I should note that my cluster is slightly unique
in that I am using cephfs(shouldn't matter?) and it currently contains
about 310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
total.  System disk is on its own disk.  I'm also using a backend network
with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishingsay <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster.  Are my expectations off?

I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool.  These are thoughts i've
had but am not certain are relevant here.

$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s
[sudo] password for bababurko:
cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
 health HEALTH_WARN
5 pgs backfilling
5 pgs stuck unclean
recovery 3046506/676638611 objects misplaced (0.450%)
 monmap e1: 3 mons at {cephmon01=
10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
}
election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
 mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
 osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
  pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
18319 GB used, 9612 GB / 27931 GB avail
3046506/676638611 objects misplaced (0.450%)
2095 active+clean
  12 active+clean+scrubbing+deep
   5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s

$ sudo rados df
pool name KB  objects   clones degraded
 unfound   rdrd KB   wrwr KB
cephfs_data   676756996233574670200
  0  21368341676984208   7052266742
cephfs_metadata42738  105843700
  0 16130199  30718800215295996938   3811963908
rbd0000
  00000
  total used 19209068780336805139
  total avail10079469460
  total space29288538240

$ sudo ceph osd pool get cephfs_data pgp_num
pg_num: 1024
$ sudo ceph osd pool get cephfs_metadata pgp_num
pg_num: 1024


thanks,
Bob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-12 Thread Bob Ababurko
On Wed, Aug 12, 2015 at 7:21 PM, Yan, Zheng  wrote:

> On Thu, Aug 13, 2015 at 7:05 AM, Bob Ababurko  wrote:
> >
> > If I am using a more recent client(kernel OR ceph-fuse), should I still
> be
> > worried about the MDS's crashing?  I have added RAM to my MDS hosts and
> its
> > my understanding this will also help mitigate any issues, in addition to
> > setting mds_bal_frag = true.  Not having used cephfs before, do I always
> > need to worry about my MDS servers crashing all the time, thus the need
> for
> > setting mds_reconnect_timeout to 0?  This is not ideal for us nor is the
> > idea of clients not able to access their mounts after a MDS recovery.
> >
>
> It's unlikely this issue will happen again. But I can't  guarantee no
> other issue.
>
> no need to set mds_reconnect_timeout to 0.
>

ok, Good to know.


>
>
> > I am actually looking for the most stable way to implement cephfs at this
> > point.   My cephfs cluster contains millions of small files, so many
> inodes
> > if that needs to be taken into account.  Perhaps I should only be using
> one
> > MDS node for stability at this point?  Is this the best way forward to
> get a
> > handle on stability?  I'm also curious if I should I set my mds cache
> size
> > to a number greater than files I have in the cephfs cluster?  If you can
> > give some key points to configure cephfs to get the best stability and if
> > possible, availability.this would be helpful to me.
>
> One active MDS is the most stable setup. Adding a few standby MDS
> should not hurt stability.
>
> You can't set  mds cache size to a number greater than files in the
> fs, it requires lots of memory.
>


I'm not sure what amount of RAM you consider to be 'lots' but I would
really like to understand a bit more about this.  Perhaps a rule of thumb?
It there an advantage to more RAM & large mds cache size?  We plan on
putting close to a billion small files in this pool via cephfs so what
should we be considering when sizing our MDS hosts OR change to the MDS
config?  Basically, what should we OR should not be doing when we have a
cluster with this many files?  Thanks!


> Yan, Zheng
>
> >
> > thanks again for the help.
> >
> > thanks,
> > Bob
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-12 Thread Bob Ababurko
If I am using a more recent client(kernel OR ceph-fuse), should I still be
worried about the MDS's crashing?  I have added RAM to my MDS hosts and its
my understanding this will also help mitigate any issues, in addition to
setting mds_bal_frag = true.  Not having used cephfs before, do I always
need to worry about my MDS servers crashing all the time, thus the need for
setting mds_reconnect_timeout to 0?  This is not ideal for us nor is the
idea of clients not able to access their mounts after a MDS recovery.

I am actually looking for the most stable way to implement cephfs at this
point.   My cephfs cluster contains millions of small files, so many inodes
if that needs to be taken into account.  Perhaps I should only be using one
MDS node for stability at this point?  Is this the best way forward to get
a handle on stability?  I'm also curious if I should I set my mds cache
size to a number greater than files I have in the cephfs cluster?  If you
can give some key points to configure cephfs to get the best stability and
if possible, availability.this would be helpful to me.

thanks again for the help.

thanks,
Bob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
John,

This seems to have worked.  I rebooted my client and restarted ceph on the
MDS hosts after giving them more RAM.   I restarted the rsync's that were
running on the client after remounting the cephfs fs and things seem to be
working.  I can access the files so that is a relief.

What is risky about enabling mds_bal_frag on a cluster with data and will
there be any performance degradation if enabled?

Thanks again for the help.

On Tue, Aug 11, 2015 at 2:25 PM, John Spray  wrote:

> On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko  wrote:
> > Here is the backtrace from the core dump.
> >
> > (gdb) bt
> > #0  0x7f71f5404ffb in raise () from /lib64/libpthread.so.0
> > #1  0x0087065d in reraise_fatal (signum=6) at
> > global/signal_handler.cc:59
> > #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
> > #3  
> > #4  0x7f71f40235d7 in raise () from /lib64/libc.so.6
> > #5  0x7f71f4024cc8 in abort () from /lib64/libc.so.6
> > #6  0x7f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() ()
> from
> > /lib64/libstdc++.so.6
> > #7  0x7f71f4925926 in ?? () from /lib64/libstdc++.so.6
> > #8  0x7f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6
> > #9  0x7f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6
> > #10 0x0077d0fc in operator new (num_bytes=2408) at
> mds/CInode.h:120
> > Python Exception  list index out of range:
> > #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with
> > 65536 elements, want_dn="", r=) at mds/CDir.cc:1700
> > #12 0x007d7d44 in complete (r=0, this=0x502b000) at
> > include/Context.h:65
> > #13 MDSIOContextBase::complete (this=0x502b000, r=0) at
> mds/MDSContext.cc:59
> > #14 0x00894818 in Finisher::finisher_thread_entry
> (this=0x5108698)
> > at common/Finisher.cc:59
> > #15 0x7f71f53fddf5 in start_thread () from /lib64/libpthread.so.0
> > #16 0x7f71f40e41ad in clone () from /lib64/libc.so.6
>
> If we believe the line numbers here, then it's a malloc failure.  Are
> you running out of memory?
>
> The MDS is loading a bunch of these 64k file directories (presumably a
> characteristic of your workload), and ending up with an unusually
> large number of inodes in cache (this is all happening during the
> "rejoin" phase so no trimming of the cache is done and we merrily
> exceed the default mds_cache_size limit of 100k inodes).
>
> The thing triggering the load of the dirs is clients replaying
> requests that refer to inodes by inode number, and the MDS's procedure
> for handling that involves fully loading the relevant dirs.  That
> might be something we can improve, it doesn't seem obviously necessary
> to load all the dentrys in a dirfrag during this phase.
>
> Anyway, you can hopefully recover from this state by forcibly
> unmounting your clients.  Since you're using the kernel client it may
> be easiest to hard reset the client boxes.  When you next restart your
> MDS, the clients won't be present, so the MDS will be able to make it
> all the way up without trying to load a bunch of directory fragments.
> If you've got some more RAM for the MDS box that wouldn't hurt either.
>
> One of the less well tested (but relevant here) features we have is
> directory fragmentation, where large dirs like these are internally
> split up (partly to avoid memory management issues like this).  It
> might be a risky business on a system that you've already got real
> data on, but once your MDS is back up and running you can try enabling
> the mds_bal_frag setting.
>
> This is not a use case we have particularly strong coverage of in our
> automated tests, so thanks for your experimentation and persistence.
>
> John
>
> >
> > I have also gotten a log file w / debug mds = 20.  It was 1.2GB, so I
> > bzip2'd it w max compression and got it down to 75MB.  I wasn't sure
> where
> > to upload it so if there is a better place to put it, please let me know.
> >
> > https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8
> >
> > thanks,
> > Bob
> >
> >
> > On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng  wrote:
> >>
> >> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko  wrote:
> >> > I had a dual mds server configuration and have been copying data via
> >> > cephfs
> >> > kernel module to my cluster for the past 3 weeks and just had a MDS
> >> > crash
> >> > halting all IO.  Leading up to the crash, I ran a test dd that
> increased
> >> > the
> >> > throughpu

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
Yes, this was a package install and ceph-debuginfo was used and hopefully
the output of the backtrace is useful.

I thought it was interesting that you mentioned reproduce with an ls
because aside from me doing a large dd before this issue surfaced, your
post made me recall that I also ran ls a few times to drill down and
eventually list the files that are located two subdirectories down around
the same time.  I also recall for a moment that I found it strange that I
got results back so quickly because our netapp takes forever to do
thisit was so quick, that in retrospect, the list of files may not have
been complete.  I regret not following up that thought.



On Tue, Aug 11, 2015 at 1:52 AM, John Spray  wrote:

> On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko  wrote:
> > I had a dual mds server configuration and have been copying data via
> cephfs
> > kernel module to my cluster for the past 3 weeks and just had a MDS crash
> > halting all IO.  Leading up to the crash, I ran a test dd that increased
> the
> > throughput by about 2x and stopped it but about 10 minutes later, the MDS
> > server crashed and did not fail over to the standby properly. I have
> using
> > an active/standby mds configuration but neither of the mds servers will
> stay
> > running at this point and crash after starting them.
> >
> > [bababurko@cephmon01 ~]$ sudo ceph -s
> > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
> >  health HEALTH_WARN
> > mds cluster is degraded
> > mds cephmds02 is laggy
> > noscrub,nodeep-scrub flag(s) set
> >  monmap e1: 3 mons at
> > {cephmon01=
> 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
> }
> > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
> >  mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
> >  osdmap e324: 30 osds: 30 up, 30 in
> > flags noscrub,nodeep-scrub
> >   pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
> > 14051 GB used, 13880 GB / 27931 GB avail
> > 2112 active+clean
> >
> >
> > I am not sure what information is relevant so I will try to cover what I
> > think is relevant based on posts I have read through:
> >
> > Cluster:
> > running ceph-0.94.1 on CenttOS 7.1
> > [root@mdstest02 bababurko]$ uname -r
> > 3.10.0-229.el7.x86_64
> >
> > Here is my ceph-mds log with 'debug objector = 10' :
> >
> >
> https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=
>
> Ouch!  Unfortunately all we can tell from this is that we're hitting
> an assertion somewhere while loading a directory fragment from disk.
>
> As Zheng says, you'll need to drill a bit deeper.  If you were
> installing from packages you may find ceph-debuginfo useful.  In
> addition to getting us a clearer stack trace with debug symbols,
> please also crank "debug mds" up to 20 (this is massively verbose so
> hopefully it doesn't take too long to reproduce the issue).
>
> Hopefully this is fairly straightforward to reproduce.  If it's
> something fundamentally malformed on disk then just doing a recursive
> ls on the filesystem would trigger it, at least.
>
> Cheers,
> John
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
Here is the backtrace from the core dump.

(gdb) bt
#0  0x7f71f5404ffb in raise () from /lib64/libpthread.so.0
#1  0x0087065d in reraise_fatal (signum=6) at
global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  
#4  0x7f71f40235d7 in raise () from /lib64/libc.so.6
#5  0x7f71f4024cc8 in abort () from /lib64/libc.so.6
#6  0x7f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from
/lib64/libstdc++.so.6
#7  0x7f71f4925926 in ?? () from /lib64/libstdc++.so.6
#8  0x7f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6
#9  0x7f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6
#10 0x0077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120
Python Exception  list index out of range:
#11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with
65536 elements, want_dn="", r=) at mds/CDir.cc:1700
#12 0x007d7d44 in complete (r=0, this=0x502b000) at
include/Context.h:65
#13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59
#14 0x00894818 in Finisher::finisher_thread_entry (this=0x5108698)
at common/Finisher.cc:59
#15 0x7f71f53fddf5 in start_thread () from /lib64/libpthread.so.0
#16 0x7f71f40e41ad in clone () from /lib64/libc.so.6

I have also gotten a log file w / debug mds = 20.  It was 1.2GB, so I
bzip2'd it w max compression and got it down to 75MB.  I wasn't sure where
to upload it so if there is a better place to put it, please let me know.

https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8

thanks,
Bob


On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng  wrote:

> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko  wrote:
> > I had a dual mds server configuration and have been copying data via
> cephfs
> > kernel module to my cluster for the past 3 weeks and just had a MDS crash
> > halting all IO.  Leading up to the crash, I ran a test dd that increased
> the
> > throughput by about 2x and stopped it but about 10 minutes later, the MDS
> > server crashed and did not fail over to the standby properly. I have
> using
> > an active/standby mds configuration but neither of the mds servers will
> stay
> > running at this point and crash after starting them.
> >
> > [bababurko@cephmon01 ~]$ sudo ceph -s
> > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
> >  health HEALTH_WARN
> > mds cluster is degraded
> > mds cephmds02 is laggy
> > noscrub,nodeep-scrub flag(s) set
> >  monmap e1: 3 mons at
> > {cephmon01=
> 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
> }
> > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
> >  mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
> >  osdmap e324: 30 osds: 30 up, 30 in
> > flags noscrub,nodeep-scrub
> >   pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
> > 14051 GB used, 13880 GB / 27931 GB avail
> > 2112 active+clean
> >
> >
> > I am not sure what information is relevant so I will try to cover what I
> > think is relevant based on posts I have read through:
> >
> > Cluster:
> > running ceph-0.94.1 on CenttOS 7.1
> > [root@mdstest02 bababurko]$ uname -r
> > 3.10.0-229.el7.x86_64
> >
> > Here is my ceph-mds log with 'debug objector = 10' :
> >
> >
> https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=
>
>
> could you use gdb to check where the crash happened. (gdb
> /usr/local/bin/ceph-mds /core.x.  maybe you need re-compile mds
> with debuginfo)
>
> Yan, Zheng
>
> >
> > cat /sys/kernel/debug/ceph/*/mdsc output:
> >
> >
> https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g=
> >
> > ceph.conf :
> >
> >
> https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U=
> >
> > I have copied almost 5TB of small files to this cluster which has taken
> the
> > better part of three weeks, so I am really hoping that there is a way to
> > recover from this.  This is ourPOC cluster
> >
> > I'm sure I have missed something relevant as i'm just getting my mind
> back
> > after nearly losing it, so feel free to ask for anything to assist.
> >
> > Any help would be greatly appreciated.
> >
> > thanks,
> > Bob
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-10 Thread Bob Ababurko
Thanks John.  I'll give that a try as soon as I fix an issue with my MDS
servers that cropped up today.

On Mon, Aug 10, 2015 at 2:58 AM, John Spray  wrote:

> On Fri, Aug 7, 2015 at 1:36 AM, Bob Ababurko  wrote:
> > @John,
> >
> > Can you clarify which values would suggest that my metadata pool is too
> > slow?   I have added a link that includes values for the "op_active" &
> > "handle_client_request"gathered in a crude fashion but should
> hopefully
> > give enough data to paint a picture of what is happening.
> >
> > http://pastebin.com/5zAG8VXT
>
> Dividing by the first 20s of the second sample period, you're seeing
> ~750 client metadata operations handled per second, which is kind of a
> baseline level of performance (a little better than what I get running
> a ceph cluster locally on my workstation).  That's probably
> corresponding to roughly the same number of file creates per second --
> your workload is very much a small file one, where "files per second"
> is a much more meaningful measure than IOPS or MB/s.
>
> It does look like the kind of pattern where you've got a large clutch
> of several thousand metadata pool rados ops coming out every few
> seconds, then draining out over a few seconds.  Your metadata pool
> isn't pathologically slow (it's completing at least hundreds of ops
> per second), but it is noticeable that during some periods where
> op_active is draining, handle_client_request is not incrementing --
> i.e. client metadata ops are stalling while the MDS waits for its
> RADOS operations to complete.
>
> I can't say a massive amount beyond that, other than what you'd
> already figured out -- it would be worth trying to put some faster
> storage in for your metadata pool.
>
> John
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds server(s) crashed

2015-08-10 Thread Bob Ababurko
I had a dual mds server configuration and have been copying data via cephfs
kernel module to my cluster for the past 3 weeks and just had a MDS crash
halting all IO.  Leading up to the crash, I ran a test dd that increased
the throughput by about 2x and stopped it but about 10 minutes later, the
MDS server crashed and did not fail over to the standby properly. I have
using an active/standby mds configuration but neither of the mds servers
will stay running at this point and crash after starting them.

[bababurko@cephmon01 ~]$ sudo ceph -s
cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
 health HEALTH_WARN
mds cluster is degraded
mds cephmds02 is laggy
noscrub,nodeep-scrub flag(s) set
 monmap e1: 3 mons at {cephmon01=
10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
}
election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
 mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
 osdmap e324: 30 osds: 30 up, 30 in
flags noscrub,nodeep-scrub
  pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
14051 GB used, 13880 GB / 27931 GB avail
2112 active+clean


I am not sure what information is relevant so I will try to cover what I
think is relevant based on posts I have read through:

Cluster:
running ceph-0.94.1 on CenttOS 7.1
[root@mdstest02 bababurko]$ uname -r
3.10.0-229.el7.x86_64

Here is my ceph-mds log with 'debug objector = 10' :

https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=

cat /sys/kernel/debug/ceph/*/mdsc output:

https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g=

ceph.conf :

https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U=

I have copied almost 5TB of small files to this cluster which has taken the
better part of three weeks, so I am really hoping that there is a way to
recover from this.  This is ourPOC cluster

I'm sure I have missed something relevant as i'm just getting my mind back
after nearly losing it, so feel free to ask for anything to assist.

Any help would be greatly appreciated.

thanks,
Bob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-06 Thread Bob Ababurko
@John,

Can you clarify which values would suggest that my metadata pool is too
slow?   I have added a link that includes values for the "op_active"
& "handle_client_request"gathered in a crude fashion but should
hopefully give enough data to paint a picture of what is happening.

http://pastebin.com/5zAG8VXT

thanks in advance,
Bob

On Thu, Aug 6, 2015 at 1:24 AM, Bob Ababurko  wrote:

> I should have probably condensed my finding over the course of the day
> into one post but, I guess that just not how i'm built.
>
> Another data point.  I ran the `ceph daemon mds.cephmds02 perf dump` in a
> while loop w/ 1 second sleep and grepping out the stats John mentioned and
> at times(~every 10-15 seconds), I have some large objector.op_active
> values.  After the high values hit, there are 5-10 seconds of zero values.
>
> "handle_client_request": 5785438,
> "op_active": 2375,
> "handle_client_request": 5785438,
> "op_active": 2444,
> "handle_client_request": 5785438,
> "op_active": 2239,
> "handle_client_request": 5785438,
> "op_active": 1648,
> "handle_client_request": 5785438,
> "op_active": 1121,
> "handle_client_request": 5785438,
> "op_active": 709,
> "handle_client_request": 5785438,
> "op_active": 235,
> "handle_client_request": 5785572,
> "op_active": 0,
>...
>
> Should I be concerned about these "op_active" values?  I see that in my
> narrow slice of output, "handle_client_request" does not increment.  What
> is happening there?
>
> thanks,
> Bob
>
> On Wed, Aug 5, 2015 at 11:43 PM, Bob Ababurko  wrote:
>
>> I found a way to get the stats you mentioned: 
>> mds_server.handle_client_request
>> & objecter.op_active.  I can see these values when I run:
>>
>> ceph daemon mds. perf dump
>>
>> I recently restarted the mds server so my stats reset but I still have
>> something to share:
>>
>> "mds_server.handle_client_request": 4406055
>> "objecter.op_active": 0
>>
>> Should I assume that op_active might be operations in writes or reads
>> that are queued?  I haven't been able to find anything describing what
>> these stats actually mean so if anyone knows where to find them, please
>> advise.
>>
>> On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko  wrote:
>>
>>> I have installed diamond(built by ksingh found at
>>> https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and
>>> I am not seeing the mds_server.handle_client_request OR objecter.op_active
>>> metrics being sent to graphite.  Mind you, this is not the graphite that is
>>> part of the calamari install but our own internal graphite cluster.
>>> Perhaps that is the reason?  I could not get calamari working correctly on
>>> hammerhead/centos7.1 so I put it on pause for now to concentrate on the
>>> cluster itself.
>>>
>>> Ultimately, I need to find a way to get a hold of these metrics to
>>> determine the health of my MDS so I can justify moving forward on a SSD
>>> based cephfs metadata pool.
>>>
>>> On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko  wrote:
>>>
>>>> Hi John,
>>>>
>>>> You are correct in that my expectations may be incongruent with what is
>>>> possible with ceph(fs).  I'm currently copying many small files(images)
>>>> from a netapp to the cluster...~35k sized files to be exact and the number
>>>> of objects/files copied thus far is fairly significant(below in bold):
>>>>
>>>> [bababurko@cephmon01 ceph]$ sudo rados df
>>>> pool name KB  objects   clones degraded
>>>>  unfound   rdrd KB   wrwr KB
>>>> cephfs_data   3289284749*163993660*00
>>>>   000328097038   3369847354
>>>> cephfs_metadata   133364   52436300
>>>>   0  3600023   5264453980 9564   1361554516
>>>> rbd0000
>>>>   00000
>>>>   total used  9297615196164518023
>>>>   total avail19990923044
>>>>   total space292885382

Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-06 Thread Bob Ababurko
I should have probably condensed my finding over the course of the day into
one post but, I guess that just not how i'm built.

Another data point.  I ran the `ceph daemon mds.cephmds02 perf dump` in a
while loop w/ 1 second sleep and grepping out the stats John mentioned and
at times(~every 10-15 seconds), I have some large objector.op_active
values.  After the high values hit, there are 5-10 seconds of zero values.

"handle_client_request": 5785438,
"op_active": 2375,
"handle_client_request": 5785438,
"op_active": 2444,
"handle_client_request": 5785438,
"op_active": 2239,
"handle_client_request": 5785438,
"op_active": 1648,
"handle_client_request": 5785438,
"op_active": 1121,
"handle_client_request": 5785438,
"op_active": 709,
"handle_client_request": 5785438,
"op_active": 235,
"handle_client_request": 5785572,
"op_active": 0,
   ...

Should I be concerned about these "op_active" values?  I see that in my
narrow slice of output, "handle_client_request" does not increment.  What
is happening there?

thanks,
Bob

On Wed, Aug 5, 2015 at 11:43 PM, Bob Ababurko  wrote:

> I found a way to get the stats you mentioned: mds_server.handle_client_request
> & objecter.op_active.  I can see these values when I run:
>
> ceph daemon mds. perf dump
>
> I recently restarted the mds server so my stats reset but I still have
> something to share:
>
> "mds_server.handle_client_request": 4406055
> "objecter.op_active": 0
>
> Should I assume that op_active might be operations in writes or reads that
> are queued?  I haven't been able to find anything describing what these
> stats actually mean so if anyone knows where to find them, please advise.
>
> On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko  wrote:
>
>> I have installed diamond(built by ksingh found at
>> https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I
>> am not seeing the mds_server.handle_client_request OR objecter.op_active
>> metrics being sent to graphite.  Mind you, this is not the graphite that is
>> part of the calamari install but our own internal graphite cluster.
>> Perhaps that is the reason?  I could not get calamari working correctly on
>> hammerhead/centos7.1 so I put it on pause for now to concentrate on the
>> cluster itself.
>>
>> Ultimately, I need to find a way to get a hold of these metrics to
>> determine the health of my MDS so I can justify moving forward on a SSD
>> based cephfs metadata pool.
>>
>> On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko  wrote:
>>
>>> Hi John,
>>>
>>> You are correct in that my expectations may be incongruent with what is
>>> possible with ceph(fs).  I'm currently copying many small files(images)
>>> from a netapp to the cluster...~35k sized files to be exact and the number
>>> of objects/files copied thus far is fairly significant(below in bold):
>>>
>>> [bababurko@cephmon01 ceph]$ sudo rados df
>>> pool name KB  objects   clones degraded
>>>  unfound   rdrd KB   wrwr KB
>>> cephfs_data   3289284749*163993660*00
>>> 000328097038   3369847354
>>> cephfs_metadata   133364   52436300
>>>   0  3600023   5264453980 9564   1361554516
>>> rbd0000
>>>   00000
>>>   total used  9297615196164518023
>>>   total avail19990923044
>>>   total space29288538240
>>>
>>> Yes, that looks like ~164 million objects copied to the cluster.  I
>>> would assume this will potentially be a burden to the MDS but I have yet to
>>> confirm with the ceph daemontool mds..  I cannot seem to run it on the
>>> mds host as it doesn't seem to know about that command:
>>>
>>> [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01
>>> no valid command found; 10 closest matches:
>>> osd lost  {--yes-i-really-mean-it}
>>> osd create {}
>>> osd primary-temp  
>>> osd primary-affinity  
>>> osd reweight  
>>> osd pg-temp  { [...]}
>>> osd in  [...]
>>> osd rm  [...]
>>> osd down  [...]
>>> osd out  [...]
>>&g

Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-05 Thread Bob Ababurko
I found a way to get the stats you mentioned: mds_server.handle_client_request
& objecter.op_active.  I can see these values when I run:

ceph daemon mds. perf dump

I recently restarted the mds server so my stats reset but I still have
something to share:

"mds_server.handle_client_request": 4406055
"objecter.op_active": 0

Should I assume that op_active might be operations in writes or reads that
are queued?  I haven't been able to find anything describing what these
stats actually mean so if anyone knows where to find them, please advise.

On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko  wrote:

> I have installed diamond(built by ksingh found at
> https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I
> am not seeing the mds_server.handle_client_request OR objecter.op_active
> metrics being sent to graphite.  Mind you, this is not the graphite that is
> part of the calamari install but our own internal graphite cluster.
> Perhaps that is the reason?  I could not get calamari working correctly on
> hammerhead/centos7.1 so I put it on pause for now to concentrate on the
> cluster itself.
>
> Ultimately, I need to find a way to get a hold of these metrics to
> determine the health of my MDS so I can justify moving forward on a SSD
> based cephfs metadata pool.
>
> On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko  wrote:
>
>> Hi John,
>>
>> You are correct in that my expectations may be incongruent with what is
>> possible with ceph(fs).  I'm currently copying many small files(images)
>> from a netapp to the cluster...~35k sized files to be exact and the number
>> of objects/files copied thus far is fairly significant(below in bold):
>>
>> [bababurko@cephmon01 ceph]$ sudo rados df
>> pool name KB  objects   clones degraded
>>  unfound   rdrd KB   wrwr KB
>> cephfs_data   3289284749*163993660*00
>> 000328097038   3369847354
>> cephfs_metadata   133364   52436300
>> 0  3600023   5264453980 9564   1361554516
>> rbd0000
>> 00000
>>   total used  9297615196164518023
>>   total avail19990923044
>>   total space29288538240
>>
>> Yes, that looks like ~164 million objects copied to the cluster.  I would
>> assume this will potentially be a burden to the MDS but I have yet to
>> confirm with the ceph daemontool mds..  I cannot seem to run it on the
>> mds host as it doesn't seem to know about that command:
>>
>> [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01
>> no valid command found; 10 closest matches:
>> osd lost  {--yes-i-really-mean-it}
>> osd create {}
>> osd primary-temp  
>> osd primary-affinity  
>> osd reweight  
>> osd pg-temp  { [...]}
>> osd in  [...]
>> osd rm  [...]
>> osd down  [...]
>> osd out  [...]
>> Error EINVAL: invalid command
>>
>> This fails in a similar manner on all the hosts in the cluster.  I'm very
>> green w/ ceph and i'm probably missing something obvious.  Is there
>> something I need to install to get access to the 'ceph daemonperf' command
>> in hammerhead?
>>
>> thanks,
>> Bob
>>
>> On Wed, Aug 5, 2015 at 2:43 AM, John Spray  wrote:
>>
>>> On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko  wrote:
>>> > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) &
>>> write
>>> > throughput( ~25MB/s max).  I'm interested in understanding what it
>>> takes to
>>> > create a SSD pool that I can then migrate the current Cephfs_metadata
>>> pool
>>> > to.  I suspect that the spinning disk metadata pool is a bottleneck
>>> and I
>>> > want to try to get the max performance out of this cluster to prove
>>> that we
>>> > would build out a larger version.  One caveat is that I have copied
>>> about 4
>>> > TB of data to the cluster via cephfs and dont want to lose the data so
>>> I
>>> > obviously need to keep the metadata intact.
>>>
>>> I'm a bit suspicious of this: your IOPS expectations sort of imply
>>> doing big files, but you're then suggesting that metadata is the
>>> bottleneck (i.e. small file workload).
>>>
>>> There are lots of statistics that come out of the MDS, you may be
>>> particular interested in mds_server.

Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-05 Thread Bob Ababurko
I have installed diamond(built by ksingh found at
https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I am
not seeing the mds_server.handle_client_request OR objecter.op_active
metrics being sent to graphite.  Mind you, this is not the graphite that is
part of the calamari install but our own internal graphite cluster.
Perhaps that is the reason?  I could not get calamari working correctly on
hammerhead/centos7.1 so I put it on pause for now to concentrate on the
cluster itself.

Ultimately, I need to find a way to get a hold of these metrics to
determine the health of my MDS so I can justify moving forward on a SSD
based cephfs metadata pool.

On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko  wrote:

> Hi John,
>
> You are correct in that my expectations may be incongruent with what is
> possible with ceph(fs).  I'm currently copying many small files(images)
> from a netapp to the cluster...~35k sized files to be exact and the number
> of objects/files copied thus far is fairly significant(below in bold):
>
> [bababurko@cephmon01 ceph]$ sudo rados df
> pool name KB  objects   clones degraded
>  unfound   rdrd KB   wrwr KB
> cephfs_data   3289284749*163993660*00
>   000328097038   3369847354
> cephfs_metadata   133364   52436300
> 0  3600023   5264453980 9564   1361554516
> rbd0000
> 00000
>   total used  9297615196164518023
>   total avail19990923044
>   total space29288538240
>
> Yes, that looks like ~164 million objects copied to the cluster.  I would
> assume this will potentially be a burden to the MDS but I have yet to
> confirm with the ceph daemontool mds..  I cannot seem to run it on the
> mds host as it doesn't seem to know about that command:
>
> [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01
> no valid command found; 10 closest matches:
> osd lost  {--yes-i-really-mean-it}
> osd create {}
> osd primary-temp  
> osd primary-affinity  
> osd reweight  
> osd pg-temp  { [...]}
> osd in  [...]
> osd rm  [...]
> osd down  [...]
> osd out  [...]
> Error EINVAL: invalid command
>
> This fails in a similar manner on all the hosts in the cluster.  I'm very
> green w/ ceph and i'm probably missing something obvious.  Is there
> something I need to install to get access to the 'ceph daemonperf' command
> in hammerhead?
>
> thanks,
> Bob
>
> On Wed, Aug 5, 2015 at 2:43 AM, John Spray  wrote:
>
>> On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko  wrote:
>> > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) &
>> write
>> > throughput( ~25MB/s max).  I'm interested in understanding what it
>> takes to
>> > create a SSD pool that I can then migrate the current Cephfs_metadata
>> pool
>> > to.  I suspect that the spinning disk metadata pool is a bottleneck and
>> I
>> > want to try to get the max performance out of this cluster to prove
>> that we
>> > would build out a larger version.  One caveat is that I have copied
>> about 4
>> > TB of data to the cluster via cephfs and dont want to lose the data so I
>> > obviously need to keep the metadata intact.
>>
>> I'm a bit suspicious of this: your IOPS expectations sort of imply
>> doing big files, but you're then suggesting that metadata is the
>> bottleneck (i.e. small file workload).
>>
>> There are lots of statistics that come out of the MDS, you may be
>> particular interested in mds_server.handle_client_request,
>> objecter.op_active, to work out if there really are lots of RADOS
>> operations getting backed up on the MDS (which would be the symptom of
>> a too-slow metadata pool).  "ceph daemonperf mds." may be some
>> help if you don't already have graphite or similar set up.
>>
>> > If anyone has done this OR understands how this can be done, I would
>> > appreciate the advice.
>>
>> You could potentially do this in a two-phase process where you
>> initially set a crush rule that includes both SSDs and spinners, and
>> then finally set a crush rule that just points to SSDs.  Obviously
>> that'll do lots of data movement, but your metadata is probably a fair
>> bit smaller than your data so that might be acceptable.
>>
>> John
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-05 Thread Bob Ababurko
Hi John,

You are correct in that my expectations may be incongruent with what is
possible with ceph(fs).  I'm currently copying many small files(images)
from a netapp to the cluster...~35k sized files to be exact and the number
of objects/files copied thus far is fairly significant(below in bold):

[bababurko@cephmon01 ceph]$ sudo rados df
pool name KB  objects   clones degraded
 unfound   rdrd KB   wrwr KB
cephfs_data   3289284749*163993660*00
000328097038   3369847354
cephfs_metadata   133364   52436300
  0  3600023   5264453980 9564   1361554516
rbd0000
  00000
  total used  9297615196164518023
  total avail19990923044
  total space29288538240

Yes, that looks like ~164 million objects copied to the cluster.  I would
assume this will potentially be a burden to the MDS but I have yet to
confirm with the ceph daemontool mds..  I cannot seem to run it on the
mds host as it doesn't seem to know about that command:

[bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01
no valid command found; 10 closest matches:
osd lost  {--yes-i-really-mean-it}
osd create {}
osd primary-temp  
osd primary-affinity  
osd reweight  
osd pg-temp  { [...]}
osd in  [...]
osd rm  [...]
osd down  [...]
osd out  [...]
Error EINVAL: invalid command

This fails in a similar manner on all the hosts in the cluster.  I'm very
green w/ ceph and i'm probably missing something obvious.  Is there
something I need to install to get access to the 'ceph daemonperf' command
in hammerhead?

thanks,
Bob

On Wed, Aug 5, 2015 at 2:43 AM, John Spray  wrote:

> On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko  wrote:
> > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) &
> write
> > throughput( ~25MB/s max).  I'm interested in understanding what it takes
> to
> > create a SSD pool that I can then migrate the current Cephfs_metadata
> pool
> > to.  I suspect that the spinning disk metadata pool is a bottleneck and I
> > want to try to get the max performance out of this cluster to prove that
> we
> > would build out a larger version.  One caveat is that I have copied
> about 4
> > TB of data to the cluster via cephfs and dont want to lose the data so I
> > obviously need to keep the metadata intact.
>
> I'm a bit suspicious of this: your IOPS expectations sort of imply
> doing big files, but you're then suggesting that metadata is the
> bottleneck (i.e. small file workload).
>
> There are lots of statistics that come out of the MDS, you may be
> particular interested in mds_server.handle_client_request,
> objecter.op_active, to work out if there really are lots of RADOS
> operations getting backed up on the MDS (which would be the symptom of
> a too-slow metadata pool).  "ceph daemonperf mds." may be some
> help if you don't already have graphite or similar set up.
>
> > If anyone has done this OR understands how this can be done, I would
> > appreciate the advice.
>
> You could potentially do this in a two-phase process where you
> initially set a crush rule that includes both SSDs and spinners, and
> then finally set a crush rule that just points to SSDs.  Obviously
> that'll do lots of data movement, but your metadata is probably a fair
> bit smaller than your data so that might be acceptable.
>
> John
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-04 Thread Bob Ababurko
I will dig into the network and determine if we have any issues.  One thing
to note is our MTU is 1500 and will not be changed for this testsimply
put, I am not going to be able to get these changes implemented in our
current network .  I dont expect a huge increase in performance by moving
to jumbo frames and I suspect not necessarily worth it for a POC and not
the reason my cluster performance is sucking so bad at this particular
moment.

One other thing I wanted to get clarity on was your rbd perf(dd) tests.  I
was under the impression that rbd devices are striped across all of the
OSD's, where when writing via objects and files, the object would be
getting written to a single disk.  If my understanding is true, a dd would
yield significantly better results(throughput/iops) for a rbd vs file OR
object.   Please let me know if I am missing something.

thank you.


On Tue, Aug 4, 2015 at 2:53 PM, Shane Gibson 
wrote:

>
> Bob,
>
> Those numbers would seem to indicate some other problem    One of the
> biggest culprits of that poor performance is often related to network
> issues.  In the last few months, there have been several reported issues of
> performance, that have turned out to be network.  Not all, but most.
> You're best bet is to check each host interface statistics for errors.
>  make sure you have a match on the MTU size (jumbo frames settings on the
> host and on your switches).  Check your switches for network errors.  Try
> extended size ping checks between nodes, insure you set the packet size
> close to your max MTU size and check that you're getting good performance
> from *all nodes* to every other node.  Last, try a network performance test
> to each of the OSD nodes and see if one of them is acting up.
>
> If you are backing your journal on SSD - you DEFINITELY should be getting
> vastly better performance than that.  I have a cluster with 6 OSD nodes w/
> 10x 4TB OSDs - using 2 7200 rpm disks as the journatl (12 disks total).  NO
> SSDs in that configuration.  I can push the cluster to about 650 MByte/sec
> via network RBD 'dd' tests, and get about 2500 IOPS.  NOTE - this is an all
> spinning HDD cluster w/ 7200 rpm disks!
>
> ~~shane
>
> On 8/4/15, 2:36 PM, "ceph-users on behalf of Bob Ababurko" <
> ceph-users-boun...@lists.ceph.com on behalf of b...@ababurko.net> wrote:
>
> I have my first ceph cluster up and running and am currently testing
> cephfs for file access.  It turns out, I am not getting excellent write
> performance on my cluster via cephfs(kernel driver) and would like to try
> to explore moving my cephfs_metadata pool to SSD.
>
> To quickly describe the cluster:
>
> all nodes run Centos 7.1 w/ ceph-0.94.1(hammerhead)
> [bababurko@cephosd01 ~]$ uname -r
> 3.10.0-229.el7.x86_64
> [bababurko@cephosd01 ~]$ cat /etc/redhat-release
> CentOS Linux release 7.1.1503 (Core)
>
> 6 OSD nodes w/ 5 x 1TB(7200 rpm/dont have model handy) sata & 1 TB SSD(850
> pro) which includes a journal(5GB) for each of the 5 OSD's, so there is
> much space on the SSD left to create a partition for a SSD pool...at least
> 900GB per SSD.  Also noteworthy is that these disks are behind a raid
> controller(LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2)
> with each disk configured as raid 0.
> 3 MON nodes
> 1 MDS node
>
> My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) &
> write throughput( ~25MB/s max).  I'm interested in understanding what it
> takes to create a SSD pool that I can then migrate the current
> Cephfs_metadata pool to.  I suspect that the spinning disk metadata pool is
> a bottleneck and I want to try to get the max performance out of this
> cluster to prove that we would build out a larger version.  One caveat is
> that I have copied about 4 TB of data to the cluster via cephfs and dont
> want to lose the data so I obviously need to keep the metadata intact.
>
> If anyone has done this OR understands how this can be done, I would
> appreciate the advice.
>
> thanks in advance,
> Bob
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] migrating cephfs metadata pool from spinning disk to SSD.

2015-08-04 Thread Bob Ababurko
I have my first ceph cluster up and running and am currently testing cephfs
for file access.  It turns out, I am not getting excellent write
performance on my cluster via cephfs(kernel driver) and would like to try
to explore moving my cephfs_metadata pool to SSD.

To quickly describe the cluster:

all nodes run Centos 7.1 w/ ceph-0.94.1(hammerhead)
[bababurko@cephosd01 ~]$ uname -r
3.10.0-229.el7.x86_64
[bababurko@cephosd01 ~]$ cat /etc/redhat-release
CentOS Linux release 7.1.1503 (Core)

6 OSD nodes w/ 5 x 1TB(7200 rpm/dont have model handy) sata & 1 TB SSD(850
pro) which includes a journal(5GB) for each of the 5 OSD's, so there is
much space on the SSD left to create a partition for a SSD pool...at least
900GB per SSD.  Also noteworthy is that these disks are behind a raid
controller(LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2)
with each disk configured as raid 0.
3 MON nodes
1 MDS node

My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & write
throughput( ~25MB/s max).  I'm interested in understanding what it takes to
create a SSD pool that I can then migrate the current Cephfs_metadata pool
to.  I suspect that the spinning disk metadata pool is a bottleneck and I
want to try to get the max performance out of this cluster to prove that we
would build out a larger version.  One caveat is that I have copied about 4
TB of data to the cluster via cephfs and dont want to lose the data so I
obviously need to keep the metadata intact.

If anyone has done this OR understands how this can be done, I would
appreciate the advice.

thanks in advance,
Bob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com