Re: [ceph-users] un-even data filled on OSDs

2016-06-10 Thread Max A. Krasilnikov
Hello! 

On Fri, Jun 10, 2016 at 07:38:10AM +0530, swamireddy wrote:

> Blair - Thanks for the details. I used to set the low priority for
> recovery during the rebalance/recovery activity.
> Even though I set the recovery_priority as 5 (instead of 1) and
> client-op_priority set as 63, some of my customers complained that
> their VMs are not reachable for a few mins/secs during the reblancing
> task. Not sure, these low priority configurations are doing the job as
> its.

It is true up to Hammer at least. I have no possibility to test it on jevel
setup due to my company policy (part of my cluster already is jevel, but I can
not continue upgrade due to direct directive).

> Thanks
> Swami

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD randwrite performance

2016-05-25 Thread Max A. Krasilnikov
Hello! 

On Wed, May 25, 2016 at 11:45:29AM +0900, chibi wrote:


> Hello,

> On Tue, 24 May 2016 21:20:49 +0300 Max A. Krasilnikov wrote:

>> Hello!
>> 
>> I have cluster with 5 SSD drives as OSD backed by SSD journals, one per
>> osd. One osd per node.
>> 
> More details will help identify other potential bottlenecks, such as:
> CPU/RAM
> Kernel, OS version.

For now I have 3x(Openstack controller + ceph mon + 8xOSD (one for SSD)). All
running Ubuntu 14.04+Hammer from ubuntu-cloud, now moving to Ubuntu 14.04+Ceph
Jewel from Ceph site.
E5-2620 v2 (12 cores)
32G RAM
Linux 4.2.0, moving to 4.4 from Xenial.

>> Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G,
>> journal partition is 24GB, data partition is 790GB. OSD nodes connected
>> by 2x10Gbps linux bonding for data/cluster network.
>>
> As Oliver wrote, these SSDs are totally unsuited for usage with Ceph,
> especially regarding to journals. 
> But also in general, since they're neither handling IOPS in a consistent,
> predictable manner.
> And they're not durable (endurance, TBW) enough either.

Yep, I understand. But on second cluster w/ ScaleIO they do much better :(

> When using SSDs or NVMes, use DC level ones exclusively, Intel is the more
> tested one in these parts, but the Samsung DC level ones ought to be fine,
> too.

I can hope, my employer will provide me with them, but for now i have to do all
the best with current hardware :(

>  
>> When doing random write with 4k blocks with direct=1, buffered=0,
>> iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no
>> more than 9kiops. Randread is about 13-15 kiops.
>> 
>> Trouble is that randwrite not depends on iodepth. read, write can be up
>> to 140kiops, randread up to 15 kiops. randwrite is always 2-9 kiops.
>> 
> Aside from the limitations of your SSDs, there are other factors, like CPU
> utilization.
> And very importantly also network latency, but that's for single threaded
> IOPS mostly.

>> Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On
>> Hammer I got the same results.
>> 
> Mixed is a very bad state for a cluster to be in.

> Jewel has lots of improvements in that area, but w/o decent hardware you
> may not see them.

My cluster is upgrading now. 2 OSD per night :), one node per week, with
changing old 850EVO to new ones.

>> All journals can do up to 32kiops with the same config for fio.
>> 
>> I am confused because EMC ScaleIO can do much more iops what is boring
>> my boss :)
>> 
> There are lot of discussion and slides on how to improve/maximize IOPS
> with Ceph, go search for them.

> Fast CPUs, jmalloc, pinning, configuration, NVMes for journals, etc.

I have seen a lot of them. Will try to use pinning, I have never used it before.

> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD randwrite performance

2016-05-24 Thread Max A. Krasilnikov
Hello!

I have cluster with 5 SSD drives as OSD backed by SSD journals, one per osd. One
osd per node.

Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G, journal
partition is 24GB, data partition is 790GB. OSD nodes connected by 2x10Gbps
linux bonding for data/cluster network.

When doing random write with 4k blocks with direct=1, buffered=0,
iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no more than
9kiops. Randread is about 13-15 kiops.

Trouble is that randwrite not depends on iodepth. read, write can be up to
140kiops, randread up to 15 kiops. randwrite is always 2-9 kiops.

Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On Hammer I
got the same results.

All journals can do up to 32kiops with the same config for fio.

I am confused because EMC ScaleIO can do much more iops what is boring my boss
:)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] using jemalloc in trusty

2016-05-24 Thread Max A. Krasilnikov
Hello! 

On Mon, May 23, 2016 at 02:34:37PM +, Somnath.Roy wrote:

> You need to build ceph code base to use jemalloc for OSDs..LD_PRELOAD won't 
> work..

Is it true for Xenial too or only for Trusty? I don't want to rebuild Jewel on
xenial hosts...

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel ubuntu release is half cooked

2016-05-23 Thread Max A. Krasilnikov
Hello! 

On Mon, May 23, 2016 at 11:26:38AM +0100, andrei wrote:

> 1. Ceph journals - After performing the upgrade the ceph-osd processes are 
> not starting. I've followed the instructions and chowned /var/lib/ceph (also 
> see point 2 below). The issue relates to the journal partitions, which are 
> not chowned due to the symlinks. Thus, the ceph user had no read/write access 
> to the journal partitions. IMHO, this should be addressed at the 
> documentation layer unless it can be easily and reliably dealt with by the 
> installation script. 

I had met the same trouble and have to chown journal partitions. A also have
14.04, upgrading Ceph from Hammer (ubuntu-cloud archive) to Jevel (Ceph site)
On another node, running Ubuntu 16.04 and ceph from Ubuntu repo, no such
issues found.

I prefer to upgrade Ceph first and upgrade all systems and Openstack services
later.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.7 Hammer released

2016-05-17 Thread Max A. Krasilnikov
Hello! 

On Tue, May 17, 2016 at 10:04:41AM +0200, dan wrote:

> Hi Sage et al,

> I'm updating our pre-prod cluster from 0.94.6 to 0.94.7 and after
> upgrading the ceph-mon's I'm getting loads of warnings like:

> 2016-05-17 10:01:29.314785 osd.76 [WRN] failed to encode map e103116
> with expected crc

> I've seen that error is whitelisted in the qa-suite:
> https://github.com/ceph/ceph-qa-suite/pull/602/files

> Is it really harmless? (This is the first time I've seen such a warning).

I have the same warning using some jewel OSDs in hammer cluster (considering
step-by-step per-node upgrade). No problems, just warning in logs.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can I attach a volume to 2 servers

2016-05-02 Thread Max A. Krasilnikov
Hello!

On Mon, May 02, 2016 at 11:25:11AM -0400, forsaks.30 wrote:

> Hi Edward

> thanks for your explanation!

> Yes you are right.

> I just came across sebastien han's post, using nfs on top of rbd(
> http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/)

> I will try this method.

Why not to use cephfs? I prefer it as I already have ceph cluster for rbd.

> On Mon, May 2, 2016 at 11:14 AM, Edward Huyer <erh...@rit.edu> wrote:

>> Mapping a single RBD on multiple servers isn’t going to do what you want
>> unless you’re putting some kind of clustered filesystem on it.  Exporting
>> the filesystem via an NFS server will generally be simpler.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replace Journal

2016-04-22 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Apr 22, 2016 at 09:30:15AM +0200, martin.wilderoth wrote:

> >> I have a ceph cluster and I will change my journal devices to new SSD's.
> >>
> >> In some instructions of doing this they refer to a journal file (link to
> >> UUID of journal )
> >>
> >> In my OSD folder this journal don’t exist.
> >>
>> If your cluster is "years old" and not created with ceph-disk, then yes,
>> that's not surprising.
>> Mind, I created a recent one of mine manually and still used that scheme:
>> --- ls -la /var/lib/ceph/osd/ceph-12/
>> total 80
>> drwxr-xr-x   4 root root  4096 Mar  1 14:44 .
>> drwxr-xr-x   8 root root  4096 Sep 10  2015 ..
>> -rw-r--r--   1 root root37 Sep 10  2015 ceph_fsid
>> drwxr-xr-x 320 root root 24576 Mar  2 20:24 current
>> -rw-r--r--   1 root root37 Sep 10  2015 fsid
>> lrwxrwxrwx   1 root root44 Sep 10  2015 journal ->
>> /dev/disk/by-id/wwn-0x55cd2e404b77573c-part5
>> -rw---   1 root root57 Sep 10  2015 keyring
>> ---
>>
>> Ceph isn't magical, so if that link isn't there, you probably have
>> something like this in your ceph.conf, preferably with UUID instead of thet
>> possibly changing device name:
>> ---
>> [osd.0]
>> host = ceph-01
>> osd journal = /dev/sdc3
>> ---


> Yes that is my setup, Would that mean i could either create symlink journal
> -> /dev/disk/..
> remove the osd journal in ceph.conf.

> or change my ceph.conf with osd journal = /dev/

> And the recommended way is actually to use journal symlink ?

I'm using symlinks to /dev/disk/by-partlabel/
It saves me from any troubles when replacing journal SSDs.
The same for mounts: I use LABEL= in fstab because of changing device names when
replacing HW in storage nodes.

Of cause, any can set proper udev rules, but I'm too lazy for this exercises :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Apr 12, 2016 at 07:48:58AM +, Maxime.Guyot wrote:

> Hi Adrian,

> Looking at the documentation RadosGW has multi region support with the 
> “federated gateways” 
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical locales, 
> configuring Ceph Object Gateway regions and metadata synchronization agents 
> enables the service to maintain a global namespace, even though Ceph Object 
> Gateway instances run in different geographic locales and potentially on 
> different Ceph Storage Clusters.”

> Maybe that could do the trick for your multi metro EC pools?

> Disclaimer: I haven't tested the federated gateways RadosGW.

As I can see in doc, Jewel have to be able to perform per-image async mirroring:

There is new support for mirroring (asynchronous replication) of RBD images
across clusters. This is implemented as a per-RBD image journal that can be
streamed across a WAN to another site, and a new rbd-mirror daemon that performs
the cross-cluster replication.

© http://docs.ceph.com/docs/master/release-notes/

I will test it 1-2 month later this year :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Max A. Krasilnikov
Hello!

On Mon, Apr 11, 2016 at 05:39:37PM -0400, sage wrote:

> Hi,

> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.

1. Does filestore_xattr_use_omap fix issues with ext4? So, can I continue using
ext4 for cluster with RBD && CephFS + this option set to true?
2. Agree with Christian, it would be better to warn but not drop support for
legacy fs until old HW is out of service, 4-5 years.
3. Also, if BlueStore will be so good, one prefer to use it instead of
FileStore, so fs deprecation would be not so painful.

I'm not so great ceph user, but I have limitations like Christian and changing
fs would cost me 24 nights for now :(

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] recorded data digest != on disk

2016-03-22 Thread Max A. Krasilnikov
Hello!

I have 3-node cluster running ceph version 0.94.6 
(e832001feaf8c176593e0325c8298e3f16dfb403)
on Ubuntu 14.04. When scrubbing I get error:

-9> 2016-03-21 17:36:09.047029 7f253a4f6700  5 -- op tracker -- seq: 48045, 
time: 2016-03-21 17:36:09.046984, event: all_read, op: osd_sub_op(unknown.0.0:0 
5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
-8> 2016-03-21 17:36:09.047035 7f253a4f6700  5 -- op tracker -- seq: 48045, 
time: 0.00, event: dispatched, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 
[scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
-7> 2016-03-21 17:36:09.047066 7f254411b700  5 -- op tracker -- seq: 48045, 
time: 2016-03-21 17:36:09.047066, event: reached_pg, op: 
osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
-6> 2016-03-21 17:36:09.047086 7f254411b700  5 -- op tracker -- seq: 48045, 
time: 2016-03-21 17:36:09.047086, event: started, op: osd_sub_op(unknown.0.0:0 
5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
-5> 2016-03-21 17:36:09.047127 7f254411b700  5 -- op tracker -- seq: 48045, 
time: 2016-03-21 17:36:09.047127, event: done, op: osd_sub_op(unknown.0.0:0 
5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
-4> 2016-03-21 17:36:09.047173 7f253f912700  2 osd.13 pg_epoch: 23286 
pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 
ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 
crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 
active+clean+scrubbing+deep+repair] scrub_compare_maps   osd.13 has 10 items
-3> 2016-03-21 17:36:09.047377 7f253f912700  2 osd.13 pg_epoch: 23286 
pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 
ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 
crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 
active+clean+scrubbing+deep+repair] scrub_compare_maps replica 21 has 10 items
-2> 2016-03-21 17:36:09.047983 7f253f912700  2 osd.13 pg_epoch: 23286 
pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 
ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 
crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 
active+clean+scrubbing+deep+repair] 5.ca recorded data digest 0xb284fef9 != on 
disk 0x43d61c5d on 6134ccca/rb
d_data.86280c78aaf7da.000e0bb5/17//5

-1> 2016-03-21 17:36:09.048201 7f253f912700 -1 log_channel(cluster) log 
[ERR] : 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 
6134ccca/rbd_data.86280c78aaf7da.000e0bb5/17//5
 0> 2016-03-21 17:36:09.050672 7f253f912700 -1 osd/osd_types.cc: In 
function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 
7f253f912700 time 2016-03-21 17:36:09.048341
osd/osd_types.cc: 4103: FAILED assert(clone_size.count(clone))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0x5606c23633db]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x5606c1fd4666]
 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair, std::less, std::allocator<std::pair > > > const&)+0xa1c) 
[0x5606c20b3c6c]
 4: (PG::scrub_compare_maps()+0xec9) [0x5606c2020d49]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x5606c20264be]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x1f4) [0x5606c2027d44]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x5606c1f0c379]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x5606c2353fc6]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x5606c2355070]
 10: (()+0x8182) [0x7f256168e182]
 11: (clone()+0x6d) [0x7f255fbf947d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Is there any way to recalculate data digest?
I have removed OSD with failed PG, data was recovered but error occurs on other
OSD. I think, I do not have consistent copy of data.
What can I do to recover?

pool size 2 (it's not so good, I know, but i have not ability to increase this
nearest 2 month).

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests with rbd

2016-03-04 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Mar 04, 2016 at 01:33:24PM +0100, honza801 wrote:

> hi,

> i have rbd0 mapped to client, xfs formatted. i'm putting a lot of data on it.
> following messages appear in logs and 'ceph -s' output

> osd.255 [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 51.726881 secs
> osd.255 [WRN] slow request 51.726881 seconds old, received at
> 2016-03-04 12:22:23.549737: osd_op(client.14296.1:389333
> rbd_data.37d230c8153.000d1cc8 [set-alloc-hint object_size
> 4194304 write_size 4194304,writefull 0~4194304] 2.fc8c5908
> ondisk+write e7523) currently waiting for subops from 120,239

> it causes slow downs on writes. iostat, load, dmesg on osds shows nothing odd.

> could anyone give me a hint?

I spent a lot of time with this trouble because of "overtuning" of Linux TCP/IP
stack using sysctl. If your disks are not overloaded, if your network is not
overloaded, take a look on network configuration including sysctl.

BTW, default sysctl settings are quite well :) Things can be better, but they
are stable anough.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restore properties to default?

2016-03-03 Thread Max A. Krasilnikov
Здравствуйте! 

On Thu, Mar 03, 2016 at 09:53:22AM +1000, lindsay.mathieson wrote:

> Ok, reduced my recovery I/O with

> ceph tell osd.* injectargs '--osd-max-backfills 1'
> ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'


> Now I can put it back to the default values explicity (10, 15), but is 
> there a way to tell ceph to just restore the default args?

As an option:
ceph --show-config -c /dev/null |grep osd_max_backfills
...
ceph tell osd.* injectargs '--osd_max_backfills='

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer OSD crash during deep scrub

2016-02-18 Thread Max A. Krasilnikov
Hello!

On Wed, Feb 17, 2016 at 11:14:09AM +0200, pseudo wrote:

> Hello!

> Now I'm going to check OSD filesystem. But I have neither strange logs in 
> syslog, nor SMART reports about this drive.

Filesystem check did not find any troubles. Removing OSD and scrubbing
problematic PG on other pair of OSD result in crash of new primary OSD.

It looks like I have wrong PG data, but it is design flaw when OSD is crashed
due to inconsistent data on input...
I have no idea how to find problematic object in PG. If I find it, I would
repair it by hands.

Any ideas?

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can not umount ceph osd partition

2016-02-04 Thread Max A. Krasilnikov
Hello!

On Thu, Feb 04, 2016 at 11:10:06AM +0100, yoann.moulin wrote:

> Hello,

>>>> I am using 0.94.5. When I try to umount partition and fsck it I have issue:
>>>> root@storage003:~# stop ceph-osd id=13
>>>> ceph-osd stop/waiting
>>>> root@storage003:~# umount /var/lib/ceph/osd/ceph-13
>>>> root@storage003:~# fsck -yf /dev/sdf
>>>> fsck from util-linux 2.20.1
>>>> e2fsck 1.42.9 (4-Feb-2014)
>>>> /dev/sdf is in use.
>>>> e2fsck: Cannot continue, aborting.
>>>>
>>>> There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to 
>>>> check
>>>> fs.
>>>> I can mount -o remount,rw, but I would like to umount device for 
>>>> maintenance
>>>> and, maybe, replace it.
>>>>
>>>> Why I can't umount?
>> 
>>> is "lsof -n | grep /dev/sdf" give something ?
>> 
>> Nothing.
>> 
>>> and are you sure /dev/sdf is the disk for osd 13 ?
>> 
>> Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck.
>> 
>> Disk is mounted using LABEL in fstab, journal is symlink to
>> /dev/disk/by-partlabel/j-13.

> I think it's more linux related.

Maybe. But I have it only on ceph boxes :(

> could you try to look with lsof if something hold the device by the
> label or uuid instead of /dev/sdf ?

> you can try to delete the device from the scsi bus with something like :

> echo 1 > /sys/block//device/delete

> be careful, it is like removing the disk physically, if a process holds
> the device, you might expect that process gonna switch into kernel
> status "D+" . You won't be able to kill that process even by kill -9. To
> stop it, you will have to reboot the server.

> you can give a look here how to manipulate scsi bus:

> http://fibrevillage.com/storage/279-hot-add-remove-rescan-of-scsi-devices-on-linux

> you can install the package "scsitools" that provide rescan-scsi-bus.sh
> to rescan you scsi bus to get back your disk removed.

> http://manpages.ubuntu.com/manpages/precise/man8/rescan-scsi-bus.8.html

> hope that can help you

Thanx a lot! I will try to use partx -u (it sometimes helped me in past to
re-read partitions from disk when gdisk was not able to update kernel's list of
partitions) and software removing/inserting drive.
If some processes fails into uninterruptible sleep, I will reboot node. It will
be rebooted in any case if this will not help.

If I investigate thomething it will be posted here. I think, it can affect other
ceph users.

-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] can not umount ceph osd partition

2016-02-03 Thread Max A. Krasilnikov
Hello!

I am using 0.94.5. When I try to umount partition and fsck it I have issue:
root@storage003:~# stop ceph-osd id=13
ceph-osd stop/waiting
root@storage003:~# umount /var/lib/ceph/osd/ceph-13
root@storage003:~# fsck -yf /dev/sdf
fsck from util-linux 2.20.1
e2fsck 1.42.9 (4-Feb-2014)
/dev/sdf is in use.
e2fsck: Cannot continue, aborting.

There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to check
fs.
I can mount -o remount,rw, but I would like to umount device for maintenance
and, maybe, replace it.

Why I can't umount?

-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can not umount ceph osd partition

2016-02-03 Thread Max A. Krasilnikov
Здравствуйте! 

On Wed, Feb 03, 2016 at 04:59:30PM +0100, yoann.moulin wrote:

> Hello,

>> I am using 0.94.5. When I try to umount partition and fsck it I have issue:
>> root@storage003:~# stop ceph-osd id=13
>> ceph-osd stop/waiting
>> root@storage003:~# umount /var/lib/ceph/osd/ceph-13
>> root@storage003:~# fsck -yf /dev/sdf
>> fsck from util-linux 2.20.1
>> e2fsck 1.42.9 (4-Feb-2014)
>> /dev/sdf is in use.
>> e2fsck: Cannot continue, aborting.
>> 
>> There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to 
>> check
>> fs.
>> I can mount -o remount,rw, but I would like to umount device for maintenance
>> and, maybe, replace it.
>> 
>> Why I can't umount?

> is "lsof -n | grep /dev/sdf" give something ?

Nothing.

> and are you sure /dev/sdf is the disk for osd 13 ?

Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck.

Disk is mounted using LABEL in fstab, journal is symlink to
/dev/disk/by-partlabel/j-13.

-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd_recovery_delay_start ignored in Hammer?

2016-01-16 Thread Max A. Krasilnikov
Hello!

In my cluster running Hammer 0.94.5-0ubuntu0.15.04.1~cloud0 when starting OSD it
starts recovery immediatelly. I have changed osd_recovery_delay_start to 60
seconds, but this setting is ignored during osd bootup.

root@storage001:~# ceph -n osd.9 --show-config |grep osd_recovery_delay_start
osd_recovery_delay_start = 60

I would like to delay recovery because it increases load on cluster leading to 
slow
request on start. After 1-2 minutes after startup of osd slow requests desapear
and all things doing ok.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write throughput drops to zero

2015-11-05 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Oct 30, 2015 at 09:30:40PM +, moloney wrote:

> Hi,

> I recently got my first Ceph cluster up and running and have been doing some 
> stress tests. I quickly found that during sequential write benchmarks the 
> throughput would often drop to zero. Initially I saw this inside QEMU virtual 
> machines, but I can also reproduce the issue with "rados bench" within 5-10 
> minutes of sustained writes.  If left alone the writes will eventually start 
> going again, but it takes quite a while (at least a couple minutes). If I 
> stop and restart the benchmark the write throughput will immediately be where 
> it is supposed to be.

> I have convinced myself it is not a network hardware issue.  I can load up 
> the network with a bunch of parallel iperf benchmarks and it keeps chugging 
> along happily. When the issue occurs with Ceph I don't see any indications of 
> network issues (e.g. dropped packets).  Adding additional network load during 
> the rados bench (using iperf) doesn't seem to trigger the issue any faster or 
> more often.

> I have also convinced myself it isn't an issue with a journal getting full or 
> an OSD being too busy.  The amount of data being written before the problem 
> occurs is much larger than the total journal capacity. Watching the load on 
> the OSD servers with top/iostat I don't seen anything being overloaded, 
> rather I see the load everywhere drop to essentially zero when the writes 
> stall. Before the writes stall the load is well distributed with no visible 
> hot spots. The OSDs and hosts that report slow requests are random, so I 
> don't think it is a failing disk or server.  I don't see anything interesting 
> going on in the logs so far (I am just about to do some tests with Ceph's 
> debug logging cranked up).

> The cluster specs are:

> OS: Ubuntu 14.04 with 3.16 kernel
> Ceph: 9.1.0
> OSD Filesystem: XFS
> Replication: 3X
> Two racks with IPoIB network
> 10Gbps Ethernet between racks
> 8 OSD servers with:
>   * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
>   * 128GB RAM
>   * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
>   * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions 
> for OSD journals each)
>   * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
> 3 Mons collocated on OSD servers

> Any advice is greatly appreciated. I am planning to try this with Hammer too.

I had the same trouble with Hammer, Ubuntu 14.04 and 3.19 kernel on Supermicro
X9DRL-3F/iF with Intel 82599ES, bounded into one links to 2 different Cisco
Nexus 5020. It was finally fixed with dropping down MTU from 1500+ to 1500.
It was working with 9000 and folowing sysctls, but after several weeks trouble
repeated and I had to drop mtu down again:

net.ipv4.tcp_rmem= 1024000 8738000 1677721600   

   
net.ipv4.tcp_wmem= 1024000 8738000 1677721600   

   
net.ipv4.tcp_mem= 1024000 8738000 1677721600

   
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 15
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
vm.swappiness = 1
net.ipv4.tcp_moderate_rcvbuf = 0

All 

> Thanks,
> Brendan

> _______
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Potential OSD deadlock?

2015-10-16 Thread Max A. Krasilnikov
Hello!

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the 
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

> What about some intermediate MTU like 8000 - does that work?
> Oh and if there's any bonding/trunking involved, beware that you need to set 
> the same MTU and offloads on all interfaces on certains kernels - flags like 
> MTU/offloads should propagate between the master/slave interfaces but in 
> reality it's not the case and they get reset even if you unplug/replug the 
> ethernet cable.

I'm sorry for long time to answer, but I have fixed problem with Jumbo frames
with sysctl:
#
net.ipv4.tcp_moderate_rcvbuf = 0
#
net.ipv4.tcp_rmem= 1024000 8738000 1677721600
net.ipv4.tcp_wmem= 1024000 8738000 1677721600
net.ipv4.tcp_mem= 1024000 8738000 1677721600
net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

And now i can load my cluster without any slow requests. The essential setting
is net.ipv4.tcp_moderate_rcvbuf = 0. All other are just tunings.

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pse...@colocall.net> wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do 
>>> you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port 
>>> whenever any path got congested which you don't want to happen with a 
>>> cluster like Ceph. It's better to let the frame drop/retransmit in this 
>>> case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
>>> put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 
>> after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
>>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote:
>>>> 
>>>> Hello!
>>>> 
>>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>>> 
>>>>> -BEGIN PGP SIGNED MESSAGE-
>>>>> Hash: SHA256
>>>> 
>>>>> Sage,
>>>> 
>>>>> After trying to bisect this issue (all test moved the bisect towards
>>>>> Infernalis) and eventually testing the Infernalis branch again, it
>>>>> looks like the problem still exists although it is handled a tad
>>>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>>>> week and then try and dive into the code to see if I can expose any
>>>>> thing.
>>>> 
>>>>> If I can do anything to provide you with information, please let me know.
>>>> 
>>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
>>>> network
>>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux 
>>>> bounding
>>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
>>>> 82599ES
>>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on 
>>>> Nexus 5020
>>>> switch with Jumbo frames enabled i have performance drop and slow 
>>>> requests. When
>>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>>> 
>>>> I have rebooted all my ceph services when changing MTU and changing things 
>>>> to
>>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>>> environment.
>>>> 
>>>>> Thanks,
>>>>> -BEGIN PGP SIGNATURE-
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>> 
>>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello!

On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256

> Sage,

> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still exists although it is handled a tad
> better in Infernalis. I'm going to test against Firefly/Giant next
> week and then try and dive into the code to see if I can expose any
> thing.

> If I can do anything to provide you with information, please let me know.

I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
5020
switch with Jumbo frames enabled i have performance drop and slow requests. When
setting 1500 on nodes and not touching Nexus all problems are fixed.

I have rebooted all my ceph services when changing MTU and changing things to
9000 and 1500 several times in order to be sure. It is reproducable in my
environment.

> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com

> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> BCFo
> =GJL4
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We forgot to upload the ceph.log yesterday. It is there now.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> I upped the debug on about everything and ran the test for about 40
>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>> There was at least one op on osd.19 that was blocked for over 1,000
>>> seconds. Hopefully this will have something that will cast a light on
>>> what is going on.
>>>
>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>> the test to verify the results from the dev cluster. This cluster
>>> matches the hardware of our production cluster but is not yet in
>>> production so we can safely wipe it to downgrade back to Hammer.
>>>
>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>
>>> Let me know what else we can do to help.
>>>
>>> Thanks,
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>> EDrG
>>> =BZVw
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 On my second test (a much longer one), it took nearly an hour, but a
 few messages have popped up over a 20 window. Still far less than I
 have been seeing.
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'll 

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello!

On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:

> Are there any errors on the NICs? (ethtool -s ethX)

No errors. Neither on nodes, nor on switches.

> Also take a look at the switch and look for flow control statistics - do you 
> have flow control enabled or disabled?

flow control disabled everywhere.

> We had to disable flow control as it would pause all IO on the port whenever 
> any path got congested which you don't want to happen with a cluster like 
> Ceph. It's better to let the frame drop/retransmit in this case (and you 
> should size it so it doesn't happen in any case).
> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
> put my money on that...

I tried to completely disable all offloads and setting mtu back to 9000 after.
No luck.
I am speaking with my NOC about MTU in 10G network. If I have update, I will
write here. I can hardly beleave that it is ceph side, but nothing is
impossible.

> Jan


>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote:
>> 
>> Hello!
>> 
>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>> 
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>> 
>>> Sage,
>> 
>>> After trying to bisect this issue (all test moved the bisect towards
>>> Infernalis) and eventually testing the Infernalis branch again, it
>>> looks like the problem still exists although it is handled a tad
>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>> week and then try and dive into the code to see if I can expose any
>>> thing.
>> 
>>> If I can do anything to provide you with information, please let me know.
>> 
>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
>> network
>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
>> 82599ES
>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 
>> 5020
>> switch with Jumbo frames enabled i have performance drop and slow requests. 
>> When
>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>> 
>> I have rebooted all my ceph services when changing MTU and changing things to
>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>> environment.
>> 
>>> Thanks,
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>> 
>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>> BCFo
>>> =GJL4
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <rob...@leblancnet.us> wrote:
>>>> -BEGIN PGP SIGNED MESSAGE-
>>>> Hash: SHA256
>>>> 
>>>> We forgot to upload the ceph.log yesterday. It is there now.
>>>> - 
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> 
>>>> 
>>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>>> -BEGIN PGP SIGNED MESSAGE-
>>>>> Hash: SHA256
>>>>> 
>>>>> I upped the debug on about everything and ran the test for about 40
>>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>>> seconds. Hopefully this will have something that will cast a light on
>>>>> what is going on.
>>>>> 
>>>>>

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Здравствуйте! 

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the 
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

No other layers, only 2x Nexus 5020 with virtual portchannels. All other I will
check on Monday.

> What about some intermediate MTU like 8000 - does that work?

Not tested. I will.

> Oh and if there's any bonding/trunking involved, beware that you need to set 
> the same MTU and offloads on all interfaces on certains kernels - flags like 
> MTU/offloads should propagate between the master/slave interfaces but in 
> reality it's not the case and they get reset even if you unplug/replug the 
> ethernet cable.

Yes, I understand it :) I was setting parameters on both interfaces and checked
it out using "ip link".

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pse...@colocall.net> wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do 
>>> you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port 
>>> whenever any path got congested which you don't want to happen with a 
>>> cluster like Ceph. It's better to let the frame drop/retransmit in this 
>>> case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't 
>>> put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 
>> after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
>>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pse...@colocall.net> wrote:
>>>> 
>>>> Hello!
>>>> 
>>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>>> 
>>>>> -BEGIN PGP SIGNED MESSAGE-
>>>>> Hash: SHA256
>>>> 
>>>>> Sage,
>>>> 
>>>>> After trying to bisect this issue (all test moved the bisect towards
>>>>> Infernalis) and eventually testing the Infernalis branch again, it
>>>>> looks like the problem still exists although it is handled a tad
>>>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>>>> week and then try and dive into the code to see if I can expose any
>>>>> thing.
>>>> 
>>>>> If I can do anything to provide you with information, please let me know.
>>>> 
>>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G 
>>>> network
>>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux 
>>>> bounding
>>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 
>>>> 82599ES
>>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on 
>>>> Nexus 5020
>>>> switch with Jumbo frames enabled i have performance drop and slow 
>>>> requests. When
>>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>>> 
>>>> I have rebooted all my ceph services when changing MTU and changing things 
>>>> to
>>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>>> environment.
>>>> 
>>>>> Thanks,
>>>>> -BEGIN PGP SIGNATURE-
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>> 
>>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>>>> jjyy5w

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Max A. Krasilnikov
Hello!

On Mon, Oct 05, 2015 at 09:35:26PM -0600, robert wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256

> With some off-list help, we have adjusted
> osd_client_message_cap=1. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client messages. But
> it does not solve the problem with the blocked I/O.

> One thing that I have noticed is that almost exactly 30 seconds elapse
> between an OSD boots and the first blocked I/O message. I don't know
> if the OSD doesn't have time to get it's brain right about a PG before
> it starts servicing it or what exactly.

I have problems like yours in my cluster. All of them can be fixed with
restarting some osds, but i can not restart all my osds time to time.
Problem occurs when client is writing to rbd wolume or when recovering volume.
Typical message is (this was when recovering):

[WRN] slow request 30.929654 seconds old, received at 2015-10-06 
13:00:41.412329: osd_op(client.1068613.0:192715 
rbd_data.dc7650539e6a.0820 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3371008~4096] 5.d66fd55d snapc c=[c] 
ack+ondisk+write+known_if_redirected e4009) currently waiting for subops from 51

Restarting osd.51 in such scenario fixes the problem.

There are no slow requests with low io on systems, only when i do something like
uploading image.

Some times ago i had too much created but not used osds. In that time, when
going down for restart, osds did not inform mon about this. Removing unused osds
entries fixes this issue. But when doing ceph crush dump i can see them. Maybe,
it is the root of problem? I tried to do getcrushmap/edit/setcrushmap, but
entries are in their place.

Maybe, my experience will help You to find answer. I hope, it wil fix my
problems :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-21 Thread Max A. Krasilnikov
Hello!

On Sat, Sep 19, 2015 at 07:03:35AM +0200, martin wrote:

> Thanks all for the suggestions.

> Our storage nodes have plenty of RAM and their only purpose is to host the
> OSD daemons, so we will not create a swap partition on provisioning.

As an option, You can use swap file on demand. It is easy to deploy.

> For the OS disk we will then use a software raid 1 to handle eventually
> disk failures. For provisioning the hosts we use kickstart and then Ansible
> to install an prepare the hosts to be ready to for ceph-deploy.

I don't think raid1 is suitable for ceph as of ability to have distributed over
hosts copies of data. Think as this is a raid1 over hosts that is more reliable.
Even a server crush will not destroy Your data. If Your setup is quite correct
:)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help! Ceph Manual Depolyment

2015-09-18 Thread Max A. Krasilnikov
Здравствуйте! 

On Thu, Sep 17, 2015 at 11:59:47PM +0800, wikison wrote:


> Is there any detailed manual deployment document? I downloaded the source and 
> built ceph, then installed ceph on 7 computers. I used three as monitors and 
> four as OSD. I followed the official document on ceph.com. But it didn't work 
> and it seemed to be out-dated. Could anybody help me?

This works for me:
http://docs.ceph.com/docs/master/install/manual-deployment/
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/
http://www.sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/
http://docs.ceph.com/docs/master/cephfs/createfs/

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange rbd hung with non-standard crush location

2015-09-18 Thread Max A. Krasilnikov
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.10]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.11]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.12]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.20]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.30]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.31]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.32]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.40]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.50]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.51]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.52]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

My volumes:

pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 92 flags hashpspool stripe_width 0
pool 4 'openstack-img' replicated size 2 min_size 1 crush_ruleset 3 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 187 flags hashpspool stripe_width 0
pool 5 'openstack-hdd' replicated size 2 min_size 1 crush_ruleset 3 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 114 flags hashpspool stripe_width 0
pool 6 'openstack-ssd' replicated size 2 min_size 1 crush_ruleset 4 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 118 flags hashpspool stripe_width 0
pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 3 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 141 flags hashpspool 
stripe_width 0
pool 8 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 3 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 145 flags hashpspool 
crash_replay_interval 45 stripe_width 0

First one was added by ceph setup and not used by me. I hava only changed 
ruleset to 3.

So, why I need "default" root with osds in it? And why this is not described in 
docs? Or, maybe, I have mistaken understanding it?

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommended way of leveraging multiple disks by Ceph

2015-09-16 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Sep 15, 2015 at 04:16:47PM +, fangzhe.chang wrote:

> Hi,

> I'd like to run Ceph on a few machines, each of which has multiple disks. The 
> disks are heterogeneous: some are rotational disks of larger capacities while 
> others are smaller solid state disks. What are the recommended ways of 
> running ceph osd-es on them?

> Two of the approaches can be:

> 1)  Deploy an osd instance on each hard disk. For instance, if a machine 
> has six hard disks, there will be six osd instances running on it. In this 
> case, does Ceph's replication algorithm recognize that these osd-es are on 
> the same machine therefore try to avoid placing replicas on disks/osd-es of a 
> same machine?

When adding osd or whenever later You can set crush location for osd. pg placing
is based on Your crush rules and crush locations. In general case, data would be
written to different hosts.

I have confid with multiple disks on 3 nodes, some of them are hdd and 1 ssd per
node. Each serve 1 osd.

> 2)  Create a logical volume spanning multiple hard disks of a machine and 
> run a single copy of osd per machine.

It is more reliable to have several osd'es, one per drive. When loosing drive,
You will not loose all data on host.

> If you have previous experiences, benchmarking results, or know a pointer to 
> the corresponding documentation, please share with me and other users. Thanks 
> a lot.

I preferred this fine article:
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD performance slowly degrades :-(

2015-08-12 Thread Max A. Krasilnikov
Здравствуйте! 

On Wed, Aug 12, 2015 at 02:30:59PM +, pieter.koorts wrote:

 Hi Irek,

 Thanks for the link. I have removed the SSD's for now and performance is up 
 to 30MB/s on a benchmark now. To be honest, I new the Samsung SSD weren't 
 great but did not expect them to be worse then just plain hard disks.

I had the same trouble with Samsung 840 EVO 1TB. 15 of 16 disks was terribly
slow (about 3000 iops and up to 200 MBps per drive). All the drives were
replased by 850 EVO 250 GB and problem was fixed.
My ssds had the latest firmware and was brand new at the moment of test.

 Pieter

 Something that's been bugging me for a while is I am trying to diagnose 
 iowait time within KVM guests. Guests doing reads or writes tend do about 50% 
 to 90% iowait but the host itself is only doing about 1% to 2% iowait. So the 
 result is the guests are extremely slow.

 I currently run 3x hosts each with a single SSD and single HDD OSD in 
 cache-teir writeback mode. Although the SSD (Samsung 850 EVO 120GB) is not a 
 great one it should at least perform reasonably compared to a hard disk and 
 doing some direct SSD tests I get approximately 100MB/s write and 200MB/s 
 read on each SSD.

 When I run rados bench though, the benchmark starts with a not great but okay 
 speed and as the benchmark progresses it just gets slower and slower till 
 it's worse than a USB hard drive. The SSD cache pool is 120GB in size (360GB 
 RAW) and in use at about 90GB. I have tried tuning the XFS mount options as 
 well but it has had little effect.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enclosure power failure pausing client IO till all connected hosts up

2015-07-27 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Jul 07, 2015 at 02:21:56PM +0530, mallikarjuna.biradar wrote:

 Hi all,

 Setup details:
 Two storage enclosures each connected to 4 OSD nodes (Shared storage).
 Failure domain is Chassis (enclosure) level. Replication count is 2.
 Each host has allotted with 4 drives.

 I have active client IO running on cluster. (Random write profile with
 4M block size  64 Queue depth).

 One of enclosure had power loss. So all OSD's from hosts that are
 connected to this enclosure went down as expected.

 But client IO got paused. After some time enclosure  hosts connected
 to it came up.
 And all OSD's on that hosts came up.

 Till this time, cluster was not serving IO. Once all hosts  OSD's
 pertaining to that enclosure came up, client IO resumed.


 Can anybody help me why cluster not serving IO during enclosure
 failure. OR its a bug?

With replication factor 2 You have to take 3+ nodes in order to serve clients. 
If chooseleaf type  0.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com