[ceph-users] rbd mount unmap network outage

2017-11-29 Thread Hauke Homburg
Hello,

Actually i am working on a NFS HA Cluster to export rbd Images with NFS.
To test the failover i tried the following:

https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/

i set the rbdimage to exclusive Lock and the osd and mon timeout to 20
Seconds.

on 1 NFS Server i mapped the rbd image with rbd map. after mapping i
blocked the TCP ports with iptables to simulate a network outage. Ports
tcp 6789 and 6800:7300.

I can see with rbd status that the watchers on the CLuster itself are
unmapped after timeout.

The nfs server gives encountered watch error -110.

The NFS Server tries to connect libceph to another mon.

When all this happens i cannot unmap the image.

Ceph CLuster is 10.2.10 with Centos 7 the NFS Server is Debian 9. The
Pacemaker ra is ceph-resource.agents 10.2.10.

My consideration is to unmap the image when the network outage happens,
because the failover and the Problem that i don't want to mount 1 rbd
image with 2 Server to prevent data damage. after network outage is solved.

Thanks fpr Help

Hauke



-- 
www.w3-creative.de

www.westchat.de

https://friendica.westchat.de/profile/hauke

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume lvm for bluestore for newer disk

2017-11-29 Thread nokia ceph
Hello,

I'm following
http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#ceph-volume-lvm-prepare-bluestore
to create new OSD's.

I took the latest branch from https://shaman.ceph.com/repos/ceph/luminous/

# ceph -v
ceph version 12.2.1-851-g6d9f216

What I did, formatted the device.

#sgdisk -Z /dev/sdv
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.


Getting below error while the creation of bluestore OSD's

# ceph-volume lvm prepare --bluestore  --data /dev/sdv
Running command: sudo vgcreate --force --yes
ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121 # use uuidgen to create an ID,
use this for all ceph nodes in your cluster /dev/sdv
 stderr: Name contains invalid character, valid set includes:
[a-zA-Z0-9.-_+].
  New volume group name "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121 # use
uuidgen to create an ID, use this for all ceph nodes in your cluster" is
invalid.
  Run `vgcreate --help' for more information.
-->  RuntimeError: command returned non-zero exit status: 3

# grep fsid /etc/ceph/ceph.conf
fsid = b2f1b9b9-eecc-4c17-8b92-cfa60b31c121


My question

1. We have 68 disks per server so for all the 68 disks sharing same Volume
group --> "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121" ?
2. Why ceph-volume failed to create vg name with this name, even I manually
tried to create, as it will ask for Physical volume as argument
#vgcreate --force --yes "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121"
  No command with matching syntax recognised.  Run 'vgcreate --help' for
more information.
  Correct command syntax is:
  vgcreate VG_new PV ...

Please let me know the comments.

Thanks
Jayaram
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Brad Hubbard
# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan | grep ceph-osd

To find the actual thread that is using 100% CPU.

# for x in `seq 1 5`; do gdb -batch -p [PID] -ex "thr appl all bt";
echo; done > /tmp/osd.stack.dump

Then look at the stacks for the thread that was using all the CPU and
see what it was doing at the time.

Note that you may need to install debuginfo for ceph to see meaningful
stack traces. How you go about this is dependant on the distro you are
using.


On Thu, Nov 30, 2017 at 8:48 AM, Denes Dolhay  wrote:
> Hello,
>
> You might consider checking the iowait (during the problem), and the dmesg
> (after it recovered). Maybe an issue with the given sata/sas/nvme port?
>
>
> Regards,
>
> Denes
>
>
> On 11/29/2017 06:24 PM, Matthew Vernon wrote:
>>
>> Hi,
>>
>> We have a 3,060 OSD ceph cluster (running Jewel
>> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
>> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
>> host), and having ops blocking on it for some time. It will then behave
>> for a bit, and then go back to doing this.
>>
>> It's always the same OSD, and we've tried replacing the underlying disk.
>>
>> The logs have lots of entries of the form
>>
>> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15
>>
>> I've had a brief poke through the collectd metrics for this osd (and
>> comparing them with other OSDs on the same host) but other than showing
>> spikes in latency for that OSD (iostat et al show no issues with the
>> underlying disk) there's nothing obviously explanatory.
>>
>> I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
>> is what googling for the above message suggests), but that just said
>> "unchangeable", and didn't seem to make any difference.
>>
>> Any ideas? Other metrics to consider? ...
>>
>> Thanks,
>>
>> Matthew
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Denes Dolhay

Hello,

You might consider checking the iowait (during the problem), and the 
dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme 
port?



Regards,

Denes


On 11/29/2017 06:24 PM, Matthew Vernon wrote:

Hi,

We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
host), and having ops blocking on it for some time. It will then behave
for a bit, and then go back to doing this.

It's always the same OSD, and we've tried replacing the underlying disk.

The logs have lots of entries of the form

2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

I've had a brief poke through the collectd metrics for this osd (and
comparing them with other OSDs on the same host) but other than showing
spikes in latency for that OSD (iostat et al show no issues with the
underlying disk) there's nothing obviously explanatory.

I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
is what googling for the above message suggests), but that just said
"unchangeable", and didn't seem to make any difference.

Any ideas? Other metrics to consider? ...

Thanks,

Matthew




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-11-29 Thread Subhachandra Chandra
Hello,

   We are trying out Ceph on a small cluster and are observing memory
leakage in the OSD processes. The leak seems to be in addition to the known
leak related to the "buffer_anon" pool and is high enough for the processes
to run against their memory limits in a few hours.

The following table gives a snapshot of increase in memory being used by
one of the OSD processes over an hour (t+63 indicates 63 minutes after the
first snapshot). Full mempool dumps and output of top are at the bottom.
Over an hour the OSDs went from RSS in the range 469-704MB to 735-844MB.
The container are restricted to 1GB of memory which causes them to restart
after a few hours.

t+00 t+11 t+63
683980   706980   786324VmRSS KB
5803 1045732308 buffer_anon KB(dump_mempools)
437369   444945   458688total   KB(dump_mempools)



Our setup is as follows:
* 3 nodes each with 30 OSDs for a total of 90 OSDs.
* Running Luminous (12.2.1)  official docker images on top of CoreOS
* The OSDs use Bluestore with all the db.* partitions on the same drive
* The nodes have 32GB of RAM and 8 cores. The test cluster nodes do have
less than the recommended amount of RAM per OSD to constrain them and find
problems
* The cluster currently has 501 PGs/OSD (Again higher than recommended for
testing)
* The pools are setup for RGW usage with replication_factor of 3 on all the
pools (2752 PGs) except default.rgw.buckets.data (4096 PGs) which is setup
with 6+3 erasure coding.
* The clients use the python rados library to push 128MB files directly
into the default.rgw.buckets.data pool. There are 3 clients running in
parallel on VMs and are pushing  about 350-400MB/s in aggregate.

The conf file with non-default settings looks like
[global]
mon_max_pg_per_osd = 750
mon_osd_down_out_interval = 900
mon_pg_warn_max_per_osd = 600
osd_crush_chooseleaf_type = 0
osd_map_message_max = 10
osd_max_pg_per_osd_hard_ratio = 1.2

[mon]
mon_max_pgmap_epochs = 100
mon_min_osdmap_epochs = 100

[osd]
bluestore_cache_kv_ratio = .95
bluestore_cache_kv_max = 67108864
bluestore_cache_meta_ratio = .05
bluestore_cache_size = 268435456
osd_map_cache_size = 50
osd_map_max_advance = 25
osd_map_share_max_epochs = 25
osd_max_object_size = 1073741824
osd_max_write_size = 256
osd_pg_epoch_persisted_max_stale = 25
osd_pool_erasure_code_stripe_unit = 4194304


top - 20:46:18 up 1 day,  2:21,  2 users,  load average: 3.13, 1.78, 1.25
Tasks: 567 total,   1 running, 566 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.7 us,  5.9 sy,  0.0 ni, 73.5 id, 11.1 wa,  0.3 hi,  2.5 si,
0.0 st
KiB Mem:  32981660 total, 24427392 used,  8554268 free,   351396 buffers
KiB Swap:0 total,0 used,0 free.  2803348 cached Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
 376901 64045 20   0 150 704896  29056 S  14.4  2.1   1:26.72
ceph-osd
 370611 64045 20   0 1527432 698080  29476 S   2.0  2.1   1:29.03
ceph-osd
 396886 64045 20   0 1486584 696480  29060 S   2.0  2.1   1:22.93
ceph-osd
 382254 64045 20   0 1516968 690196  28984 S   3.0  2.1   1:27.15
ceph-osd
 359523 64045 20   0 1516888 686728  29332 S   3.5  2.1   1:28.67
ceph-osd
 366478 64045 20   0 1560912 683980  29076 S   1.5  2.1   1:28.59
ceph-osd
 382255 64045 20   0 1493116 669616  29276 S   1.5  2.0   1:29.46
ceph-osd
 360152 64045 20   0 1529896 28  29268 S   0.5  2.0   1:27.96
ceph-osd
 372155 64045 20   0 1523640 662492  29416 S  17.4  2.0   1:29.79
ceph-osd
 358800 64045 20   0 1513640 662224  29184 S  13.9  2.0   1:29.80
ceph-osd
 360142 64045 20   0 1517992 661868  29328 S   0.5  2.0   1:31.69
ceph-osd
 398310 64045 20   0 1504552 658216  28796 S   1.0  2.0   1:20.62
ceph-osd
 368705 64045 20   0 1505544 657776  29292 S   1.0  2.0   1:27.32
ceph-osd
 386044 64045 20   0 1501488 655960  29536 S   3.0  2.0   1:24.87
ceph-osd
 386940 64045 20   0 1503056 652552  29152 S   4.5  2.0   1:28.22
ceph-osd
 386050 64045 20   0 1489996 650628  28800 S   1.0  2.0   1:28.46
ceph-osd
 402086 64045 20   0 1504528 646672  29344 S   2.5  2.0   1:26.96
ceph-osd
 400590 64045 20   0 1487424 642288  29348 S   3.5  1.9   1:21.55
ceph-osd
 387860 64045 20   0 1504520 641296  29316 S   4.0  1.9   1:19.98
ceph-osd
 392900 64045 20   0 1493492 637572  29156 S   1.5  1.9   1:26.86
ceph-osd
 375314 64045 20   0 1520448 629272  29412 S   1.0  1.9   1:32.04
ceph-osd
 372038 64045 20   0 1497992 627176  29300 S   1.0  1.9   1:30.36
ceph-osd
 385149 64045 20   0 1514284 624428  28960 S   0.5  1.9   1:28.56
ceph-osd
 382236 64045 20   0 1512248 616256  29568 S   2.0  1.9   1:24.03
ceph-osd
 374703 64045 20   0 1511740 571404  29628 S   2.5  1.7   1:27.88
ceph-osd
 367873 64045 20   0 1394740 564488  29012 S   2.5  1.7   1:31.64
ceph-osd
 360104 64045 20   0 1373880 532880  29132 S   2.5  1.6   1:32.11
ceph-osd
 376002 64045 20   0 1391576 516256  29132 S   0.5  1.6   

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-29 Thread Zoltan Arnold Nagy

On 2017-11-27 14:02, German Anders wrote:

4x 2U servers:
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
so I assume you are using IPoIB as the cluster network for the 
replication...



1x OneConnect 10Gb NIC (quad-port) - in a bond configuration
(active/active) with 3 vlans

... and the 10GbE network for the front-end network?

At 4k writes your network latency will be very high (see the flame 
graphs at the Intel NVMe presentation from the Boston OpenStack Summit - 
not sure if there is a newer deck that somebody could link ;)) and the 
time will be spent in the kernel. You could give RDMAMessenger a try but 
it's not stable at the current LTS release.


If I were you I'd be looking at 100GbE - we've recently pulled in a 
bunch of 100GbE links and it's been wonderful to see 100+GB/s going over 
the network for just storage.


Some people suggested mounting multiple RBD volumes - unless I'm 
mistaken and you're using very recent qemu/libvirt combinations with the 
proper libvirt disk settings all IO will still be single threaded 
towards librbd thus not making any speedup.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "failed to open ino"

2017-11-29 Thread David C
On Tue, Nov 28, 2017 at 1:50 PM, Jens-U. Mozdzen  wrote:

> Hi David,
>
> Zitat von David C :
>
>> On 27 Nov 2017 1:06 p.m., "Jens-U. Mozdzen"  wrote:
>>
>> Hi David,
>>
>> Zitat von David C :
>>
>> Hi Jens
>>
>>>
>>> We also see these messages quite frequently, mainly the "replicating
>>> dir...". Only seen "failed to open ino" a few times so didn't do any real
>>> investigation. Our set up is very similar to yours, 12.2.1,
>>> active/standby
>>> MDS and exporting cephfs through KNFS (hoping to replace with Ganesha
>>> soon).
>>>
>>>
>> been there, done that - using Ganesha more than doubled the run-time of
>> our
>> jobs, while with knfsd, the run-time is about the same for CephFS-based
>> and
>> "local disk"-based files. But YMMV, so if you see speeds with Ganesha that
>> are similar to knfsd, please report back with details...
>>
>>
>> I'd be interested to know if you tested Ganesha over a cephfs kernel mount
>> (ie using the VFS fsal) or if you used the Ceph fsal. Also the server and
>> client versions you tested.
>>
>
> I had tested Ganesha only via the Ceph FSAL. Our Ceph nodes (including the
> one used as a Ganesha server) are running 
> ceph-12.2.1+git.1507910930.aea79b8b7a
> on OpenSUSE 42.3, SUSE's kernel 4.4.76-1-default (which has a number of
> back-ports in it), Ganesha is at version nfs-ganesha-2.5.2.0+git.150427
> 5777.a9d23b98f.
>
> The NFS clients are a broad mix of current and older systems.
>
> Prior to Luminous, Ganesha writes were terrible due to a bug with fsync
>> calls in the mds code. The fix went into the mds and client code. If
>> you're
>> doing Ganesha over the top of the kernel mount you'll need a pretty recent
>> kernel to see the write improvements.
>>
>
> As we were testing the Ceph FSAL, this should not be the cause.
>
> From my limited Ganesha testing so far, reads are better when exporting the
>> kernel mount, writes are much better with the Ceph fsal. But that's
>> expected for me as I'm using the CentOS kernel. I was hoping the
>> aforementioned fix would make it into the rhel 7.4 kernel but doesn't look
>> like it has.
>>
>
> When exporting the kernel-mounted CephFS via kernel nfsd, we see similar
> speeds to serving the same set of files from a local bcache'd RAID1 array
> on SAS disks. This is for a mix of reads and writes, mostly small files
> (compile jobs, some packaging).
>

I'm surprised your knfs writes are that good on a 4.4 kernel (assuming your
exports aren't async). At least when I tested with the mainline 4.4 kernel
it was still super slow for me. It's only in 4.12 or 4.13 where they
improve. It sounds like Suse have potentially backported some good stuff!



>
> From what I can see, it would have to be A/A/P, since MDS demands at least
>> one stand-by.
>>
>>
>> That's news to me.
>>
>
> From http://docs.ceph.com/docs/master/cephfs/multimds/ :
>
> "Each CephFS filesystem has a max_mds setting, which controls how many
> ranks will be created. The actual number of ranks in the filesystem will
> only be increased if a spare daemon is available to take on the new rank.
> For example, if there is only one MDS daemon running, and max_mds is set to
> two, no second rank will be created."
>
> Might well be I was mis-reading this... I had first read it to mean that a
> spare daemon needs to be available *while running* A/A, but the example
> sounds like the spare is required when *switching to* A/A.
>

Yep I think you're right. Further down that page it states: "Even with
multiple active MDS daemons, a highly available system still requires
standby daemons to take over if any of the servers running an active daemon
fail."

I assumed if an active MDS failed, the surviving MDS(s) would just pick up
the workload. The question is, would losing an MDS in a cluster with no
standbys stop all metadata IO or would it just be a health warning? I need
to do some playing around with this at some point.



> Is it possible you still had standby config in your ceph.conf?
>>
>
> Not sure what you're asking for, is this related to active/active or to
> our Ganesha tests? We have not yet tried to switch to A/A, so our config
> actually contains standby parameters.
>

It was in relation to A/A, but query answered above.

>
> Regards,
> Jens
>
> --
> Jens-U. Mozdzen voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15   mobile  : +49-179-4 98 21 98
> D-22423 Hamburg e-mail  : jmozd...@nde.ag
>
> Vorsitzende des Aufsichtsrates: Angelika Torlée-Mozdzen
>   Sitz und Registergericht: Hamburg, HRB 90934
>   Vorstand: Jens-U. Mozdzen
>USt-IdNr. DE 814 013 983
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Jean-Charles Lopez
Hi Mathhew,

anything special happening on the NIC side that could cause a problem? Packet 
drops? Incorrect jumbo frame settings causing fragmentation?

Have you checked the cstate settings on the box?

Have you disabled energy saving settings differently from the other boxes?

Any unexpected wait time on some devices on the box?

Have you compared your kernel parameters on this box compared to the other 
boxes?

Just in case
JC

> On Nov 29, 2017, at 09:24, Matthew Vernon  wrote:
> 
> Hi,
> 
> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It will then behave
> for a bit, and then go back to doing this.
> 
> It's always the same OSD, and we've tried replacing the underlying disk.
> 
> The logs have lots of entries of the form
> 
> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15
> 
> I've had a brief poke through the collectd metrics for this osd (and
> comparing them with other OSDs on the same host) but other than showing
> spikes in latency for that OSD (iostat et al show no issues with the
> underlying disk) there's nothing obviously explanatory.
> 
> I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
> is what googling for the above message suggests), but that just said
> "unchangeable", and didn't seem to make any difference.
> 
> Any ideas? Other metrics to consider? ...
> 
> Thanks,
> 
> Matthew
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: costly MDS cache misses?

2017-11-29 Thread Jens-U. Mozdzen

Hi *,

while tracking down a different performance issue with CephFS  
(creating tar balls from CephFS-based directories takes multiple times  
as long as when backing up the same data from local disks, i.e. 56  
hours instead of 7), we had a look at CephFS performance related to  
the size of the MDS process.


Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS  
data is on SAS HDDs, meta data is on SAS SSDs.


It came to mind that MDS memory consumption might cause the delays  
with "tar". But while below results don't confirm this (it actually  
confirms that MDS memory size does not affect CephFS read speed when  
the cache is sufficiently warm), it does show an almost 30%  
performance drop if the cache is filled with the wrong entries.


After a fresh process start, our MDS takes about 450k memory, with 56k  
residual. I then start a tar run for 36 GB small files (which I had  
also run a few minutes before MDS restart, to warm up disk caches):


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
   1233 ceph  20   0  446584  56000  15908 S  3.960 0.085
0:01.08 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 17:38:21 CET 2017
38245529600
Wed Nov 29 17:44:27 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  485760 109156  16148 S  0.331 0.166
0:10.76 ceph-mds

--- cut here ---

As you can see, there's only small growth in MDS virtual size.

The job took 366 seconds, that an average of about 100 MB/s.

I repeat that job a few minutes later, to get numbers with a  
previously active MDS (the MDS cache should be warmed up now):


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  494976 118404  16148 S  2.961 0.180
0:16.21 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 17:53:09 CET 2017
38245529600
Wed Nov 29 17:58:53 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  508288 131368  16148 S  1.980 0.200
0:25.45 ceph-mds

--- cut here ---

The job took 344 seconds, that's an average of about 106 MB/s. With  
only a single run per situation, these numbers aren't more than rough  
estimate, of course.


At 18:00:00, a file-based incremental backup job kicks in, which reads  
through most of the files on the CephFS, but only backing up those  
that were changed since the last run. This has nothing to do with our  
"tar" and is running on a different node, where CephFS is  
kernel-mounted as well. That backup job makes the MDS cache grow  
drastically, you can see MDS at more than 8 GB now.


We then start another tar job (or rather two, to account for MDS  
caching), as before:


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0 8644776 7.750g  16184 S  0.990 12.39
6:45.24 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 18:13:20 CET 2017
38245529600
Wed Nov 29 18:21:50 CET 2017
server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 18:22:52 CET 2017
38245529600
Wed Nov 29 18:28:28 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0 8761512 7.642g  16184 S  3.300 12.22
7:03.52 ceph-mds

--- cut here ---

The second run is even a bit quicker than the "warmed-up" run with the  
only partially filled cache (336 seconds, that's 108,5 MB/s).


But the run against the filled-up MDS cache, where most (if not all)  
entries are no match for our tar lookups, took 510 seconds - that 71,5  
MB/s, instead of the roughly 100 MB/s when the cache was empty.


This is by far no precise benchmark test, indeed. But it at least  
seems to be an indicator that MDS cache misses are costly. (During the  
tests, only small amounts of changes in CephFS were likely -  
especially compared to the amount of reads and file lookups for their  
metadata.)


Regards,
Jens

PS: Why so much memory for MDS in the first place? Because during  
those (hourly) incremental backup runs, we got a large number of MDS  
warnings about insufficient cache pressure responses from clients.  
Increasing the MDS cache size did help to avoid these.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Orit Wasserman
On Wed, Nov 29, 2017 at 6:52 PM, Aristeu Gil Alves Jr
 wrote:
>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> > for
>> > local processing data as have cephfs hadoop plugin?
>> >
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote  for their cloud.
>
>
>
> Assuming this cases, how would be a mapreduce process without data locality?
> How the processors get the data? Still there's the need to split the data,
> no?
The s3/swift storage splits the data.

> Doesn't it severely impact the performance of big files (not just the
> network)?
>
There is a facebook research paper showing locality is not as good as
expected, if I remember correctly it was around 30%.
The users that use s3/swift with Hadoop are already using object
storage (for other usages) or have a very very big data set that fits
object storage better.

> --
> Aristeu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Orit Wasserman
On Wed, Nov 29, 2017 at 6:54 PM, Gregory Farnum  wrote:
> On Wed, Nov 29, 2017 at 8:52 AM Aristeu Gil Alves Jr 
> wrote:
>>>
>>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout
>>> > APIs for
>>> > local processing data as have cephfs hadoop plugin?
>>> >
>>> With s3 and swift you won't have data locality as it was designed for
>>> public cloud.
>>> We recommend disable locality based scheduling in Hadoop when running
>>> with those connectors.
>>> There is on going work on to optimize those connectors to work with
>>> object storage.
>>> Hadoop community works on the s3a connector.
>>> There is also https://github.com/SparkTC/stocator which is a swift
>>> based connector IBM wrote  for their cloud.
>>
>>
>>
>> Assuming this cases, how would be a mapreduce process without data
>> locality?
>> How the processors get the data? Still there's the need to split the data,
>> no?
>> Doesn't it severely impact the performance of big files (not just the
>> network)?
>>
>
> Given that you already have your data in CephFS (and have been using it
> successfully for two years!), I'd try using its Hadoop plugin and seeing if
> it suits your needs. Trying a less-supported plugin is a lot easier than
> rolling out a new storage stack! :)

completely agree :)

> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Matthew Vernon
Hi,

We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
host), and having ops blocking on it for some time. It will then behave
for a bit, and then go back to doing this.

It's always the same OSD, and we've tried replacing the underlying disk.

The logs have lots of entries of the form

2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

I've had a brief poke through the collectd metrics for this osd (and
comparing them with other OSDs on the same host) but other than showing
spikes in latency for that OSD (iostat et al show no issues with the
underlying disk) there's nothing obviously explanatory.

I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
is what googling for the above message suggests), but that just said
"unchangeable", and didn't seem to make any difference.

Any ideas? Other metrics to consider? ...

Thanks,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange error on link() for nfs over cephfs

2017-11-29 Thread Patrick Donnelly
On Wed, Nov 29, 2017 at 3:44 AM, Jens-U. Mozdzen  wrote:
> Hi *,
>
> we recently have switched to using CephFS (with Luminous 12.2.1). On one
> node, we're kernel-mounting the CephFS (kernel 4.4.75, openSUSE version) and
> export it via kernel nfsd. As we're transitioning right now, a number of
> machines still auto-mount users home directories from that nfsd.

You need to try a newer kernel as there have been many fixes since 4.4
which probably have not been backported to your distribution's kernel.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Gregory Farnum
On Wed, Nov 29, 2017 at 8:52 AM Aristeu Gil Alves Jr 
wrote:

> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> for
>> > local processing data as have cephfs hadoop plugin?
>> >
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote  for their cloud.
>>
>
>
> Assuming this cases, how would be a mapreduce process without data
> locality?
> How the processors get the data? Still there's the need to split the data,
> no?
> Doesn't it severely impact the performance of big files (not just the
> network)?
>
>
Given that you already have your data in CephFS (and have been using it
successfully for two years!), I'd try using its Hadoop plugin and seeing if
it suits your needs. Trying a less-supported plugin is a lot easier than
rolling out a new storage stack! :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Aristeu Gil Alves Jr
>
> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
> for
> > local processing data as have cephfs hadoop plugin?
> >
> With s3 and swift you won't have data locality as it was designed for
> public cloud.
> We recommend disable locality based scheduling in Hadoop when running
> with those connectors.
> There is on going work on to optimize those connectors to work with
> object storage.
> Hadoop community works on the s3a connector.
> There is also https://github.com/SparkTC/stocator which is a swift
> based connector IBM wrote  for their cloud.
>


Assuming this cases, how would be a mapreduce process without data
locality?
How the processors get the data? Still there's the need to split the data,
no?
Doesn't it severely impact the performance of big files (not just the
network)?

--
Aristeu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Orit Wasserman
Hi,

On Wed, Nov 29, 2017 at 5:32 PM, Aristeu Gil Alves Jr
 wrote:
> Orit,
>
> As I mentioned, I have cephfs in production for almost two years.
> Can I use this installed filesystem or I need to start from scratch? If the
> first is true, is there any tutorial that you recommend on adding s3 on an
> installed base, or to ceph in general?

Radosgw is the service that provide s3 and swift compatible object storage.
http://docs.ceph.com/docs/master/radosgw/
You can use you existing ceph cluster (monitors and OSDS) but will
need to add the radosgw demon.
It will have it's one separate pools.

> Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs for
> local processing data as have cephfs hadoop plugin?
>
With s3 and swift you won't have data locality as it was designed for
public cloud.
We recommend disable locality based scheduling in Hadoop when running
with those connectors.
There is on going work on to optimize those connectors to work with
object storage.
Hadoop community works on the s3a connector.
There is also https://github.com/SparkTC/stocator which is a swift
based connector IBM wrote  for their cloud.

> Sorry for my lack of knowledge on the matter. As I was exclusively a CephFS
> user, I didn't touch RGW yet. Gonna learn everything now. Any hint is going
> to be welcome.
>

CephFS is great and depending on your dataset and workload it maybe
the right storage for you :)

Cheers,
Orit

>
> Thanks and regards,
> Aristeu
>
> 2017-11-29 4:19 GMT-02:00 Orit Wasserman :
>>
>> On Tue, Nov 28, 2017 at 7:26 PM, Aristeu Gil Alves Jr
>>  wrote:
>> > Greg and Donny,
>> >
>> > Thanks for the answers. It helped a lot!
>> >
>> > I just watched the swifta presentation and it looks quite good!
>> >
>>
>> I would highly recommend using s3a and not swifta as it is much more
>> mature and is more used.
>>
>> Cheers,
>> Orit
>>
>> > Due the lack of updates/development, and the fact that we can choose
>> > spark
>> > also, I think maybe swift/swifta with ceph is a good strategy too.
>> > I need to study it more, tho.
>> >
>> > Can I get the same results (performance and integrated data-layout APIs)
>> > with it?
>> >
>> > Is there a migration cases/tutorials from a cephfs to a swift with ceph
>> > scenario that you could suggest?
>> >
>> > Best regards,
>> > --
>> > Aristeu
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs Hadoop Plugin and CEPH integration

2017-11-29 Thread Aristeu Gil Alves Jr
Orit,

As I mentioned, I have cephfs in production for almost two years.
Can I use this installed filesystem or I need to start from scratch? If the
first is true, is there any tutorial that you recommend on adding s3 on an
installed base, or to ceph in general?
Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
for local processing data as have cephfs hadoop plugin?

Sorry for my lack of knowledge on the matter. As I was exclusively a CephFS
user, I didn't touch RGW yet. Gonna learn everything now. Any hint is going
to be welcome.


Thanks and regards,
Aristeu

2017-11-29 4:19 GMT-02:00 Orit Wasserman :

> On Tue, Nov 28, 2017 at 7:26 PM, Aristeu Gil Alves Jr
>  wrote:
> > Greg and Donny,
> >
> > Thanks for the answers. It helped a lot!
> >
> > I just watched the swifta presentation and it looks quite good!
> >
>
> I would highly recommend using s3a and not swifta as it is much more
> mature and is more used.
>
> Cheers,
> Orit
>
> > Due the lack of updates/development, and the fact that we can choose
> spark
> > also, I think maybe swift/swifta with ceph is a good strategy too.
> > I need to study it more, tho.
> >
> > Can I get the same results (performance and integrated data-layout APIs)
> > with it?
> >
> > Is there a migration cases/tutorials from a cephfs to a swift with ceph
> > scenario that you could suggest?
> >
> > Best regards,
> > --
> > Aristeu
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Jason Dillaman
We experienced this problem in the past on older (pre-Jewel) releases
where a PG split that affected the RBD header object would result in
the watch getting lost by librados. Any chance you know if the
affected RBD header objects were involved in a PG split? Can you
generate a gcore dump of one of the affected VMs and ceph-post-file it
for analysis?

As for the VM going R/O, that is the expected behavior when a client
breaks the exclusive lock held by a (dead) client.

On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> Hi,
>
> On a OpenStack environment I encountered a VM which went into R/O mode after 
> a RBD snapshot was created.
>
> Digging into this I found 10s (out of thousands) RBD images which DO have a 
> running VM, but do NOT have a watcher on the RBD image.
>
> For example:
>
> $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>
> 'Watchers: none'
>
> The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> client.
>
> In the meantime the cluster was already upgraded to 10.2.10
>
> Looking further I also found a Compute node with 10.2.10 installed which also 
> has RBD images without watchers.
>
> Restarting or live migrating the VM to a different host resolves this issue.
>
> The internet is full of posts where RBD images still have Watchers when 
> people don't expect them, but in this case I'm expecting a watcher which 
> isn't there.
>
> The main problem right now is that creating a snapshot potentially puts a VM 
> in Read-Only state because of the lack of notification.
>
> Has anybody seen this as well?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Logan Kuhn
We've seen this.  Our environment isn't identical though, we use oVirt and 
connect to ceph (11.2.1) via cinder (9.2.1), but it's so very rare that we've 
never had any luck in pin pointing it and have a lot less VMs, <300.

Regards,
Logan

- On Nov 29, 2017, at 7:48 AM, Wido den Hollander w...@42on.com wrote:

| Hi,
| 
| On a OpenStack environment I encountered a VM which went into R/O mode after a
| RBD snapshot was created.
| 
| Digging into this I found 10s (out of thousands) RBD images which DO have a
| running VM, but do NOT have a watcher on the RBD image.
| 
| For example:
| 
| $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
| 
| 'Watchers: none'
| 
| The VM is however running since September 5th 2017 with Jewel 10.2.7 on the
| client.
| 
| In the meantime the cluster was already upgraded to 10.2.10
| 
| Looking further I also found a Compute node with 10.2.10 installed which also
| has RBD images without watchers.
| 
| Restarting or live migrating the VM to a different host resolves this issue.
| 
| The internet is full of posts where RBD images still have Watchers when people
| don't expect them, but in this case I'm expecting a watcher which isn't there.
| 
| The main problem right now is that creating a snapshot potentially puts a VM 
in
| Read-Only state because of the lack of notification.
| 
| Has anybody seen this as well?
| 
| Thanks,
| 
| Wido
| ___
| ceph-users mailing list
| ceph-users@lists.ceph.com
| http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-29 Thread Wido den Hollander
Hi,

On a OpenStack environment I encountered a VM which went into R/O mode after a 
RBD snapshot was created.

Digging into this I found 10s (out of thousands) RBD images which DO have a 
running VM, but do NOT have a watcher on the RBD image.

For example:

$ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086

'Watchers: none'

The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
client.

In the meantime the cluster was already upgraded to 10.2.10

Looking further I also found a Compute node with 10.2.10 installed which also 
has RBD images without watchers.

Restarting or live migrating the VM to a different host resolves this issue.

The internet is full of posts where RBD images still have Watchers when people 
don't expect them, but in this case I'm expecting a watcher which isn't there.

The main problem right now is that creating a snapshot potentially puts a VM in 
Read-Only state because of the lack of notification.

Has anybody seen this as well?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Mounting a second Ceph file system

2017-11-29 Thread Daniel Baumann
On 11/29/17 00:06, Nigel Williams wrote:
> Are their opinions on how stable multiple filesystems per single Ceph
> cluster is in practice?

we're using a single cephfs in production since february, and switched
to three cephfs in september - without any problem so far (running 12.2.1).

workload is backend for smb, hpc number crunching, and running generic
linux containers on it.

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Mounting a second Ceph file system

2017-11-29 Thread Yan, Zheng
On Wed, Nov 29, 2017 at 7:06 AM, Nigel Williams
 wrote:
> On 29 November 2017 at 01:51, Daniel Baumann  wrote:
>> On 11/28/17 15:09, Geoffrey Rhodes wrote:
>>> I'd like to run more than one Ceph file system in the same cluster.
>
> Are their opinions on how stable multiple filesystems per single Ceph
> cluster is in practice? is anyone using it actively with a stressful
> load?
>

It should be very stable, we haven't seem bug in this area for a long time

> I see the docs still place it under Experimental:
>
> http://docs.ceph.com/docs/master/cephfs/experimental-features/#multiple-filesystems-within-a-ceph-cluster
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk is now deprecated

2017-11-29 Thread Yoann Moulin
Le 27/11/2017 à 14:36, Alfredo Deza a écrit :
> For the upcoming Luminous release (12.2.2), ceph-disk will be
> officially in 'deprecated' mode (bug fixes only). A large banner with
> deprecation information has been added, which will try to raise
> awareness.
> 
> We are strongly suggesting using ceph-volume for new (and old) OSD
> deployments. The only current exceptions to this are encrypted OSDs
> and FreeBSD systems
> 
> Encryption support is planned and will be coming soon to ceph-volume.
> 
> A few items to consider:
> 
> * ceph-disk is expected to be fully removed by the Mimic release
> * Existing OSDs are supported by ceph-volume. They can be "taken over" [0]
> * ceph-ansible already fully supports ceph-volume and will soon default to it
> * ceph-deploy support is planned and should be fully implemented soon
> 
> 
> [0] http://docs.ceph.com/docs/master/ceph-volume/simple/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Is that possible to update the "add-or-rm-osds" documentation to have also the 
process with ceph-volume. That would help to the adoption.

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/

This page should be updated as well with ceph-volume command.

http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/

Documentation (at least for master, maybe for luminous) should keep both 
options (ceph-disk and ceph-volume) but with a warning message to
encourage people to use ceph-volume instead of ceph-disk.

I agree with comments here that say changing the status of ceph-disk as 
deprecated in a minor release is not what I expect for a stable storage
systems but I also understand the necessity to move forward with ceph-volume 
(and bluestore). I think keeping ceph-disk in mimic is necessary,
even though there is no update, just for compatibility with old scripts.

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Transparent huge pages

2017-11-29 Thread German Anders
Is possible that in Ubuntu with kernel version 4.12.14 at least, it comes
by default with the parameter enabled in [madvise]?



*German*

2017-11-28 12:07 GMT-03:00 Nigel Williams :

> Given that memory is a key resource for Ceph, this advice about switching
> Transparent Huge Pages kernel setting to madvise would be worth testing to
> see if THP is helping or hindering.
>
> Article:
> https://blog.nelhage.com/post/transparent-hugepages/
>
> Discussion:
> https://news.ycombinator.com/item?id=15795337
>
>
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange error on link() for nfs over cephfs

2017-11-29 Thread Jens-U. Mozdzen

Hi *,

we recently have switched to using CephFS (with Luminous 12.2.1). On  
one node, we're kernel-mounting the CephFS (kernel 4.4.75, openSUSE  
version) and export it via kernel nfsd. As we're transitioning right  
now, a number of machines still auto-mount users home directories from  
that nfsd.


A strange error that was not present when using the same nfsd  
exporting local-disk-based file systems, has recently surfaced. The  
problem is most visible to the user when doing a ssh-keygen operation  
to remove old keys from their "known_hosts", but it seems likely that  
this error will occur in other constellations, too.


The error report from "ssh_keygen" is:

--- cut here ---
user@host:~> ssh-keygen -R somehost -f /home/user/.ssh/known_hosts
# Host somehost found: line 232
link /home/user/.ssh/known_hosts to /home/user/.ssh/known_hosts.old:  
Not a directory

user@host:~>
--- cut here ---

This error persists... until the user lists the contents of the  
directory containing the "known_hosts" file (~/.ssh). Once that is  
done (i.e. "ls -l ~/.ssh"), ssh_keygen works as expected.


We've strace'd ssh_keygen and see the following steps (and more, of course):

- the original known_hosts file is opened successfully
- a temp file is created in .ssh (successfully)
- a previous backup copy (known_hosts.old) is unlink()ed (not  
successful, since not present)

- a link() from known_hosts to known_hosts.old is tried - ENOTDIR

--- cut here ---
[...]
unlink("/home/user/.ssh/known_hosts.old") = -1 ENOENT (No such file or  
directory)
link("/home/user/.ssh/known_hosts", "/home/user/.ssh/known_hosts.old")  
= -1 ENOTDIR (Not a directory)

--- cut here ---

Once the directory was listed, the link() call works nicely:

--- cut here ---
unlink("/home/user/.ssh/known_hosts.old") = -1 ENOENT (No such file or  
directory)

link("/home/user/.ssh/known_hosts", "/home/user/.ssh/known_hosts.old") = 0
rename("/home/user/.ssh/known_hosts.5trpXBpIgB",  
"/home/user/.ssh/known_hosts") = 0

--- cut here ---

When link() returns an error, the rename is not called, leaving the  
user with ("try" times) temporary files in .ssh - they never got  
renamed.


This does sound like a bug to me, has anybody else stumbled across  
similar symptoms as well?


Regards,
Jens


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-29 Thread Maged Mokhtar
Hi German, 

I would personally prefer to use rados bench/ fio which are more common
to benchmark the cluster first then later do mysql specific tests using
sysbench. Another thing is to run the client test simultaneously on more
than 1 machine and aggregate/add the performance numbers of each, the
limitation can be caused by client side resources which could be
stressed differently based on the different storage backends you tried. 

Maged 

On 2017-11-28 21:20, German Anders wrote:

> Don't know if there's any statistics available really, but Im running some 
> sysbench tests with mysql before the changes and the idea is to run those 
> tests again after the 'tuning' and see if numbers get better in any way, also 
> I'm gathering numbers from some collectd and statsd collectors running on the 
> osd nodes so, I hope to get some info about that :) 
> 
> GERMAN 
> 2017-11-28 16:12 GMT-03:00 Marc Roos :
> 
>> I was wondering if there are any statistics available that show the
>> performance increase of doing such things?
>> 
>> -Original Message-
>> From: German Anders [mailto:gand...@despegar.com]
>> Sent: dinsdag 28 november 2017 19:34
>> To: Luis Periquito
>> Cc: ceph-users
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>> 
>> Thanks a lot Luis, I agree with you regarding the CPUs, but
>> unfortunately those were the best CPU model that we can afford :S
>> 
>> For the NUMA part, I manage to pinned the OSDs by changing the
>> /usr/lib/systemd/system/ceph-osd@.service file and adding the
>> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
>> or specific CPU list. But I can't find the way to specify a list for
>> only a specific number of OSDs.
>> 
>> Also, I notice that the NVMe disks are all on the same node (since I'm
>> using half of the shelf - so the other half will be pinned to the other
>> node), so the lanes of the NVMe disks are all on the same CPU (in this
>> case 0). Also, I find that the IB adapter that is mapped to the OSD
>> network (osd replication) is pinned to CPU 1, so this will cross the QPI
>> path.
>> 
>> And for the memory, from the other email, we are already using the
>> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
>> 134217728
>> 
>> In this case I can pinned all the actual OSDs to CPU 0, but in the near
>> future when I add more nvme disks to the OSD nodes, I'll definitely need
>> to pinned the other half OSDs to CPU 1, someone already did this?
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-28 6:36 GMT-03:00 Luis Periquito :
>> 
>> There are a few things I don't like about your machines... If you
>> want latency/IOPS (as you seemingly do) you really want the highest
>> frequency CPUs, even over number of cores. These are not too bad, but
>> not great either.
>> 
>> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
>> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
>> connected to. Each NVMe device will be running on PCIe lanes generated
>> by one of the CPUs...
>> 
>> What versions of TCMalloc (or jemalloc) are you running? Have you
>> tuned them to have a bigger cache?
>> 
>> These are from what I've learned using filestore - I've yet to run
>> full tests on bluestore - but they should still apply...
>> 
>> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
>>  wrote:
>> 
>> Hi Nick,
>> 
>> yeah, we are using the same nvme disk with an additional
>> partition to use as journal/wal. We double check the c-state and it was
>> not configure to use c1, so we change that on all the osd nodes and mon
>> nodes and we're going to make some new tests, and see how it goes. I'll
>> get back as soon as get got those tests running.
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk :
>> 
>> From: ceph-users
>> [mailto:ceph-users-boun...@lists.ceph.com
>>  ] On Behalf Of German Anders
>> Sent: 27 November 2017 14:44
>> To: Maged Mokhtar 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance
>> tuning
>> 
>> Hi Maged,
>> 
>> Thanks a lot for the response. We try with different
>> number of threads and we're getting almost the same kind of difference
>> between the storage types. Going to try with different rbd stripe size,
>> object size values and see if we get more competitive numbers. Will get
>> back with more tests and param changes to see if we get better :)
>> 
>> Just to echo a couple of comments. Ceph will always
>> struggle to match the performance of a traditional array for mainly 2
>> reasons.
>> 
>> 1.  You are replacing some sort of dual ported SAS or
>> internally RDMA connected device with a network for Ceph replication
>> traffic. This will instantly have a large impact on write latency
>> 2.  Ceph locks at the PG