Re: [ceph-users] EXT: ceph-lvm - a tool to deploy OSDs from LVM volumes

2017-06-16 Thread Warren Wang - ISD
I would prefer that this is something more generic, to possibly support other 
backends one day, like ceph-volume. Creating one tool per backend seems silly.

Also, ceph-lvm seems to imply that ceph itself has something to do with lvm, 
which it really doesn’t. This is simply to deal with the underlying disk. If 
there’s resistance to something more generic like ceph-volume, then it should 
at least be called something like ceph-disk-lvm.

2 cents from one of the LVM for Ceph users,
Warren Wang
Walmart ✻

On 6/16/17, 10:25 AM, "ceph-users on behalf of Alfredo Deza" 
 wrote:

Hello,

At the last CDM [0] we talked about `ceph-lvm` and the ability to
deploy OSDs from logical volumes. We have now an initial draft for the
documentation [1] and would like some feedback.

The important features for this new tool are:

* parting ways with udev (new approach will rely on LVM functionality
for discovery)
* compatibility/migration for existing LVM volumes deployed as directories
* dmcache support

By documenting the API and workflows first we are making sure that
those look fine before starting on actual development.

It would be great to get some feedback, specially if you are currently
using LVM with ceph (or planning to!).

Please note that the documentation is not complete and is missing
content on some parts.

[0] http://tracker.ceph.com/projects/ceph/wiki/CDM_06-JUN-2017
[1] http://docs.ceph.com/ceph-lvm/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EXT: Re: Intel power tuning - 30% throughput performance increase

2017-05-08 Thread Warren Wang - ISD
We also noticed a tremendous gain in latency performance by setting cstates to 
processor.max_cstate=1 intel_idle.max_cstate=0. We went from being over 1ms 
latency for 4KB writes to well under (.7ms? going off mem). I will note that we 
did not have as much of a problem on Intel v3 procs, but on v4 procs, our low 
QD, single threaded write perf dropped tremendously. I don’t recall now, but it 
was much worse than just a 30% loss in perf compared to a v3 proc that had 
default C states set. We only saw a small bump in power usage as well.

Bumping the CPU frequency up also offered a small performance change as well.

Warren Wang
Walmart ✻

On 5/3/17, 3:43 AM, "ceph-users on behalf of Dan van der Ster" 
 wrote:

Hi Blair,

We use cpu_dma_latency=1, because it was in the latency-performance profile.
And indeed by setting cpu_dma_latency=0 on one of our OSD servers,
powertop now shows the package as 100% in turbo mode.

So I suppose we'll pay for this performance boost in energy.
But more importantly, can the CPU survive being in turbo 100% of the time?

-- Dan



On Wed, May 3, 2017 at 9:13 AM, Blair Bethwaite
 wrote:
> Hi all,
>
> We recently noticed that despite having BIOS power profiles set to
> performance on our RHEL7 Dell R720 Ceph OSD nodes, that CPU frequencies
> never seemed to be getting into the top of the range, and in fact spent a
> lot of time in low C-states despite that BIOS option supposedly disabling
> C-states.
>
> After some investigation this C-state issue seems to be relatively common,
> apparently the BIOS setting is more of a config option that the OS can
> choose to ignore. You can check this by examining
> /sys/module/intel_idle/parameters/max_cstate - if this is >1 and you 
*think*
> C-states are disabled then your system is messing with you.
>
> Because the contemporary Intel power management driver
> (https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt) now
> limits the proliferation of OS level CPU power profiles/governors, the 
only
> way to force top frequencies is to either set kernel boot command line
> options or use the /dev/cpu_dma_latency, aka pmqos, interface.
>
> We did the latter using the pmqos_static.py, which was previously part of
> the RHEL6 tuned latency-performance profile, but seems to have been 
dropped
> in RHEL7 (don't yet know why), and in any case the default tuned profile 
is
> throughput-performance (which does not change cpu_dma_latency). You can 
find
> the pmqos-static.py script here
> 
https://github.com/NetSys/NetBricks/blob/master/scripts/tuning/pmqos-static.py.
>
> After setting `./pmqos-static.py cpu_dma_latency=0` across our OSD nodes 
we
> saw a conservative 30% increase in backfill and recovery throughput - now
> when our main RBD pool of 900+ OSDs is backfilling we expect to see 
~22GB/s,
> previously that was ~15GB/s.
>
> We have just got around to opening a case with Red Hat regarding this as 
at
> minimum Ceph should probably be actively using the pmqos interface and 
tuned
> should be setting this with recommendations for the latency-performance
> profile in the RHCS install guide. We have done no characterisation of it 
on
> Ubuntu yet, however anecdotally it looks like it has similar issues on the
> same hardware.
>
> Merry xmas.
>
> Cheers,
> Blair
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate OSD Journal to SSD

2016-12-02 Thread Warren Wang - ISD
I’ve actually had to migrate every single journal in many clusters from one 
(horrible) SSD model to a better SSD. It went smoothly. You’ll also need to 
update your /var/lib/ceph/osd/ceph-*/journal_uuid file. 

Honestly, the only challenging part was mapping and automating the back and 
forth conversion from /dev/sd* to the uuid for the corresponding osd.  I would 
share the script, but it was at my previous employer.

Warren Wang
Walmart ✻

On 12/1/16, 7:26 PM, "ceph-users on behalf of Christian Balzer" 
 wrote:

On Thu, 1 Dec 2016 18:06:38 -0600 Reed Dier wrote:

> Apologies if this has been asked dozens of times before, but most answers 
are from pre-Jewel days, and want to double check that the methodology still 
holds.
> 
It does.

> Currently have 16 OSD’s across 8 machines with on-disk journals, created 
using ceph-deploy.
> 
> These machines have NVMe storage (Intel P3600 series) for the system 
volume, and am thinking about carving out a partition for SSD journals for the 
OSD’s. The drives don’t make tons of use of the local storage, so should have 
plenty of io overhead to support the OSD journaling, as well as the P3600 
should have the endurance to handle the added write wear.
>
Slight disconnect there, money for a NVMe (which size?) and on disk
journals? ^_-
 
> From what I’ve read, you need a partition per OSD journal, so with the 
probability of a third (and final) OSD being added to each node, I should 
create 3 partitions, each ~8GB in size (is this a good value? 8TB OSD’s, is the 
journal size based on size of data or number of objects, or something else?).
> 
Journal size is unrelated to the OSD per se, with default parameters and
HDDs for OSDs a size of 10GB would be more than adequate, the default of
5GB would do as well.

> So:
> {create partitions}
> set noout
> service ceph stop osd.$i
> ceph-osd -i osd.$i —flush-journal
> rm -f rm -f /var/lib/ceph/osd//journal
Typo and there should be no need for -f. ^_^

> ln -s  /var/lib/ceph/osd//journal /dev/
Even though in your case with a single(?) NVMe there is little chance for
confusion, ALWAYS reference to devices by their UUID or similar, I prefer
the ID:
---
lrwxrwxrwx   1 root root44 May 21  2015 journal -> 
/dev/disk/by-id/wwn-0x55cd2e404b73d570-part4
---


> ceph-osd -i osd.$i -mkjournal
> service ceph start osd.$i
> ceph osd unset noout
> 
> Does this logic appear to hold up?
> 
Yup.

Christian

> Appreciate the help.
> 
> Thanks,
> 
> Reed

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding second interface to storage network - issue

2016-12-01 Thread Warren Wang - ISD
Jumbo frames for the cluster network has been done by quite a few operators 
without any problems. Admittedly, I’ve not run it that way in a year now, but 
we plan on switching back to jumbo for the cluster.

I do agree that jumbo on the public could result in poor behavior from clients, 
if you’re not careful.

Warren Wang
Walmart ✻

From: ceph-users  on behalf of John Petrini 

Date: Wednesday, November 30, 2016 at 1:09 PM
To: Mike Jacobacci 
Cc: ceph-users 
Subject: Re: [ceph-users] Adding second interface to storage network - issue

Yes that should work. Though I'd be weary of increasing the MTU to 9000 as this 
could introduce other issues. Jumbo Frames don't provide a very significant 
performance increase so I wouldn't recommend it unless you have a very good 
reason to make the change. If you do want to go down that path I'd suggest 
getting LACP configured on all of the nodes before upping the MTU and even then 
make sure you understand the requirement of a larger MTU size before 
introducing it on your network.


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com   //   [witter] 
[inkedIn] 
[oogle Plus] 
[log] 

Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com

[xceptional people. Proven Processes. Innovative Technology. 
Discover]

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 1:01 PM, Mike Jacobacci 
mailto:mi...@flowjo.com>> wrote:
Hi John,

Thanks that makes sense... So I take it if I use the same IP for the bond, I 
shouldn't run into the issues I ran into last night?

Cheers,
Mike

On Wed, Nov 30, 2016 at 9:55 AM, John Petrini 
mailto:jpetr...@coredial.com>> wrote:
For redundancy I would suggest bonding the interfaces using LACP that way both 
ports are combined under the same interface with the same IP. They will both 
send and receive traffic and if one link goes down the other continues to work. 
The ports will need to be configured for LACP on the switch as well.


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com   //   [witter] 
[inkedIn] 
[oogle Plus] 
[log] 

Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com

[xceptional people. Proven Processes. Innovative Technology. 
Discover]

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 12:15 PM, Mike Jacobacci 
mailto:mi...@flowjo.com>> wrote:
I ran into an interesting issue last night when I tried to add a second storage 
interface.  The original 10gb storage interface on the OSD node was only set at 
1500 MTU, so the plan was to bump it to 9000 and configure the second interface 
the same way with a diff IP and reboot. Once I did that, for some reason the 
original interface showed active but would not respond to ping from the other 
OSD nodes, the second interface I added came up and was reachable.  So even 
though the node could still communicate to the others on the second interface, 
PG's would start remapping and would get stuck at about 300 (of 1024).  I 
resolved the issue by changing the config back on the original interface and 
disabling the second.  After a Reboot, PG's recovered very quickly.

It seemed that the remapping would only go partially because the first node 
could reach the others, but they couldn't reach the original interface and 
didn't use the newly added second. So for my questions:

Is there a proper way to add an additional 

Re: [ceph-users] osd crash - disk hangs

2016-12-01 Thread Warren Wang - ISD
You’ll need to upgrade your kernel. It’s a terrible div by zero bug that occurs 
while trying to calculate load. You can still use “top –b –n1” instead of ps, 
but ultimately the kernel update fixed it for us. You can’t kill procs that are 
in uninterruptible wait.

Here’s the Ubuntu version: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1568729

Warren Wang
Walmart ✻

From: ceph-users  on behalf of VELARTIS 
Philipp Dürhammer 
Date: Thursday, December 1, 2016 at 7:19 AM
To: "'ceph-users@lists.ceph.com'" 
Subject: [ceph-users] osd crash - disk hangs

Hello!

Tonight i had a osd crash. See the dump below. Also this osd is still mounted. 
Whats the cause? A bug? What to do next? I cant do a lsof or ps ax because it 
hangs.

Thank You!

Dec  1 00:31:30 ceph2 kernel: [17314369.493029] divide error:  [#1] SMP
Dec  1 00:31:30 ceph2 kernel: [17314369.493062] Modules linked in: act_police 
cls_basic sch_ingress sch_htb vhost_net vhost macvtap macvlan 8021q garp mrp 
veth nfsv3 softdog ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables xt_mac ipt_REJECT nf_reject_ipv4 xt_NFLOG 
nfnetlink_log xt_physdev nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_tcpudp 
xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set 
nfnetlink iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd 
grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs libcrc32c 
ipmi_ssif mxm_wmi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr 
input_leds sb_edac shpchp edac_core mei_me ioatdma mei lpc_ich i2c_i801 ipmi_si 
8250_fintek wmi ipmi_msghandler mac_hid nf_conntrack_ftp nf_conntrack autofs4 
ses enclosure hid_generic usbmouse usbkbd usbhid hid ixgbe(O) vxlan 
ip6_udp_tunnel megaraid_sas udp_tunnel isci ahci libahci libsas igb(O) 
scsi_transport_sas dca ptp pps_core fjes
Dec  1 00:31:30 ceph2 kernel: [17314369.493708] CPU: 1 PID: 17291 Comm: 
ceph-osd Tainted: G   O4.4.8-1-pve #1
Dec  1 00:31:30 ceph2 kernel: [17314369.493754] Hardware name: Thomas-Krenn.AG 
X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Dec  1 00:31:30 ceph2 kernel: [17314369.493799] task: 881f6ff05280 ti: 
880037c4c000 task.ti: 880037c4c000
Dec  1 00:31:30 ceph2 kernel: [17314369.493843] RIP: 0010:[]  
[] task_numa_find_cpu+0x23d/0x710
Dec  1 00:31:30 ceph2 kernel: [17314369.493893] RSP: :880037c4fbd8  
EFLAGS: 00010257
Dec  1 00:31:30 ceph2 kernel: [17314369.493919] RAX:  RBX: 
880037c4fc80 RCX: 
Dec  1 00:31:30 ceph2 kernel: [17314369.493962] RDX:  RSI: 
88103fa4 RDI: 881033f50c00
Dec  1 00:31:30 ceph2 kernel: [17314369.494006] RBP: 880037c4fc48 R08: 
000202046ea8 R09: 036b
Dec  1 00:31:30 ceph2 kernel: [17314369.494049] R10: 007c R11: 
0540 R12: 88064fbd
Dec  1 00:31:30 ceph2 kernel: [17314369.494093] R13: 0250 R14: 
0540 R15: 0009
Dec  1 00:31:30 ceph2 kernel: [17314369.494136] FS:  7ff17dd6c700() 
GS:88103fa4() knlGS:
Dec  1 00:31:30 ceph2 kernel: [17314369.494182] CS:  0010 DS:  ES:  
CR0: 80050033
Dec  1 00:31:30 ceph2 kernel: [17314369.494209] CR2: 7ff17dd6aff8 CR3: 
001025e4b000 CR4: 001426e0
Dec  1 00:31:30 ceph2 kernel: [17314369.494252] Stack:
Dec  1 00:31:30 ceph2 kernel: [17314369.494273]  880037c4fbe8 
81038219 003f 00017180
Dec  1 00:31:30 ceph2 kernel: [17314369.494323]  881f6ff05280 
00017180 0251 ffe7
Dec  1 00:31:30 ceph2 kernel: [17314369.494374]  0251 
881f6ff05280 880037c4fc80 00cb
Dec  1 00:31:30 ceph2 kernel: [17314369.494424] Call Trace:
Dec  1 00:31:30 ceph2 kernel: [17314369.494449]  [] ? 
sched_clock+0x9/0x10
Dec  1 00:31:30 ceph2 kernel: [17314369.494476]  [] 
task_numa_migrate+0x4e6/0xa00
Dec  1 00:31:30 ceph2 kernel: [17314369.494506]  [] ? 
copy_to_iter+0x7c/0x260
Dec  1 00:31:30 ceph2 kernel: [17314369.494534]  [] 
numa_migrate_preferred+0x79/0x80
Dec  1 00:31:30 ceph2 kernel: [17314369.494563]  [] 
task_numa_fault+0x848/0xd10
Dec  1 00:31:30 ceph2 kernel: [17314369.494591]  [] ? 
should_numa_migrate_memory+0x59/0x130
Dec  1 00:31:30 ceph2 kernel: [17314369.494623]  [] 
handle_mm_fault+0xc64/0x1a20
Dec  1 00:31:30 ceph2 kernel: [17314369.494654]  [] ? 
SYSC_recvfrom+0x144/0x160
Dec  1 00:31:30 ceph2 kernel: [17314369.494684]  [] 
__do_page_fault+0x19d/0x410
Dec  1 00:31:30 ceph2 kernel: [17314369.494713]  [] ? 
exit_to_usermode_loop+0xb0/0xd0
Dec  1 00:31:30 ceph2 kernel: [17314369.494742]  [] 
do_page_fault+0x22/0x30
Dec  1 00:31:30 ceph2 kernel: [17314369.494771]  [] 
page_fau

Re: [ceph-users] osd down detection broken in jewel?

2016-11-30 Thread Warren Wang - ISD
FYI - Setting min down reports to 10 is somewhat risky. Unless you have a 
really large cluster, I would advise turning that down to 5 or lower. In a past 
life, we used to run that number higher on super dense nodes, but we found that 
it would result in some instances where legitimately down OSDs did not have 
enough peers to exceed the min down reporters.

Warren Wang
Walmart ✻


From: ceph-users  on behalf of John Petrini 

Date: Wednesday, November 30, 2016 at 9:24 AM
To: Manuel Lausch 
Cc: Ceph Users 
Subject: Re: [ceph-users] osd down detection broken in jewel?

It's right there in your config.

mon osd report timeout = 900

See: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com   //   [witter] 
[inkedIn] 
[oogle Plus] 
[log] 

Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com

[xceptional people. Proven Processes. Innovative Technology. 
Discover]

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 6:39 AM, Manuel Lausch 
mailto:manuel.lau...@1und1.de>> wrote:
Hi,

In a test with ceph jewel we tested how long the cluster needs to detect and 
mark down OSDs after they are killed (with kill -9). The result -> 900 seconds.

In Hammer this took about 20 - 30 seconds.

In the Logfile from the leader monitor are a lot of messeages like
2016-11-30 11:32:20.966567 7f158f5ab700  0 log_channel(cluster) log [DBG] : 
osd.7 10.78.43.141:8120/106673 reported failed 
by osd.272 10.78.43.145:8106/117053
A deeper look at this. A lot of OSDs reported this exactly one time. In Hammer 
The OSDs reported a down OSD a few more times.

Finaly there is the following and the osd is marked down.
2016-11-30 11:36:22.633253 7f158fdac700  0 log_channel(cluster) log [INF] : 
osd.7 marked down after no pg stats for 900.982893seconds

In my ceph.conf I have the following lines in the global section
mon osd min down reporters = 10
mon osd min down reports = 3
mon osd report timeout = 900

It seems the parameter "mon osd min down reports" is removed in jewel but the 
documentation is not updated -> 
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/


Can someone tell me how ceph jewel detects down OSDs and mark them down in a 
appropriated time?


The Cluster:
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
24 hosts á 60 OSDs -> 1440 OSDs
2 pool with replication factor 4
65536 PGs
5 Mons

--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: 
www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 
distribution or use of the content of this e-mail in any way is prohibited. If 
you have received this e-mail in error, please notify the sender and delete the 
e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-us

Re: [ceph-users] OSDs going down during radosbench benchmark

2016-09-12 Thread Warren Wang - ISD
Hi Tom, a few things you can check into. Some of these depend on how many
OSDs you¹re trying to run on a single chassis.

# up PIDs, otherwise you may run out of the ability to spawn new threads
kernel.pid_max=4194303

# up available mem for sudden bursts, like during benchmarking
Vm.min_free_kbytes = 

In ceph.conf:

max_open_files = <32K or more>

# make sure you have enough ephemeral port range for the number of OSDs
Ms bind port min = 6800
Ms bind port max = 9000

You may need to up your network tuning as well, but it¹s less likely to
cause these sorts of problems. Watch your netstat -s for clues.

Warren Wang



On 9/12/16, 12:44 PM, "ceph-users on behalf of Deneau, Tom"
 wrote:

>Trying to understand why some OSDs (6 out of 21) went down in my cluster
>while running a CBT radosbench benchmark.  From the logs below, is this a
>networking problem between systems, or is it some kind of FileStore
>problem.
>
>Looking at one crashed OSD log, I see the following crash error:
>
>2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed
>out after 600 seconds.
> ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4)
>
>just before that I see things like:
>
>2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no
>reply from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09
>21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758)
>
>and also
>
>2016-09-09 19:03:45.788327 7efc53905700  0 -- 10.0.1.2:6826/58682 >>
>10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\
> c=0x7efc8bef5b00).connect got RESETSESSION
>
>and many warnings for slow requests.
>
>
>All the other osds that died seem to have died with:
>
>2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In
>function 'bool ceph::HeartbeatMap::_check(const
>ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f2157e65700 time
>2016-09-09 19:11:01.660671
>common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>
>
>-- Tom Deneau, AMD
>
>
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw meta pool

2016-09-09 Thread Warren Wang - ISD
A little extra context here. Currently the metadata pool looks like it is
on track to exceed the number of objects in the data pool, over time. In a
brand new cluster, we¹re already up to almost 2 million in each pool.

NAME  ID USED  %USED MAX AVAIL
OBJECTS
default.rgw.buckets.data  17 3092G  0.86  345T
2013585
default.rgw.meta  25  743M 0  172T
1975937

We¹re concerned this will be unmanageable over time.

Warren Wang


On 9/9/16, 10:54 AM, "ceph-users on behalf of Pavan Rallabhandi"
 wrote:

>Any help on this is much appreciated, am considering to fix this, given
>it¹s confirmed an issue unless am missing something obvious.
>
>Thanks,
>-Pavan.
>
>On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi"
>prallabha...@walmartlabs.com> wrote:
>
>Trying it one more time on the users list.
>
>In our clusters running Jewel 10.2.2, I see default.rgw.meta pool
>running into large number of objects, potentially to the same range of
>objects contained in the data pool.
>
>I understand that the immutable metadata entries are now stored in
>this heap pool, but I couldn¹t reason out why the metadata objects are
>left in this pool even after the actual bucket/object/user deletions.
>
>The put_entry() promptly seems to be storing the same in the heap
>pool 
>https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880,
>but I do not see them to be reaped ever. Are they left there for some
>reason?
>
>Thanks,
>-Pavan.
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw ignores rgw_frontends? (10.2.2)

2016-08-05 Thread Warren Wang - ISD
It works for us. Here¹s what ours looks like:

rgw frontends = civetweb port=80 num_threads=50

>From netstat:
tcp0  0 0.0.0.0:80  0.0.0.0:*   LISTEN
 4010203/radosgw

Warren Wang



On 7/28/16, 7:20 AM, "ceph-users on behalf of Zoltan Arnold Nagy"

wrote:

>Hi,
>
>I just did a test deployment using ceph-deploy rgw create 
>after which I've added
>
>[client.rgw.c11n1]
>rgw_frontends = ³civetweb port=80²
>
>to the config.
>
>Using show-config I can see that it¹s there:
>
>root@c11n1:~# ceph --id rgw.c11n1 --show-config | grep civet
>debug_civetweb = 1/10
>rgw_frontends = civetweb port=80
>root@c11n1:~#
>
>However, radosgw ignores it:
>
>root@c11n1:~# netstat -anlp | grep radosgw
>tcp0  0 IP:48514   IP:6800ESTABLISHED
>29879/radosgw
>tcp0  0 IP:47484   IP:6789 ESTABLISHED
>29879/radosgw
>unix  2  [ ACC ] STREAM LISTENING 720517   29879/radosgw
> /var/run/ceph/ceph-client.rgw.c11n1.asok
>root@c11n1:~#
>
>I¹ve removed the key under /var/lib/ceph and copied it under /etc/ceph
>then added the keyring configuration after, which is read and is used by
>radosgw.
>
>Any ideas how I could debug this further?
>Is there a debug option that shows me that configuration settings is it
>reading from the configuration file?
>
>I¹ve been launching it for debugging purposes like this:
>usr/bin/radosgw --cluster=ceph -c /etc/ceph/ceph.conf --id rgw.c11n1 -d
>--setuser ceph --setgroup ceph --debug_rgw='20/20' --debug_client='20/20'
>--debug_civetweb='20/20' --debug_asok='20/20' --debug_auth='20/20'
>--debug-rgw=20/20
>
>Thanks,
>Zoltan
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performance when two fio write to the same image

2016-08-04 Thread Warren Wang - ISD
Wow, thanks. I think that¹s the tidbit of info I needed to explain why
increasing numjobs doesn¹t (anymore) scale performance as expected.

Warren Wang



On 8/4/16, 7:49 AM, "ceph-users on behalf of Jason Dillaman"
 wrote:

>With exclusive-lock, only a single client can have write access to the
>image at a time. Therefore, if you are using multiple fio processes
>against the same image, they will be passing the lock back and forth
>between each other and you can expect bad performance.
>
>If you have a use-case where you really need to share the same image
>between multiple concurrent clients, you will need to disable the
>exclusive-lock feature (this can be done with the RBD cli on existing
>images or by passing "--image-shared" when creating new images).
>
>On Thu, Aug 4, 2016 at 5:52 AM, Alexandre DERUMIER 
>wrote:
>> Hi,
>>
>> I think this is because of exclusive-lock feature enabled by default
>>since jessie on rbd image
>>
>>
>> - Mail original -
>> De: "Zhiyuan Wang" 
>> À: "ceph-users" 
>> Envoyé: Jeudi 4 Août 2016 11:37:04
>> Objet: [ceph-users] Bad performance when two fio write to the same image
>>
>>
>>
>> Hi Guys
>>
>> I am testing the performance of Jewel (10.2.2) with FIO, but found the
>>performance would drop dramatically when two process write to the same
>>image.
>>
>> My environment:
>>
>> 1. Server:
>>
>> One mon and four OSDs running on the same server.
>>
>> Intel P3700 400GB SSD which have 4 partitions, and each for one osd
>>journal (journal size is 10GB)
>>
>> Inter P3700 400GB SSD which have 4 partitions, and each format to XFS
>>for one osd data (each data is 90GB)
>>
>> 10GB network
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2660 (it is not the bottleneck)
>>
>> Memory: 256GB (it is not the bottleneck)
>>
>> 2. Client
>>
>> 10GB network
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2660 (it is not the bottleneck)
>>
>> Memory: 256GB (it is not the bottleneck)
>>
>> 3. Ceph
>>
>> Default configuration expect use async messager (have tried simple
>>messager, got nearly the same result)
>>
>> 10GB image with 256 pg num
>>
>> Test Case
>>
>> 1. One Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd;
>>randwrite
>>
>> The performance is nearly 60MB/s and IOPS is nearly 15K
>>
>> Four osd are nearly the same busy
>>
>> 2. Two Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd;
>>randwrite (write to the same image)
>>
>> The performance is nearly 4MB/s each, and IOPS is nearly 1.5K each
>>Terrible
>>
>> And I found that only one osd is busy, the other three are much more
>>idle on CPU
>>
>> And I also run FIO on two clients, the same result
>>
>> 3. Two Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd
>>randwrite (one to image1, one to image2)
>>
>> The performance is nearly 35MB/s each and IOPS is nearly 8.5K each
>>Reasonable
>>
>> Four osd are nearly the same busy
>>
>>
>>
>>
>>
>> Could someone help to explain the reason of TEST 2
>>
>>
>>
>> Thanks
>>
>>
>> Email Disclaimer & Confidentiality Notice
>>
>> This message is confidential and intended solely for the use of the
>>recipient to whom they are addressed. If you are not the intended
>>recipient you should not deliver, distribute or copy this e-mail. Please
>>notify the sender immediately by e-mail and delete this e-mail from your
>>system. Copyright © 2016 by Istuary Innovation Labs, Inc. All rights
>>reserved.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>-- 
>Jason
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I use fio with randwrite io to ceph image , it's run 2000 IOPS in the first time , and run 6000 IOPS in second time

2016-08-03 Thread Warren Wang - ISD
It¹s probably rbd cache taking effect. If you know all your clients are
well behaved, you could set "rbd cache writethrough until flush" to false,
instead of the default true, but understand the ramification. You could
also just do it during benchmarking.

Warren Wang



From:  ceph-users  on behalf of
"m13913886...@yahoo.com" 
Reply-To:  "m13913886...@yahoo.com" 
Date:  Monday, August 1, 2016 at 11:30 PM
To:  Ceph-users 
Subject:  [ceph-users] I use fio with randwrite io to ceph image , it's
run 2000 IOPS in the first time , and run 6000 IOPS in second time



In version 10.2.2, fio firstly run 2000 IOPS, then I break fio,
and continue run fio, it run 6000 IOPS.

But in version 0.94, fio always run 6000 IOPS. With or without
repeated fio.


what is the different between this two versions about this.


my config is that :

I have three nodes, and two osds per node. A total of six osds.
All osds are ssd disk.


Here is my ceph.conf of osd:

[osd]

osd mkfs type=xfs
osd data = /data/$name
osd_journal_size = 8
filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 1048576

journal max write bytes = 1073714824
journal max write entries = 1
journal queue max ops = 5
journal queue max bytes = 1048576

osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 8
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd recovery op priority = 4
osd recovery max active = 10
osd max backfills = 4


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Warren Wang - ISD
I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
Back when I ran it, the  drivers were kind of a pain to deal with in
Ubuntu, primarily during PXE. It should be better now though.

If you have the network to support it, 25Gbe is quite a bit cheaper per
port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
personally probably would not do 40 again.

Warren Wang



On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
Koordinator"  wrote:

>Am 13.07.16 um 14:59 schrieb Joe Landman:
>>
>>
>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>>> 40Gbps can be used as 4*10Gbps
>>>
>>> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>>> ports", but extented to "usage of more than a single 10Gbps port, eg
>>> 20Gbps etc too"
>>>
>>> Is there people here that are using more than 10G on an ceph server ?
>>
>> We have built, and are building Ceph units for some of our customers
>> with dual 100Gb links.  The storage box was one of our all flash
>> Unison units for OSDs.  Similarly, we have several customers actively
>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
>> (SRD) box.
>>
>Now we get closer. Can you tell me which 40G Nic you use?
>
>/götz
>

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quick short survey which SSDs

2016-07-12 Thread Warren Wang - ISD
Our testing so far shows that it¹s a pretty good drive. We use it for the
actual backing OSD, but the journal is on NVMe. The raw results indicate
that it¹s a reasonable journal too, if you need to colocate, but you¹ll
exhaust write performance pretty quickly depending on your workload. We
also have them in large number. So far, so good.

Warren Wang





On 7/8/16, 1:37 PM, "ceph-users on behalf of Carlos M. Perez"
 wrote:

>I posted a bunch of the more recent numbers in the specs.  Had some down
>time and had a bunch of SSD's lying around and just curious if any were
>hidden gems... Interestingly, the Intel drives seem to not require the
>write cache off, while other drives had to be "forced" off using the
>hdparm -W0 /dev/sdx to make sure it was off.
>
>The machine we tested on is a Dell C2100 Dual x5560, 96GB ram, LSI2008 IT
>mode controller
>
>intel Dc S3700 200GB
>Model Number:   INTEL SSDSC2BA200G3L
>Firmware Revision:  5DV10265
>
>1 - io=4131.2MB, bw=70504KB/s, iops=17626, runt= 60001msec
>5 - io=9529.1MB, bw=162627KB/s, iops=40656, runt= 60001msec
>10 - io=7130.5MB, bw=121684KB/s, iops=30421, runt= 60004msec
>
>Samsung SM863
>Model Number:   SAMSUNG MZ7KM240HAGR-0E005
>Firmware Revision:  GXM1003Q
>
>1 - io=2753.1MB, bw=47001KB/s, iops=11750, runt= 6msec
>5 - io=6248.8MB, bw=106643KB/s, iops=26660, runt= 60001msec
>10 - io=8084.1MB, bw=137981KB/s, iops=34495, runt= 60001msec
>
>We decided to go with Intel model.  The Samsung was impressive on the
>higher end with multiple threads, but figured for most of our nodes with
>4-6 OSD's the intel were a bit more proven and had better "light-medium"
>load numbers.  
>
>Carlos M. Perez
>CMP Consulting Services
>305-669-1515
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Dan van der Ster
>Sent: Tuesday, July 5, 2016 4:23 AM
>To: Christian Balzer 
>Cc: ceph-users 
>Subject: Re: [ceph-users] Quick short survey which SSDs
>
>On Tue, Jul 5, 2016 at 10:04 AM, Dan van der Ster 
>wrote:
>> On Tue, Jul 5, 2016 at 9:53 AM, Christian Balzer  wrote:
 Unfamiliar: Samsung SM863

>>> You might want to read the thread here:
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007
>>> 871.html
>>>
>>> And google "ceph SM863".
>>>
>>> However I'm still waiting for somebody to confirm that these perform
>>> (as one would expect from DC level SSDs) at full speed with sync
>>> writes, which is the only important factor for journals.
>>
>> Tell me the fio options you're interested in and I'll run it right now.
>
>Using the options from Sebastien's blog I get:
>
>1 job: write: io=5863.3MB, bw=100065KB/s, iops=25016, runt= 60001msec
>5 jobs: write: io=11967MB, bw=204230KB/s, iops=51057, runt= 60001msec
>10 jobs: write: io=13760MB, bw=234829KB/s, iops=58707, runt= 60001msec
>
>Drive is model MZ7KM240 with firmware GXM1003Q.
>
>--
>Dan
>
>
>[1] fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k
>--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
>--name=journal-test ___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-24 Thread Warren Wang - ISD
Oops, that reminds me, do you have min_free_kbytes set to something
reasonable like at least 2-4GB?

Warren Wang



On 6/24/16, 10:23 AM, "Wade Holler"  wrote:

>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>think it is the best choice for most configs.  However with our large
>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>hitting an issue where our write response time would double; then a
>drop of caches would return response time to normal.  I don't claim to
>totally understand this and I only have speculation at the moment.
>Again thanks for this suggestion, I do think it is best for boxes that
>don't have very large memory.
>
>@ Christian - reformatting to btrfs or ext4 is an option in my test
>cluster.  I thought about that but needed to sort xfs first. (thats
>what production will run right now) You all have helped me do that and
>thank you again.  I will circle back and test btrfs under the same
>conditions.  I suspect that it will behave similarly but it's only a
>day and half's work or so to test.
>
>Best Regards,
>Wade
>
>
>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy 
>wrote:
>> Oops , typo , 128 GB :-)...
>>
>> -Original Message-
>> From: Christian Balzer [mailto:ch...@gol.com]
>> Sent: Thursday, June 23, 2016 5:08 PM
>> To: ceph-users@lists.ceph.com
>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>Development
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>of objects in pool
>>
>>
>> Hello,
>>
>> On Thu, 23 Jun 2016 22:24:59 + Somnath Roy wrote:
>>
>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>> *pin* inode/dentries in memory. We are using that for long now (with
>>> 128 TB node memory) and it seems helping specially for the random
>>> write workload and saving xattrs read in between.
>>>
>> 128TB node memory, really?
>> Can I have some of those, too? ^o^
>> And here I was thinking that Wade's 660GB machines were on the
>>excessive side.
>>
>> There's something to be said (and optimized) when your storage nodes
>>have the same or more RAM as your compute nodes...
>>
>> As for Warren, well spotted.
>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>fireworks if your memory is really needed elsewhere, while keeping
>>things in memory normally.
>>
>> Christian
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>> To: Wade Holler; Blair Bethwaite
>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>> of objects in pool
>>>
>>> vm.vfs_cache_pressure = 100
>>>
>>> Go the other direction on that. You易ll want to keep it low to help
>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>
>>>
>>> Warren Wang
>>>
>>>
>>>
>>>
>>> On 6/22/16, 9:41 PM, "Wade Holler"  wrote:
>>>
>>> >Blairo,
>>> >
>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>3.
>>> >
>>> >23.3 Million Objects / OSD
>>> >pg_num 2048
>>> >16 OSDs / Server
>>> >3 Servers
>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>> >vm.vfs_cache_pressure = 100
>>> >
>>> >Workload is native librados with python.  ALL 4k objects.
>>> >
>>> >Best Regards,
>>> >Wade
>>> >
>>> >
>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>> > wrote:
>>> >> Wade, good to know.
>>> >>
>>> >> For the record, what does this work out to roughly per OSD? And how
>>> >> much RAM and how many PGs per OSD do you have?
>>> >>
>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>> >> RBD) it's better to increase default object size somewhat before
>>> >> pushing the split/merge up a lot...
>>> >>
>>> >> Cheers,
>>> >>
>>> >> On 23 June 2016 at 11:26, Wade Holler  wrote:
>>> >>> Bas

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-23 Thread Warren Wang - ISD
vm.vfs_cache_pressure = 100

Go the other direction on that. You¹ll want to keep it low to help keep
inode/dentry info in memory. We use 10, and haven¹t had a problem.


Warren Wang




On 6/22/16, 9:41 PM, "Wade Holler"  wrote:

>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server
>vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python.  ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler  wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling).  Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer 
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's
>>>>> not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing
>>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA
>>>>>SSD
>>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>>> should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion, it
>>>> would probably be good time to clarify that it isn't the splits that
>>>>cause
>>>> the performance degradation, but the resulting inflation of dir
>>>>entries
>>>> and exhaustion of SLAB and thus having to go to disk for things that
>>>> normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that clear,
>>>>a
>>>> purely split caused degradation should have relented much quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under the
>>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>> 
>>>>>mailto:ceph-users-boun...@lists.cep
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> mailto:wade.hol...@gmail.com>> Date: Monday,
>>>>>June
>>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>> mailto:blair.bethwa...@gmail.com>>, Wido
>>>>>den
>>>>> Hollander mailto:w...@42on.com>> Cc: Ceph Development
>>>>> mailto:ceph-de...@vger.kernel.org>>,
>>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>>> mailto:ceph-users@lists.ceph.com>>
>>>>>Subject:
>>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>>>objects
>>>>> in pool
>>>>>
>>>>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>>>>> tes

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Warren Wang - ISD
Sorry, late to the party here. I agree, up the merge and split thresholds. 
We're as high as 50/12. I chimed in on an RH ticket here. One of those things 
you just have to find out as an operator since it's not well documented :(

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

We have over 200 million objects in this cluster, and it's still doing over 
15000 write IOPS all day long with 302 spinning drives + SATA SSD journals. 
Having enough memory and dropping your vfs_cache_pressure should also help.

Keep in mind that if you change the values, it won't take effect immediately. 
It only merges them back if the directory is under the calculated threshold and 
a write occurs (maybe a read, I forget).

Warren


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Wade Holler mailto:wade.hol...@gmail.com>>
Date: Monday, June 20, 2016 at 2:48 PM
To: Blair Bethwaite 
mailto:blair.bethwa...@gmail.com>>, Wido den 
Hollander mailto:w...@42on.com>>
Cc: Ceph Development 
mailto:ceph-de...@vger.kernel.org>>, 
"ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
objects in pool

Thanks everyone for your replies.  I sincerely appreciate it. We are testing 
with different pg_num and filestore_split_multiple settings.  Early indications 
are  well not great. Regardless it is nice to understand the symptoms 
better so we try to design around it.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite 
mailto:blair.bethwa...@gmail.com>> wrote:
On 20 June 2016 at 09:21, Blair Bethwaite 
mailto:blair.bethwa...@gmail.com>> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

--
Cheers,
~Blairo

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CFQ changes affect Ceph priority?

2016-02-05 Thread Warren Wang - ISD
Not sure how many folks use the CFQ scheduler to use Ceph IO priority, but 
there’s a CFQ change that probably needs to be evaluated for Ceph purposes.

http://lkml.iu.edu/hypermail/linux/kernel/1602.0/00820.html

This might be a better question for the dev list.

Warren Wang

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
I get where you are coming from, Jan, but for a test this small, I still
think checking network latency first for a single op is a good idea.

Given that the cluster is not being stressed, CPUs may be running slow. It
may also benefit the test to turn CPU governors to performance for all
cores.

Warren Wang




On 12/14/15, 5:07 PM, "Jan Schermer"  wrote:

>Even with 10G ethernet, the bottleneck is not the network, nor the drives
>(assuming they are datacenter-class). The bottleneck is the software.
>The only way to improve that is to either increase CPU speed (more GHz
>per core) or to simplify the datapath IO has to take before it is
>considered durable.
>Stuff like RDMA will help only if there so zero-copy between the (RBD)
>client and the drive, or if the write is acknowledged when in the remote
>buffers of replicas (but it still has to come from client directly or
>RDMA becomes a bit pointless, IMHO).
>
>Databases do sync writes for a reason, O_DIRECT doesn't actually make
>strong guarantees on ordering or buffering, though in practice the race
>condition is negligible.
>
>Your 600 IOPS are pretty good actually.
>
>Jan
>
>
>> On 14 Dec 2015, at 22:58, Warren Wang - ISD 
>>wrote:
>> 
>> Whoops, I misread Nikola易s original email, sorry!
>> 
>> If all your SSDs are all performing at that level for sync IO, then I
>> agree that it易s down to other things, like network latency and PG
>>locking.
>> Sequential 4K writes with 1 thread and 1 qd is probably the worst
>> performance you易ll see. Is there a router between your VM and the Ceph
>> cluster, or one between Ceph nodes for the cluster network?
>> 
>> Are you using dsync at the VM level to simulate what a database or other
>> app would do? If you can switch to directIO, you易ll likely get far
>>better
>> performance. 
>> 
>> Warren Wang
>> 
>> 
>> 
>> 
>> On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:
>> 
>>> 
>>> 
>>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>>>> Hello,
>>>> 
>>>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>>> performance
>>>> drop for sync writes..
>>>> 
>>>> I'm using SSD for both journalling and OSD. It should be suitable for
>>>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>>> 
>>>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>>> --group_reporting --name=journal-test)
>>>> 
>>>> On top of this cluster, I have running KVM guest (using qemu librbd
>>>> backend).
>>>> Overall performance seems to be quite good, but the problem is when I
>>>> try
>>>> to measure sync IO performance inside the guest.. I'm getting only
>>>> about 600IOPS,
>>>> which I think is quite poor.
>>>> 
>>>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>>> be hanging on
>>>> IO, neither hogging CPU, qemu process is also not somehow too much
>>>> loaded..
>>>> 
>>>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>>> disabled,
>>>> 
>>>> my question is, what results I can expect for synchronous writes? I
>>>> understand
>>>> there will always be some performance drop, but 600IOPS on top of
>>>> storage which
>>>> can give as much as 16K IOPS seems to little..
>>> 
>>> So basically what this comes down to is latency.  Since you get 16K
>>>IOPS
>>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>>> super-capacitor on board and can basically acknowledge a write as
>>> complete as soon as it hits the on-board cache rather than when it's
>>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>>> is completing in around 0.06ms on average.  That's very fast!  At 600
>>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms
>>>per
>>> IO on average.
>>> 
>>> So how do we account for the difference?  Let's start out by looking at
>>> a quick example of network latency (This is between two random machines
>>> in one of our labs at Red Hat):
>>> 
>>>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>>>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>>>>

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
Whoops, I misread Nikola¹s original email, sorry!

If all your SSDs are all performing at that level for sync IO, then I
agree that it¹s down to other things, like network latency and PG locking.
Sequential 4K writes with 1 thread and 1 qd is probably the worst
performance you¹ll see. Is there a router between your VM and the Ceph
cluster, or one between Ceph nodes for the cluster network?

Are you using dsync at the VM level to simulate what a database or other
app would do? If you can switch to directIO, you¹ll likely get far better
performance. 

Warren Wang




On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:

>
>
>On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>> Hello,
>>
>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>performance
>> drop for sync writes..
>>
>> I'm using SSD for both journalling and OSD. It should be suitable for
>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>
>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>--group_reporting --name=journal-test)
>>
>> On top of this cluster, I have running KVM guest (using qemu librbd
>>backend).
>> Overall performance seems to be quite good, but the problem is when I
>>try
>> to measure sync IO performance inside the guest.. I'm getting only
>>about 600IOPS,
>> which I think is quite poor.
>>
>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>be hanging on
>> IO, neither hogging CPU, qemu process is also not somehow too much
>>loaded..
>>
>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>disabled,
>>
>> my question is, what results I can expect for synchronous writes? I
>>understand
>> there will always be some performance drop, but 600IOPS on top of
>>storage which
>> can give as much as 16K IOPS seems to little..
>
>So basically what this comes down to is latency.  Since you get 16K IOPS
>for O_DSYNC writes on the SSD, there's a good chance that it has a
>super-capacitor on board and can basically acknowledge a write as
>complete as soon as it hits the on-board cache rather than when it's
>written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>is completing in around 0.06ms on average.  That's very fast!  At 600
>IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>IO on average.
>
>So how do we account for the difference?  Let's start out by looking at
>a quick example of network latency (This is between two random machines
>in one of our labs at Red Hat):
>
>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>
>now consider that when you do a write in ceph, you write to the primary
>OSD which then writes out to the replica OSDs.  Every replica IO has to
>complete before the primary will send the acknowledgment to the client
>(ie you have to add the latency of the worst of the replica writes!).
>In your case, the network latency alone is likely dramatically
>increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>process crush mappings, look up directory and inode metadata on the
>filesystem where objects are stored (assuming it's not cached), and
>other processing time, and the 1.6ms latency for the guest writes starts
>to make sense.
>
>Can we improve things?  Likely yes.  There's various areas in the code
>where we can trim latency away, implement alternate OSD backends, and
>potentially use alternate network technology like RDMA to reduce network
>latency.  The thing to remember is that when you are talking about
>O_DSYNC writes, even very small increases in latency can have dramatic
>effects on performance.  Every fraction of a millisecond has huge
>ramifications.
>
>>
>> Has anyone done similar measuring?
>>
>> thanks a lot in advance!
>>
>> BR
>>
>> nik
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
Which SSD are you using? Dsync flag will dramatically slow down most SSDs.
You¹ve got to be very careful about the SSD you pick.

Warren Wang




On 12/14/15, 5:49 AM, "Nikola Ciprich"  wrote:

>Hello,
>
>i'm doing some measuring on test (3 nodes) cluster and see strange
>performance
>drop for sync writes..
>
>I'm using SSD for both journalling and OSD. It should be suitable for
>journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>
>(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>--group_reporting --name=journal-test)
>
>On top of this cluster, I have running KVM guest (using qemu librbd
>backend).
>Overall performance seems to be quite good, but the problem is when I try
>to measure sync IO performance inside the guest.. I'm getting only about
>600IOPS,
>which I think is quite poor.
>
>The problem is, I don't see any bottlenect, OSD daemons don't seem to be
>hanging on
>IO, neither hogging CPU, qemu process is also not somehow too much
>loaded..
>
>I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>disabled,
>
>my question is, what results I can expect for synchronous writes? I
>understand
>there will always be some performance drop, but 600IOPS on top of storage
>which
>can give as much as 16K IOPS seems to little..
>
>Has anyone done similar measuring?
>
>thanks a lot in advance!
>
>BR
>
>nik
>
>
>-- 
>-
>Ing. Nikola CIPRICH
>LinuxBox.cz, s.r.o.
>28.rijna 168, 709 00 Ostrava
>
>tel.:   +420 591 166 214
>fax:+420 596 621 273
>mobil:  +420 777 093 799
>www.linuxbox.cz
>
>mobil servis: +420 737 238 656
>email servis: ser...@linuxbox.cz
>-

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Sizing

2015-12-03 Thread Warren Wang - ISD
I would be a lot more conservative in terms of what a spinning drive can
do. The Mirantis presentation has pretty high expectations out of a
spinning drive, as they¹re ignoring somewhat latency (til the last few
slides). Look at the max latencies for anything above 1 QD on a spinning
drive.

If you factor in a latency requirement, the capability of the drives fall
dramatically. You might be able to offset this by using NVMe or something
as a cache layer between the journal and the OSD, using bcache, LVM cache,
etc. In much of the performance testing that we¹ve done, the average isn¹t
too bad, but 90th percentile numbers tend to be quite bad. Part of it is
probably from locking PGs during a flush, and the other part is just the
nature of spinning drives.

I¹d try to get a handle on expected workloads before picking the gear, but
if you have to pick before that, SSD if you have the budget :) You can
offset it a little by using erasure coding for the RGW portion, or using
spinning drives for that.

I think picking gear for Ceph is tougher than running an actual cluster :)
Best of luck. I think you¹re still starting with better, and more info
than some of us did years ago.

Warren Wang




From:  Sam Huracan 
Date:  Thursday, December 3, 2015 at 4:01 AM
To:  Srinivasula Maram 
Cc:  Nick Fisk , "ceph-us...@ceph.com"

Subject:  Re: [ceph-users] Ceph Sizing


I'm following this presentation of Mirantis team:
http://www.slideshare.net/mirantis/ceph-talk-vancouver-20

They calculate CEPH IOPS = Disk IOPS * HDD Quantity * 0.88 (4-8k random
read proportion)


And  VM IOPS = CEPH IOPS / VM Quantity

But if I use replication of 3, Would VM IOPS be divided by 3?


2015-12-03 7:09 GMT+07:00 Sam Huracan :

IO size is 4 KB, and I need a Minimum sizing, cost optimized
I intend use SuperMicro Devices
http://www.supermicro.com/solutions/storage_Ceph.cfm


What do you think?


2015-12-02 23:17 GMT+07:00 Srinivasula Maram
:

One more factor we need to consider here is IO size(block size) to get
required IOPS, based on this we can calculate the bandwidth and design the
solution.

Thanks
Srinivas

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Nick Fisk
Sent: Wednesday, December 02, 2015 9:28 PM
To: 'Sam Huracan'; ceph-us...@ceph.com
Subject: Re: [ceph-users] Ceph Sizing

You've left out an important factorcost. Otherwise I would just say
buy enough SSD to cover the capacity.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Sam Huracan
> Sent: 02 December 2015 15:46
> To: ceph-us...@ceph.com
> Subject: [ceph-users] Ceph Sizing
>
> Hi,
> I'm building a storage structure for OpenStack cloud System, input:
> - 700 VM
> - 150 IOPS per VM
> - 20 Storage per VM (boot volume)
> - Some VM run database (SQL or MySQL)
>
> I want to ask a sizing plan for Ceph to satisfy the IOPS requirement,
> I list some factors considered:
> - Amount of OSD (SAS Disk)
> - Amount of Journal (SSD)
> - Amount of OSD Servers
> - Amount of MON Server
> - Network
> - Replica ( default is 3)
>
> I will divide to 3 pool with 3 Disk types: SSD, SAS 15k and SAS 10k
> Should I use all 3 disk types in one server or build dedicated servers
> for every pool? Example: 3 15k servers for Pool-1, 3 10k Servers for
>Pool-2.
>
> Could you help me a formula to calculate the minimum devices needed
> for above input.
>
> Thanks and regards.








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com












This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to hammer, crush tuneables issue

2015-11-24 Thread Warren Wang - ISD
You upgraded (and restarted as appropriate) all the clients first, right?

Warren Wang





On 11/24/15, 10:52 AM, "Joe Ryner"  wrote:

>Hi,
>
>Last night I upgraded my cluster from Centos 6.5 -> Centos 7.1 and in the
>process upgraded from Emperor -> Firefly -> Hammer
>
>When I finished I changed the crush tunables from
>ceph osd crush tunables legacy -> ceph osd crush tunables optimal
>
>I knew this would cause data movement.  But the IO for my clients is
>unacceptable.  Can any please tell what the best settings are for my
>configuration.  I have 2 Dell R720 Servers and 2 Dell R730 servers.  I
>have 36 1TB SATA SSD Drives in my cluster.  The servers have 128 GB of
>RAM.
>
>Below is some detail the might help.  According to my calculations the
>rebalance will take over a day.
>
>I would greatly appreciate some help on this.
>
>Thank you,
>
>Joe
>
>-
>BEGIN --
>NODE: gold.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128726   713162077  20   55332
>56767
>Swap: 0   0   0
>END   --
>BEGIN --
>NODE: gallo.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128726   79489 462  36   48774
>48547
>Swap:  8191   08191
>END   --
>BEGIN --
>NODE: hamms.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128536   69412 659  19   58464
>58342
>Swap: 16383   0   16383
>END   --
>BEGIN --
>NODE: helm.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128536   662168799  26   53520
>61603
>Swap: 163831739   14644
>END   --
>
>ceph osd tree
>ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 36.0 root default
> -3 36.0 rack curack
>-10  9.0 host helm
>  0  1.0 osd.0   up  1.0  1.0
>  1  1.0 osd.1   up  1.0  1.0
>  2  1.0 osd.2   up  1.0  1.0
>  3  1.0 osd.3   up  1.0  1.0
>  4  1.0 osd.4   up  1.0  1.0
>  5  1.0 osd.5   up  1.0  1.0
>  6  1.0 osd.6   up  1.0  1.0
>  7  1.0 osd.7   up  1.0  1.0
>  8  1.0 osd.8   up  1.0  1.0
> -7  9.0 host gold
> 16  1.0 osd.16  up  1.0  1.0
> 17  1.0 osd.17  up  1.0  1.0
> 18  1.0 osd.18  up  1.0  1.0
> 19  1.0 osd.19  up  1.0  1.0
> 20  1.0 osd.20  up  1.0  1.0
> 21  1.0 osd.21  up  1.0  1.0
>  9  1.0 osd.9   up  1.0  1.0
> 10  1.0 osd.10  up  1.0  1.0
> 34  1.0 osd.34  up  1.0  1.0
> -8  9.0 host gallo
> 22  1.0 osd.22  up  1.0  1.0
> 23  1.0 osd.23  up  1.0  1.0
> 24  1.0 osd.24  up  1.0  1.0
> 25  1.0 osd.25  up  1.0  1.0
> 26  1.0 osd.26  up  1.0  1.0
> 27  1.0 osd.27  up  1.0  1.0
> 11  1.0 osd.11  up  1.0  1.0
> 12  1.0 osd.12  up  1.0  1.0
> 35  1.0 osd.35  up  1.0  1.0
> -9  9.0 host hamms
> 13  1.0 osd.13  up  1.0  1.0
> 14  1.0 osd.14  up  1.0  1.0
> 15  1.0 osd.15  up  1.0  1.0
> 28  1.0 osd.28  up  1.0  1.0
> 29  1.0 osd.29  up  1.0  1.0
> 30  1.0 osd.30  up  1.0  1.0
> 31  1.0 osd.31  up  1.0  1.0
> 32  1.0   

Re: [ceph-users] Advised Ceph release

2015-11-18 Thread Warren Wang - ISD
If it’s your first prod cluster, and you have no hard requirements for 
Infernalis features, I would say stick with Hammer.

Warren

From: Bogdan SOLGA mailto:bogdan.so...@gmail.com>>
Date: Wednesday, November 18, 2015 at 1:58 PM
To: ceph-users mailto:ceph-users@lists.ceph.com>>
Cc: Calin Fatu mailto:calin_f...@yahoo.com>>
Subject: [ceph-users] Advised Ceph release

Hello, everyone!

We have recently setup a Ceph cluster running on the Hammer release (v0.94.5), 
and we would like to know what is the advised release for preparing a 
production-ready cluster - the LTS version (Hammer) or the latest stable 
version (Infernalis)?

The cluster works properly (so far), and we're still not sure whether we should 
upgrade to Infernalis or not.

Thank you!

Regards,
Bogdan


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-18 Thread Warren Wang - ISD
What were you using for iodepth and numjobs? If you’re getting an average of 
2ms per operation, and you’re single threaded, I’d expect about 500 IOPS / 
thread, until you hit the limit of your QEMU setup, which may be a single IO 
thread. That’s also what I think Mike is alluding to.

Warren

From: Sean Redmond mailto:sean.redmo...@gmail.com>>
Date: Wednesday, November 18, 2015 at 6:39 AM
To: "ceph-us...@ceph.com" 
mailto:ceph-us...@ceph.com>>
Subject: [ceph-users] All SSD Pool - Odd Performance

Hi,

I have a performance question for anyone running an SSD only pool. Let me 
detail the setup first.

12 X Dell PowerEdge R630 ( 2 X 2620v3 64Gb RAM)
8 X intel DC 3710 800GB
Dual port Solarflare 10GB/s NIC (one front and one back)
Ceph 0.94.5
Ubuntu 14.04 (3.13.0-68-generic)

The above is in one pool that is used for QEMU guests, A 4k FIO test on the SSD 
directly yields around 55k Iops, the same test inside a QEMU guest seems to hit 
a limit around 4k Iops. If I deploy multiple guests they can all reach 4K Iops 
simultaneously.

I don't see any evidence of a bottle neck on the OSD hosts,Is this limit inside 
the guest expected or I am just not looking deep enough yet?

Thanks

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-21 Thread Warren Wang - ISD
Depending on how busy your cluster is, I’d nuke and pave node by node. You can 
slow the data movement off the old box, and also slow it on the way back in 
with weighting. My own personal preference, if you have performance overhead to 
spare.

Warren

From: Andrei Mikhailovsky mailto:and...@arhont.com>>
Date: Tuesday, October 20, 2015 at 3:05 PM
To: "ceph-us...@ceph.com" 
mailto:ceph-us...@ceph.com>>
Subject: [ceph-users] ceph and upgrading OS version

Hello everyone

I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
wondering if you have a recommended process of upgrading the OS version without 
causing any issues to the ceph cluster?

Many thanks

Andrei

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean on a new pool despite the pool size reconfiguration

2015-10-02 Thread Warren Wang - ISD
You probably don’t want hashpspool automatically set, since your clients may 
still not understand that crush map feature. You can try to unset it for that 
pool and see what happens, or create a new pool without hashpspool enabled from 
the start.  Just a guess.

Warren

From: Giuseppe Civitella 
mailto:giuseppe.civite...@gmail.com>>
Date: Friday, October 2, 2015 at 10:05 AM
To: ceph-users mailto:ceph-us...@ceph.com>>
Subject: [ceph-users] pgs stuck unclean on a new pool despite the pool size 
reconfiguration

Hi all,
I have a Firefly cluster which has been upgraded from Emperor.
It has 2 OSD hosts and 3 monitors.
The cluster has default values for what concerns size and min_size of the pools.
Once upgraded to Firefly, I created a new pool called bench2:
ceph osd pool create bench2 128 128
and set its sizes:
ceph osd pool set bench2 size 2
ceph osd pool set bench2 min_size 1

this is the state of the pools:
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 stripe_width 0
pool 3 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 384 pgp_num 384 last_change 2568 stripe_width 0
removed_snaps [1~75]
pool 4 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 384 pgp_num 384 last_change 1895 stripe_width 0
pool 8 'bench2' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 2580 flags hashpspool stripe_width 0

despite this I still get a warning about 128 pgs stuck unclean.
The "ceph health detail" shows me the stuck PGs. So i take one to get the 
involved OSDs:

pg 8.38 is stuck unclean since forever, current state active, last acting [22,7]

if I restart the OSD with id 22, the PG 8.38 gets an active+clean state.

This is an incorrect behavior, AFAIK. The cluster should get noticed of the new 
size and min_size values without any manual intervention. So my question is: 
any idea about why this happens and how to restore the default behavior? Do I 
need to restart all of the OSDs to restore an healthy state?

thanks a lot
Giuseppe

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, SSD, and NVMe

2015-10-02 Thread Warren Wang - ISD
Since you didn¹t hear much from the successful crowd, I¹ll chime in. At my
previous employer, we ran some pretty large clusters (over 1PB)
successfully on Hammer. Some were upgraded from Firefly, and by no means
do I consider myself to be a developer. We totaled over 15 production
clusters. I¹m not saying there weren¹t some rocky times, but they were
generally not directly due to Ceph code, but things ancillary to it, like
kernel bugs, customers driving traffic, hardware selection/failures, or
minor config issues. We never lost a cluster, though we did lose access to
them on occasion.

It does require you to stay up to date on what¹s going on with the
community, but I don¹t think that it¹s too different from OpenStack in
that regard. If support is a concern, there¹s always the Red Hat option,
or purchase a Ceph appliance like the Sandisk Infiniflash, which comes
with solid support from folks like Somnath.

FWIW, Hammer¹s write performance isn¹t awful. My coworker borrowed some
compute nodes, and ran a pretty large scale test with 400 SSDs across 50
nodes, and the results were pretty encouraging.

Warren

On 10/1/15, 10:01 PM, "J David"  wrote:

>This is all very helpful feedback, thanks so much.
>
>Also it sounds like you guys have done a lot of work on this, so
>thanks for that as well!
>
>Is Hammer generally considered stable enough for production in an
>RBD-only environment?  The perception around here is that the number
>of people who report lost data or inoperable clusters due to bugs in
>Hammer on this list is troubling enough to cause hesitation.  There's
>a specific term for overweighting the probability of catastrophic
>negative outcomes, and maybe that's what's happening.  People tend not
>to post to the list "Hey we have a cluster, it's running great!"
>instead waiting until things are not great, so the list paints an
>artificially depressing picture of stability.  But when we ask around
>quietly to other places we know running Ceph in production, which is
>admittedly a very small sample, they're all also still running
>Firefly.
>
>Admittedly, it doesn't help that "On my recommendation, we performed a
>non-reversible upgrade on the production cluster which, despite our
>best testing efforts, wrecked things causing us to lose 4 hours of
>data and requiring 2 days of downtime while we rebuilt the cluster and
>restored the backups" is pretty much guaranteed to be followed by,
>"You're fired."
>
>So, do medium-sized IT organizations (i.e. those without the resources
>to have a Ceph developer on staff) run Hammer-based deployments in
>production successfully?
>
>Please understand this is not meant to be sarcastic or critical of the
>project in any way.  Ceph is amazing, and we love it.  Some features
>of Ceph, like CephFS, have been considered not-production-quality for
>a long time, and that is to be expected.  These things are incredibly
>complex and take time to get right.  So organizations in our position
>just don't use that stuff.  As a relative outsider for whom the Ceph
>source code is effectively a foreign language, it's just *really* hard
>to tell if Hammer in general is in that same "still baking" category.
>
>Thanks!
>
>
>On Wed, Sep 30, 2015 at 3:33 PM, Somnath Roy 
>wrote:
>> David,
>> You should move to Hammer to get all the benefits of performance. It's
>>all added to Giant and migrated to the present hammer LTS release.
>> FYI, focus was so far with read performance improvement and what we saw
>>in our environment with 6Gb SAS SSDs so far that we are able to saturate
>>drives BW wise with 64K onwards. But, with smaller block like 4K we are
>>not able to saturate the SAS SSD drives yet.
>> But, considering Ceph's scale out nature you can get some very good
>>numbers out of a cluster. For example, with 8 SAS SSD drives (in a JBOF)
>>and having 2 heads in front (So, a 2 node Ceph cluster) we are able to
>>hit ~300K Random read iops while 8 SSD aggregated performance would be
>>~400K. Not too bad. At this point we are saturating host cpus.
>> We have seen almost linear scaling if you add similar setups i.e adding
>>say ~3 of the above setup, you could hit ~900K RR iops. So, I would say
>>it is definitely there in terms read iops and more improvement are
>>coming.
>> But, write path is very awful compare to read and that's where the
>>problem is. Because, in the mainstream, no workload is 100% RR (IMO).
>>So,  even if you have say 90-10 read/write the performance numbers would
>>be  ~6/7 X slower.
>> So, it is very much dependent on your workload/application access
>>pattern and obviously the cost you are willing to spend.
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>Of Mark Nelson
>> Sent: Wednesday, September 30, 2015 12:04 PM
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph, SSD, and NVMe
>>
>> On 09/30/2015 09:34 AM, J David wrote:
>>> Because we have a good thing going,