Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread Udo Lembke
Hi,
if you add on more than one server an SSD with an short lifetime, you
can run in real trouble (dataloss)!
Even if, all other SSDs are enterprise grade.
Ceph mix all data in PGs, which are spread over many disks - if one disk
fails - no poblem, but if the next two fails after that due high io
(recovery) you will have data loss.
But if you have only one node with consumer SSDs, the whole node can go
down without trouble...

I've tried consumer SSDs as yournal a long time ago - was an bad idea!
But this SSDs are cheap - buy one and do the io-test.
If you monitoring the live-time it's perhaps possible for your setup.

Udo


Am 19.12.19 um 20:20 schrieb Sinan Polat:
> Hi all,
>
> Thanks for the replies. I am not worried about their lifetime. We will be 
> adding only 1 SSD disk per physical server. All SSD’s are enterprise drives. 
> If the added consumer grade disk will fail, no problem.
>
> I am more curious regarding their I/O performance. I do want to have 50% drop 
> in performance.
>
> So anyone any experience with 860 EVO or Crucial MX500 in a Ceph setup?
>
> Thanks!
>
>> Op 19 dec. 2019 om 19:18 heeft Mark Nelson  het volgende 
>> geschreven:
>>
>> The way I try to look at this is:
>>
>>
>> 1) How much more do the enterprise grade drives cost?
>>
>> 2) What are the benefits? (Faster performance, longer life, etc)
>>
>> 3) How much does it cost to deal with downtime, diagnose issues, and replace 
>> malfunctioning hardware?
>>
>>
>> My personal take is that enterprise drives are usually worth it. There may 
>> be consumer grade drives that may be worth considering in very specific 
>> scenarios if they still have power loss protection and high write 
>> durability.  Even when I was in academia years ago with very limited 
>> budgets, we got burned with consumer grade SSDs to the point where we had to 
>> replace them all.  You have to be very careful and know exactly what you are 
>> buying.
>>
>>
>> Mark
>>
>>
>>> On 12/19/19 12:04 PM, jes...@krogh.cc wrote:
>>> I dont think “usually” is good enough in a production setup.
>>>
>>>
>>>
>>> Sent from myMail for iOS
>>>
>>>
>>> Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов 
>>> :
>>>
>>>Usually it doesn't, it only harms performance and probably SSD
>>>lifetime
>>>too
>>>
>>>> I would not be running ceph on ssds without powerloss protection. I
>>>> delivers a potential data loss scenario
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-16 Thread Udo Lembke
Hi,

On 16.07.2017 15:04, Phil Schwarz wrote:
> ...
> Same result, the OSD is known by the node, but not by the cluster.
> ...
Firewall? Or missmatch in /etc/hosts or DNS??

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Udo Lembke
Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:
> Hi,
> ...
>
> While investigating, i wondered about my config :
> Question relative to /etc/hosts file :
> Should i use private_replication_LAN Ip or public ones ?
private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Re-weight Entire Cluster?

2017-05-29 Thread Udo Lembke
Hi Mike,

On 30.05.2017 01:49, Mike Cave wrote:
>
> Greetings All,
>
>  
>
> I recently started working with our ceph cluster here and have been
> reading about weighting.
>
>  
>
> It appears the current best practice is to weight each OSD according
> to it’s size (3.64 for 4TB drive, 7.45 for 8TB drive, etc).
>
>  
>
> As it turns out, it was not configured this way at all; all of the
> OSDs are weighted at 1.
>
>  
>
> So my questions are:
>
>  
>
> Can we re-weight the entire cluster to 3.64 and then re-weight the 8TB
> drives afterwards at a slow rate which won’t impact performance?
>
> If we do an entire re-weight will we have any issues?
>
I would set osd_max_backfills + osd_recovery_max_active to 1 (with
injectargs) before start the reweight to minimize the impact for running
clients.
After set all to 3.64 you can raise the weight for the 8TB-drives one by
one.
Depends on your cluster/OSDs, it's perhaps an good idea to adjust the
primary affinity for the 8-TB drives during reweight?! Otherwise you got
more reads from the (slower) 8TB-drives.


> Would it be better to just reweight the 8TB drives to 2 gradually?
>
I would go for 3.64 - than you have the right settings if you init
further OSDs with ceph-deploy.

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to think a two different disk's technologies architecture

2017-03-23 Thread Udo Lembke
Hi,
ceph speeds up with more nodes and more OSDs - so go for 6 nodes with
mixed SSD+SATA.

Udo

On 23.03.2017 18:55, Alejandro Comisario wrote:
> Hi everyone!
> I have to install a ceph cluster (6 nodes) with two "flavors" of
> disks, 3 servers with SSD and 3 servers with SATA.
>
> Y will purchase 24 disks servers (the ones with sata with NVE SSD for
> the SATA journal)
> Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
> OS, and 1.3GB of ram per storage TB.
>
> The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
> for cluster network.
> My doubts resides, ar want to ask the community about experiences and
> pains and gains of choosing between.
>
> Option 1
> 3 x servers just for SSD
> 3 x servers jsut for SATA
>
> Option 2
> 6 x servers with 12 SSD and 12 SATA each
>
> Regarding crushmap configuration and rules everything is clear to make
> sure that two pools (poolSSD and poolSATA) uses the right disks.
>
> But, what about performance, maintenance, architecture scalability, etc ?
>
> thank you very much !
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-11 Thread Udo Lembke
Hi,

thanks for the usefull infos.


On 11.03.2017 12:21, cephmailingl...@mosibi.nl wrote:
>
> Hello list,
>
> A week ago we upgraded our Ceph clusters from Hammer to Jewel and with
> this email we want to share our experiences.
>
> ...
>
>
> e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> ... the 'find' in step e found so much files that xargs (the shell)
> could not handle it (too many arguments). At that time we decided to
> keep the permissions on root in the upgrade phase.
>
>
Perhaps would an "find /var/lib/ceph/ ! -uid 64045 -exec chown
ceph:ceph" do an better job?!

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Testing a node by fio - strange results to me

2017-01-22 Thread Udo Lembke
Hi,

I don't use mds, but I thinks it's the same like with rdb - the readed
data are cached on the OSD-nodes.

The 4MB-chunks of the 3G-file fit completly in the cache, the other not.


Udo


On 18.01.2017 07:50, Ahmed Khuraidah wrote:
> Hello community,
>
> I need your help to understand a little bit more about current MDS
> architecture. 
> I have created one node CephFS deployment and tried to test it by fio.
> I have used two file sizes of 3G and 320G. My question is why I have
> around 1k+ IOps when perform random reading from 3G file into
> comparison to expected ~100 IOps from 320G. Could somebody clarify
> where is read buffer/caching performs here and how to control it?
>
> A little bit about setup - Ubuntu 14.04 server that consists Jewel
> based: one MON, one MDS (default parameters, except mds_log = false)
> and OSD using SATA drive (XFS) for placing data and SSD drive for
> journaling. No RAID controller and no pool tiering used
>
> Thanks
>  
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Udo Lembke
Hi Sam,

the webfrontend of an external ceph-dash was interrupted till the node
was up again. The reboot took app. 5 min.

But  the ceph -w output shows some IO much faster. I will look tomorrow
at the output again and create an ticket.


Thanks


Udo


On 12.01.2017 20:02, Samuel Just wrote:
> How long did it take for the cluster to recover?
> -Sam
>
> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
>> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>>> Hi all,
>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>> ceph-cluster. All nodes are mons and have two OSDs.
>>> During reboot of one node, ceph stucks longer than normaly and I look in the
>>> "ceph -w" output to find the reason.
>>>
>>> This is not the reason, but I'm wonder why "osd marked itself down" will not
>>> recognised by the mons:
>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>> quorum 0,2
>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>>> 0,2
>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>>> wr, 15 op/s
>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
>>> wr, 12 op/s
>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>>> rd, 135 kB/s wr, 15 op/s
>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>>> rd, 189 kB/s wr, 7 op/s
>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
>>> reporters from different host after 21.222945 >= grace 20.388836)
>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
>>> reporters from different host after 21.222970 >= grace 20.388836)
>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>>
>>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
>>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>>
>>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> That's not what anybody intended to have happen. It's possible the
>> simultaneous loss of a monitor and the OSDs is triggering a case
>> that's not behaving correctly. Can you create a ticket at
>> tracker.ceph.com with your logs and what steps you took and symptoms
>> observed?
>> -Greg
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Udo Lembke
Hi,
but I assume you measure also cache in this scenario - the osd-nodes has
cached the writes in the filebuffer
(due this the latency should be very small).

Udo

On 12.12.2016 03:00, V Plus wrote:
> Thanks Somnath!
> As you recommended, I executed:
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1
>
> Then the output results look more reasonable!
> Could you tell me why??
>
> Btw, the purpose of my run is to test the performance of rbd in ceph.
> Does my case mean that before every test, I have to "initialize" all
> the images???
>
> Great thanks!!
>
> On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy  > wrote:
>
> Fill up the image with big write (say 1M) first before reading and
> you should see sane throughput.
>
>  
>
> Thanks & Regards
>
> Somnath
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>  
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host). 
>
> 1. I created 2 images in Ceph, and map them to another host A
> (*/outside /*the Ceph cluster). On host A, I
> got *//dev/rbd0/* and*/ /dev/rbd1/*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio
> job descriptions can be found below)
>
> */"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output
> b.txt  & wait"/*
>
> 3. After the test, in a.txt, we got */bw=1162.7MB/s/*, in b.txt,
> we get */bw=3579.6MB/s/*.
>
> The results do NOT make sense because there is only one NIC on
> host A, and its limit is 10 Gbps (1.25GB/s).
>
>  
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file *//etc/ceph/ceph.conf/* on host A,I
> already added:
>
> */[client]/*
>
> */rbd cache = false/*
>
>  
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>  
>
> *fioA.job:*
>
> /[A]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd0/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> *fioB.job:*
>
> /[B]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd1/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> /Thanks.../
>
> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated
> recipient(s) named above. If the reader of this message is not the
> intended recipient, you are hereby notified that you have received
> this message in error and that any review, dissemination,
> distribution, or copying of this message is strictly prohibited.
> If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and
> destroy any and all copies of this message in your possession
> (whether hard copies or electronically stored copies).
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Udo Lembke
Hi,

unfortunately there are no Debian Jessie packages...


Don't know that an recompile take such an long time for ceph... I think
such an important fix should hit the repros faster.


Udo


On 09.12.2016 18:54, Francois Lafont wrote:
> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>
>> Sounds great.  May I asked what procedure you did to upgrade?
> Of course. ;)
>
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
>
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
>
> deb 
> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>  trusty main
>
> And after that "apt-get update && apt-get upgrade" etc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel

2016-11-16 Thread Udo Lembke
Hi,


On 16.11.2016 19:01, Vincent Godin wrote:
> Hello,
>
> We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2
> (initial was hammer 0.94.5) but we have still some big problems on our
> production environment :
>
>   * some ceph filesystem are not mounted at startup and we have to
> mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk
> /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'"
>
vdX1?? This sounds you use ceph inside an virtualized system?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke
Hi again,

and change the value with something like this

ceph tell osd.* injectargs '--mon_osd_full_ratio 0.96'

Udo

On 01.11.2016 21:16, Udo Lembke wrote:
> Hi Marcus,
>
> for a fast help you can perhaps increase the mon_osd_full_ratio?
>
> What values do you have?
> Please post the output of (on host ceph1, because osd.0.asok)
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
> full_ratio
>
> after that it would be helpfull to use on all hosts 2 OSDs...
>
>
> Udo
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke
Hi Marcus,

for a fast help you can perhaps increase the mon_osd_full_ratio?

What values do you have?
Please post the output of (on host ceph1, because osd.0.asok)

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
full_ratio

after that it would be helpfull to use on all hosts 2 OSDs...


Udo


On 01.11.2016 20:14, Marcus Müller wrote:
> Hi all,
>
> i have a big problem and i really hope someone can help me!
>
> We are running a ceph cluster since a year now. Version is: 0.94.7
> (Hammer)
> Here is some info:
>
> Our osd map is:
>
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 26.67998 root default 
> -2  3.64000 host ceph1   
>  0  3.64000 osd.0   up  1.0  1.0 
> -3  3.5 host ceph2   
>  1  3.5 osd.1   up  1.0  1.0 
> -4  3.64000 host ceph3   
>  2  3.64000 osd.2   up  1.0  1.0 
> -5 15.89998 host ceph4   
>  3  4.0 osd.3   up  1.0  1.0 
>  4  3.5 osd.4   up  1.0  1.0 
>  5  3.2 osd.5   up  1.0  1.0 
>  6  5.0 osd.6   up  1.0  1.0 
>
> ceph df:
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED 
> 40972G 26821G   14151G 34.54 
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS 
> blocks  7  4490G 10.96 1237G 7037004 
> commits 8   473M 0 1237G  802353 
> fs  9  9666M  0.02 1237G 7863422 
>
> ceph osd df:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
>  0 3.64000  1.0  3724G  3128G   595G 84.01 2.43 
>  1 3.5  1.0  3724G  3237G   487G 86.92 2.52 
>  2 3.64000  1.0  3724G  3180G   543G 85.41 2.47 
>  3 4.0  1.0  7450G  1616G  5833G 21.70 0.63 
>  4 3.5  1.0  7450G  1246G  6203G 16.74 0.48 
>  5 3.2  1.0  7450G  1181G  6268G 15.86 0.46 
>  6 5.0  1.0  7450G   560G  6889G  7.52 0.22 
>   TOTAL 40972G 14151G 26820G 34.54  
> MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53
>
>
> Our current cluster state is: 
>
>  health HEALTH_WARN
> 63 pgs backfill
> 8 pgs backfill_toofull
> 9 pgs backfilling
> 11 pgs degraded
> 1 pgs recovering
> 10 pgs recovery_wait
> 11 pgs stuck degraded
> 89 pgs stuck unclean
> recovery 8237/52179437 objects degraded (0.016%)
> recovery 9620295/52179437 objects misplaced (18.437%)
> 2 near full osd(s)
> noout,noscrub,nodeep-scrub flag(s) set
>  monmap e8: 4 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
> election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
> flags noout,noscrub,nodeep-scrub
>   pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
> 14152 GB used, 26820 GB / 40972 GB avail
> 8237/52179437 objects degraded (0.016%)
> 9620295/52179437 objects misplaced (18.437%)
>  231 active+clean
>   61 active+remapped+wait_backfill
>9 active+remapped+backfilling
>6 active+recovery_wait+degraded+remapped
>6 active+remapped+backfill_toofull
>4 active+recovery_wait+degraded
>2 active+remapped+wait_backfill+backfill_toofull
>1 active+recovering+degraded
> recovery io 11754 kB/s, 35 objects/s
>   client io 1748 kB/s rd, 249 kB/s wr, 44 op/s
>
>
> My main problems are: 
>
> - As you can see from the osd tree, we have three separate hosts with
> only one osd each. Another one has four osds. Ceph allows me not to
> get data back from these three nodes with only one HDD, which are all
> near full. I tried to set the weight of the osds in the bigger node
> higher but this just does not work. So i added a new osd yesterday
> which made things not better, as you can see now. What do i have to do
> to just become these three nodes empty again and put more data on the
> other node with the four HDDs.
>
> - I added the „ceph4“ node later, this resulted in a strange ip change
> as you can see in the mon list. The public network and the cluster
> network were swapped or not assigned right. See ceph.conf
>
> [global]
> fsid = xxx
> mon_initial_members = ceph1
> mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> 

Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Udo Lembke
Hi Vincent,

On 12.07.2016 15:03, Vincent Godin wrote:
> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no
> allow partition rereading with it contains used volumes.
>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
1+2 has an performance impact.
I do an trick and use partition labels for the journal.
[osd]
osd_journal = /dev/disk/by-partlabel/journal-$id

Due this i'm independed from linux device-naming.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Udo Lembke
Hi Albert,
to free unused space you must enable trim (or do an fstrim) in the vm -
and all things in the storage chain must support this.
The normal virtio-driver don't support trim, but if you use scsi-disks
with virtio-scsi-driver you can use it.
Work well but need some time for huge filesystems.

Udo

On 19.05.2016 19:58, Albert Archer wrote:
> Hello All.
> I am newbie in ceph. and i use jewel release for testing purpose. it
> seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
> I create some RBD images (rbd create  ) and map to some ubuntu
> host . 
> I can read and write data to my volume , but when i delete some content
> from volume (e,g some huge files,...), populated capacity of cluster
> does not free and None of objects were clean.
> what is the problem ???
>
> Regards 
> Albert
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-25 Thread Udo Lembke

Hi Mike,

Am 21.04.2016 um 15:20 schrieb Mike Miller:

Hi Udo,

thanks, just to make sure, further increased the readahead:

$ sudo blockdev --getra /dev/rbd0
1048576

$ cat /sys/block/rbd0/queue/read_ahead_kb
524288

No difference here. First one is sectors (512 bytes), second one KB.

oops, sorry! My fault. Sector/KB make sense...


The second read (after drop cache) is somewhat faster (10%-20%) but 
not much.
That's very strange! Looks like tuning possibilities. Has your OSD-Nodes 
enough RAM? Are they very very busy?


If I do single thread reading on a test-vm I got following results (very 
small test-cluster - 2 nodes with 10GB-Nic and one Node with 1GB-Nic):

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 62,0267 s, 69,2 MB/s

### as root "echo 3 > /proc/sys/vm/drop_caches" and the same on the VM-host

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 30,0987 s, 143 MB/s

# this is due to cached data on the osd-nodes
# with cleared cache on all nodes (vm, vm-host, osd-nodes)
# I got the value like on the first run:

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 61,8995 s, 69,4 MB/s

I don't know why this should not the same with krbd.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-21 Thread Udo Lembke

Hi Mike,

Am 21.04.2016 um 09:07 schrieb Mike Miller:

Hi Nick and Udo,

thanks, very helpful, I tweaked some of the config parameters along 
the line Udo suggests, but still only some 80 MB/s or so.
this mean you have reached factor 3 (this are round about the value I 
see with single thread on RBD too). Better than nothing.




Kernel 4.3.4 running on the client machine and comfortable readahead 
configured


$ sudo blockdev --getra /dev/rbd0
262144

Still not more than about 80-90 MB/s.

they are two possibilities for read-ahead.
Take a look here (and change with echo)
cat /sys/block/rbd0/queue/read_ahead_kb

Perhaps there are slightly differences?



For writing the parallelization is amazing and I see very impressive 
speeds, but why is reading performance so much behind? Why is it not 
parallelized the same way writing is? Is this something coming up in 
the jewel release? Or is it planned further down the road?
If you read an big file and clear your cache ("echo 3 > 
/proc/sys/vm/drop_caches") on the client, is the second read very fast? 
I assume yes.
In this case the readed data is in the cache on the osd-nodes... so 
tuning must be there (and I'm very interesting in improvements).


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Howto reduce the impact from cephx with small IO

2016-04-21 Thread Udo Lembke

Hi Mark,
thanks for the links.

If I search for wip-auth I found nothing in docs.ceph.com... this mean, 
that wip-auth don't find the way in the ceph code base?!


But I'm wonder about the RHEL7 position at the link 
http://www.spinics.net/lists/ceph-devel/msg22416.html

Unfortunality there are no values for RHEL7 with auth...
But is known on which side (or how many percent) the bottleneck for 
cephx is (client, mon, osd)? My clients (qemu on proxmox-ve) are not 
changeable, but my OSDs can also run on RHEL7/CentOS if this bring an 
performance boost. The Mons are running on the proxmox-ve host yet.


Udo


Am 20.04.2016 um 19:13 schrieb Mark Nelson:

Hi Udo,

There was quite a bit of discussion and some partial improvements to 
cephx performance about a year ago.  You can see some of the 
discussion here:


http://www.spinics.net/lists/ceph-devel/msg3.html

and in particular these tests:

http://www.spinics.net/lists/ceph-devel/msg22416.html

Mark

On 04/20/2016 11:50 AM, Udo Lembke wrote:

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I
compare with and without cephx.

I use fio for that inside an VM on an host, outside the 3 ceph-nodes,
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the
average (but the values are very close together).

cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access
to the ceph-network can read/write all data...

ceph.conf without cephx:
[global]
  auth_cluster_required = none
  auth_service_required = none
  auth_client_required = none
  cephx_sign_messages = false
  cephx_require_signatures = false
  #
  cluster network =...

ceph.conf with cephx:
[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  #
  cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Udo Lembke

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I 
compare with and without cephx.


I use fio for that inside an VM on an host, outside the 3 ceph-nodes, 
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G 
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the 
average (but the values are very close together).


cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access 
to the ceph-network can read/write all data...


ceph.conf without cephx:
[global]
 auth_cluster_required = none
 auth_service_required = none
 auth_client_required = none
 cephx_sign_messages = false
 cephx_require_signatures = false
 #
 cluster network =...

ceph.conf with cephx:
[global]
 auth client required = cephx
 auth cluster required = cephx
 auth service required = cephx
 #
 cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-20 Thread Udo Lembke
Hi Mike,
I don't have experiences with RBD mounts, but see the same effect with RBD.

You can do some tuning to get better results (disable debug and so on).

As hint some values from a ceph.conf:
[osd]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 filestore_op_threads = 4
 osd max backfills = 1
 osd mount options xfs =
"rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
 osd mkfs options xfs = "-f -i size=2048"
 osd recovery max active = 1
 osd_disk_thread_ioprio_class = idle
 osd_disk_thread_ioprio_priority = 7
 osd_disk_threads = 1
 osd_enable_op_tracker = false
 osd_op_num_shards = 10
 osd_op_num_threads_per_shard = 1
 osd_op_threads = 4

Udo

On 19.04.2016 11:21, Mike Miller wrote:
> Hi,
>
> RBD mount
> ceph v0.94.5
> 6 OSD with 9 HDD each
> 10 GBit/s public and private networks
> 3 MON nodes 1Gbit/s network
>
> A rbd mounted with btrfs filesystem format performs really badly when
> reading. Tried readahead in all combinations but that does not help in
> any way.
>
> Write rates are very good in excess of 600 MB/s up to 1200 MB/s,
> average 800 MB/s
> Read rates on the same mounted rbd are about 10-30 MB/s !?
>
> Of course, both writes and reads are from a single client machine with
> a single write/read command. So I am looking at single threaded
> performance.
> Actually, I was hoping to see at least 200-300 MB/s when reading, but
> I am seeing 10% of that at best.
>
> Thanks for your help.
>
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Udo Lembke

Hi Sage,
we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy 
with ext4.

We start with xfs but the latency was much higher comparable to ext4...

But we use RBD only  with "short" filenames like 
rbd_data.335986e2ae8944a.000761e1.
If we can switch from Jewel to K* and change during the update the 
filestore for each OSD to BlueStore it's will be OK for us.

I hope we will get than an better performance with BlueStore??
Will be BlueStore production ready during the Jewel-Lifetime, so that we 
can switch to BlueStore before the next big upgrade?



Udo

Am 11.04.2016 um 23:39 schrieb Sage Weil:

Hi,

ext4 has never been recommended, but we did test it.  After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.

Why:

Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling.  (There is a limit in the amount of xattr data ext4 can store in
the inode, which causes problems in LFNIndex.)

We *could* invest a ton of time rewriting this to fix, but it only affects
ext4, which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.

Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.

The long file name handling is problematic anytime someone is storing
rados objects with long names.  The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS.  Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.

How:

To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len).  The OSD will complain that ext4
cannot store such an object and refuse to start.  A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully.  They would be taking a risk, though, because we would like
to stop testing on ext4.

Is this reasonable?  If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-02-25 Thread Udo Lembke
Hi,

Am 24.02.2016 um 17:27 schrieb Alfredo Deza:
> On Wed, Feb 24, 2016 at 4:31 AM, Dan van der Ster  wrote:
>> Thanks Sage, looking forward to some scrub randomization.
>>
>> Were binaries built for el6? http://download.ceph.com/rpm-hammer/el6/x86_64/
> 
> We are no longer building binaries for el6. Just for Centos 7, Ubuntu
> Trusty, and Debian Jessie.
> 
this means that our proxmox-ve server 3.4, which run debian wheezy, could not 
be updated from ceph 0.94.5 to 0.94.6!
The OSD-nodes run's wheezy too - they can be upgraded. But the MONs must be 
also upgraded (first).

I can understand, that newer versions will not supplied to an older OS, but 
stop from minor.5 to minor.6 makes realy no
sense to me.

Of course, I can update to proxmox-ve 4.x, which is jessie based, but in this 
case I have trouble with DRBD...


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke
Hi,
have done the test again in a cleaner way.

Same pool, same VM, different hosts (qemu 2.4 + qemu 2.2) but same hardware.
But only one run!

The biggest difference is due cache settings:

qemu2.4 cache=writethrough  iops=3823 bw=15294KB/s
qemu2.4 cache=writeback  iops=8837 bw=35348KB/s
qemu2.2 cache=writethrough  iops=2996 bw=11988KB/s
qemu2.2 cache=writeback  iops=7980 bw=31921KB/s

iothread does change anything, because only one disk is used.

Test:
io --time_based --name=benchmark --size=4G --filename=test.bin
--ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1
--verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
--group_reporting


Udo

On 22.11.2015 23:59, Udo Lembke wrote:
> Hi Zoltan,
> you are right ( but this was two running systems...).
>
> I see also an big failure: "--filename=/mnt/test.bin" (use simply
> copy/paste without to much thinking :-( )
> The root filesystem is not on ceph (on both servers).
> So my measurements are not valid!!
>
> I would do the measurements clean tomorow.
>
>
> Udo
>
>
> On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
>> It would have been more interesting if you had tweaked only one
>> option as now we can’t be sure which changed had what impact… :-)
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke
Hi Zoltan,
you are right ( but this was two running systems...).

I see also an big failure: "--filename=/mnt/test.bin" (use simply
copy/paste without to much thinking :-( )
The root filesystem is not on ceph (on both servers).
So my measurements are not valid!!

I would do the measurements clean tomorow.


Udo


On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
> It would have been more interesting if you had tweaked only one option
> as now we can’t be sure which changed had what impact… :-)
>
>> On 22 Nov 2015, at 04:29, Udo Lembke <ulem...@polarzone.de
>> <mailto:ulem...@polarzone.de>> wrote:
>>
>> Hi Sean,
>> Haomai is right, that qemu can have a huge performance differences.
>>
>> I have done two test to the same ceph-cluster (different pools, but
>> this should not do any differences).
>> One test with proxmox ve 4 (qemu 2.4, iothread for device, and
>> cache=writeback) gives 14856 iops
>> Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
>> 5070 iops only.
>>
>> Here the results in long:
>> ### proxmox ve 3.x ###
>> kvm --version
>> QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard
>>
>> VM:
>> virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G
>>
>> root@fileserver:/daten/support/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0
>> --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
>> --group_reporting
>> fio: time_based requires a runtime/timeout setting
>> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=128
>> ...
>> fio-2.1.11
>> Starting 4 processes
>> benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
>> Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
>> [0/10.6K/0 iops] [eta 00m:00s]
>> benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47
>> 2015
>>   write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
>> slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
>> clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
>>  lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
>> clat percentiles (msec):
>>  |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[  
>> 28],
>>  | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[  
>> 63],
>>  | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[ 
>> 367],
>>  | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[
>> 1713],
>>  | 99.99th=[ 2573]
>> bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84,
>> stdev=3014.45
>> lat (usec) : 750=0.01%, 1000=0.01%
>> lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
>> lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
>> lat (msec) : 2000=0.29%, >=2000=0.03%
>>   cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
>>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=100.0%
>>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>  issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
>>  latency   : target=0, window=0, percentile=100.00%, depth=128
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s,
>> mint=827178msec, maxt=827178msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
>> in_queue=105927128, util=100.00%, aggrios=1/4469640,
>> aggrmerge=0/14788, aggrticks=64/103711096, aggrin_queue=104165356,
>> aggrutil=100.00%
>>   vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
>> in_queue=104165356, util=100.00%
>>
>> ##
>>
>> ### proxmox ve 4.x ###
>> kvm --version
>> QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
>> 2003-2008 Fabrice Bellard
>>
>> grep ceph /etc/pve/qemu-server/102.conf
>> virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G
>>
>> root@fileserver-test:/daten/tv01/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-21 Thread Udo Lembke
Hi Sean,
Haomai is right, that qemu can have a huge performance differences.

I have done two test to the same ceph-cluster (different pools, but this
should not do any differences).
One test with proxmox ve 4 (qemu 2.4, iothread for device, and
cache=writeback) gives 14856 iops
Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
5070 iops only.

Here the results in long:
### proxmox ve 3.x ###
kvm --version
QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard

VM:
virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G

root@fileserver:/daten/support/test# fio --time_based --name=benchmark
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0
--numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
fio: time_based requires a runtime/timeout setting
benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 4 processes
benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
[0/10.6K/0 iops] [eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47 2015
  write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
 lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
clat percentiles (msec):
 |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[   28],
 | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[   63],
 | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[  367],
 | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[ 1713],
 | 99.99th=[ 2573]
bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84,
stdev=3014.45
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
lat (msec) : 2000=0.29%, >=2000=0.03%
  cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
 issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s,
mint=827178msec, maxt=827178msec

Disk stats (read/write):
dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
in_queue=105927128, util=100.00%, aggrios=1/4469640, aggrmerge=0/14788,
aggrticks=64/103711096, aggrin_queue=104165356, aggrutil=100.00%
  vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
in_queue=104165356, util=100.00%

##

### proxmox ve 4.x ###
kvm --version
QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
2003-2008 Fabrice Bellard

grep ceph /etc/pve/qemu-server/102.conf
virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G

root@fileserver-test:/daten/tv01/test# fio --time_based --name=benchmark
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0
--numjobs=4 --rw=randwrite --blocksize=4k --group_reporting  
fio: time_based requires a runtime/timeout
setting 
 

benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128 
... 
   

fio-2.1.11
Starting 4 processes
Jobs: 4 (f=4): [w(4)] [99.6% done] [0KB/56148KB/0KB /s] [0/14.4K/0 iops]
[eta 00m:01s]
benchmark: (groupid=0, jobs=4): err= 0: pid=26131: Sun Nov 22 03:51:04 2015
  write: io=0B, bw=59425KB/s, iops=14856, runt=282327msec
slat (usec): min=6, max=216925, avg=261.78, stdev=1802.78
clat (msec): min=1, max=330, avg=34.04, stdev=27.78
 lat (msec): min=1, max=330, avg=34.30, stdev=27.87
clat percentiles (msec):
 |  1.00th=[   10],  5.00th=[   13], 10.00th=[   14], 20.00th=[   16],
 | 30.00th=[   18], 40.00th=[   19], 50.00th=[   21], 60.00th=[   24],
 | 70.00th=[   33], 80.00th=[   62], 90.00th=[   81], 95.00th=[   87],
 | 99.00th=[   95], 99.50th=[  100], 99.90th=[  269], 99.95th=[  277],
 | 99.99th=[  297]
bw (KB  /s): min=3, max=42216, per=25.10%, avg=14917.03,
stdev=2990.50
lat (msec) : 2=0.01%, 4=0.01%, 10=1.13%, 20=45.52%, 50=28.23%
  

Re: [ceph-users] two or three replicas?

2015-11-03 Thread Udo Lembke
Hi,
for production (with enough OSDs) is three replicas the right choice.
The chance for data loss if two ODSs fails at one time is to high.

And if this happens most of your data ist lost, because the data is
spead over many OSDs...

And yes - two replicas is faster for writes.


Udo


On 02.11.2015 11:10, Wah Peng wrote:
> Hello,
>
> for production application (for example, openstack's block storage),
> is it better to setup data to be stored with two replicas, or three
> replicas? is two replicas with better performance and lower cost?
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network performance

2015-10-22 Thread Udo Lembke
Hi Jonas,
you can create an bond over multible NICs (depends on your switch which modes 
are possible) to use one IP addresses but
more than one NIC.

Udo

On 21.10.2015 10:23, Jonas Björklund wrote:
> Hello,
> 
> In the configuration I have read about "cluster network" and "cluster addr".
> Is it possible to make the OSDs to listens to multiple IP addresses?
> I want to use several network interfaces to increase performance.
> 
> I hav tried
> 
> [global]
> cluster network = 172.16.3.0/24,172.16.4.0/24
> 
> [osd.0]
> public addr = 0.0.0.0
> #public addr = 172.16.3.1
> #public addr = 172.16.4.1
> 
> But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the same 
> time.
> 
> Any ideas?
> 
> /Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Udo Lembke
Hi,
do you have changed the ownership like discribed in Sages mail about
"v9.1.0 Infernalis release candidate released"?

  #. Fix the ownership::

   chown -R ceph:ceph /var/lib/ceph

or set ceph.conf to use root instead?
  When upgrading, administrators have two options:

   #. Add the following line to ``ceph.conf`` on all hosts::

setuser match path = /var/lib/ceph/$type/$cluster-$id

  This will make the Ceph daemons run as root (i.e., not drop
  privileges and switch to user ceph) if the daemon's data
  directory is still owned by root.  Newly deployed daemons will
  be created with data owned by user ceph and will run with
  reduced privileges, but upgraded daemons will continue to run as
  root.



Udo

On 20.10.2015 14:59, German Anders wrote:
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the
> following error msg while trying to restart the mon daemons:
>
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features:
> (1) Operation not permitted
> 2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
> 2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features:
> (1) Operation not permitted
>
>
> any ideas?
>
> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
>
> Thanks in advance,
>
> Cheers,
>
> **
>
> *German*
>
> 2015-10-19 18:07 GMT-03:00 Sage Weil  >:
>
> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to
> Infernalis. That is, all users of earlier version of Hammer or any
> version of Firefly will first need to upgrade to hammer v0.94.4 or
> later before upgrading to Infernalis (or future releases).
>
> All v0.94.x Hammer users are strongly encouraged to upgrade.
>
> Changes
> ---
>
> * build/ops: ceph.spec.in : 50-rbd.rules
> conditional is wrong (#12166, Nathan Cutler)
> * build/ops: ceph.spec.in : ceph-common needs
> python-argparse on older distros, but doesn't require it (#12034,
> Nathan Cutler)
> * build/ops: ceph.spec.in : radosgw requires
> apache for SUSE only -- makes no sense (#12358, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: cephfs_java
> not fully conditionalized (#11991, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: not possible
> to turn off Java (#11992, Owen Synge)
> * build/ops: ceph.spec.in : running fdupes
> unnecessarily (#12301, Nathan Cutler)
> * build/ops: ceph.spec.in : snappy-devel for
> all supported distros (#12361, Nathan Cutler)
> * build/ops: ceph.spec.in : SUSE/openSUSE
> builds need libbz2-devel (#11629, Nathan Cutler)
> * build/ops: ceph.spec.in : useless
> %py_requires breaks SLE11-SP3 build (#12351, Nathan Cutler)
> * build/ops: error in ext_mime_map_init() when /etc/mime.types is
> missing (#11864, Ken Dreyer)
> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5
> in 30s) (#11798, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#10927, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11140, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11686, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#12407, Sage Weil)
> * cli: ceph: cli throws exception on unrecognized errno (#11354,
> Kefu Chai)
> * cli: ceph tell: broken error message / misleading hinting
> (#11101, Kefu Chai)
> * common: arm: all programs that link to librados2 hang forever on
> startup (#12505, Boris Ranto)
> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer
> objects (#13070, Sage Weil)
> * common: do not insert emtpy ptr when rebuild emtpy bufferlist
> (#12775, Xinze Chi)
> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
> * 

Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Udo Lembke
Hi Christian,

On 07.10.2015 09:04, Christian Balzer wrote:
> 
> ...
> 
> My main suspect for the excessive slowness are actually the Toshiba DT
> type drives used. 
> We only found out after deployment that these can go into a zombie mode
> (20% of their usual performance for ~8 hours if not permanently until power
> cycled) after a week of uptime.
> Again, the HW cache is likely masking this for the steady state, but
> asking a sick DT drive to seek (for reads) is just asking for trouble.
> 
> ...
does this mean, you can reboot your OSD-Nodes one after the other and then your 
cluster should be fast enough for app.
one week to bring the additional node in?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-25 Thread Udo Lembke
Hi,
you can use this sources-list

cat /etc/apt/sources.list.d/ceph.list
deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3
jessie main

Udo

On 25.09.2015 15:10, Jogi Hofmüller wrote:
> Hi,
>
> Am 2015-09-11 um 13:20 schrieb Florent B:
>
>> Jessie repository will be available on next Hammer release ;)
> An how should I continue installing ceph meanwhile?  ceph-deploy new ...
> overwrites the /etc/apt/sources.list.d/ceph.list and hence throws an
> error :(
>
> Any hint appreciated.
>
> Cheers,
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-07 Thread Udo Lembke
Hi Vickey,
I had the same rados bench output after changing the motherboard of the
monitor node with the lowest IP...
Due to the new mainboard, I assume the hw-clock was wrong during
startup. Ceph health show no errors, but all VMs aren't able to do IO
(very high load on the VMs - but no traffic).
I stopped the mon, but this don't changed anything. I had to restart all
other mons to get IO again. After that I started the first mon also
(with the right time now) and all worked fine again...

Another posibility:
Do you use journal on SSDs? Perhaps the SSDs can't write to garbage
collection?


Udo

On 07.09.2015 16:36, Vickey Singh wrote:
> Dear Experts
>
> Can someone please help me , why my cluster is not able write data.
>
> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
>
>
> Ceph Hammer  0.94.2
> CentOS 6 (3.10.69-1)
>
> The Ceph status says OPS are blocked , i have tried checking , what
> all i know 
>
> - System resources ( CPU , net, disk , memory )-- All normal 
> - 10G network for public and cluster network  -- no saturation 
> - Add disks are physically healthy 
> - No messages in /var/log/messages OR dmesg
> - Tried restarting OSD which are blocking operation , but no luck
> - Tried writing through RBD  and Rados bench , both are giving same
> problemm
>
> Please help me to fix this problem.
>
> #  rados bench -p rbd 60 write
>  Maintaining 16 concurrent writes of 4194304 bytes for up to 60
> seconds or 0 objects
>  Object prefix: benchmark_data_stor1_1791844
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0   0 0 0 0 0 - 0
>  1  16   125   109   435.873   436  0.022076 0.0697864
>  2  16   139   123   245.94856  0.246578 0.0674407
>  3  16   139   123   163.969 0 - 0.0674407
>  4  16   139   123   122.978 0 - 0.0674407
>  5  16   139   12398.383 0 - 0.0674407
>  6  16   139   123   81.9865 0 - 0.0674407
>  7  16   139   123   70.2747 0 - 0.0674407
>  8  16   139   123   61.4903 0 - 0.0674407
>  9  16   139   123   54.6582 0 - 0.0674407
> 10  16   139   123   49.1924 0 - 0.0674407
> 11  16   139   123   44.7201 0 - 0.0674407
> 12  16   139   123   40.9934 0 - 0.0674407
> 13  16   139   123   37.8401 0 - 0.0674407
> 14  16   139   123   35.1373 0 - 0.0674407
> 15  16   139   123   32.7949 0 - 0.0674407
> 16  16   139   123   30.7451 0 - 0.0674407
> 17  16   139   123   28.9364 0 - 0.0674407
> 18  16   139   123   27.3289 0 - 0.0674407
> 19  16   139   123   25.8905 0 - 0.0674407
> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> 20  16   139   12324.596 0 - 0.0674407
> 21  16   139   123   23.4247 0 - 0.0674407
> 22  16   139   123 22.36 0 - 0.0674407
> 23  16   139   123   21.3878 0 - 0.0674407
> 24  16   139   123   20.4966 0 - 0.0674407
> 25  16   139   123   19.6768 0 - 0.0674407
> 26  16   139   123 18.92 0 - 0.0674407
> 27  16   139   123   18.2192 0 - 0.0674407
> 28  16   139   123   17.5686 0 - 0.0674407
> 29  16   139   123   16.9628 0 - 0.0674407
> 30  16   139   123   16.3973 0 - 0.0674407
> 31  16   139   123   15.8684 0 - 0.0674407
> 32  16   139   123   15.3725 0 - 0.0674407
> 33  16   139   123   14.9067 0 - 0.0674407
> 34  16   139   123   14.4683 0 - 0.0674407
> 35  16   139   123   14.0549 0 - 0.0674407
> 36  16   139   123   13.6645 0 - 0.0674407
> 37  16   139   123   13.2952 0 - 0.0674407
> 38  16   139   123   12.9453 0 - 0.0674407
> 39  16   139   123   12.6134 0 - 0.0674407
> 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>   

Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Udo Lembke
Hi Christian,
for my setup "b" takes too long - too much data movement and stress to all 
nodes.
I have simply (with replica 3) "set noout", reinstall one node (with new 
filesystem on the OSDs, but leave them in the
crushmap) and start all OSDs (at friday night) - takes app. less than one day 
for rebuild (11*4TB 1*8TB).
Do also stress the other nodes, but less than with weigting to zero.

Udo

On 31.08.2015 06:07, Christian Balzer wrote:
> 
> Hello,
> 
> I'm about to add another storage node to small firefly cluster here and
> refurbish 2 existing nodes (more RAM, different OSD disks).
> 
> Insert rant about not going to start using ceph-deploy as I would have to
> set the cluster to no-in since "prepare" also activates things due to the
> udev magic...
> 
> This cluster is quite at the limits of its IOPS capacity (the HW was
> requested ages ago, but the mills here grind slowly and not particular
> fine either), so the plan is to:
> 
> a) phase in the new node (lets call it C), one OSD at a time (in the dead
> of night)
> b) empty out old node A (weight 0), one OSD at a time. When
> done, refurbish and bring it back in, like above.
> c) repeat with 2nd old node B.
> 
> Looking at this it's obvious where the big optimization in this procedure
> would be, having the ability to "freeze" the OSDs on node B.
> That is making them ineligible for any new PGs while preserving their
> current status. 
> So that data moves from A to C (which is significantly faster than A or B)
> and then back to A when it is refurbished, avoiding any heavy lifting by B.
> 
> Does that sound like something other people might find useful as well and
> is it feasible w/o upsetting the CRUSH applecart?
> 
> Christian
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the same cluster

2015-08-07 Thread Udo Lembke
Hi,
some time ago I switched all OSDs from XFS to ext4 (step by step).
I had no issues during mixed osd-format (the process takes some weeks).

And yes, for me ext4 performs also better (esp. the latencies).

Udo

Am 07.08.2015 13:31, schrieb Межов Игорь Александрович:
 Hi!
 
 We do some performance tests on our small Hammer install:
  - Debian Jessie;
  - Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
  - 1xE5-2670 + 128Gb RAM
  - 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
  - 16 OSD on each node, SAS 10k;
  - 2 Intel DC S3700 200Gb SSD for journalling 
  - 10Gbit interconnect, shared public and cluster metwork, MTU9100
  - 10Gbit client host, fio 2.2.7 compiled with RBD engine
 
 We benchmark 4k random read performance on 500G RBD volume with fio-rbd 
 and got different results. When we use XFS 
 (noatime,attr2,inode64,allocsize=4096k,
 noquota) on OSD disks, we can get ~7k sustained iops. After recreating the 
 same OSDs
 with EXT4 fs (noatime,data=ordered) we can achieve ~9.5k iops in the same 
 benchmark.
 
 So there are some questions to community:
  1. Is really EXT4 perform better under typical RBD load (we Ceph to host VM 
 images)?
  2. Is it safe to intermix OSDs with different backingstore filesystems at 
 one cluster 
 (we use ceph-deploy to create and manage OSDs)?
  3. Is it safe to move our production cluster (Firefly 0.80.7) from XFS to 
 ext4 by
 removing XFS osds one-by-one and later add the same disk drives as Ext4 OSDs
 (of course, I know about huge data-movement that will take place during this 
 process)?
 
 Thanks!
 
 Megov Igor
 CIO, Yuterra
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke
Hi Jan,
thanks for the hint.

I changed the mount-option from noatime to relatime and will remount all
OSDs during weekend.

Udo

On 07.08.2015 16:37, Jan Schermer wrote:
 ext4 does support external journal, and it is _FAST_

 btw I'm not sure noatime is the right option nowadays for two reasons
 1) the default is relatime which has minimal impact on performance
 2) AFAIK some ceph features actually use atime (cache tiering was it?) or at 
 least so I gathered from some bugs I saw

 Jan

 On 07 Aug 2015, at 16:30, Udo Lembke ulem...@polarzone.de wrote:

 Hi,
 I use the ext4-parameters like Christian Balzer wrote in one posting:
 osd mount options ext4 = user_xattr,rw,noatime,nodiratime
 osd_mkfs_options_ext4 = -J size=1024 -E 
 lazy_itable_init=0,lazy_journal_init=0

 The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
 support an different journal-device, like
 xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!

 Udo

 Am 07.08.2015 16:13, schrieb Burkhard Linke:
 Hi,


 On 08/07/2015 04:04 PM, Udo Lembke wrote:
 Hi,
 some time ago I switched all OSDs from XFS to ext4 (step by step).
 I had no issues during mixed osd-format (the process takes some weeks).

 And yes, for me ext4 performs also better (esp. the latencies).
 Just out of curiosity:

 Do you use a ext4 setup as described in the documentation? Did you try to 
 use external ext4 journals on SSD?

 Regards,
 Burkhard
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping old distros: el6, precise 12.04, debian wheezy?

2015-07-30 Thread Udo Lembke
Hi,
dropping debian wheezy are quite fast - till now there aren't packages
for jessie?!
Dropping of squeeze I understand, but wheezy at this time?


Udo


On 30.07.2015 15:54, Sage Weil wrote:
 As time marches on it becomes increasingly difficult to maintain proper 
 builds and packages for older distros.  For example, as we make the 
 systemd transition, maintaining the kludgey sysvinit and udev support for 
 centos6/rhel6 is a pain in the butt and eats up time and energy to 
 maintain and test that we could be spending doing more useful work.

 Dropping them would mean:

  - Ongoing development on master (and future versions like infernalis and 
 jewel) would not be tested on these distros.

  - We would stop building upstream release packages on ceph.com for new 
 releases.

  - We would probably continue building hammer and firefly packages for 
 future bugfix point releases.

  - The downstream distros would probably continue to package them, but the 
 burden would be on them.  For example, if Ubuntu wanted to ship Jewel on 
 precise 12.04, they could, but they'd probably need to futz with the 
 packaging and/or build environment to make it work.

 So... given that, I'd like to gauge user interest in these old distros.  
 Specifically,

  CentOS6 / RHEL6
  Ubuntu precise 12.04
  Debian wheezy

 Would anyone miss them?

 In particular, dropping these three would mean we could drop sysvinit 
 entirely and focus on systemd (and continue maintaining the existing 
 upstart files for just a bit longer).  That would be a relief.  (The 
 sysvinit files wouldn't go away in the source tree, but we wouldn't worry 
 about packaging and testing them properly.)

 Thanks!
 sage
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Udo Lembke
Hi,

On 28.07.2015 12:02, Shneur Zalman Mattern wrote:
 Hi!

 And so, in your math
 I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
 demans 
30 replicas is the wrong math! Less replicas = more speed (because of
less writing).
More replicas less speed.
Fore data safety an replica of 3 is recommended.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-23 Thread Udo Lembke
Hi,
I use ceph 0.94 from wheezy repro (deb http://eu.ceph.com/debian-hammer wheezy 
main) inside jessie.
0.94.1 are installable without trouble, but an upgrade to 0.94.2 don't work 
correctly:
dpkg -l | grep ceph
ii  ceph   0.94.1-1~bpo70+1  amd64  
  distributed storage and file system
ii  ceph-common0.94.2-1~bpo70+1  amd64  
  common utilities to mount and interact
with a ceph storage cluster
ii  ceph-fs-common 0.94.2-1~bpo70+1  amd64  
  common utilities to mount and interact
with a ceph file system
ii  ceph-fuse  0.94.2-1~bpo70+1  amd64  
  FUSE-based client for the Ceph
distributed file system
ii  ceph-mds   0.94.2-1~bpo70+1  amd64  
  metadata server for the ceph
distributed file system
ii  libcephfs1 0.94.2-1~bpo70+1  amd64  
  Ceph distributed file system client
library
ii  python-cephfs  0.94.2-1~bpo70+1  amd64  
  Python libraries for the Ceph
libcephfs library

This is the reason, why I switched back to wheezy (and clean 0.94.2) but than 
all OSDs on that node failed to start.
Switching back to the jessie-system-disk don't solve this ploblem, because only 
3 OSDs started again...


My conclusion is, if now die one of my (partly brocken) jessie osd-node (like 
failed system ssd) I need less than an
hour for a new system (wheezy), around two ours to reinitilize all OSDs (format 
new, install ceph) and around two days
to refill the whole node.

Udo

Am 23.07.2015 13:21, schrieb Haomai Wang:
 Do you use upstream ceph version previously? Or do you shutdown
 running ceph-osd when upgrading osd?
 
 How many osds meet this problems?
 
 This assert failure means that osd detects a upgraded pg meta object
 but failed to read(or lack of 1 key) meta keys from object.
 
 On Thu, Jul 23, 2015 at 7:03 PM, Udo Lembke ulem...@polarzone.de wrote:
 Am 21.07.2015 12:06, schrieb Udo Lembke:
 Hi all,
 ...

 Normaly I would say, if one OSD-Node die, I simply reinstall the OS and 
 ceph and I'm back again... but this looks bad
 for me.
 Unfortunality the system also don't start 9 OSDs as I switched back to the 
 old system-disk... (only three of the big
 OSDs are running well)

 What is the best solution for that? Empty one node (crush weight 0), fresh 
 reinstall OS/ceph, reinitialise all OSDs?
 This will take a long long time, because we use 173TB in this cluster...



 Hi,
 answer myself if anybody has similiar issues and find the posting.

 Empty the whole nodes takes too long.
 I used the puppet wheezy system and have to recreate all OSDs (in this case 
 I need to empty the first blocks of the
 journal before create the OSD again).


 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-23 Thread Udo Lembke
Am 21.07.2015 12:06, schrieb Udo Lembke:
 Hi all,
 ...
 
 Normaly I would say, if one OSD-Node die, I simply reinstall the OS and ceph 
 and I'm back again... but this looks bad
 for me.
 Unfortunality the system also don't start 9 OSDs as I switched back to the 
 old system-disk... (only three of the big
 OSDs are running well)
 
 What is the best solution for that? Empty one node (crush weight 0), fresh 
 reinstall OS/ceph, reinitialise all OSDs?
 This will take a long long time, because we use 173TB in this cluster...
 
 

Hi,
answer myself if anybody has similiar issues and find the posting.

Empty the whole nodes takes too long.
I used the puppet wheezy system and have to recreate all OSDs (in this case I 
need to empty the first blocks of the
journal before create the OSD again).


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-21 Thread Udo Lembke
Hi all,
we had an ceph cluster with 7 OSD-nodes (Debian Jessie (because patched 
tcmalloc) with ceph 0.94) which we expand with
one further node.
For this node we use puppet with Debian 7.8, because ceph 0.92.2 doesn't 
install on Jessie (upgrade 0.94.1 work on the
other nodes but 0.94.2 looks not clean because the package ceph are still 
0.94.1).
The ceph.conf is systemwide the same and the OSDs are on all nodes initialized 
with ceph-deploy (only some exceptions).
All OSDs are used ext4, switched from xfs during the cluster run ceph 0.80.7, 
filestore xattr use omap = true are used
in ceph.conf.

I'm wondering that the omap-format is different on the nodes.
The new wheezy node use .sst files:
ls -lsa /var/lib/ceph/osd/ceph-92/current/omap/
...
2084 -rw-r--r--   1 root root 2131113 Jul 20 17:45 98.sst
2084 -rw-r--r--   1 root root 2131913 Jul 20 17:45 99.sst
2084 -rw-r--r--   1 root root 2130623 Jul 20 17:45 000111.sst
...

Due the jessie nodes use levelDB:
ls -lsa /var/lib/ceph/osd/ceph-1/current/omap/
...

2084 -rw-r--r--   1 root root 2130468 Jul 20 22:33 80.ldb
2084 -rw-r--r--   1 root root 2130827 Jul 20 22:33 81.ldb
2084 -rw-r--r--   1 root root 2130171 Jul 20 22:33 88.ldb
...

On some OSDs I found old .sst files which came out of wheezy/ceph 0.87 times:
ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.sst
2096 -rw-r--r-- 1 root root 2142558 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016722.sst
2092 -rw-r--r-- 1 root root 2141968 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016723.sst
2092 -rw-r--r-- 1 root root 2141679 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016724.sst
2096 -rw-r--r-- 1 root root 2142376 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016725.sst
2096 -rw-r--r-- 1 root root 2142227 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016726.sst
2092 -rw-r--r-- 1 root root 2141369 Apr 20 21:23 
/var/lib/ceph/osd/ceph-23/current/omap/019470.sst
But much more .ldb-files
ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.ldb | wc -l
128

The config shows for OSDs on both nodes (old and new with .sst-files) as 
backend leveldb:
ceph --admin-daemon /var/run/ceph/ceph-osd.92.asok config show | grep -i omap
filestore_omap_backend: leveldb,
filestore_debug_omap_check: false,
filestore_omap_header_cache_size: 1024,


Normaly I would not care about that, but I tried to switch the first OSD-Node 
to an clean puppet install and see, that
none OSD are started. The error message looks a little bit like 
http://tracker.ceph.com/issues/11429 but this should not
happens, because the puppet install has ceph 0.94.2.

Error message during start:
cat ceph-osd.0.log
2015-07-20 16:51:29.435081 7fb47b126840  0 ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3), process
ceph-osd, pid 9803
2015-07-20 16:51:29.457776 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
backend generic (magic 0xef53)
2015-07-20 16:51:29.460470 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
ioctl is supported and appears to work
2015-07-20 16:51:29.460479 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-07-20 16:51:29.485120 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-07-20 16:51:29.572670 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
limited size xattrs
2015-07-20 16:51:29.889599 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
mount: enabling WRITEAHEAD journal mode:
checkpoint is not enabled
2015-07-20 16:51:31.517179 7fb47b126840  0 cls cls/hello/cls_hello.cc:271: 
loading cls_hello
2015-07-20 16:51:31.552366 7fb47b126840  0 osd.0 151644 crush map has features 
2303210029056, adjusting msgr requires
for clients
2015-07-20 16:51:31.552375 7fb47b126840  0 osd.0 151644 crush map has features 
2578087936000 was 8705, adjusting msgr
requires for mons
2015-07-20 16:51:31.552382 7fb47b126840  0 osd.0 151644 crush map has features 
2578087936000, adjusting msgr requires
for osds
2015-07-20 16:51:31.552394 7fb47b126840  0 osd.0 151644 load_pgs
2015-07-20 16:51:42.682678 7fb47b126840 -1 osd/PG.cc: In function 'static 
epoch_t PG::peek_map_epoch(ObjectStore*,
spg_t, ceph::bufferlist*)' thread 7fb47b126840 time 2015-07-20 16:51:42.680036
osd/PG.cc: 2825: FAILED assert(values.size() == 2)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) 
[0xcdb572]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0x7b2) 
[0x908742]
 3: (OSD::load_pgs()+0x734) [0x7e9064]
 4: (OSD::init()+0xdac) [0x7ed8fc]
 5: (main()+0x253e) [0x79069e]
 6: (__libc_start_main()+0xfd) [0x7fb47898fead]
 7: /usr/bin/ceph-osd() [0x7966b9]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
...

Normaly I would 

Re: [ceph-users] He8 drives

2015-07-13 Thread Udo Lembke
Hi,
I have just expand our ceph-cluster (7 nodes) with one 8TB HGST (change
from 4TB to 8TB) on each node (and 11 4TB HGST).
But I have set the primary affinity to 0 for the 8 TB-disks... in this
case my performance values are not 8-TB-disk related.

Udo

On 08.07.2015 02:28, Blair Bethwaite wrote:
 Hi folks,

 Does anyone have any experience with the newish HGST He8 8TB Helium
 filled HDDs? Storagereview looked at them here:
 http://www.storagereview.com/hgst_ultrastar_helium_he8_8tb_enterprise_hard_drive_review.
 I'm torn as to the lower read performance shown there than e.g. the
 He6 or Seagate 6TB, but thing is, I think we probably have enough
 aggregate IOPs with ~170 drives. Has anyone tried these in a Ceph
 cluster yet?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Udo Lembke
Hi,

On 01.05.2015 10:30, Piotr Wachowicz wrote:
 Is there any way to confirm (beforehand) that using SSDs for journals
 will help?
yes SSD-Journal helps a lot (if you use the right SSDs) for write speed,
and I made the experiences that this also helped (but not too much) for
read-performance.


 We're seeing very disappointing Ceph performance. We have 10GigE
 interconnect (as a shared public/internal network).
Which kind of CPU do you use for the OSD-hosts?


 We're wondering whether it makes sense to buy SSDs and put journals on
 them. But we're looking for a way to verify that this will actually
 help BEFORE we splash cash on SSDs.
I can recommend the Intel DC S3700 SSD for journaling! In the beginning
I started with different much cheaper models, but this was the wrong
decision.

 The problem is that the way we have things configured now, with
 journals on spinning HDDs (shared with OSDs as the backend storage),
 apart from slow read/write performance to Ceph I already mention,
 we're also seeing fairly low disk utilization on OSDs. 

 This low disk utilization suggests that journals are not really used
 to their max, which begs for the questions whether buying SSDs for
 journals will help.

 This kind of suggests that the bottleneck is NOT the disk. But,m yeah,
 we cannot really confirm that.

 Our typical data access use case is a lot of small random read/writes.
 We're doing a lot of rsyncing (entire regular linux filesystems) from
 one VM to another.

 We're using Ceph for OpenStack storage (kvm). Enabling RBD cache
 didn't really help all that much.
The read speed can be optimized with an bigger read ahead cache inside
the VM, like:
echo 4096  /sys/block/vda/queue/read_ahead_kb

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer release data and a Design question

2015-03-27 Thread Udo Lembke
Hi,

Am 26.03.2015 11:18, schrieb 10 minus:
 Hi ,
 
 I 'm just starting on small Ceph implementation and wanted to know the 
 release date for Hammer.
 Will it coincide with relase of Openstack.
 
 My Conf:  (using 10G and Jumboframes on Centos 7 / RHEL7 )
 
 3x Mons (VMs) :
 CPU - 2
 Memory - 4G
 Storage - 20 GB
 
 4x OSDs :
 CPU - Haswell Xeon
 Memory - 8 GB
 Sata - 3x 2TB (3 OSD per node)
 SSD - 2x 480 GB ( Journaling and if possible tiering)
 
 
 This is a test environment to see how all the components play . If all goes 
 well
 then we plan to increase the OSDs to 24 per node and RAM to 32 GB and a dual 
 Socket Haswell Xeons
32GB for 24 OSDs are much to less!! I have 32GB for 12 OSDs - that's ok, but 
64GB will be better.
CPU depends on you Model (Cores, DualSocket?).

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-26 Thread Udo Lembke
Hi Don,
after a lot of trouble due an unfinished setcrushmap, I was able to remove the 
new EC pool.
Load the old crushmap and edit agin. After include an step set_choose_tries 
100 in the crushmap the EC pool creation with
ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile
work without trouble.

Due to defect PGs from this test, I remove the cache tier from the old EC pool 
which gaves the next trouble - but this
is another story!


Thanks again

Udo

Am 25.03.2015 20:37, schrieb Don Doerner:
 More info please: how did you create your EC pool?  It's hard to imagine that 
 you could have specified enough PGs to make it impossible to form PGs out of 
 84 OSDs (I'm assuming your SSDs are in a separate root) but I have to ask...
 
 -don-
 
 

 -Original Message-
 From: Udo Lembke [mailto:ulem...@polarzone.de] 
 Sent: 25 March, 2015 08:54
 To: Don Doerner; ceph-us...@ceph.com
 Subject: Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 
 active+undersized+degraded
 
 Hi Don,
 thanks for the info!
 
 looks that choose_tries set to 200 do the trick.
 
 But the setcrushmap takes a long long time (alarming, but the client have 
 still IO)... hope it's finished soon ;-)
 
 
 Udo
 
 Am 25.03.2015 16:00, schrieb Don Doerner:
 Assuming you've calculated the number of PGs reasonably, see here 
 https://urldefense.proofpoint.com/v1/url?u=http://tracker.ceph.com/issues/10350k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=b2547ec4aefa0f1b25d47bc813cab344a24c22c2464d4ff2cb199be0ef9b15cf
  and here 
 https://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/%23crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=09d9aeb34481797e2d8f24938980db3697f26d94e92ff4c72714651181329de9.
 I'm guessing these will address your issue.  That weird number means that no 
 OSD was found/assigned to the PG.

  

 -don-
 
 --
 The information contained in this transmission may be confidential. Any 
 disclosure, copying, or further distribution of confidential information is 
 not permitted unless such privilege is explicitly granted in writing by 
 Quantum. Quantum reserves the right to have electronic communications, 
 including email and attachments, sent across its networks filtered through 
 anti virus and spam software programs and retain such messages in order to 
 comply with applicable data security and retention requirements. Quantum is 
 not responsible for the proper and complete transmission of the substance of 
 this communication or for any delay in its receipt.
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke
Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
ruleset 5
type replicated
min_size 1
max_size 10
step take ssd
step choose firstn 0 type osd
step emit
}


Are there any magic (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke
Hi Greg,

On 26.03.2015 18:46, Gregory Farnum wrote:
 I don't know why you're mucking about manually with the rbd directory;
 the rbd tool and rados handle cache pools correctly as far as I know.
that's because I deleted the cache tier pool, so the files like 
rbd_header.2cfc7ce74b0dc51 and rbd_directory are gone.
The whole vm-disk data are in the ec pool (rbd_data.2cfc7ce74b0dc51.*)

I can't see or recreate the VM-disk, because rados setomapval don't like
binary-data and the rbd-tool can't (re)create an rbd-disk with an given
hash (like 2cfc7ce74b0dc51).

The only way I see in the moment, is to create new rbd-disks and copy
all blocks with rados get - file - rados put.
The problem is the time it's take (days to weeks for 3 * 16TB)...

Udo

 -Greg

 On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi Greg,
 ok!

 It's looks like, that my problem is more setomapval-related...

 I must o something like
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51

 but rados setomapval don't use the hexvalues - instead of this I got
 rados -p ssd-archiv listomapvals rbd_directory
 name_vm-409-disk-2
 value: (35 bytes) :
  : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
 0020 : 63 35 31: c51


 hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
 name_vm-409-disk-2 name_vm-409-disk-2
 I got the binary inside the file name_vm-409-disk-2, but reverse do an
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 name_vm-409-disk-2
 fill the variable with name_vm-409-disk-2 and not with the content of the 
 file...

 Are there other tools for the rbd_directory?

 regards

 Udo

 Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)

 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg

 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any 
 content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with 
 setomapval), but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting 
 data throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke
Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
\0x0f\0x00\0x00\0x002cfc7ce74b0dc51

but rados setomapval don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
name_vm-409-disk-2 name_vm-409-disk-2
I got the binary inside the file name_vm-409-disk-2, but reverse do an
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)
 
 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg
 
 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting data 
 throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke
Hi,
due to two more hosts (now 7 storage nodes) I want to create an new
ec-pool and get an strange effect:

ceph@admin:~$ ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
pgs stuck undersized; 2 pgs undersized
pg 22.3e5 is stuck unclean since forever, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck unclean since forever, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is stuck undersized for 406.614447, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck undersized for 406.616563, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is stuck degraded for 406.614566, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck degraded for 406.616679, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is active+undersized+degraded, acting
[76,15,82,11,57,29,2147483647]
pg 22.240 is active+undersized+degraded, acting
[38,85,17,74,2147483647,10,58]

But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647!
Where the heck came the 2147483647 from?

I do following commands:
ceph osd erasure-code-profile set 7hostprofile k=5 m=2
ruleset-failure-domain=host
ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile

my version:
ceph -v
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)


I found an issue in my crush-map - one SSD was twice in the map:
host ceph-061-ssd {
id -16  # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
}
root ssd {
id -13  # do not change unnecessarily
# weight 0.780
alg straw
hash 0  # rjenkins1
item ceph-01-ssd weight 0.170
item ceph-02-ssd weight 0.170
item ceph-03-ssd weight 0.000
item ceph-04-ssd weight 0.170
item ceph-05-ssd weight 0.170
item ceph-06-ssd weight 0.050
item ceph-07-ssd weight 0.050
item ceph-061-ssd weight 0.000
}

Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd,
but after fix the crusmap the issue with the osd 2147483647 still excist.

Any idea how to fix that?

regards

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke
Hi Gregory,
thanks for the answer!

I have look which storage nodes are missing, and it's two differrent:
pg 22.240 is stuck undersized for 24437.862139, current state 
active+undersized+degraded, last acting
[38,85,17,74,2147483647,10,58]
pg 22.240 is stuck undersized for 24437.862139, current state 
active+undersized+degraded, last acting
[ceph-04,ceph-07,ceph-02,ceph-06,2147483647,ceph-01,ceph-05]
ceph-03 is missing

pg 22.3e5 is stuck undersized for 24437.860025, current state 
active+undersized+degraded, last acting
[76,15,82,11,57,29,2147483647]
pg 22.3e5 is stuck undersized for 24437.860025, current state 
active+undersized+degraded, last acting
[ceph-06,ceph-ceph-02,ceph-07,ceph-01,ceph-05,ceph-03,2147483647]
ceph-04 is missing

Perhaps I hit an PGs/OSD max?!

I look with the script from 
http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

pool :  17  18  19  9   10  20  21  13  22  
23  16  | SUM

...
host ceph-03:
osd.24  0   12  2   2   4   76  16  5   74  
0   66  | 257
osd.25  0   17  3   4   4   89  16  4   82  
0   60  | 279
osd.26  0   20  2   5   3   71  12  5   81  
0   61  | 260
osd.27  0   18  2   4   3   73  21  3   76  
0   61  | 261
osd.28  0   14  2   9   4   73  23  9   94  
0   64  | 292
osd.29  0   19  3   3   4   54  25  4   89  
0   62  | 263
osd.30  0   22  2   6   3   80  15  6   92  
0   47  | 273
osd.31  0   25  4   2   3   87  20  3   76  
0   62  | 282
osd.32  0   13  4   2   2   64  14  1   82  
0   69  | 251
osd.33  0   12  2   5   5   89  25  7   83  
0   68  | 296
osd.34  0   28  0   8   5   81  18  3   99  
0   65  | 307
osd.35  0   17  3   2   4   74  21  3   95  
0   58  | 277
host ceph-04:
osd.36  0   13  1   9   6   72  17  5   93  
0   56  | 272
osd.37  0   21  2   5   6   83  20  4   78  
0   71  | 290
osd.38  0   17  3   2   5   64  22  7   76  
0   57  | 253
osd.39  0   21  3   7   6   79  27  4   80  
0   68  | 295
osd.40  0   15  1   5   7   71  17  6   93  
0   74  | 289
osd.41  0   16  5   5   6   76  18  6   95  
0   70  | 297
osd.42  0   13  0   6   1   71  25  4   83  
0   56  | 259
osd.43  0   20  2   2   6   81  23  4   89  
0   59  | 286
osd.44  0   21  2   5   6   77  9   5   76  
0   52  | 253
osd.45  0   11  4   8   3   76  24  6   82  
0   49  | 263
osd.46  0   17  2   5   6   57  15  4   84  
0   62  | 252
osd.47  0   19  3   2   3   84  19  5   94  
0   48  | 277
...

SUM :   768 1536192 384 384 61441536384 7168
24  5120|


Pool 22 is the new ec7archiv.

But on ceph-04 there aren't OSD with more than 300 PGs...

Udo

Am 25.03.2015 14:52, schrieb Gregory Farnum:
 On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi,
 due to two more hosts (now 7 storage nodes) I want to create an new
 ec-pool and get an strange effect:

 ceph@admin:~$ ceph health detail
 HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
 pgs stuck undersized; 2 pgs undersized
 
 This is the big clue: you have two undersized PGs!
 
 pg 22.3e5 is stuck unclean since forever, current state
 active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
 
 2147483647 is the largest number you can represent in a signed 32-bit
 integer. There's an output error of some kind which is fixed
 elsewhere; this should be -1.
 
 So for whatever reason (in general it's hard on CRUSH trying to select
 N entries out of N choices), CRUSH hasn't been able to map an OSD to
 this slot for you. You'll want to figure out why that is and fix it.
 -Greg
 
 pg 22.240 is stuck unclean since forever, current state
 active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
 pg

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke
Hi Don,
thanks for the info!

looks that choose_tries set to 200 do the trick.

But the setcrushmap takes a long long time (alarming, but the client have still 
IO)... hope it's finished soon ;-)


Udo

Am 25.03.2015 16:00, schrieb Don Doerner:
 Assuming you've calculated the number of PGs reasonably, see here 
 http://tracker.ceph.com/issues/10350 and here
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/.
  
 I’m guessing these will address your issue.  That weird number means that no 
 OSD was found/assigned to the PG.
 
  
 
 -don-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] won leader election with quorum during osd setcrushmap

2015-03-25 Thread Udo Lembke
Hi,
due to PG-trouble with an EC-Pool I modify the crushmap (step set_choose_tries 
200) from

rule ec7archiv {
ruleset 6
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step chooseleaf indep 0 type host
step emit
}

to

rule ec7archiv {
ruleset 6
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step set_choose_tries 200
step take default
step chooseleaf indep 0 type host
step emit
}

ceph osd setcrushmap runs since one hour and ceph -w give following output:

2015-03-25 17:20:18.163295 mon.0 [INF] mdsmap e766: 1/1/1 up {0=b=up:active}, 1 
up:standby
2015-03-25 17:20:18.163370 mon.0 [INF] osdmap e130004: 91 osds: 91 up, 91 in
2015-03-25 17:20:28.525445 mon.0 [INF] from='client.? 172.20.2.1:0/1007537' 
entity='client.admin' cmd=[{prefix: osd
setcrushmap}]: dispatch
2015-03-25 17:20:28.525580 mon.0 [INF] mon.0 calling new monitor election
2015-03-25 17:20:28.526263 mon.0 [INF] mon.0@0 won leader election with quorum 
0,1,2


Fortunaly the clients have still access to the cluster (kvm)!!

How long take such an setcrushmap?? Normaly it's done in few seconds.
Has the setcrushmap chance to get ready?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Udo Lembke
Hi Tony,
sounds like an good idea!

Udo
On 09.03.2015 21:55, Tony Harris wrote:
 I know I'm not even close to this type of a problem yet with my small
 cluster (both test and production clusters) - but it would be great if
 something like that could appear in the cluster HEALTHWARN, if Ceph
 could determine the amount of used processes and compare them against
 the current limit then throw a health warning if it gets within say 10
 or 15% of the max value.  That would be a really quick indicator for
 anyone who frequently checks the health status (like through a web
 portal) as they may see it more quickly then during their regular log
 check interval.  Just a thought.

 -Tony


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] too few pgs in cache tier

2015-02-27 Thread Udo Lembke
Hi all,
we use an EC-Pool with an small cache tier in front of, for our
archive-data (4 * 16TB VM-disks).

The ec-pool has k=3;m=2 because we startet with 5 nodes and want to
migrate to an new ec-pool with k=5;m=2. Therefor we migrate one VM-disk
(16TB) from the ceph-cluster to an fc-raid with the proxmox-ve interface
move disk.

The move was finished, but during removing the ceph-vm file the warning
'ssd-archiv' at/near target max; pool ssd-archiv has too few pgs occour.

Some hour later only the second warning exsist.

ceph health detail
HEALTH_WARN pool ssd-archiv has too few pgs
pool ssd-archiv objects per pg (51196) is more than 14.7709 times
cluster average (3466)

info about the image, which was deleted:
rbd image 'vm-409-disk-1':
size 16384 GB in 4194304 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.2b8fda574b0dc51
format: 2
features: layering

I think we hit http://tracker.ceph.com/issues/8103
but normaly one reading should not put the data in the cache tier, or??
Is deleting a second read??

Our ceph version: 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)


Regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Udo Lembke
Hi Jeff,
is the osd /var/lib/ceph/osd/ceph-2 mounted?

If not, does it helps, if you mounted the osd and start with
service ceph start osd.2
??

Udo

Am 17.02.2015 09:54, schrieb Jeff:
 Hi,
 
 We had a nasty power failure yesterday and even with UPS's our small (5
 node, 12 OSD) cluster is having problems recovering.
 
 We are running ceph 0.87
 
 3 of our OSD's are down consistently (others stop and are restartable,
 but our cluster is so slow that almost everything we do times out).
 
 We are seeing errors like this on the OSD's that never run:
 
 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
 Operation not permitted
 
 We are seeing errors like these of the OSD's that run some of the time:
 
 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout)
 
 Does anyone have any suggestions on how to recover our cluster?
 
 Thanks!
   Jeff
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placement Groups fail on fresh Ceph cluster installation with all OSDs up and in

2015-02-10 Thread Udo Lembke
Hi,
your will get further trouble, because your weight is not correct.

You need an weight = 0.01 for each OSD. This mean, you OSD must be 10GB
or greater!


Udo

Am 10.02.2015 12:22, schrieb B L:
 Hi Vickie,
 
 My OSD tree looks like this:
 
 ceph@ceph-node3:/home/ubuntu$ ceph osd tree
 # idweighttype nameup/downreweight
 -10root default
 -20host ceph-node1
 00osd.0up1
 10osd.1up1
 -30host ceph-node3
 20osd.2up1
 30osd.3up1
 -40host ceph-node2
 40osd.4up1
 50osd.5up1
 
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placement Groups fail on fresh Ceph cluster installation with all OSDs up and in

2015-02-10 Thread Udo Lembke
Hi,
use:
ceph osd crush set 0 0.01 pool=default host=ceph-node1
ceph osd crush set 1 0.01 pool=default host=ceph-node1
ceph osd crush set 2 0.01 pool=default host=ceph-node3
ceph osd crush set 3 0.01 pool=default host=ceph-node3
ceph osd crush set 4 0.01 pool=default host=ceph-node2
ceph osd crush set 5 0.01 pool=default host=ceph-node2

Udo
Am 10.02.2015 15:01, schrieb B L:
 Thanks Vikhyat,
 
 As suggested .. 
 
 ceph@ceph-node1:/home/ubuntu$ ceph osd crush reweight 0.0095 osd.0
 
 Invalid command:  osd.0 doesn't represent a float
 osd crush reweight name float[0.0-] :  change name's weight to
 weight in crush map
 Error EINVAL: invalid command
 
 What do you think
 
 
 On Feb 10, 2015, at 3:18 PM, Vikhyat Umrao vum...@redhat.com
 mailto:vum...@redhat.com wrote:

 sudo ceph osd crush reweight 0.0095 osd.0 to osd.5
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure code : number of chunks for a small cluster ?

2015-02-06 Thread Udo Lembke
Am 06.02.2015 09:06, schrieb Hector Martin:
 On 02/02/15 03:38, Udo Lembke wrote:
 With 3 hosts only you can't survive an full node failure, because for
 that you need
 host = k + m.
 
 Sure you can. k=2, m=1 with the failure domain set to host will survive
 a full host failure.
 

Hi,
Alexandre has the requirement of 2 failed disk or one full node failure.
This is the reason why I wrote, that this is not possible...

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi Dan,
I mean qemu-kvm, also librbd.
But how I can kvm told to flush the buffer?

Udo

On 05.02.2015 07:59, Dan Mick wrote:
 On 02/04/2015 10:44 PM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?

 Udo
 Do you mean the kernel rbd or librbd?  The latter responds to flush
 requests from the hypervisor.  The former...I'm not sure it has a
 separate cache.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi Josh,
thanks for the info.

detach/reattach schould be fine for me, because it's only for
performance testing.

#2468 would be fine of course.

Udo

On 05.02.2015 08:02, Josh Durgin wrote:
 On 02/05/2015 07:44 AM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?

 librbd exposes it as rbd_invalidate_cache(), and qemu uses it
 internally, but I don't think you can trigger that via any user-facing
 qemu commands.

 Exposing it through the admin socket would be pretty simple though:

 http://tracker.ceph.com/issues/2468

 You can also just detach and reattach the device to flush the rbd cache.

 Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi all,
is there any command to flush the rbd cache like the
echo 3  /proc/sys/vm/drop_caches for the os cache?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-04 Thread Udo Lembke
Hi Marco,

Am 04.02.2015 10:20, schrieb Colombo Marco:
...
 We choosen the 6TB of disk, because we need a lot of storage in a small 
 amount of server and we prefer server with not too much disks.
 However we plan to use max 80% of a 6TB Disk
 

80% is too much! You will run into trouble.
Ceph don't write the data in equal distribution. Sometimes I see an
difference of 20% in the usage of the OSD.

I recommend 60-70% as maximum.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure code : number of chunks for a small cluster ?

2015-02-01 Thread Udo Lembke
Hi Alexandre,

nice to meet you here ;-)

With 3 hosts only you can't survive an full node failure, because for
that you need
host = k + m.
And k:1 m:2 don't make any sense.

I start with 5 hosts and use k:3, m:2. In this case two hdds can fail or
one host can be down for maintenance.

Udo

PS: you also can't change k+m on a pool later...

On 01.02.2015 18:15, Alexandre DERUMIER wrote:
 Hi,

 I'm currently trying to understand how to setup correctly a pool with erasure 
 code


 https://ceph.com/docs/v0.80/dev/osd_internals/erasure_coding/developer_notes/


 My cluster is 3 nodes with 6 osd for each node (18 osd total).

 I want to be able to survive of 2 disk failures, but also a full node failure.

 What is the best setup for this ? Does I need M=2 or M=6 ?




 Also, how to determinate the best chunk number ?

 for example,
 K = 4 , M=2
 K = 8 , M=2
 K = 16 , M=2

 you can loose which each config 2 osd, but the more data chunks you have, the 
 less space is used by coding chunks right ?
 Does the number of chunk have performance impact ? (read/write ?)

 Regards,

 Alexandre




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD capacity variance ?

2015-02-01 Thread Udo Lembke
Hi Howard,
I assume it's an typo with 160 + 250 MB.
Ceph OSDs must be min. 10GB to get an weight of 0.01

Udo

On 31.01.2015 23:39, Howard Thomson wrote:
 Hi All,

 I am developing a custom disk storage backend for the Bacula backup
 system, and am in the process of setting up a trial Ceph system,
 intending to use a direct interface to RADOS.

 I have a variety of 1Tb, 250Mb and 160Mb disk drives that I would like
 to use, but it is not [as yet] obvious as to whether having differences
 in capacity at different OSDs matters.

 Can anyone comment, or point me in the right direction on
 docs.ceph.com ?

 Thanks,

 Howard


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Udo Lembke
Hi Xu,

On 01.02.2015 21:39, Xu (Simon) Chen wrote:
 RBD doesn't work extremely well when ceph is recovering - it is common
 to see hundreds or a few thousands of blocked requests (30s to
 finish). This translates high IO wait inside of VMs, and many
 applications don't deal with this well.
this sounds like you don't have settings like
osd max backfills = 1
osd recovery max active = 1


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD caching on 4K reads???

2015-01-30 Thread Udo Lembke
Hi Bruce,
hmm, sounds for me like the rbd cache.
Can you look, if the cache is realy disabled in the running config with

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep cache

Udo

On 30.01.2015 21:51, Bruce McFarland wrote:

 I have a cluster and have created a rbd device - /dev/rbd1. It shows
 up as expected with ‘rbd –image test info’ and rbd showmapped. I have
 been looking at cluster performance with the usual Linux block device
 tools – fio and vdbench. When I look at writes and large block
 sequential reads I’m seeing what I’d expect with performance limited
 by either my cluster interconnect bandwidth or the backend device
 throughput speeds – 1 GE frontend and cluster network and 7200rpm SATA
 OSDs with 1 SSD/osd for journal. Everything looks good EXCEPT 4K
 random reads. There is caching occurring somewhere in my system that I
 haven’t been able to detect and suppress - yet.

  

 I’ve set ‘rbd_cache=false’ in the [client] section of ceph.conf on the
 client, monitor, and storage nodes. I’ve flushed the system caches on
 the client and storage nodes before test run ie vm.drop_caches=3 and
 set the huge pages to the maximum available to consume free system
 memory so that it can’t be used for system cache . I’ve also disabled
 read-ahead on all of the HDD/OSDs.

  

 When I run a 4k randon read workload on the client the most I could
 expect would be ~100iops/osd x number of osd’s – I’m seeing an order
 of magnitude greater than that AND running IOSTAT on the storage nodes
 show no read activity on the OSD disks.

  

 Any ideas on what I’ve overlooked? There appears to be some read-ahead
 caching that I’ve missed.

  

 Thanks,

 Bruce



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD caching on 4K reads???

2015-01-30 Thread Udo Lembke
Hi Bruce,
you can also look on the mon, like
ceph --admin-daemon /var/run/ceph/ceph-mon.b.asok config show | grep cache

(I guess you have an number instead of the .b. )

Udo
On 30.01.2015 22:02, Bruce McFarland wrote:

 The ceph daemon isn’t running on the client with the rbd device so I
 can’t verify if it’s disabled at the librbd level on the client. If
 you mean on the storage nodes I’ve had some issues dumping the config.
 Does the rbd caching occur on the storage nodes, client, or both?

  

  

 *From:*Udo Lembke [mailto:ulem...@polarzone.de]
 *Sent:* Friday, January 30, 2015 1:00 PM
 *To:* Bruce McFarland; ceph-us...@ceph.com
 *Cc:* Prashanth Nednoor
 *Subject:* Re: [ceph-users] RBD caching on 4K reads???

  

 Hi Bruce,
 hmm, sounds for me like the rbd cache.
 Can you look, if the cache is realy disabled in the running config with

 ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep cache

 Udo

 On 30.01.2015 21:51, Bruce McFarland wrote:

 I have a cluster and have created a rbd device - /dev/rbd1. It
 shows up as expected with ‘rbd –image test info’ and rbd
 showmapped. I have been looking at cluster performance with the
 usual Linux block device tools – fio and vdbench. When I look at
 writes and large block sequential reads I’m seeing what I’d expect
 with performance limited by either my cluster interconnect
 bandwidth or the backend device throughput speeds – 1 GE frontend
 and cluster network and 7200rpm SATA OSDs with 1 SSD/osd for
 journal. Everything looks good EXCEPT 4K random reads. There is
 caching occurring somewhere in my system that I haven’t been able
 to detect and suppress - yet.

  

 I’ve set ‘rbd_cache=false’ in the [client] section of ceph.conf on
 the client, monitor, and storage nodes. I’ve flushed the system
 caches on the client and storage nodes before test run ie
 vm.drop_caches=3 and set the huge pages to the maximum available
 to consume free system memory so that it can’t be used for system
 cache . I’ve also disabled read-ahead on all of the HDD/OSDs.

  

 When I run a 4k randon read workload on the client the most I
 could expect would be ~100iops/osd x number of osd’s – I’m seeing
 an order of magnitude greater than that AND running IOSTAT on the
 storage nodes show no read activity on the OSD disks.

  

 Any ideas on what I’ve overlooked? There appears to be some
 read-ahead caching that I’ve missed.

  

 Thanks,

 Bruce




 ___

 ceph-users mailing list

 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing SSD's for ceph

2015-01-29 Thread Udo Lembke
Hi,

Am 29.01.2015 07:53, schrieb Christian Balzer:
 On Thu, 29 Jan 2015 01:30:41 + Ramakrishna Nishtala (rnishtal) wrote:

 * Per my understanding once writes are complete to journal then
 it is read again from the journal before writing to data disk. Does this
 mean, we have to do, not just sync/async writes but also reads
 ( random/seq ? ) in order to correctly size them?

 You might want to read this thread:
 https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg12952.html
 
 Assuming this didn't change (and just looking at my journal SSDs and OSD
 HDDs with atop I don't think so) your writes go to the HDDs pretty much in
 parallel.
 
 In either case, an SSD that can _write_ fast enough to satisfy your needs
 will definitely have no problems reading fast enough. 
 

due, that the data are in the cache (ram), there are only marginal reads
from the journal-ssd!

iostat from an journal ssd:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdc 304,45 0,16 82750,46  29544 15518960008

I would say, if you have much more reads, you have to less memory.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read-performance inside the vm

2015-01-27 Thread Udo Lembke
Hi Patrik,

Am 27.01.2015 14:06, schrieb Patrik Plank:
 

 ...
 I am really happy, these values above are enough for my little amount of
 vms. Inside the vms I get now for write 80mb/s and read 130mb/s, with
 write-cache enabled.
 
 But there is one little problem.
 
 Are there some tuning parameters for small files?
 
 For 4kb to 50kb files the cluster is very slow.
 

do you use an higher read-ahead inside the VM?
Like echo 4096  /sys/block/vda/queue/read_ahead_kb

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Better way to use osd's of different size

2015-01-16 Thread Udo Lembke
Hi Megov,
you should weight the OSD so it's represent the size (like an weight of
3.68 for an 4TB HDD).
cephdeploy do this automaticly.

Nevertheless also with the correct weight the disk was not filled in
equal distribution. For that purposes you can use reweight for single
OSDs, or automaticly with ceph osd reweight-by-utilization.

Udo

On 14.01.2015 16:36, Межов Игорь Александрович wrote:

 Hi!


 We have a small production ceph cluster, based on firefly release.


 It was built using hardware we already have in our site so it is not
 new  shiny,

 but works quite good. It was started in 2014.09 as a proof of
 concept from 4 hosts

 with 3 x 1tb osd's each: 1U dual socket Intel 54XX  55XX platforms on
 1 gbit network.


 Now it contains 4x12 osd nodes on shared 10Gbit network. We use it as
 a backstore

 for running VMs under qemu+rbd.


 During migration we temporarily use 1U nodes with 2tb osds and already
 face some

 problems with uneven distribution. I know, that the best practice is
 to use osds of same

 capacity, but it is impossible sometimes.


 Now we have 24-28 spare 2tb drives and want to increase capacity on
 the same boxes.

 What is the more right way to do it:

 - replace 12x1tb drives with 12x2tb drives, so we will have 2 nodes
 full of 2tb drives and

 other nodes remains in 12x1tb confifg

 - or replace 1tb to 2tb drives in more unify way, so every node will
 have 6x1tb + 6x2tb drives?


 I feel that the second way will give more smooth distribution among
 the nodes, and

 outage of one node may give lesser impact on cluster. Am I right and
 what you can

 advice me in such a situation?




 Megov Igor
 yuterra.ru, CIO
 me...@yuterra.ru


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Part 2: ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)

2015-01-14 Thread Udo Lembke
Hi again,
sorry for not threaded, but my last email don't came back on the mailing
list (often miss some posts!).

Just after sending the last mail, the first time another SSD fails - in
this case an cheap one, but with the same error:

root@ceph-04:/var/log/ceph# more ceph-osd.62.log
2015-01-13 16:40:55.712967 7fb29cfd3700  0 log [INF] : 17.2 scrub ok
2015-01-13 17:54:35.548361 7fb29dfd5700  0 log [INF] : 17.3 scrub ok
2015-01-13 17:54:38.007014 7fb29dfd5700  0 log [INF] : 17.5 scrub ok
2015-01-13 17:54:41.215558 7fb29d7d4700  0 log [INF] : 17.f scrub ok
2015-01-13 17:54:42.277585 7fb29dfd5700  0 log [INF] : 17.a scrub ok
2015-01-13 17:54:48.961582 7fb29d7d4700  0 log [INF] : 17.6 scrub ok
2015-01-13 20:15:08.749597 7fb292337700  0 -- 192.168.3.14:6824/9185 
192.168.3.15:6824/11735 pipe(0x107d9680 sd=307 :6824 s=2 pgs=2 cs=1
l=0 c=0x124a09a0).fault, initiating reconnect
2015-01-13 20:15:08.750803 7fb296dbe700  0 -- 192.168.3.14:0/9185 
192.168.3.15:6825/11735 pipe(0xd011180 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x
8d19760).fault
2015-01-13 20:15:08.750804 7fb292b3f700  0 -- 192.168.3.14:0/9185 
172.20.2.15:6837/11735 pipe(0x1210f900 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x
beae840).fault
2015-01-13 20:15:08.751056 7fb291d31700  0 -- 192.168.3.14:6824/9185 
192.168.3.15:6824/11735 pipe(0x107d9680 sd=29 :6824 s=1 pgs=2 cs=2 l
=0 c=0x124a09a0).fault
2015-01-13 20:15:27.035342 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:07.035339)
2015-01-13 20:15:28.036773 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.036769)
2015-01-13 20:15:28.945179 7fb29b7d0700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.945178)
2015-01-13 20:15:29.037016 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:09.037014)
2015-01-13 20:15:30.037204 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.037202)
2015-01-13 20:15:30.645491 7fb29b7d0700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.645483)
2015-01-13 20:15:31.037326 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:11.037323)
2015-01-13 20:15:32.037442 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:12.037439)
2015-01-13 20:15:33.037641 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:13.037637)
2015-01-13 20:15:34.037843 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:14.037839)
2015-01-13 21:39:35.241153 7fb29dfd5700  0 log [INF] : 17.d scrub ok
2015-01-13 21:39:39.293113 7fb29a7ce700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
ol)' thread 7fb29a7ce700 time 2015-01-13 21:39:39.279799
osd/ReplicatedPG.cc: 5306: FAILED assert(soid  scrubber.start || soid
= scrubber.end)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
bool)+0x1320) [0x9296b0]
 2:
(ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6)
[0x92b076]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
[0x92b876]
 4: (C_Flush::finish(int)+0x86) [0x986226]
 5: (Context::complete(int)+0x9) [0x78f449]
 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
 7: (()+0x6b50) [0x7fb2b94ceb50]
 8: (clone()+0x6d) [0x7fb2b80dc7bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
  -127 2015-01-10 19:39:41.861724 7fb2b9faa780  5 asok(0x28e4230)
register_command perfcounters_dump hook 0x28d4010
  -126 2015-01-10 19:39:41.861749 7fb2b9faa780  5 asok(0x28e4230)
register_command 1 hook 0x28d4010
  -125 2015-01-10 19:39:41.861753 7fb2b9faa780  5 asok(0x28e4230)
register_command perf dump hook 0x28d4010
  -124 2015-01-10 19:39:41.861756 7fb2b9faa780  5 asok(0x28e4230)
register_command perfcounters_schema hook 0x28d4010
  -123 2015-01-10 19:39:41.861759 7fb2b9faa780  5 asok(0x28e4230)
register_command 2 hook 0x28d4010
  -122 2015-01-10 19:39:41.861762 7fb2b9faa780  

[ceph-users] ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)

2015-01-13 Thread Udo Lembke
Hi,
since last thursday we had an ssd-pool (cache tier) in front of an
ec-pool and fill the pools with data via rsync (app. 50MB/s).
The ssd-pool has tree disks and one of them (an DC S3700) fails four
times since that.
I simply start the osd again and the pool pas rebuilded and work again
for some hours up to some days.

I switched the ceph-node and the ssh-adapter, but this don't solve the
issue.
There wasn't any messages in syslog/messages and an fsck runs without
trouble, so I guess the problem is not OS-related.

I found this issue http://tracker.ceph.com/issues/8747 but my
ceph-version is newer (debian: ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3)),
and it's looks that i can reproduce this issue during 1-3 days.

The osd is ext4-formatted. All other OSDs (62) runs without trouble.

# more ceph-osd.61.log
2015-01-13 16:29:26.494458 7fedf9a3d700  0 log [INF] : 17.0 scrub ok
2015-01-13 17:29:03.988530 7fedf823a700  0 log [INF] : 17.16 scrub ok
2015-01-13 17:30:31.901032 7fedf8a3b700  0 log [INF] : 17.18 scrub ok
2015-01-13 17:31:58.983736 7fedf823a700  0 log [INF] : 17.9 scrub ok
2015-01-13 17:32:30.780308 7fedf9a3d700  0 log [INF] : 17.c scrub ok
2015-01-13 17:32:33.311433 7fedf8a3b700  0 log [INF] : 17.11 scrub ok
2015-01-13 17:37:22.237214 7fedf9a3d700  0 log [INF] : 17.7 scrub ok
2015-01-13 20:15:07.874376 7fedf6236700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
ol)' thread 7fedf6236700 time 2015-01-13 20:15:07.853440
osd/ReplicatedPG.cc: 5306: FAILED assert(soid  scrubber.start || soid
= scrubber.end)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
bool)+0x1320) [0x9296b0]
 2:
(ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6)
[0x92b076]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
[0x92b876]
 4: (C_Flush::finish(int)+0x86) [0x986226]
 5: (Context::complete(int)+0x9) [0x78f449]
 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
 7: (()+0x6b50) [0x7fee152f6b50]
 8: (clone()+0x6d) [0x7fee13f047bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
   -70 2015-01-11 19:54:47.962164 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_dump hook 0x2f44010
   -69 2015-01-11 19:54:47.962190 7fee15dd4780  5 asok(0x2f56230)
register_command 1 hook 0x2f44010
   -68 2015-01-11 19:54:47.962195 7fee15dd4780  5 asok(0x2f56230)
register_command perf dump hook 0x2f44010
   -67 2015-01-11 19:54:47.962201 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_schema hook 0x2f44010
   -66 2015-01-11 19:54:47.962203 7fee15dd4780  5 asok(0x2f56230)
register_command 2 hook 0x2f44010
   -65 2015-01-11 19:54:47.962207 7fee15dd4780  5 asok(0x2f56230)
register_command perf schema hook 0x2f44010
   -64 2015-01-11 19:54:47.962209 7fee15dd4780  5 asok(0x2f56230)
register_command config show hook 0x2f44010
   -63 2015-01-11 19:54:47.962214 7fee15dd4780  5 asok(0x2f56230)
register_command config set hook 0x2f44010
   -62 2015-01-11 19:54:47.962219 7fee15dd4780  5 asok(0x2f56230)
register_command config get hook 0x2f44010
   -61 2015-01-11 19:54:47.962223 7fee15dd4780  5 asok(0x2f56230)
register_command log flush hook 0x2f44010
   -60 2015-01-11 19:54:47.962226 7fee15dd4780  5 asok(0x2f56230)
register_command log dump hook 0x2f44010
   -59 2015-01-11 19:54:47.962229 7fee15dd4780  5 asok(0x2f56230)
register_command log reopen hook 0x2f44010
   -58 2015-01-11 19:54:47.965000 7fee15dd4780  0 ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 117
35
   -57 2015-01-11 19:54:47.967362 7fee15dd4780  1 finished
global_init_daemonize
   -56 2015-01-11 19:54:47.971666 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -55 2015-01-11 19:54:47.971682 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -54 2015-01-11 19:54:47.973281 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
syscall(SYS_syncfs, f
d) fully supported
   -53 2015-01-11 19:54:47.975393 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) limited size xattrs
   -52 2015-01-11 19:54:48.013905 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) mount: enabling WRITEAHEAD journal
mode: checkpoint
is not enabled
   -51 2015-01-11 19:54:49.245360 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -50 2015-01-11 19:54:49.245370 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -49 2015-01-11 19:54:49.247017 7fee15dd4780  0

Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread Udo Lembke
Hi,
I had an similiar effect two weeks ago - 1PG backfill_toofull and due
reweighting and delete there was enough free space but the rebuild
process stopped after a while.

After stop and start ceph on the second node, the rebuild process runs
without trouble and the backfill_toofull are gone.

This happens with firefly.

Udo

On 09.01.2015 21:29, c3 wrote:
 In this case the root cause was half denied reservations.

 http://tracker.ceph.com/issues/9626

 This stopped backfills since, those listed as backfilling were
 actually half denied and doing nothing. The toofull status is not
 checked until a free backfill slot happens, so everything was just stuck.

 Interestingly, the toofull was created by other backfills which were
 not stoppped.
 http://tracker.ceph.com/issues/9594

 Quite the log jam to clear.


 Quoting Craig Lewis cle...@centraldesktop.com:

 What was the osd_backfill_full_ratio?  That's the config that controls
 backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
 ceph status.

 I've noticed that it takes a while for backfilling to restart after
 changing osd_backfill_full_ratio.  Backfilling usually restarts for
 me in
 10-15 minutes.  Some PGs will stay in that state until the cluster is
 nearly done recoverying.

 I've only seen backfill_toofull happen after the OSD exceeds the
 ratio (so
 it's reactive, no proactive).  Mine usually happen when I'm
 rebalancing a
 nearfull cluster, and an OSD backfills itself toofull.




 On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote:

 Hi,

 I am wondering how a PG gets marked backfill_toofull.

 I reweighted several OSDs using ceph osd crush reweight. As
 expected, PG
 began moving around (backfilling).

 Some PGs got marked +backfilling (~10), some +wait_backfill (~100).

 But some are marked +backfill_toofull. My OSDs are between 25% and 72%
 full.

 Looking at ceph pg dump, I can find the backfill_toofull PGs and
 verified
 the OSDs involved are less than 72% full.

 Do backfill reservations include a size? Are these OSDs projected to be
 toofull, once the current backfilling complete? Some of the
 backfill_toofull and backfilling point to the same OSDs.

 I did adjust the full ratios, but that did not change the
 backfill_toofull
 status.
 ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
 ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Improving Performance with more OSD's?

2015-01-04 Thread Udo Lembke
Hi Lindsay,

On 05.01.2015 06:52, Lindsay Mathieson wrote:
 ...
 So two OSD Nodes had:
 - Samsung 840 EVO SSD for Op. Sys.
 - Intel 530 SSD for Journals (10GB Per OSD)
 - 3TB WD Red
 - 1 TB WD Blue
 - 1 TB WD Blue
 - Each disk weighted at 1.0
 - Primary affinity of the WD Red (slow) set to 0
the weight should be the size of the filesystem. With weight 1 for all
disks, you run in trouble if your cluster filled, because the 1TB-Disks
are full, before the 3TB disk!

You should have something like 0.9 for the 1TB and 2.82 for the 3TB
disks ( df -k | grep osd | awk '{print $2/(1024^3) }'  ).

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2014-12-23 Thread Udo Lembke
Hi Sage,

Am 23.12.2014 15:39, schrieb Sage Weil:
...
 
 You can't reduce the PG count without creating new (smaller) pools 
 and migrating data. 
does this also work with the pool metadata, or is this pool essential
for ceph?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Udo Lembke
Hi,
for monitoring only I use the Ceph Dashboard
https://github.com/Crapworks/ceph-dash/

Fo me it's an nice tool for an good overview - for administration i use
the cli.


Udo

On 23.12.2014 01:11, Tony wrote:
 Please don't mention calamari :-)

 The best web interface for ceph that actually works with RHEL6.6 

 Preferable something in repo and controls and monitors all other ceph
 osd, mon, etc.


 Take everything and live for the moment.




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-20 Thread Udo Lembke
Hi,
for information for other cepher...

I switched from unknown crush tunables to firefly and it's takes 6 hour
(30.853% degration) to finisched on our production-cluster (5 Nodes, 60
OSDs, 10GBE, 20% data used:  pgmap v35678572: 3904 pgs, 4 pools, 21947
GB data, 5489 kobjects).

Should an chooseleaf_vary_r 1 (from 0) take round about the same time
to finished??


Regards

Udo

On 04.12.2014 14:09, Udo Lembke wrote:
 Hi,
 to answer myself.

 With ceph osd crush show-tunables I see a little bit more, but doesn't
 know how far away from firefly-tunables I'm at the procuction cluster are.

 New testcluster with profile optimal:
 ceph osd crush show-tunables
 { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 1,
   profile: firefly,
   optimal_tunables: 1,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 1}

 the production cluster:
  ceph osd crush show-tunables
 { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 0,
   profile: unknown,
   optimal_tunables: 0,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 0}

 Look this like argonaut or bobtail?

 And how proceed to update?
 Does in makes sense first go to profile bobtail and then to firefly?


 Regards

 Udo

 Am 01.12.2014 17:39, schrieb Udo Lembke:
 Hi all,
 http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
 described how to set the tunables to legacy, argonaut, bobtail, firefly
 or optimal.

 But how can I see, which profile is active in an ceph-cluster?

 With ceph osd getcrushmap I got not realy much info
 (only tunable choose_local_tries 0
 tunable choose_local_fallback_tries 0
 tunable choose_total_tries 50)


 Udo

 _
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-20 Thread Udo Lembke
Hi Craig,
right! I had also post one mail in that thread.

My question was, if the whole step to chooseleaf_vary_r 1 take the
same amount of time like the setting tunables to firefly.

The funny thing: I just decompile the crushmap to start with
chooseleaf_vary_r 4 and see, that after upgrade tonight the
chooseleaf_vary_r  allready on 1!

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
...

ceph osd crush show-tunables -f json-pretty

{ choose_local_tries: 0,
  choose_local_fallback_tries: 0,
  choose_total_tries: 50,
  chooseleaf_descend_once: 1,
  profile: firefly,
  optimal_tunables: 1,
  legacy_tunables: 0,
  require_feature_tunables: 1,
  require_feature_tunables2: 1}


Udo

On 20.12.2014 17:53, Craig Lewis wrote:
 There was a tunables discussion on the ML a few months ago, with a lot
 of good suggestions.  Sage gave some suggestions on rolling out (and
 rolling back) chooseleaf_vary_r changes.  That reminds me... I
 intended to try those changes over the holidays...


 Found it; the subject was ceph osd crush tunables optimal AND add new
 OSD at the same time.


 On Sat, Dec 20, 2014 at 3:26 AM, Udo Lembke ulem...@polarzone.de
 mailto:ulem...@polarzone.de wrote:

 Hi,
 for information for other cepher...

 I switched from unknown crush tunables to firefly and it's takes 6
 hour
 (30.853% degration) to finisched on our production-cluster (5
 Nodes, 60
 OSDs, 10GBE, 20% data used:  pgmap v35678572: 3904 pgs, 4 pools, 21947
 GB data, 5489 kobjects).

 Should an chooseleaf_vary_r 1 (from 0) take round about the same
 time
 to finished??


 Regards

 Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-18 Thread Udo Lembke
Hi Mark,

On 18.12.2014 07:15, Mark Kirkwood wrote:

 While you can't do much about the endurance lifetime being a bit low,
 you could possibly improve performance using a journal *file* that is
 located on the 840's (you'll need to symlink it - disclaimer - have
 not tried this myself, but will experiment if you are interested).
 Slightly different open() options are used in this case and these
 cheaper consumer SSD seem to work better with them.
I had the symlink-file method before, (with different SSDs) but the
performance was much better after changing to partitions.
I try fist some different consumer SSDs with journal as file and end
now with DC S3700 with partitions.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any tuning of LVM-Storage inside an VM related to ceph?

2014-12-18 Thread Udo Lembke
Hi all,
I have some fileserver with insufficient read speed.
Enabling read ahead inside the VM improve the read speed, but it's
looks, that this has an drawback during lvm-operations like pvmove.

For test purposes, I move the lvm-storage inside an VM from vdb to vdc1.
It's take days, because it's 3TB data.
After enbling read ahead (echo 4096 
/sys/block/vdb/queue/read_ahead_kb; echo 4096 
/sys/block/vdc/queue/read_ahead_kb) the move-speed drop noticeable!

Are they any tunings to improve speed related to lvm on rbd-storage?
Perhaps, if using partitions, align the partition on 4MB?

Any hints?


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Udo Lembke
Hi Lindsay,
have you tried the different cache-options (no cache, write through,
...) which proxmox offer, for the drive?


Udo

On 18.12.2014 05:52, Lindsay Mathieson wrote:
 I'be been experimenting with CephFS for funning KVM images (proxmox).

 cephfs fuse version - 0.87

 cephfs kernel module - kernel version 3.10


 Part of my testing involves running a Windows 7 VM up and running
 CrystalDiskMark to check the I/O in the VM. Its surprisingly good with
 both the fuse and the kernel driver, seq reads  writes are actually
 faster than the underlying disk, so I presume the FS is aggressively
 caching.

 With the fuse driver I have no problems.

 With the kernel driver, the benchmark runs fine, but when I reboot the
 VM the drive is corrupted and unreadable, every time. Rolling back to
 a snapshot fixes the disk. This does not happen unless I run the
 benchmark, which I presume is writing a lot of data.

 No problems with the same test for Ceph rbd, or NFS.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-17 Thread Udo Lembke
Hi Mikaël,


 I have EVOs too, what to you mean by not playing well with D_SYNC?
 Is there something I can test on my side to compare results with you,
 as I have mine flashed?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
described how test the ssd-performance for an journal ssd (your ssd will
be overwritten!!).

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph

2014-12-15 Thread Udo Lembke
Hi Benjamin,
On 15.12.2014 03:31, Benjamin wrote:
 Hey there,

 I've set up a small VirtualBox cluster of Ceph VMs. I have one
 ceph-admin0 node, and three ceph0,ceph1,ceph2 nodes for a total of 4.

 I've been following this
 guide: http://ceph.com/docs/master/start/quick-ceph-deploy/ to the letter.

 At the end of the guide, it calls for you to run ceph health... this
 is what happens when I do.

 HEALTH_ERR 64 pgs stale; 64 pgs stuck stale; 2 full osd(s); 2/2 in
 osds are down
hmm, why you have two OSDs only with tree nodes?

Can you post the output of following commands
ceph health detail
ceph osd tree
rados df
ceph osd pool get data size
ceph osd pool get rbd size
df -h # on all OSD-nodes

/etc/init.d/ceph start osd.0  # on node with osd.0
/etc/init.d/ceph start osd.1  # on node with osd.1


Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph

2014-12-15 Thread Udo Lembke
Hi,
see here:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15546.html

Udo

On 16.12.2014 05:39, Benjamin wrote:
 I increased the OSDs to 10.5GB each and now I have a different issue...

 cephy@ceph-admin0:~/ceph-cluster$ echo {Test-data}  testfile.txt
 cephy@ceph-admin0:~/ceph-cluster$ rados put test-object-1 testfile.txt
 --pool=data
 error opening pool data: (2) No such file or directory
 cephy@ceph-admin0:~/ceph-cluster$ ceph osd lspools
 0 rbd,

 Here's ceph -w:
 cephy@ceph-admin0:~/ceph-cluster$ ceph -w
 cluster b3e15af-SNIP
  health HEALTH_WARN mon.ceph0 low disk space; mon.ceph1 low disk
 space; mon.ceph2 low disk space; clock skew detected on mon.ceph0,
 mon.ceph1, mon.ceph2
  monmap e3: 4 mons at
 {ceph-admin0=10.0.1.10:6789/0,ceph0=10.0.1.11:6789/0,ceph1=10.0.1.12:6789/0,ceph2=10.0.1.13:6789/0
 http://10.0.1.10:6789/0,ceph0=10.0.1.11:6789/0,ceph1=10.0.1.12:6789/0,ceph2=10.0.1.13:6789/0},
 election epoch 10, quorum 0,1,2,3 ceph-admin0,ceph0,ceph1,ceph2
  osdmap e17: 3 osds: 3 up, 3 in
   pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
 19781 MB used, 7050 MB / 28339 MB avail
   64 active+clean

 Any other commands to run that would be helpful? Is it safe to simply
 manually create the data and metadata pools myself?

 On Mon, Dec 15, 2014 at 5:07 PM, Benjamin zor...@gmail.com
 mailto:zor...@gmail.com wrote:

 Aha, excellent suggestion! I'll try that as soon as I get back,
 thank you.
 - B

 On Dec 15, 2014 5:06 PM, Craig Lewis cle...@centraldesktop.com
 mailto:cle...@centraldesktop.com wrote:


 On Sun, Dec 14, 2014 at 6:31 PM, Benjamin zor...@gmail.com
 mailto:zor...@gmail.com wrote:

 The machines each have Ubuntu 14.04 64-bit, with 1GB of
 RAM and 8GB of disk. They have between 10% and 30% disk
 utilization but common between all of them is that they
 *have free disk space* meaning I have no idea what the
 heck is causing Ceph to complain.


 Each OSD is 8GB?  You need to make them at least 10 GB.

 Ceph weights each disk as it's size in TiB, and it truncates
 to two decimal places.  So your 8 GiB disks have a weight of
 0.00.  Bump it up to 10 GiB, and it'll get a weight of 0.01.

 You should have 3 OSDs, one for each of ceph0,ceph1,ceph2.

 If that doesn't fix the problem, go ahead and post the things
 Udo mentioned.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] For all LSI SAS9201-16i user - don't upgrate to firmware P20

2014-12-11 Thread Udo Lembke
Hi all,
I have upgrade two LSI SAS9201-16i HBAs to the latest Firmware P20.00.00
and after that I got following syslog messages:

Dec  9 18:11:31 ceph-03 kernel: [  484.602834] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)
Dec  9 18:12:15 ceph-03 kernel: [  528.310174] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)
Dec  9 18:15:25 ceph-03 kernel: [  718.782477] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)

Next night one OSD went down (read only mounted, and I must repair the 
filesystem with fsck) and then two other OSDs follows.

Then I change the card and after some tries I'm able to downgrade* the cards to 
P17 which run stable.


Udo


* downgrade on the fourth computer with dos booted and sas2flsh -o -e 6...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Old OSDs on new host, treated as new?

2014-12-05 Thread Udo Lembke
Hi,
perhaps an stupid question, but why you change the hostname?

Not tried, but I guess if you boot the node with an new hostname, the
old hostname are in the crush map, but without any OSDs - because they
are on the new host.
Don't know ( I guess not) if the degration level stay also on 5% if you
delete the empty host from the crush map.

I would simply use the same hostconfig on an rebuildet host.

Udo

On 03.12.2014 05:06, Indra Pramana wrote:
 Dear all,

 We have a Ceph cluster with several nodes, each node contains 4-6
 OSDs. We are running the OS off USB drive to maximise the use of the
 drive bays for the OSDs and so far everything is running fine.

 Occasionally, the OS running on the USB drive would fail, and we would
 normally replace the drive with a pre-configured similar OS and Ceph
 running, so when the new OS boots up, it will automatically detect all
 the OSDs and start them. It works fine without any issues.

 However, the issue is in recovery. When one node goes down, all the
 OSDs would be down and recovery will start to move the pg replicas on
 the affected OSDs to other available OSDs, and cause the Ceph to be
 degraded, say 5%, which is expected. However, when we boot up the
 failed node with a new OS, and bring back the OSDs up, more PGs are
 being scheduled for backfilling and instead of reducing, the
 degradation level will shoot up again to, for example, 10%, and in
 some occasion, it goes up to 19%.

 We had experience when one node is down, it will degraded to 5% and
 recovery will start, but when we manage to bring back up the node
 (still the same OS), the degradation level will reduce to below 1% and
 eventually recovery will be completed faster.

 Why the same behaviour doesn't apply on the above situation? The OSD
 numbers are the same when the node boots up, the crush map weight
 values are also the same. Only the hostname is different.

 Any advice / suggestions?

 Looking forward to your reply, thank you.

 Cheers.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-01 Thread Udo Lembke
Hi all,
http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
described how to set the tunables to legacy, argonaut, bobtail, firefly
or optimal.

But how can I see, which profile is active in an ceph-cluster?

With ceph osd getcrushmap I got not realy much info
(only tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50)


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-06 Thread Udo Lembke
Hi,
from one host to five OSD-hosts.

NIC Intel 82599EB; jumbo-frames; single Switch IBM G8124 (blade network).

rtt min/avg/max/mdev = 0.075/0.114/0.231/0.037 ms
rtt min/avg/max/mdev = 0.088/0.164/0.739/0.072 ms
rtt min/avg/max/mdev = 0.081/0.141/0.229/0.030 ms
rtt min/avg/max/mdev = 0.083/0.115/0.183/0.030 ms
rtt min/avg/max/mdev = 0.087/0.144/0.190/0.028 ms


Udo

Am 06.11.2014 14:18, schrieb Wido den Hollander:
 Hello,
 
 While working at a customer I've ran into a 10GbE latency which seems
 high to me.
 
 I have access to a couple of Ceph cluster and I ran a simple ping test:
 
 $ ping -s 8192 -c 100 -n ip
 
 Two results I got:
 
 rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms
 rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms
 
 Both these environment are running with Intel 82599ES 10Gbit cards in
 LACP. One with Extreme Networks switches, the other with Arista.
 
 Now, on a environment with Cisco Nexus 3000 and Nexus 7000 switches I'm
 seeing:
 
 rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms
 
 As you can see, the Cisco Nexus network has high latency compared to the
 other setup.
 
 You would say the switches are to blame, but we also tried with a direct
 TwinAx connection, but that didn't help.
 
 This setup also uses the Intel 82599ES cards, so the cards don't seem to
 be the problem.
 
 The MTU is set to 9000 on all these networks and cards.
 
 I was wondering, others with a Ceph cluster running on 10GbE, could you
 perform a simple network latency test like this? I'd like to compare the
 results.
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Typical 10GbE latency

2014-11-06 Thread Udo Lembke
Hi,
no special optimizations on the host.
In this case the pings are from an proxmox-ve host to ceph-osds (ubuntu
+ debian).

The pings from one osd to the others are comparable.

Udo

On 06.11.2014 15:00, Irek Fasikhov wrote:
 Hi,Udo.
 Good value :)

 Whether an additional optimization on the host?
 Thanks.

 Thu Nov 06 2014 at 16:57:36, Udo Lembke ulem...@polarzone.de
 mailto:ulem...@polarzone.de:

 Hi,
 from one host to five OSD-hosts.

 NIC Intel 82599EB; jumbo-frames; single Switch IBM G8124 (blade
 network).

 rtt min/avg/max/mdev = 0.075/0.114/0.231/0.037 ms
 rtt min/avg/max/mdev = 0.088/0.164/0.739/0.072 ms
 rtt min/avg/max/mdev = 0.081/0.141/0.229/0.030 ms
 rtt min/avg/max/mdev = 0.083/0.115/0.183/0.030 ms
 rtt min/avg/max/mdev = 0.087/0.144/0.190/0.028 ms


 Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about activate OSD

2014-10-31 Thread Udo Lembke
Hi German,
if i'm right the journal-creation on /dev/sdc1 failed (perhaps because
you only say /dev/sdc instead of /dev/sdc1?).

Do you have partitions on sdc?


Udo

On 31.10.2014 22:02, German Anders wrote:
 Hi all,
   I'm having some issues while trying to activate a new osd in a
 new cluster, the prepare command run fine, but then the activate
 command failed:

 ceph@cephbkdeploy01:~/desp-bkp-cluster$ ceph-deploy --overwrite-conf
 disk prepare --fs-type btrfs ceph-bkp-osd01:sdf:/dev/sdc
 [ceph_deploy.cli][INFO  ] Invoked (1.4.0): /usr/bin/ceph-deploy
 --overwrite-conf disk prepare --fs-type btrfs ceph-bkp-osd01:sdf:/dev/sdc
 [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
 ceph-bkp-osd01:/dev/sdf:/dev/sdc
 [ceph-bkp-osd01][DEBUG ] connected to host: ceph-bkp-osd01
 [ceph-bkp-osd01][DEBUG ] detect platform information from remote host
 [ceph-bkp-osd01][DEBUG ] detect machine type
 [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
 [ceph_deploy.osd][DEBUG ] Deploying osd to ceph-bkp-osd01
 [ceph-bkp-osd01][DEBUG ] write cluster configuration to
 /etc/ceph/{cluster}.conf
 [ceph-bkp-osd01][INFO  ] Running command: sudo udevadm trigger
 --subsystem-match=block --action=add
 [ceph_deploy.osd][DEBUG ] Preparing host ceph-bkp-osd01 disk /dev/sdf
 journal /dev/sdc activate False
 [ceph-bkp-osd01][INFO  ] Running command: sudo ceph-disk-prepare
 --fs-type btrfs --cluster ceph -- /dev/sdf /dev/sdc
 [ceph-bkp-osd01][WARNIN] libust[13609/13609]: Warning: HOME
 environment variable not set. Disabling LTTng-UST per-user tracing.
 (in setup_local_apps() at lttng-ust-comm.c:305)
 [ceph-bkp-osd01][WARNIN] libust[13627/13627]: Warning: HOME
 environment variable not set. Disabling LTTng-UST per-user tracing.
 (in setup_local_apps() at lttng-ust-comm.c:305)
 [ceph-bkp-osd01][WARNIN] WARNING:ceph-disk:OSD will not be
 hot-swappable if journal is not the same device as the osd data
 [ceph-bkp-osd01][WARNIN] Turning ON incompat feature 'extref':
 increased hardlink limit per file to 65536
 [ceph-bkp-osd01][DEBUG ] Creating new GPT entries.
 [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
 [ceph-bkp-osd01][DEBUG ] Creating new GPT entries.
 [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
 [ceph-bkp-osd01][DEBUG ]
 [ceph-bkp-osd01][DEBUG ] WARNING! - Btrfs v3.12 IS EXPERIMENTAL
 [ceph-bkp-osd01][DEBUG ] WARNING! - see http://btrfs.wiki.kernel.org
 before using
 [ceph-bkp-osd01][DEBUG ]
 [ceph-bkp-osd01][DEBUG ] fs created label (null) on /dev/sdf1
 [ceph-bkp-osd01][DEBUG ] nodesize 32768 leafsize 32768 sectorsize
 4096 size 2.73TiB
 [ceph-bkp-osd01][DEBUG ] Btrfs v3.12
 [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
 [ceph_deploy.osd][DEBUG ] Host ceph-bkp-osd01 is now ready for osd use.
 ceph@cephbkdeploy01:~/desp-bkp-cluster$
 ceph@cephbkdeploy01:~/desp-bkp-cluster$ ceph-deploy --overwrite-conf
 disk activate --fs-type btrfs ceph-bkp-osd01:sdf1:/dev/sdc1
 [ceph_deploy.cli][INFO  ] Invoked (1.4.0): /usr/bin/ceph-deploy
 --overwrite-conf disk activate --fs-type btrfs
 ceph-bkp-osd01:sdf1:/dev/sdc1
 [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks
 ceph-bkp-osd01:/dev/sdf1:/dev/sdc1
 [ceph-bkp-osd01][DEBUG ] connected to host: ceph-bkp-osd01
 [ceph-bkp-osd01][DEBUG ] detect platform information from remote host
 [ceph-bkp-osd01][DEBUG ] detect machine type
 [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
 [ceph_deploy.osd][DEBUG ] activating host ceph-bkp-osd01 disk /dev/sdf1
 [ceph_deploy.osd][DEBUG ] will use init type: upstart
 [ceph-bkp-osd01][INFO  ] Running command: sudo ceph-disk-activate
 --mark-init upstart --mount /dev/sdf1
 [ceph-bkp-osd01][WARNIN] libust[14025/14025]: Warning: HOME
 environment variable not set. Disabling LTTng-UST per-user tracing.
 (in setup_local_apps() at lttng-ust-comm.c:305)
 [ceph-bkp-osd01][WARNIN] libust[14028/14028]: Warning: HOME
 environment variable not set. Disabling LTTng-UST per-user tracing.
 (in setup_local_apps() at lttng-ust-comm.c:305)
 [ceph-bkp-osd01][WARNIN] got monmap epoch 1
 [ceph-bkp-osd01][WARNIN] libust[14059/14059]: Warning: HOME
 environment variable not set. Disabling LTTng-UST per-user tracing.
 (in setup_local_apps() at lttng-ust-comm.c:305)
 [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936163 7ffb41d32900 -1
 journal FileJournal::_open: disabling aio for non-block journal.  Use
 journal_force_aio to force use of aio anyway
 [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936221 7ffb41d32900 -1
 journal check: ondisk fsid ----
 doesn't match expected 6a26ef1f-6ece-4383-8304-7a8d064ef2b4, invalid
 (someone else's?) journal
 [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936275 7ffb41d32900 -1
 filestore(/var/lib/ceph/tmp/mnt.vt_waK) mkjournal error creating
 journal on /var/lib/ceph/tmp/mnt.vt_waK/journal: (22) Invalid argument
 [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936310 7ffb41d32900 -1
 OSD::mkfs: ObjectStore::mkfs failed with 

Re: [ceph-users] Replacing a disk: Best practices?

2014-10-16 Thread Udo Lembke
Am 15.10.2014 22:08, schrieb Iban Cabrillo:
 HI Cephers,
 
  I have an other question related to this issue, What would be the
 procedure to restore a server fail (a whole server for example due to a
 mother board trouble with no damage on disk).
 
 Regards, I 
 
Hi,
- change serverboard.
- perhaps adapt /etc/udev/rules.d/70-persistent-net.rules (to get the
same devices (eth0/1...) for your network.
boot and wait for resync.

To avoid to much traffic I set noout if a whole server is lost.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [PG] Slow request *** seconds old,v4 currently waiting for pg to exist locally

2014-09-25 Thread Udo Lembke
Hi,
looks that some osds are down?!

What is the output of ceph osd tree

Udo

Am 25.09.2014 04:29, schrieb Aegeaner:
 The cluster healthy state is WARN:
 
  health HEALTH_WARN 118 pgs degraded; 8 pgs down; 59 pgs
 incomplete; 28 pgs peering; 292 pgs stale; 87 pgs stuck inactive;
 292 pgs stuck stale; 205 pgs stuck unclean; 22 requests are blocked
  32 sec; recovery 12474/46357 objects degraded (26.909%)
  monmap e3: 3 mons at
 
 {CVM-0-mon01=172.18.117.146:6789/0,CVM-0-mon02=172.18.117.152:6789/0,CVM-0-mon03=172.18.117.153:6789/0},
 election epoch 24, quorum 0,1,2 CVM-0-mon01,CVM-0-mon02,CVM-0-mon03
  osdmap e421: 9 osds: 9 up, 9 in
   pgmap v2261: 292 pgs, 4 pools, 91532 MB data, 23178 objects
 330 MB used, 3363 GB / 3363 GB avail
 12474/46357 objects degraded (26.909%)
   20 stale+peering
   87 stale+active+clean
8 stale+down+peering
   59 stale+incomplete
  118 stale+active+degraded
 
 
 What does these errors mean? Can these PGs be recovered?
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [PG] Slow request *** seconds old,v4 currently waiting for pg to exist locally

2014-09-25 Thread Udo Lembke
Hi again,
sorry - forgot my post... see

osdmap e421: 9 osds: 9 up, 9 in

shows that all your 9 osds are up!

Do you have trouble with your journal/filesystem?

Udo

Am 25.09.2014 08:01, schrieb Udo Lembke:
 Hi,
 looks that some osds are down?!
 
 What is the output of ceph osd tree
 
 Udo
 
 Am 25.09.2014 04:29, schrieb Aegeaner:
 The cluster healthy state is WARN:

  health HEALTH_WARN 118 pgs degraded; 8 pgs down; 59 pgs
 incomplete; 28 pgs peering; 292 pgs stale; 87 pgs stuck inactive;
 292 pgs stuck stale; 205 pgs stuck unclean; 22 requests are blocked
  32 sec; recovery 12474/46357 objects degraded (26.909%)
  monmap e3: 3 mons at
 
 {CVM-0-mon01=172.18.117.146:6789/0,CVM-0-mon02=172.18.117.152:6789/0,CVM-0-mon03=172.18.117.153:6789/0},
 election epoch 24, quorum 0,1,2 CVM-0-mon01,CVM-0-mon02,CVM-0-mon03
  osdmap e421: 9 osds: 9 up, 9 in
   pgmap v2261: 292 pgs, 4 pools, 91532 MB data, 23178 objects
 330 MB used, 3363 GB / 3363 GB avail
 12474/46357 objects degraded (26.909%)
   20 stale+peering
   87 stale+active+clean
8 stale+down+peering
   59 stale+incomplete
  118 stale+active+degraded


 What does these errors mean? Can these PGs be recovered?


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Newbie Ceph Design Questions

2014-09-22 Thread Udo Lembke
Hi Christian,

On 22.09.2014 05:36, Christian Balzer wrote:
 Hello,

 On Sun, 21 Sep 2014 21:00:48 +0200 Udo Lembke wrote:

 Hi Christian,

 On 21.09.2014 07:18, Christian Balzer wrote:
 ...
 Personally I found ext4 to be faster than XFS in nearly all use cases
 and the lack of full, real kernel integration of ZFS is something that
 doesn't appeal to me either.
 a little bit OT... what kind of ext4-mount options do you use?
 I have an 5-node cluster with xfs (60 osds), and perhaps the performance
 with ext4 would be better?!
 Hard to tell w/o testing your particular load, I/O patterns.

 When benchmarking directly with single disks or RAIDs it is fairly
 straightforward to see:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028540.html

 Also note that the actual question has never been answered by the Ceph
 team, which is a shame as I venture that it would make things faster.
do you run your cluster without filestore_xattr_use_omap = true or
with due missing answer (to be on the safe side)??

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Newbie Ceph Design Questions

2014-09-21 Thread Udo Lembke
Hi Christian,

On 21.09.2014 07:18, Christian Balzer wrote:
 ...
 Personally I found ext4 to be faster than XFS in nearly all use cases and
 the lack of full, real kernel integration of ZFS is something that doesn't
 appeal to me either.
a little bit OT... what kind of ext4-mount options do you use?
I have an 5-node cluster with xfs (60 osds), and perhaps the performance
with ext4 would be better?!
For xfs  I use osd_mount_options_xfs =
rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M

regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] kvm guest with rbd-disks are unaccesible after app. 3h afterwards one OSD node fails

2014-09-01 Thread Udo Lembke
Hi list,
on the weekend one of five OSD-nodes fails (hung with kernel panic).
The cluster degraded (12 of 60 osds), but from our monitoring-host the
noout-flag is set in this case.

But around three hours later the kvm-guest, which used storage on the
ceph cluster (and use writes) are unaccessible. After restarting the
failed ceph-node the ceph-cluster are healthy again, but the VMs need to
be restartet to work again.

In the ceph.config I had defined osd_pool_default_min_size = 1
therefore I don't understand why this happens.
Which parameter must be changed/set so that the kvm-clients still
working on the unhealthy cluster?

Ceph-version is  0.72.2 - pool replication 2.


Thanks for a hint.

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-26 Thread Udo Lembke
Hi,
don't see an improvement with tcp_window_scaling=0 with my configuration.
More the other way: the iperf-performance are much less:

root@ceph-03:~# iperf -c 172.20.2.14

Client connecting to 172.20.2.14, TCP port 5001
TCP window size: 96.1 KByte (default)

[  3] local 172.20.2.13 port 50429 connected with 172.20.2.14 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  2.94 GBytes  2.52 Gbits/sec
root@ceph-03:~# sysctl -w net.ipv4.tcp_window_scaling=1
net.ipv4.tcp_window_scaling = 1
root@ceph-03:~# iperf -c 172.20.2.14

Client connecting to 172.20.2.14, TCP port 5001
TCP window size:  192 KByte (default)

[  3] local 172.20.2.13 port 50431 connected with 172.20.2.14 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  11.4 GBytes  9.77 Gbits/sec

My kernels are 3.11, 3.14 and the VM-host has an patched rhel-kernel
2.6.32 - the iperf-behavior is between all kernels the same.

switched back to net.ipv4.tcp_window_scaling=1


Udo

On 24.07.2014 22:15, Jean-Tiare LE BIGOT wrote:
 What is your kernel version ? On kernel = 3.11 sysctl -w
 net.ipv4.tcp_window_scaling=0 seems to improve the situation a lot.
 It also helped a lot to mitigate processes going (and sticking) in 'D'
 state.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-24 Thread Udo Lembke
Hi Steve,
I'm also looking for improvements of single-thread-reads.

A little bit higher values (twice?) should be possible with your config.
I have 5 nodes with 60 4-TB hdds and got following:
rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup
Total time run:60.066934
Total reads made: 863
Read size:4194304
Bandwidth (MB/sec):57.469
Average Latency:   0.0695964
Max latency:   0.434677
Min latency:   0.016444

In my case I had some osds (xfs) with an high fragmentation (20%).
Changing the mount options and defragmentation help slightly.
Performance changes:
[client]
rbd cache = true
rbd cache writethrough until flush = true

[osd]   


osd mount options xfs =
rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M


osd_op_threads =
4   
   

osd_disk_threads = 4


But I expect much more speed for an single thread...

Udo

On 23.07.2014 22:13, Steve Anthony wrote:
 Ah, ok. That makes sense. With one concurrent operation I see numbers
 more in line with the read speeds I'm seeing from the filesystems on the
 rbd images.

 # rados -p bench bench 300 seq --no-cleanup -t 1
 Total time run:300.114589
 Total reads made: 2795
 Read size:4194304
 Bandwidth (MB/sec):37.252

 Average Latency:   0.10737
 Max latency:   0.968115
 Min latency:   0.039754

 # rados -p bench bench 300 rand --no-cleanup -t 1
 Total time run:300.164208
 Total reads made: 2996
 Read size:4194304
 Bandwidth (MB/sec):39.925

 Average Latency:   0.100183
 Max latency:   1.04772
 Min latency:   0.039584

 I really wish I could find my data on read speeds from a couple weeks
 ago. It's possible that they've always been in this range, but I
 remember one of my test users saturating his 1GbE link over NFS reading
 copying from the rbd client to his workstation. Of course, it's also
 possible that the data set he was using was cached in RAM when he was
 testing, masking the lower rbd speeds.

 It just seems counterintuitive to me that read speeds would be so much
 slower that writes at the filesystem layer in practice. With images in
 the 10-100TB range, reading data at 20-60MB/s isn't going to be
 pleasant. Can you suggest any tunables or other approaches to
 investigate to improve these speeds, or are they in line with what you'd
 expect? Thanks for your help!

 -Steve



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-24 Thread Udo Lembke
Hi again,
forget to say - I'm still on 0.72.2!

Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-14 Thread Udo Lembke
Hi,
which values are all changed with ceph osd crush tunables optimal?

Is it perhaps possible to change some parameter the weekends before the
upgrade is running, to have more time?
(depends if the parameter are available in 0.72...).

The warning told, it's can take days... we have an cluster with 5
storage node and 12 4TB-osd-disk each (60 osd), replica 2. The cluster
is 60% filled.
Networkconnection 10Gb.
Takes tunables optimal in such an configuration one, two or more days?

Udo

On 14.07.2014 18:18, Sage Weil wrote:
 I've added some additional notes/warnings to the upgrade and release 
 notes:

  https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451

 If there is somewhere else where you think a warning flag would be useful, 
 let me know!

 Generally speaking, we want to be able to cope with huge data rebalances 
 without interrupting service.  It's an ongoing process of improving the 
 recovery vs client prioritization, though, and removing sources of 
 overhead related to rebalancing... and it's clearly not perfect yet. :/

 sage




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Generic Tuning parameters?

2014-06-28 Thread Udo Lembke
Hi Erich,
I'm also on searching for improvements.
You should use the right mountoptions, to prevent fragmentation (for XFS).

[osd]
osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M
osd_op_threads = 4
osd_disk_threads = 4

With 45 OSDs per node you need an powerfull system... AFAIK 12 OSDs/node is 
recomended.



You should think about what happens if one node die... I use an
monitoring-script which do an set noout if more than N OSDs are down.
Then I must decide, if it's faster to get the failed node back, or do an
rebuild (normaly the first choice).

Udo

On 27.06.2014 20:00, Erich Weiler wrote:
 Hi Folks,

 We're going to spin up a ceph cluster with the following general specs:

 * Six 10Gb/s connected servers, each with 45 4TB disks in a JBOD

 * Each disk is an OSD, so 45 OSDs per server

 * So 45*6 = 270 OSDs total

 * Three separate, dedicated monitor nodes

 The files stored on this storage cluster will be large file, each file
 will be several GB in size at the minimum, with some files being over
 100GB.

 Generically, are there any tuning parameters out there that would be
 good to drop in for this hardware profile and file size?

 We plan on growing this filesystem as we go, to 10 servers, then 15,
 then 20, etc.

 Thanks a bunch for any hints!!

 cheers,
 erich
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve performance of ceph objcect storage cluster

2014-06-26 Thread Udo Lembke
Hi,

Am 25.06.2014 16:48, schrieb Aronesty, Erik:
 I'm assuming you're testing the speed of cephfs (the file system) and not 
 ceph object storage.

for my part I mean object storage (VM disk via rbd).

Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >