from:"Udo Lembke"

Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-19 Thread Udo Lembke

Hi,
if you add on more than one server an SSD with an short lifetime, you
can run in real trouble (dataloss)!
Even if, all other SSDs are enterprise grade.
Ceph mix all data in PGs, which are spread over many disks - if one disk
fails - no poblem, but if the next two fails after that due high io
(recovery) you will have data loss.
But if you have only one node with consumer SSDs, the whole node can go
down without trouble...

I've tried consumer SSDs as yournal a long time ago - was an bad idea!
But this SSDs are cheap - buy one and do the io-test.
If you monitoring the live-time it's perhaps possible for your setup.

Udo


Am 19.12.19 um 20:20 schrieb Sinan Polat:
> Hi all,
>
> Thanks for the replies. I am not worried about their lifetime. We will be 
> adding only 1 SSD disk per physical server. All SSD’s are enterprise drives. 
> If the added consumer grade disk will fail, no problem.
>
> I am more curious regarding their I/O performance. I do want to have 50% drop 
> in performance.
>
> So anyone any experience with 860 EVO or Crucial MX500 in a Ceph setup?
>
> Thanks!
>
>> Op 19 dec. 2019 om 19:18 heeft Mark Nelson  het volgende 
>> geschreven:
>>
>> The way I try to look at this is:
>>
>>
>> 1) How much more do the enterprise grade drives cost?
>>
>> 2) What are the benefits? (Faster performance, longer life, etc)
>>
>> 3) How much does it cost to deal with downtime, diagnose issues, and replace 
>> malfunctioning hardware?
>>
>>
>> My personal take is that enterprise drives are usually worth it. There may 
>> be consumer grade drives that may be worth considering in very specific 
>> scenarios if they still have power loss protection and high write 
>> durability.  Even when I was in academia years ago with very limited 
>> budgets, we got burned with consumer grade SSDs to the point where we had to 
>> replace them all.  You have to be very careful and know exactly what you are 
>> buying.
>>
>>
>> Mark
>>
>>
>>> On 12/19/19 12:04 PM, jes...@krogh.cc wrote:
>>> I dont think “usually” is good enough in a production setup.
>>>
>>>
>>>
>>> Sent from myMail for iOS
>>>
>>>
>>> Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов 
>>> :
>>>
>>>Usually it doesn't, it only harms performance and probably SSD
>>>lifetime
>>>too
>>>
>>>> I would not be running ceph on ssds without powerloss protection. I
>>>> delivers a potential data loss scenario
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crusmap show wrong osd for PGs (EC-Pool)

2018-06-30 Thread Udo Lembke

Hi again,

On 29.06.2018 17:37, ulem...@polarzone.de wrote:
> ...
> 24.cc crushmap: [8,111,12,88,128,44,56]
> real live:  [8,121, X,88,130,44,56] - due the new osd-12 and the
> wrong searchlist (osd-121 + osd-130) the PG is undersized!
>
> /var/lib/ceph/osd/ceph-8/current/24.ccs0_head
> /var/lib/ceph/osd/ceph-44/current/24.ccs5_head
> /var/lib/ceph/osd/ceph-56/current/24.ccs6_head
> /var/lib/ceph/osd/ceph-88/current/24.ccs3_head
> /var/lib/ceph/osd/ceph-121/current/24.ccs1_head
> /var/lib/ceph/osd/ceph-130/current/24.ccs4_head
>
> ...
unfortunality the PG-slices on the "wrong" ODSs (121,130) are empty - so
the data are gone...

and now it's happens to more PGs... looks, that deep-scrup remove
important data?! We have disabled scrubbing and deep-scrubbing for now.

The cluster was installed a long time ago (Cuttlefish/Dumpling) and was
updated to 0.94.7 over the time but allways online.
This year (Februrary) there was an power outage and the whole cluster
was down and fresh restartet.
After that, some rbd-devices, which are lying on the EC-Pool, are
corrupt and the VM (an archive, with an huge FS on lvm (80TB)) need an
repair of the lvm, which ends with an damaged FS without usefull data.
Looks that after restarting the cluster, ceph use another calculation
for the PGs (EC-Pool) than before?! Is that possible?

Best regards

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-16 Thread Udo Lembke

Hi,

On 16.07.2017 15:04, Phil Schwarz wrote:
> ...
> Same result, the OSD is known by the node, but not by the cluster.
> ...
Firewall? Or missmatch in /etc/hosts or DNS??

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-15 Thread Udo Lembke

Hi,

On 15.07.2017 16:01, Phil Schwarz wrote:
> Hi,
> ...
>
> While investigating, i wondered about my config :
> Question relative to /etc/hosts file :
> Should i use private_replication_LAN Ip or public ones ?
private_replication_LAN!! And the pve-cluster should use another network
(nics) if possible.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Re-weight Entire Cluster?

2017-05-29 Thread Udo Lembke

Hi Mike,

On 30.05.2017 01:49, Mike Cave wrote:
>
> Greetings All,
>
>  
>
> I recently started working with our ceph cluster here and have been
> reading about weighting.
>
>  
>
> It appears the current best practice is to weight each OSD according
> to it’s size (3.64 for 4TB drive, 7.45 for 8TB drive, etc).
>
>  
>
> As it turns out, it was not configured this way at all; all of the
> OSDs are weighted at 1.
>
>  
>
> So my questions are:
>
>  
>
> Can we re-weight the entire cluster to 3.64 and then re-weight the 8TB
> drives afterwards at a slow rate which won’t impact performance?
>
> If we do an entire re-weight will we have any issues?
>
I would set osd_max_backfills + osd_recovery_max_active to 1 (with
injectargs) before start the reweight to minimize the impact for running
clients.
After set all to 3.64 you can raise the weight for the 8TB-drives one by
one.
Depends on your cluster/OSDs, it's perhaps an good idea to adjust the
primary affinity for the 8-TB drives during reweight?! Otherwise you got
more reads from the (slower) 8TB-drives.


> Would it be better to just reweight the 8TB drives to 2 gradually?
>
I would go for 3.64 - than you have the right settings if you init
further OSDs with ceph-deploy.

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to think a two different disk's technologies architecture

2017-03-23 Thread Udo Lembke

Hi,
ceph speeds up with more nodes and more OSDs - so go for 6 nodes with
mixed SSD+SATA.

Udo

On 23.03.2017 18:55, Alejandro Comisario wrote:
> Hi everyone!
> I have to install a ceph cluster (6 nodes) with two "flavors" of
> disks, 3 servers with SSD and 3 servers with SATA.
>
> Y will purchase 24 disks servers (the ones with sata with NVE SSD for
> the SATA journal)
> Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
> OS, and 1.3GB of ram per storage TB.
>
> The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
> for cluster network.
> My doubts resides, ar want to ask the community about experiences and
> pains and gains of choosing between.
>
> Option 1
> 3 x servers just for SSD
> 3 x servers jsut for SATA
>
> Option 2
> 6 x servers with 12 SSD and 12 SATA each
>
> Regarding crushmap configuration and rules everything is clear to make
> sure that two pools (poolSSD and poolSATA) uses the right disks.
>
> But, what about performance, maintenance, architecture scalability, etc ?
>
> thank you very much !
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-11 Thread Udo Lembke

Hi,

thanks for the usefull infos.


On 11.03.2017 12:21, cephmailingl...@mosibi.nl wrote:
>
> Hello list,
>
> A week ago we upgraded our Ceph clusters from Hammer to Jewel and with
> this email we want to share our experiences.
>
> ...
>
>
> e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> ... the 'find' in step e found so much files that xargs (the shell)
> could not handle it (too many arguments). At that time we decided to
> keep the permissions on root in the upgrade phase.
>
>
Perhaps would an "find /var/lib/ceph/ ! -uid 64045 -exec chown
ceph:ceph" do an better job?!

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Testing a node by fio - strange results to me

2017-01-22 Thread Udo Lembke

Hi Ahmed,

no, I mean the normal linux cache on the OSD-nodes.

If an file was read, they stay in the cache (how long depends on memory
and activity). The next reading will be very fast.

But you can have further caching (IMHO the client caches with cephfs too).


Udo


On 23.01.2017 07:07, Ahmed Khuraidah wrote:
> Hi Udo, thanks for reply, I thought already that my message was missed
> the list. 
> Not sure if understand correctly. Do you mean "rbd cache = true"? If
> yes, then this is RBD client cache behavior, not on OSD side, isn't it?"
>
>
> Regards
> Ahmed
>
>
> On Sun, Jan 22, 2017 at 6:45 PM, Udo Lembke  <mailto:ulem...@polarzone.de>> wrote:
>
> Hi,
>
> I don't use mds, but I thinks it's the same like with rdb - the readed
> data are cached on the OSD-nodes.
>
> The 4MB-chunks of the 3G-file fit completly in the cache, the
> other not.
>
>
> Udo
>
>
> On 18.01.2017 07:50, Ahmed Khuraidah wrote:
> > Hello community,
> >
> > I need your help to understand a little bit more about current MDS
> > architecture.
> > I have created one node CephFS deployment and tried to test it
> by fio.
> > I have used two file sizes of 3G and 320G. My question is why I have
> > around 1k+ IOps when perform random reading from 3G file into
> > comparison to expected ~100 IOps from 320G. Could somebody clarify
> > where is read buffer/caching performs here and how to control it?
> >
> > A little bit about setup - Ubuntu 14.04 server that consists Jewel
> > based: one MON, one MDS (default parameters, except mds_log = false)
> > and OSD using SATA drive (XFS) for placing data and SSD drive for
> > journaling. No RAID controller and no pool tiering used
> >
> > Thanks
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Testing a node by fio - strange results to me

2017-01-22 Thread Udo Lembke

Hi,

I don't use mds, but I thinks it's the same like with rdb - the readed
data are cached on the OSD-nodes.

The 4MB-chunks of the 3G-file fit completly in the cache, the other not.


Udo


On 18.01.2017 07:50, Ahmed Khuraidah wrote:
> Hello community,
>
> I need your help to understand a little bit more about current MDS
> architecture. 
> I have created one node CephFS deployment and tried to test it by fio.
> I have used two file sizes of 3G and 320G. My question is why I have
> around 1k+ IOps when perform random reading from 3G file into
> comparison to expected ~100 IOps from 320G. Could somebody clarify
> where is read buffer/caching performs here and how to control it?
>
> A little bit about setup - Ubuntu 14.04 server that consists Jewel
> based: one MON, one MDS (default parameters, except mds_log = false)
> and OSD using SATA drive (XFS) for placing data and SSD drive for
> journaling. No RAID controller and no pool tiering used
>
> Thanks
>  
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Udo Lembke

Hi Sam,

the webfrontend of an external ceph-dash was interrupted till the node
was up again. The reboot took app. 5 min.

But  the ceph -w output shows some IO much faster. I will look tomorrow
at the output again and create an ticket.


Thanks


Udo


On 12.01.2017 20:02, Samuel Just wrote:
> How long did it take for the cluster to recover?
> -Sam
>
> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
>> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>>> Hi all,
>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>> ceph-cluster. All nodes are mons and have two OSDs.
>>> During reboot of one node, ceph stucks longer than normaly and I look in the
>>> "ceph -w" output to find the reason.
>>>
>>> This is not the reason, but I'm wonder why "osd marked itself down" will not
>>> recognised by the mons:
>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>> quorum 0,2
>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>>> 0,2
>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>>> wr, 15 op/s
>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
>>> wr, 12 op/s
>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>>> rd, 135 kB/s wr, 15 op/s
>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>>> rd, 189 kB/s wr, 7 op/s
>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
>>> reporters from different host after 21.222945 >= grace 20.388836)
>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
>>> reporters from different host after 21.222970 >= grace 20.388836)
>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>>
>>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
>>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>>
>>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> That's not what anybody intended to have happen. It's possible the
>> simultaneous loss of a monitor and the OSDs is triggering a case
>> that's not behaving correctly. Can you create a ticket at
>> tracker.ceph.com with your logs and what steps you took and symptoms
>> observed?
>> -Greg
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Udo Lembke

Hi,
but I assume you measure also cache in this scenario - the osd-nodes has
cached the writes in the filebuffer
(due this the latency should be very small).

Udo

On 12.12.2016 03:00, V Plus wrote:
> Thanks Somnath!
> As you recommended, I executed:
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1
>
> Then the output results look more reasonable!
> Could you tell me why??
>
> Btw, the purpose of my run is to test the performance of rbd in ceph.
> Does my case mean that before every test, I have to "initialize" all
> the images???
>
> Great thanks!!
>
> On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy  > wrote:
>
> Fill up the image with big write (say 1M) first before reading and
> you should see sane throughput.
>
>  
>
> Thanks & Regards
>
> Somnath
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>  
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host). 
>
> 1. I created 2 images in Ceph, and map them to another host A
> (*/outside /*the Ceph cluster). On host A, I
> got *//dev/rbd0/* and*/ /dev/rbd1/*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio
> job descriptions can be found below)
>
> */"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output
> b.txt  & wait"/*
>
> 3. After the test, in a.txt, we got */bw=1162.7MB/s/*, in b.txt,
> we get */bw=3579.6MB/s/*.
>
> The results do NOT make sense because there is only one NIC on
> host A, and its limit is 10 Gbps (1.25GB/s).
>
>  
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file *//etc/ceph/ceph.conf/* on host A,I
> already added:
>
> */[client]/*
>
> */rbd cache = false/*
>
>  
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>  
>
> *fioA.job:*
>
> /[A]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd0/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> *fioB.job:*
>
> /[B]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd1/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> /Thanks.../
>
> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated
> recipient(s) named above. If the reader of this message is not the
> intended recipient, you are hereby notified that you have received
> this message in error and that any review, dissemination,
> distribution, or copying of this message is strictly prohibited.
> If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and
> destroy any and all copies of this message in your possession
> (whether hard copies or electronically stored copies).
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Udo Lembke

Hi,

unfortunately there are no Debian Jessie packages...


Don't know that an recompile take such an long time for ceph... I think
such an important fix should hit the repros faster.


Udo


On 09.12.2016 18:54, Francois Lafont wrote:
> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>
>> Sounds great.  May I asked what procedure you did to upgrade?
> Of course. ;)
>
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
>
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
>
> deb 
> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>  trusty main
>
> And after that "apt-get update && apt-get upgrade" etc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel

2016-11-16 Thread Udo Lembke

Hi,


On 16.11.2016 19:01, Vincent Godin wrote:
> Hello,
>
> We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2
> (initial was hammer 0.94.5) but we have still some big problems on our
> production environment :
>
>   * some ceph filesystem are not mounted at startup and we have to
> mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk
> /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'"
>
vdX1?? This sounds you use ceph inside an virtualized system?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke

Hi again,

and change the value with something like this

ceph tell osd.* injectargs '--mon_osd_full_ratio 0.96'

Udo

On 01.11.2016 21:16, Udo Lembke wrote:
> Hi Marcus,
>
> for a fast help you can perhaps increase the mon_osd_full_ratio?
>
> What values do you have?
> Please post the output of (on host ceph1, because osd.0.asok)
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
> full_ratio
>
> after that it would be helpfull to use on all hosts 2 OSDs...
>
>
> Udo
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke

Hi Marcus,

for a fast help you can perhaps increase the mon_osd_full_ratio?

What values do you have?
Please post the output of (on host ceph1, because osd.0.asok)

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
full_ratio

after that it would be helpfull to use on all hosts 2 OSDs...


Udo


On 01.11.2016 20:14, Marcus Müller wrote:
> Hi all,
>
> i have a big problem and i really hope someone can help me!
>
> We are running a ceph cluster since a year now. Version is: 0.94.7
> (Hammer)
> Here is some info:
>
> Our osd map is:
>
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 26.67998 root default 
> -2  3.64000 host ceph1   
>  0  3.64000 osd.0   up  1.0  1.0 
> -3  3.5 host ceph2   
>  1  3.5 osd.1   up  1.0  1.0 
> -4  3.64000 host ceph3   
>  2  3.64000 osd.2   up  1.0  1.0 
> -5 15.89998 host ceph4   
>  3  4.0 osd.3   up  1.0  1.0 
>  4  3.5 osd.4   up  1.0  1.0 
>  5  3.2 osd.5   up  1.0  1.0 
>  6  5.0 osd.6   up  1.0  1.0 
>
> ceph df:
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED 
> 40972G 26821G   14151G 34.54 
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS 
> blocks  7  4490G 10.96 1237G 7037004 
> commits 8   473M 0 1237G  802353 
> fs  9  9666M  0.02 1237G 7863422 
>
> ceph osd df:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
>  0 3.64000  1.0  3724G  3128G   595G 84.01 2.43 
>  1 3.5  1.0  3724G  3237G   487G 86.92 2.52 
>  2 3.64000  1.0  3724G  3180G   543G 85.41 2.47 
>  3 4.0  1.0  7450G  1616G  5833G 21.70 0.63 
>  4 3.5  1.0  7450G  1246G  6203G 16.74 0.48 
>  5 3.2  1.0  7450G  1181G  6268G 15.86 0.46 
>  6 5.0  1.0  7450G   560G  6889G  7.52 0.22 
>   TOTAL 40972G 14151G 26820G 34.54  
> MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53
>
>
> Our current cluster state is: 
>
>  health HEALTH_WARN
> 63 pgs backfill
> 8 pgs backfill_toofull
> 9 pgs backfilling
> 11 pgs degraded
> 1 pgs recovering
> 10 pgs recovery_wait
> 11 pgs stuck degraded
> 89 pgs stuck unclean
> recovery 8237/52179437 objects degraded (0.016%)
> recovery 9620295/52179437 objects misplaced (18.437%)
> 2 near full osd(s)
> noout,noscrub,nodeep-scrub flag(s) set
>  monmap e8: 4 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
> election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
> flags noout,noscrub,nodeep-scrub
>   pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
> 14152 GB used, 26820 GB / 40972 GB avail
> 8237/52179437 objects degraded (0.016%)
> 9620295/52179437 objects misplaced (18.437%)
>  231 active+clean
>   61 active+remapped+wait_backfill
>9 active+remapped+backfilling
>6 active+recovery_wait+degraded+remapped
>6 active+remapped+backfill_toofull
>4 active+recovery_wait+degraded
>2 active+remapped+wait_backfill+backfill_toofull
>1 active+recovering+degraded
> recovery io 11754 kB/s, 35 objects/s
>   client io 1748 kB/s rd, 249 kB/s wr, 44 op/s
>
>
> My main problems are: 
>
> - As you can see from the osd tree, we have three separate hosts with
> only one osd each. Another one has four osds. Ceph allows me not to
> get data back from these three nodes with only one HDD, which are all
> near full. I tried to set the weight of the osds in the bigger node
> higher but this just does not work. So i added a new osd yesterday
> which made things not better, as you can see now. What do i have to do
> to just become these three nodes empty again and put more data on the
> other node with the four HDDs.
>
> - I added the „ceph4“ node later, this resulted in a strange ip change
> as you can see in the mon list. The public network and the cluster
> network were swapped or not assigned right. See ceph.conf
>
> [global]
> fsid = xxx
> mon_initial_members = ceph1
> mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filest

Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Udo Lembke

Hi Vincent,

On 12.07.2016 15:03, Vincent Godin wrote:
> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no
> allow partition rereading with it contains used volumes.
>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
1+2 has an performance impact.
I do an trick and use partition labels for the journal.
[osd]
osd_journal = /dev/disk/by-partlabel/journal-$id

Due this i'm independed from linux device-naming.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph storage capacity does not free when deleting contents from RBD volumes

2016-05-19 Thread Udo Lembke

Hi Albert,
to free unused space you must enable trim (or do an fstrim) in the vm -
and all things in the storage chain must support this.
The normal virtio-driver don't support trim, but if you use scsi-disks
with virtio-scsi-driver you can use it.
Work well but need some time for huge filesystems.

Udo

On 19.05.2016 19:58, Albert Archer wrote:
> Hello All.
> I am newbie in ceph. and i use jewel release for testing purpose. it
> seems every thing is OK, HEALTH_OK , all of OSDs are in UP and IN state.
> I create some RBD images (rbd create  ) and map to some ubuntu
> host . 
> I can read and write data to my volume , but when i delete some content
> from volume (e,g some huge files,...), populated capacity of cluster
> does not free and None of objects were clean.
> what is the problem ???
>
> Regards 
> Albert
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-25 Thread Udo Lembke


Hi Mike,

Am 21.04.2016 um 15:20 schrieb Mike Miller:

Hi Udo,

thanks, just to make sure, further increased the readahead:

$ sudo blockdev --getra /dev/rbd0
1048576

$ cat /sys/block/rbd0/queue/read_ahead_kb
524288

No difference here. First one is sectors (512 bytes), second one KB.

oops, sorry! My fault. Sector/KB make sense...


The second read (after drop cache) is somewhat faster (10%-20%) but 
not much.
That's very strange! Looks like tuning possibilities. Has your OSD-Nodes 
enough RAM? Are they very very busy?


If I do single thread reading on a test-vm I got following results (very 
small test-cluster - 2 nodes with 10GB-Nic and one Node with 1GB-Nic):

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 62,0267 s, 69,2 MB/s

### as root "echo 3 > /proc/sys/vm/drop_caches" and the same on the VM-host

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 30,0987 s, 143 MB/s

# this is due to cached data on the osd-nodes
# with cleared cache on all nodes (vm, vm-host, osd-nodes)
# I got the value like on the first run:

support@upgrade-test:~/fio$ dd if=fiojo.0.0 of=/dev/null bs=1M
4096+0 Datensätze ein
4096+0 Datensätze aus
4294967296 Bytes (4,3 GB) kopiert, 61,8995 s, 69,4 MB/s

I don't know why this should not the same with krbd.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-21 Thread Udo Lembke


Hi Mike,

Am 21.04.2016 um 09:07 schrieb Mike Miller:

Hi Nick and Udo,

thanks, very helpful, I tweaked some of the config parameters along 
the line Udo suggests, but still only some 80 MB/s or so.
this mean you have reached factor 3 (this are round about the value I 
see with single thread on RBD too). Better than nothing.




Kernel 4.3.4 running on the client machine and comfortable readahead 
configured


$ sudo blockdev --getra /dev/rbd0
262144

Still not more than about 80-90 MB/s.

they are two possibilities for read-ahead.
Take a look here (and change with echo)
cat /sys/block/rbd0/queue/read_ahead_kb

Perhaps there are slightly differences?



For writing the parallelization is amazing and I see very impressive 
speeds, but why is reading performance so much behind? Why is it not 
parallelized the same way writing is? Is this something coming up in 
the jewel release? Or is it planned further down the road?
If you read an big file and clear your cache ("echo 3 > 
/proc/sys/vm/drop_caches") on the client, is the second read very fast? 
I assume yes.
In this case the readed data is in the cache on the osd-nodes... so 
tuning must be there (and I'm very interesting in improvements).


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Udo Lembke


Hi Mark,
thanks for the links.

If I search for wip-auth I found nothing in docs.ceph.com... this mean, 
that wip-auth don't find the way in the ceph code base?!


But I'm wonder about the RHEL7 position at the link 
http://www.spinics.net/lists/ceph-devel/msg22416.html

Unfortunality there are no values for RHEL7 with auth...
But is known on which side (or how many percent) the bottleneck for 
cephx is (client, mon, osd)? My clients (qemu on proxmox-ve) are not 
changeable, but my OSDs can also run on RHEL7/CentOS if this bring an 
performance boost. The Mons are running on the proxmox-ve host yet.


Udo


Am 20.04.2016 um 19:13 schrieb Mark Nelson:

Hi Udo,

There was quite a bit of discussion and some partial improvements to 
cephx performance about a year ago.  You can see some of the 
discussion here:


http://www.spinics.net/lists/ceph-devel/msg3.html

and in particular these tests:

http://www.spinics.net/lists/ceph-devel/msg22416.html

Mark

On 04/20/2016 11:50 AM, Udo Lembke wrote:

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I
compare with and without cephx.

I use fio for that inside an VM on an host, outside the 3 ceph-nodes,
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the
average (but the values are very close together).

cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access
to the ceph-network can read/write all data...

ceph.conf without cephx:
[global]
  auth_cluster_required = none
  auth_service_required = none
  auth_client_required = none
  cephx_sign_messages = false
  cephx_require_signatures = false
  #
  cluster network =...

ceph.conf with cephx:
[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  #
  cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Udo Lembke


Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I 
compare with and without cephx.


I use fio for that inside an VM on an host, outside the 3 ceph-nodes, 
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G 
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the 
average (but the values are very close together).


cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access 
to the ceph-network can read/write all data...


ceph.conf without cephx:
[global]
 auth_cluster_required = none
 auth_service_required = none
 auth_client_required = none
 cephx_sign_messages = false
 cephx_require_signatures = false
 #
 cluster network =...

ceph.conf with cephx:
[global]
 auth client required = cephx
 auth cluster required = cephx
 auth service required = cephx
 #
 cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-19 Thread Udo Lembke

Hi Mike,
I don't have experiences with RBD mounts, but see the same effect with RBD.

You can do some tuning to get better results (disable debug and so on).

As hint some values from a ceph.conf:
[osd]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 filestore_op_threads = 4
 osd max backfills = 1
 osd mount options xfs =
"rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
 osd mkfs options xfs = "-f -i size=2048"
 osd recovery max active = 1
 osd_disk_thread_ioprio_class = idle
 osd_disk_thread_ioprio_priority = 7
 osd_disk_threads = 1
 osd_enable_op_tracker = false
 osd_op_num_shards = 10
 osd_op_num_threads_per_shard = 1
 osd_op_threads = 4

Udo

On 19.04.2016 11:21, Mike Miller wrote:
> Hi,
>
> RBD mount
> ceph v0.94.5
> 6 OSD with 9 HDD each
> 10 GBit/s public and private networks
> 3 MON nodes 1Gbit/s network
>
> A rbd mounted with btrfs filesystem format performs really badly when
> reading. Tried readahead in all combinations but that does not help in
> any way.
>
> Write rates are very good in excess of 600 MB/s up to 1200 MB/s,
> average 800 MB/s
> Read rates on the same mounted rbd are about 10-30 MB/s !?
>
> Of course, both writes and reads are from a single client machine with
> a single write/read command. So I am looking at single threaded
> performance.
> Actually, I was hoping to see at least 200-300 MB/s when reading, but
> I am seeing 10% of that at best.
>
> Thanks for your help.
>
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Udo Lembke


Hi Sage,
we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy 
with ext4.

We start with xfs but the latency was much higher comparable to ext4...

But we use RBD only  with "short" filenames like 
rbd_data.335986e2ae8944a.000761e1.
If we can switch from Jewel to K* and change during the update the 
filestore for each OSD to BlueStore it's will be OK for us.

I hope we will get than an better performance with BlueStore??
Will be BlueStore production ready during the Jewel-Lifetime, so that we 
can switch to BlueStore before the next big upgrade?



Udo

Am 11.04.2016 um 23:39 schrieb Sage Weil:

Hi,

ext4 has never been recommended, but we did test it.  After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.

Why:

Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling.  (There is a limit in the amount of xattr data ext4 can store in
the inode, which causes problems in LFNIndex.)

We *could* invest a ton of time rewriting this to fix, but it only affects
ext4, which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.

Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.

The long file name handling is problematic anytime someone is storing
rados objects with long names.  The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS.  Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.

How:

To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len).  The OSD will complain that ext4
cannot store such an object and refuse to start.  A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully.  They would be taking a risk, though, because we would like
to stop testing on ext4.

Is this reasonable?  If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.6 Hammer released

2016-02-25 Thread Udo Lembke

Hi,

Am 24.02.2016 um 17:27 schrieb Alfredo Deza:
> On Wed, Feb 24, 2016 at 4:31 AM, Dan van der Ster  wrote:
>> Thanks Sage, looking forward to some scrub randomization.
>>
>> Were binaries built for el6? http://download.ceph.com/rpm-hammer/el6/x86_64/
> 
> We are no longer building binaries for el6. Just for Centos 7, Ubuntu
> Trusty, and Debian Jessie.
> 
this means that our proxmox-ve server 3.4, which run debian wheezy, could not 
be updated from ceph 0.94.5 to 0.94.6!
The OSD-nodes run's wheezy too - they can be upgraded. But the MONs must be 
also upgraded (first).

I can understand, that newer versions will not supplied to an older OS, but 
stop from minor.5 to minor.6 makes realy no
sense to me.

Of course, I can update to proxmox-ve 4.x, which is jessie based, but in this 
case I have trouble with DRBD...


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke

Hi,
have done the test again in a cleaner way.

Same pool, same VM, different hosts (qemu 2.4 + qemu 2.2) but same hardware.
But only one run!

The biggest difference is due cache settings:

qemu2.4 cache=writethrough  iops=3823 bw=15294KB/s
qemu2.4 cache=writeback  iops=8837 bw=35348KB/s
qemu2.2 cache=writethrough  iops=2996 bw=11988KB/s
qemu2.2 cache=writeback  iops=7980 bw=31921KB/s

iothread does change anything, because only one disk is used.

Test:
io --time_based --name=benchmark --size=4G --filename=test.bin
--ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1
--verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
--group_reporting


Udo

On 22.11.2015 23:59, Udo Lembke wrote:
> Hi Zoltan,
> you are right ( but this was two running systems...).
>
> I see also an big failure: "--filename=/mnt/test.bin" (use simply
> copy/paste without to much thinking :-( )
> The root filesystem is not on ceph (on both servers).
> So my measurements are not valid!!
>
> I would do the measurements clean tomorow.
>
>
> Udo
>
>
> On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
>> It would have been more interesting if you had tweaked only one
>> option as now we can’t be sure which changed had what impact… :-)
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke

Hi Zoltan,
you are right ( but this was two running systems...).

I see also an big failure: "--filename=/mnt/test.bin" (use simply
copy/paste without to much thinking :-( )
The root filesystem is not on ceph (on both servers).
So my measurements are not valid!!

I would do the measurements clean tomorow.


Udo


On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
> It would have been more interesting if you had tweaked only one option
> as now we can’t be sure which changed had what impact… :-)
>
>> On 22 Nov 2015, at 04:29, Udo Lembke > <mailto:ulem...@polarzone.de>> wrote:
>>
>> Hi Sean,
>> Haomai is right, that qemu can have a huge performance differences.
>>
>> I have done two test to the same ceph-cluster (different pools, but
>> this should not do any differences).
>> One test with proxmox ve 4 (qemu 2.4, iothread for device, and
>> cache=writeback) gives 14856 iops
>> Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
>> 5070 iops only.
>>
>> Here the results in long:
>> ### proxmox ve 3.x ###
>> kvm --version
>> QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard
>>
>> VM:
>> virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G
>>
>> root@fileserver:/daten/support/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0
>> --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
>> --group_reporting
>> fio: time_based requires a runtime/timeout setting
>> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=128
>> ...
>> fio-2.1.11
>> Starting 4 processes
>> benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
>> Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
>> [0/10.6K/0 iops] [eta 00m:00s]
>> benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47
>> 2015
>>   write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
>> slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
>> clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
>>  lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
>> clat percentiles (msec):
>>  |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[  
>> 28],
>>  | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[  
>> 63],
>>  | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[ 
>> 367],
>>  | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[
>> 1713],
>>  | 99.99th=[ 2573]
>> bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84,
>> stdev=3014.45
>> lat (usec) : 750=0.01%, 1000=0.01%
>> lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
>> lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
>> lat (msec) : 2000=0.29%, >=2000=0.03%
>>   cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
>>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=100.0%
>>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>  issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
>>  latency   : target=0, window=0, percentile=100.00%, depth=128
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s,
>> mint=827178msec, maxt=827178msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
>> in_queue=105927128, util=100.00%, aggrios=1/4469640,
>> aggrmerge=0/14788, aggrticks=64/103711096, aggrin_queue=104165356,
>> aggrutil=100.00%
>>   vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
>> in_queue=104165356, util=100.00%
>>
>> ##
>>
>> ### proxmox ve 4.x ###
>> kvm --version
>> QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
>> 2003-2008 Fabrice Bellard
>>
>> grep ceph /etc/pve/qemu-server/102.conf
>> virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G
>>
>> root@fileserver-test:/daten/tv01/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0
>> --verify_

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-21 Thread Udo Lembke

Hi Sean,
Haomai is right, that qemu can have a huge performance differences.

I have done two test to the same ceph-cluster (different pools, but this
should not do any differences).
One test with proxmox ve 4 (qemu 2.4, iothread for device, and
cache=writeback) gives 14856 iops
Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
5070 iops only.

Here the results in long:
### proxmox ve 3.x ###
kvm --version
QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard

VM:
virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G

root@fileserver:/daten/support/test# fio --time_based --name=benchmark
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0
--numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
fio: time_based requires a runtime/timeout setting
benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 4 processes
benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
[0/10.6K/0 iops] [eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47 2015
  write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
 lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
clat percentiles (msec):
 |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[   28],
 | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[   63],
 | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[  367],
 | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[ 1713],
 | 99.99th=[ 2573]
bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84,
stdev=3014.45
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
lat (msec) : 2000=0.29%, >=2000=0.03%
  cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
 issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s,
mint=827178msec, maxt=827178msec

Disk stats (read/write):
dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
in_queue=105927128, util=100.00%, aggrios=1/4469640, aggrmerge=0/14788,
aggrticks=64/103711096, aggrin_queue=104165356, aggrutil=100.00%
  vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
in_queue=104165356, util=100.00%

##

### proxmox ve 4.x ###
kvm --version
QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
2003-2008 Fabrice Bellard

grep ceph /etc/pve/qemu-server/102.conf
virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G

root@fileserver-test:/daten/tv01/test# fio --time_based --name=benchmark
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0
--numjobs=4 --rw=randwrite --blocksize=4k --group_reporting  
fio: time_based requires a runtime/timeout
setting 
 

benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128 
... 
   

fio-2.1.11
Starting 4 processes
Jobs: 4 (f=4): [w(4)] [99.6% done] [0KB/56148KB/0KB /s] [0/14.4K/0 iops]
[eta 00m:01s]
benchmark: (groupid=0, jobs=4): err= 0: pid=26131: Sun Nov 22 03:51:04 2015
  write: io=0B, bw=59425KB/s, iops=14856, runt=282327msec
slat (usec): min=6, max=216925, avg=261.78, stdev=1802.78
clat (msec): min=1, max=330, avg=34.04, stdev=27.78
 lat (msec): min=1, max=330, avg=34.30, stdev=27.87
clat percentiles (msec):
 |  1.00th=[   10],  5.00th=[   13], 10.00th=[   14], 20.00th=[   16],
 | 30.00th=[   18], 40.00th=[   19], 50.00th=[   21], 60.00th=[   24],
 | 70.00th=[   33], 80.00th=[   62], 90.00th=[   81], 95.00th=[   87],
 | 99.00th=[   95], 99.50th=[  100], 99.90th=[  269], 99.95th=[  277],
 | 99.99th=[  297]
bw (KB  /s): min=3, max=42216, per=25.10%, avg=14917.03,
stdev=2990.50
lat (msec) : 2=0.01%, 4=0.01%, 10=1.13%, 20=45.52%, 50=28.23%

Re: [ceph-users] two or three replicas?

2015-11-03 Thread Udo Lembke

Hi,
for production (with enough OSDs) is three replicas the right choice.
The chance for data loss if two ODSs fails at one time is to high.

And if this happens most of your data ist lost, because the data is
spead over many OSDs...

And yes - two replicas is faster for writes.

Udo

On 02.11.2015 11:10, Wah Peng wrote:
> Hello,
>
> for production application (for example, openstack's block storage),
> is it better to setup data to be stored with two replicas, or three
> replicas? is two replicas with better performance and lower cost?
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Network performance

2015-10-22 Thread Udo Lembke

Hi Jonas,
you can create an bond over multible NICs (depends on your switch which modes 
are possible) to use one IP addresses but
more than one NIC.

Udo

On 21.10.2015 10:23, Jonas Björklund wrote:
> Hello,
> 
> In the configuration I have read about "cluster network" and "cluster addr".
> Is it possible to make the OSDs to listens to multiple IP addresses?
> I want to use several network interfaces to increase performance.
> 
> I hav tried
> 
> [global]
> cluster network = 172.16.3.0/24,172.16.4.0/24
> 
> [osd.0]
> public addr = 0.0.0.0
> #public addr = 172.16.3.1
> #public addr = 172.16.4.1
> 
> But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the same 
> time.
> 
> Any ideas?
> 
> /Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread Udo Lembke

Hi,
do you have changed the ownership like discribed in Sages mail about
"v9.1.0 Infernalis release candidate released"?

  #. Fix the ownership::

   chown -R ceph:ceph /var/lib/ceph

or set ceph.conf to use root instead?
  When upgrading, administrators have two options:

   #. Add the following line to ``ceph.conf`` on all hosts::

setuser match path = /var/lib/ceph/$type/$cluster-$id

  This will make the Ceph daemons run as root (i.e., not drop
  privileges and switch to user ceph) if the daemon's data
  directory is still owned by root.  Newly deployed daemons will
  be created with data owned by user ceph and will run with
  reduced privileges, but upgraded daemons will continue to run as
  root.



Udo

On 20.10.2015 14:59, German Anders wrote:
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the
> following error msg while trying to restart the mon daemons:
>
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features:
> (1) Operation not permitted
> 2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
> 2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features:
> (1) Operation not permitted
>
>
> any ideas?
>
> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
>
> Thanks in advance,
>
> Cheers,
>
> **
>
> *German*
>
> 2015-10-19 18:07 GMT-03:00 Sage Weil  >:
>
> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to
> Infernalis. That is, all users of earlier version of Hammer or any
> version of Firefly will first need to upgrade to hammer v0.94.4 or
> later before upgrading to Infernalis (or future releases).
>
> All v0.94.x Hammer users are strongly encouraged to upgrade.
>
> Changes
> ---
>
> * build/ops: ceph.spec.in : 50-rbd.rules
> conditional is wrong (#12166, Nathan Cutler)
> * build/ops: ceph.spec.in : ceph-common needs
> python-argparse on older distros, but doesn't require it (#12034,
> Nathan Cutler)
> * build/ops: ceph.spec.in : radosgw requires
> apache for SUSE only -- makes no sense (#12358, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: cephfs_java
> not fully conditionalized (#11991, Nathan Cutler)
> * build/ops: ceph.spec.in : rpm: not possible
> to turn off Java (#11992, Owen Synge)
> * build/ops: ceph.spec.in : running fdupes
> unnecessarily (#12301, Nathan Cutler)
> * build/ops: ceph.spec.in : snappy-devel for
> all supported distros (#12361, Nathan Cutler)
> * build/ops: ceph.spec.in : SUSE/openSUSE
> builds need libbz2-devel (#11629, Nathan Cutler)
> * build/ops: ceph.spec.in : useless
> %py_requires breaks SLE11-SP3 build (#12351, Nathan Cutler)
> * build/ops: error in ext_mime_map_init() when /etc/mime.types is
> missing (#11864, Ken Dreyer)
> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5
> in 30s) (#11798, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#10927, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11140, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#11686, Sage Weil)
> * build/ops: With root as default user, unable to have multiple
> RGW instances running (#12407, Sage Weil)
> * cli: ceph: cli throws exception on unrecognized errno (#11354,
> Kefu Chai)
> * cli: ceph tell: broken error message / misleading hinting
> (#11101, Kefu Chai)
> * common: arm: all programs that link to librados2 hang forever on
> startup (#12505, Boris Ranto)
> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer
> objects (#13070, Sage Weil)
> * common: do not insert emtpy ptr when rebuild emtpy bufferlist
> (#12775, Xinze Chi)
> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
> * common: Memory lea

Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Udo Lembke

Hi Christian,

On 07.10.2015 09:04, Christian Balzer wrote:
> 
> ...
> 
> My main suspect for the excessive slowness are actually the Toshiba DT
> type drives used. 
> We only found out after deployment that these can go into a zombie mode
> (20% of their usual performance for ~8 hours if not permanently until power
> cycled) after a week of uptime.
> Again, the HW cache is likely masking this for the steady state, but
> asking a sick DT drive to seek (for reads) is just asking for trouble.
> 
> ...
does this mean, you can reboot your OSD-Nodes one after the other and then your 
cluster should be fast enough for app.
one week to bring the additional node in?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [sepia] debian jessie repository ?

2015-09-25 Thread Udo Lembke

Hi,
you can use this sources-list

cat /etc/apt/sources.list.d/ceph.list
deb http://gitbuilder.ceph.com/ceph-deb-jessie-x86_64-basic/ref/v0.94.3
jessie main

Udo

On 25.09.2015 15:10, Jogi Hofmüller wrote:
> Hi,
>
> Am 2015-09-11 um 13:20 schrieb Florent B:
>
>> Jessie repository will be available on next Hammer release ;)
> An how should I continue installing ceph meanwhile?  ceph-deploy new ...
> overwrites the /etc/apt/sources.list.d/ceph.list and hence throws an
> error :(
>
> Any hint appreciated.
>
> Cheers,
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-07 Thread Udo Lembke

Hi Vickey,
I had the same rados bench output after changing the motherboard of the
monitor node with the lowest IP...
Due to the new mainboard, I assume the hw-clock was wrong during
startup. Ceph health show no errors, but all VMs aren't able to do IO
(very high load on the VMs - but no traffic).
I stopped the mon, but this don't changed anything. I had to restart all
other mons to get IO again. After that I started the first mon also
(with the right time now) and all worked fine again...

Another posibility:
Do you use journal on SSDs? Perhaps the SSDs can't write to garbage
collection?


Udo

On 07.09.2015 16:36, Vickey Singh wrote:
> Dear Experts
>
> Can someone please help me , why my cluster is not able write data.
>
> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
>
>
> Ceph Hammer  0.94.2
> CentOS 6 (3.10.69-1)
>
> The Ceph status says OPS are blocked , i have tried checking , what
> all i know 
>
> - System resources ( CPU , net, disk , memory )-- All normal 
> - 10G network for public and cluster network  -- no saturation 
> - Add disks are physically healthy 
> - No messages in /var/log/messages OR dmesg
> - Tried restarting OSD which are blocking operation , but no luck
> - Tried writing through RBD  and Rados bench , both are giving same
> problemm
>
> Please help me to fix this problem.
>
> #  rados bench -p rbd 60 write
>  Maintaining 16 concurrent writes of 4194304 bytes for up to 60
> seconds or 0 objects
>  Object prefix: benchmark_data_stor1_1791844
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0   0 0 0 0 0 - 0
>  1  16   125   109   435.873   436  0.022076 0.0697864
>  2  16   139   123   245.94856  0.246578 0.0674407
>  3  16   139   123   163.969 0 - 0.0674407
>  4  16   139   123   122.978 0 - 0.0674407
>  5  16   139   12398.383 0 - 0.0674407
>  6  16   139   123   81.9865 0 - 0.0674407
>  7  16   139   123   70.2747 0 - 0.0674407
>  8  16   139   123   61.4903 0 - 0.0674407
>  9  16   139   123   54.6582 0 - 0.0674407
> 10  16   139   123   49.1924 0 - 0.0674407
> 11  16   139   123   44.7201 0 - 0.0674407
> 12  16   139   123   40.9934 0 - 0.0674407
> 13  16   139   123   37.8401 0 - 0.0674407
> 14  16   139   123   35.1373 0 - 0.0674407
> 15  16   139   123   32.7949 0 - 0.0674407
> 16  16   139   123   30.7451 0 - 0.0674407
> 17  16   139   123   28.9364 0 - 0.0674407
> 18  16   139   123   27.3289 0 - 0.0674407
> 19  16   139   123   25.8905 0 - 0.0674407
> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> 20  16   139   12324.596 0 - 0.0674407
> 21  16   139   123   23.4247 0 - 0.0674407
> 22  16   139   123 22.36 0 - 0.0674407
> 23  16   139   123   21.3878 0 - 0.0674407
> 24  16   139   123   20.4966 0 - 0.0674407
> 25  16   139   123   19.6768 0 - 0.0674407
> 26  16   139   123 18.92 0 - 0.0674407
> 27  16   139   123   18.2192 0 - 0.0674407
> 28  16   139   123   17.5686 0 - 0.0674407
> 29  16   139   123   16.9628 0 - 0.0674407
> 30  16   139   123   16.3973 0 - 0.0674407
> 31  16   139   123   15.8684 0 - 0.0674407
> 32  16   139   123   15.3725 0 - 0.0674407
> 33  16   139   123   14.9067 0 - 0.0674407
> 34  16   139   123   14.4683 0 - 0.0674407
> 35  16   139   123   14.0549 0 - 0.0674407
> 36  16   139   123   13.6645 0 - 0.0674407
> 37  16   139   123   13.2952 0 - 0.0674407
> 38  16   139   123   12.9453 0 - 0.0674407
> 39  16   139   123   12.6134 0 - 0.0674407
> 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>

Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-30 Thread Udo Lembke

Hi Christian,
for my setup "b" takes too long - too much data movement and stress to all 
nodes.
I have simply (with replica 3) "set noout", reinstall one node (with new 
filesystem on the OSDs, but leave them in the
crushmap) and start all OSDs (at friday night) - takes app. less than one day 
for rebuild (11*4TB 1*8TB).
Do also stress the other nodes, but less than with weigting to zero.

Udo

On 31.08.2015 06:07, Christian Balzer wrote:
> 
> Hello,
> 
> I'm about to add another storage node to small firefly cluster here and
> refurbish 2 existing nodes (more RAM, different OSD disks).
> 
> Insert rant about not going to start using ceph-deploy as I would have to
> set the cluster to no-in since "prepare" also activates things due to the
> udev magic...
> 
> This cluster is quite at the limits of its IOPS capacity (the HW was
> requested ages ago, but the mills here grind slowly and not particular
> fine either), so the plan is to:
> 
> a) phase in the new node (lets call it C), one OSD at a time (in the dead
> of night)
> b) empty out old node A (weight 0), one OSD at a time. When
> done, refurbish and bring it back in, like above.
> c) repeat with 2nd old node B.
> 
> Looking at this it's obvious where the big optimization in this procedure
> would be, having the ability to "freeze" the OSDs on node B.
> That is making them ineligible for any new PGs while preserving their
> current status. 
> So that data moves from A to C (which is significantly faster than A or B)
> and then back to A when it is refurbished, avoiding any heavy lifting by B.
> 
> Does that sound like something other people might find useful as well and
> is it feasible w/o upsetting the CRUSH applecart?
> 
> Christian
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke

Hi Jan,
thanks for the hint.

I changed the mount-option from noatime to relatime and will remount all
OSDs during weekend.

Udo

On 07.08.2015 16:37, Jan Schermer wrote:
> ext4 does support external journal, and it is _FAST_
>
> btw I'm not sure noatime is the right option nowadays for two reasons
> 1) the default is "relatime" which has minimal impact on performance
> 2) AFAIK some ceph features actually use atime (cache tiering was it?) or at 
> least so I gathered from some bugs I saw
>
> Jan
>
>> On 07 Aug 2015, at 16:30, Udo Lembke  wrote:
>>
>> Hi,
>> I use the ext4-parameters like Christian Balzer wrote in one posting:
>> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
>> osd_mkfs_options_ext4 = -J size=1024 -E 
>> lazy_itable_init=0,lazy_journal_init=0
>>
>> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
>> support an different journal-device, like
>> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
>>
>> Udo
>>
>> Am 07.08.2015 16:13, schrieb Burkhard Linke:
>>> Hi,
>>>
>>>
>>> On 08/07/2015 04:04 PM, Udo Lembke wrote:
>>>> Hi,
>>>> some time ago I switched all OSDs from XFS to ext4 (step by step).
>>>> I had no issues during mixed osd-format (the process takes some weeks).
>>>>
>>>> And yes, for me ext4 performs also better (esp. the latencies).
>>> Just out of curiosity:
>>>
>>> Do you use a ext4 setup as described in the documentation? Did you try to 
>>> use external ext4 journals on SSD?
>>>
>>> Regards,
>>> Burkhard
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] НА: Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke

Hi,
I think also it's much too complicate and the effort is not in any relation, 
like Megor allready wrote the osd-journal
on SSD handle the speed.

But for the persistant device names you can easily use partlabel and select the 
disk with something like
/dev/disk/by-partlabel/ext4-journal-15

I do this way with the osd-journal (osd_journal = 
/dev/disk/by-partlabel/journal-$id) - with this method I see very
fast, which Journal on which SSD is.

Udo

Am 07.08.2015 16:57, schrieb Межов Игорь Александрович:
> Hi!
> 
>> No, I was indeed talking about the ext4 journals, e.g. described here:
> ...
>> but the problem with the persistent device names is keeping me from trying 
>> it.
> 
> So you assume 3-way setup in Ceph: first drive for filesystem data, second
> drive for filesystem journal and third drive for ceph journal?  And what is 
> the benefits? 
> Ceph journalling already support transactional writes and ext4 journaling 
> doesn't
> improve it anyway. Maybe it is useful to split iops onto a pair devices 
> instead of one?
> It is a too complicated setup, I think.
> 
> 
> Megov Igor
> CIO, Yuterra
> 
> 
> 
> От: ceph-users  от имени Burkhard Linke 
> 
> Отправлено: 7 августа 2015 г. 17:37
> Кому: ceph-users@lists.ceph.com
> Тема: Re: [ceph-users] Different filesystems on OSD hosts at the
> samecluster
> 
> Hi,
> 
> On 08/07/2015 04:30 PM, Udo Lembke wrote:
>> Hi,
>> I use the ext4-parameters like Christian Balzer wrote in one posting:
>> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
>> osd_mkfs_options_ext4 = -J size=1024 -E 
>> lazy_itable_init=0,lazy_journal_init=0
> Thx for the details.
>>
>> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
>> support an different journal-device, like
>> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
> No, I was indeed talking about the ext4 journals, e.g. described here:
> 
> http://raid6.com.au/posts/fs_ext4_external_journal_caveats/
> 
> The setup is tempting (both ext4 + OSD journal on SSD), but the problem
> with the persistent device names is keeping me from trying it.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke

Hi,
I use the ext4-parameters like Christian Balzer wrote in one posting:
osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0

The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
support an different journal-device, like
xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!

Udo

Am 07.08.2015 16:13, schrieb Burkhard Linke:
> Hi,
> 
> 
> On 08/07/2015 04:04 PM, Udo Lembke wrote:
>> Hi,
>> some time ago I switched all OSDs from XFS to ext4 (step by step).
>> I had no issues during mixed osd-format (the process takes some weeks).
>>
>> And yes, for me ext4 performs also better (esp. the latencies).
> Just out of curiosity:
> 
> Do you use a ext4 setup as described in the documentation? Did you try to use 
> external ext4 journals on SSD?
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Different filesystems on OSD hosts at the same cluster

2015-08-07 Thread Udo Lembke

Hi,
some time ago I switched all OSDs from XFS to ext4 (step by step).
I had no issues during mixed osd-format (the process takes some weeks).

And yes, for me ext4 performs also better (esp. the latencies).

Udo

Am 07.08.2015 13:31, schrieb Межов Игорь Александрович:
> Hi!
> 
> We do some performance tests on our small Hammer install:
>  - Debian Jessie;
>  - Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
>  - 1xE5-2670 + 128Gb RAM
>  - 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
>  - 16 OSD on each node, SAS 10k;
>  - 2 Intel DC S3700 200Gb SSD for journalling 
>  - 10Gbit interconnect, shared public and cluster metwork, MTU9100
>  - 10Gbit client host, fio 2.2.7 compiled with RBD engine
> 
> We benchmark 4k random read performance on 500G RBD volume with fio-rbd 
> and got different results. When we use XFS 
> (noatime,attr2,inode64,allocsize=4096k,
> noquota) on OSD disks, we can get ~7k sustained iops. After recreating the 
> same OSDs
> with EXT4 fs (noatime,data=ordered) we can achieve ~9.5k iops in the same 
> benchmark.
> 
> So there are some questions to community:
>  1. Is really EXT4 perform better under typical RBD load (we Ceph to host VM 
> images)?
>  2. Is it safe to intermix OSDs with different backingstore filesystems at 
> one cluster 
> (we use ceph-deploy to create and manage OSDs)?
>  3. Is it safe to move our production cluster (Firefly 0.80.7) from XFS to 
> ext4 by
> removing XFS osds one-by-one and later add the same disk drives as Ext4 OSDs
> (of course, I know about huge data-movement that will take place during this 
> process)?
> 
> Thanks!
> 
> Megov Igor
> CIO, Yuterra
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dropping old distros: el6, precise 12.04, debian wheezy?

2015-07-30 Thread Udo Lembke

Hi,
dropping debian wheezy are quite fast - till now there aren't packages
for jessie?!
Dropping of squeeze I understand, but wheezy at this time?


Udo


On 30.07.2015 15:54, Sage Weil wrote:
> As time marches on it becomes increasingly difficult to maintain proper 
> builds and packages for older distros.  For example, as we make the 
> systemd transition, maintaining the kludgey sysvinit and udev support for 
> centos6/rhel6 is a pain in the butt and eats up time and energy to 
> maintain and test that we could be spending doing more useful work.
>
> "Dropping" them would mean:
>
>  - Ongoing development on master (and future versions like infernalis and 
> jewel) would not be tested on these distros.
>
>  - We would stop building upstream release packages on ceph.com for new 
> releases.
>
>  - We would probably continue building hammer and firefly packages for 
> future bugfix point releases.
>
>  - The downstream distros would probably continue to package them, but the 
> burden would be on them.  For example, if Ubuntu wanted to ship Jewel on 
> precise 12.04, they could, but they'd probably need to futz with the 
> packaging and/or build environment to make it work.
>
> So... given that, I'd like to gauge user interest in these old distros.  
> Specifically,
>
>  CentOS6 / RHEL6
>  Ubuntu precise 12.04
>  Debian wheezy
>
> Would anyone miss them?
>
> In particular, dropping these three would mean we could drop sysvinit 
> entirely and focus on systemd (and continue maintaining the existing 
> upstart files for just a bit longer).  That would be a relief.  (The 
> sysvinit files wouldn't go away in the source tree, but we wouldn't worry 
> about packaging and testing them properly.)
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Udo Lembke

Hi,

On 28.07.2015 12:02, Shneur Zalman Mattern wrote:
> Hi!
>
> And so, in your math
> I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
> demans 
30 replicas is the wrong math! Less replicas = more speed (because of
less writing).
More replicas less speed.
Fore data safety an replica of 3 is recommended.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-23 Thread Udo Lembke

Hi,
I use ceph 0.94 from wheezy repro (deb http://eu.ceph.com/debian-hammer wheezy 
main) inside jessie.
0.94.1 are installable without trouble, but an upgrade to 0.94.2 don't work 
correctly:
dpkg -l | grep ceph
ii  ceph   0.94.1-1~bpo70+1  amd64  
  distributed storage and file system
ii  ceph-common0.94.2-1~bpo70+1  amd64  
  common utilities to mount and interact
with a ceph storage cluster
ii  ceph-fs-common 0.94.2-1~bpo70+1  amd64  
  common utilities to mount and interact
with a ceph file system
ii  ceph-fuse  0.94.2-1~bpo70+1  amd64  
  FUSE-based client for the Ceph
distributed file system
ii  ceph-mds   0.94.2-1~bpo70+1  amd64  
  metadata server for the ceph
distributed file system
ii  libcephfs1 0.94.2-1~bpo70+1  amd64  
  Ceph distributed file system client
library
ii  python-cephfs  0.94.2-1~bpo70+1  amd64  
  Python libraries for the Ceph
libcephfs library

This is the reason, why I switched back to wheezy (and clean 0.94.2) but than 
all OSDs on that node failed to start.
Switching back to the jessie-system-disk don't solve this ploblem, because only 
3 OSDs started again...

My conclusion is, if now die one of my (partly brocken) jessie osd-node (like 
failed system ssd) I need less than an
hour for a new system (wheezy), around two ours to reinitilize all OSDs (format 
new, install ceph) and around two days
to refill the whole node.

Udo

Am 23.07.2015 13:21, schrieb Haomai Wang:
> Do you use upstream ceph version previously? Or do you shutdown
> running ceph-osd when upgrading osd?
> 
> How many osds meet this problems?
> 
> This assert failure means that osd detects a upgraded pg meta object
> but failed to read(or lack of 1 key) meta keys from object.
> 
> On Thu, Jul 23, 2015 at 7:03 PM, Udo Lembke  wrote:
>> Am 21.07.2015 12:06, schrieb Udo Lembke:
>>> Hi all,
>>> ...
>>>
>>> Normaly I would say, if one OSD-Node die, I simply reinstall the OS and 
>>> ceph and I'm back again... but this looks bad
>>> for me.
>>> Unfortunality the system also don't start 9 OSDs as I switched back to the 
>>> old system-disk... (only three of the big
>>> OSDs are running well)
>>>
>>> What is the best solution for that? Empty one node (crush weight 0), fresh 
>>> reinstall OS/ceph, reinitialise all OSDs?
>>> This will take a long long time, because we use 173TB in this cluster...
>>>
>>>
>>
>> Hi,
>> answer myself if anybody has similiar issues and find the posting.
>>
>> Empty the whole nodes takes too long.
>> I used the puppet wheezy system and have to recreate all OSDs (in this case 
>> I need to empty the first blocks of the
>> journal before create the OSD again).
>>
>>
>> Udo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-23 Thread Udo Lembke

Am 21.07.2015 12:06, schrieb Udo Lembke:
> Hi all,
> ...
> 
> Normaly I would say, if one OSD-Node die, I simply reinstall the OS and ceph 
> and I'm back again... but this looks bad
> for me.
> Unfortunality the system also don't start 9 OSDs as I switched back to the 
> old system-disk... (only three of the big
> OSDs are running well)
> 
> What is the best solution for that? Empty one node (crush weight 0), fresh 
> reinstall OS/ceph, reinitialise all OSDs?
> This will take a long long time, because we use 173TB in this cluster...
> 
> 

Hi,
answer myself if anybody has similiar issues and find the posting.

Empty the whole nodes takes too long.
I used the puppet wheezy system and have to recreate all OSDs (in this case I 
need to empty the first blocks of the
journal before create the OSD again).


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] different omap format in one cluster (.sst + .ldb) - new installed OSD-node don't start any OSD

2015-07-21 Thread Udo Lembke

Hi all,
we had an ceph cluster with 7 OSD-nodes (Debian Jessie (because patched 
tcmalloc) with ceph 0.94) which we expand with
one further node.
For this node we use puppet with Debian 7.8, because ceph 0.92.2 doesn't 
install on Jessie (upgrade 0.94.1 work on the
other nodes but 0.94.2 looks not clean because the package ceph are still 
0.94.1).
The ceph.conf is systemwide the same and the OSDs are on all nodes initialized 
with ceph-deploy (only some exceptions).
All OSDs are used ext4, switched from xfs during the cluster run ceph 0.80.7, 
filestore xattr use omap = true are used
in ceph.conf.

I'm wondering that the omap-format is different on the nodes.
The new wheezy node use .sst files:
ls -lsa /var/lib/ceph/osd/ceph-92/current/omap/
...
2084 -rw-r--r--   1 root root 2131113 Jul 20 17:45 98.sst
2084 -rw-r--r--   1 root root 2131913 Jul 20 17:45 99.sst
2084 -rw-r--r--   1 root root 2130623 Jul 20 17:45 000111.sst
...

Due the jessie nodes use levelDB:
ls -lsa /var/lib/ceph/osd/ceph-1/current/omap/
...

2084 -rw-r--r--   1 root root 2130468 Jul 20 22:33 80.ldb
2084 -rw-r--r--   1 root root 2130827 Jul 20 22:33 81.ldb
2084 -rw-r--r--   1 root root 2130171 Jul 20 22:33 88.ldb
...

On some OSDs I found old .sst files which came out of wheezy/ceph 0.87 times:
ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.sst
2096 -rw-r--r-- 1 root root 2142558 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016722.sst
2092 -rw-r--r-- 1 root root 2141968 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016723.sst
2092 -rw-r--r-- 1 root root 2141679 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016724.sst
2096 -rw-r--r-- 1 root root 2142376 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016725.sst
2096 -rw-r--r-- 1 root root 2142227 Apr  3 15:59 
/var/lib/ceph/osd/ceph-23/current/omap/016726.sst
2092 -rw-r--r-- 1 root root 2141369 Apr 20 21:23 
/var/lib/ceph/osd/ceph-23/current/omap/019470.sst
But much more .ldb-files
ls -lsa /var/lib/ceph/osd/ceph-23/current/omap/*.ldb | wc -l
128

The config shows for OSDs on both nodes (old and new with .sst-files) as 
backend leveldb:
ceph --admin-daemon /var/run/ceph/ceph-osd.92.asok config show | grep -i omap
"filestore_omap_backend": "leveldb",
"filestore_debug_omap_check": "false",
"filestore_omap_header_cache_size": "1024",


Normaly I would not care about that, but I tried to switch the first OSD-Node 
to an clean puppet install and see, that
none OSD are started. The error message looks a little bit like 
http://tracker.ceph.com/issues/11429 but this should not
happens, because the puppet install has ceph 0.94.2.

Error message during start:
cat ceph-osd.0.log
2015-07-20 16:51:29.435081 7fb47b126840  0 ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3), process
ceph-osd, pid 9803
2015-07-20 16:51:29.457776 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
backend generic (magic 0xef53)
2015-07-20 16:51:29.460470 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
ioctl is supported and appears to work
2015-07-20 16:51:29.460479 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-07-20 16:51:29.485120 7fb47b126840  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-07-20 16:51:29.572670 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
limited size xattrs
2015-07-20 16:51:29.889599 7fb47b126840  0 filestore(/var/lib/ceph/osd/ceph-0) 
mount: enabling WRITEAHEAD journal mode:
checkpoint is not enabled
2015-07-20 16:51:31.517179 7fb47b126840  0  cls/hello/cls_hello.cc:271: 
loading cls_hello
2015-07-20 16:51:31.552366 7fb47b126840  0 osd.0 151644 crush map has features 
2303210029056, adjusting msgr requires
for clients
2015-07-20 16:51:31.552375 7fb47b126840  0 osd.0 151644 crush map has features 
2578087936000 was 8705, adjusting msgr
requires for mons
2015-07-20 16:51:31.552382 7fb47b126840  0 osd.0 151644 crush map has features 
2578087936000, adjusting msgr requires
for osds
2015-07-20 16:51:31.552394 7fb47b126840  0 osd.0 151644 load_pgs
2015-07-20 16:51:42.682678 7fb47b126840 -1 osd/PG.cc: In function 'static 
epoch_t PG::peek_map_epoch(ObjectStore*,
spg_t, ceph::bufferlist*)' thread 7fb47b126840 time 2015-07-20 16:51:42.680036
osd/PG.cc: 2825: FAILED assert(values.size() == 2)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) 
[0xcdb572]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0x7b2) 
[0x908742]
 3: (OSD::load_pgs()+0x734) [0x7e9064]
 4: (OSD::init()+0xdac) [0x7ed8fc]
 5: (main()+0x253e) [0x79069e]
 6: (__libc_start_main()+0xfd) [0x7fb47898fead]
 7: /usr/bin/ceph-osd() [0x7966b9]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
...

Normaly I would sa

Re: [ceph-users] He8 drives

2015-07-13 Thread Udo Lembke

Hi,
I have just expand our ceph-cluster (7 nodes) with one 8TB HGST (change
from 4TB to 8TB) on each node (and 11 4TB HGST).
But I have set the primary affinity to 0 for the 8 TB-disks... in this
case my performance values are not 8-TB-disk related.

Udo

On 08.07.2015 02:28, Blair Bethwaite wrote:
> Hi folks,
>
> Does anyone have any experience with the newish HGST He8 8TB Helium
> filled HDDs? Storagereview looked at them here:
> http://www.storagereview.com/hgst_ultrastar_helium_he8_8tb_enterprise_hard_drive_review.
> I'm torn as to the lower read performance shown there than e.g. the
> He6 or Seagate 6TB, but thing is, I think we probably have enough
> aggregate IOPs with ~170 drives. Has anyone tried these in a Ceph
> cluster yet?
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Strange PGs on a osd which is reweight to 0

2015-07-02 Thread Udo Lembke

Hi all,
I want to change an osd with an bigger one and reweight the osd to 0:
ceph osd tree | grep osd.0
  0   3.57999 osd.0 up0  1.0

cluster is healthy, but pg dump shows PGs which are primary on osd.0:

root@ceph-01:~# ceph pg dump | grep "\[0,"
dumped all in format plain
1   2519547316  1129746492  3844661068
[0,2,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,59,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]
   []
3   2261965376  1387328432  3844661068
[0,2,4,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]
 []
14  2447872140  1201421668  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,13,15,24,25,26,27,28,29,30,31,32,33,34,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,81,83,84,85,86,87,88,89]
   []
23  2230352780  1418941028  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,22,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,88,89,90]
 []
26  2214777080  1434516728  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,27,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]
   []
27  2288329564  1360964244  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,26,28,36,37,38,39,40,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,74,75,76,77,78,79,80,81,82,84,85,86,87,88,89,90]
   []
28  2324418364  1324875444  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]
  []
29  2092584520  1556709288  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,28,30,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90]
  []
39  2471756056  1177537752  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,38,40,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,89,90]
  []
51  2172324716  1476969092  3844661068
[0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,37,38,39,40,41,42,45,46,47,49,50,52,68,69,70,71,72,73,74,75,76,77,78,79,80,82,83,84,85,86,87,88,89,90]
  []
59  2433811016  1215482792  3844661068
[0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,54,58,60,67,68,69,70,71,72,73,74,75,76,77,79,80,81,83,84,85,86,87,88,89,90]
  []
78  2127153140  1512573036  3834581436
[0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,77,79,80,81,82,83,84,85,86,87,88,89,90]
[]
87  2330423724  1309302452  3834581436
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,86,88]
   []
90  2242002328  1397723848  3834581436
[0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,67,68,69,70,71,72,73,74,75,76,77,78,89]

ceph -v
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

Now the question, what PGs are this??
Is it save to destroy the OSD?

Any hints?

best regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Udo Lembke

Hi,

On 01.05.2015 10:30, Piotr Wachowicz wrote:
> Is there any way to confirm (beforehand) that using SSDs for journals
> will help?
yes SSD-Journal helps a lot (if you use the right SSDs) for write speed,
and I made the experiences that this also helped (but not too much) for
read-performance.

>
> We're seeing very disappointing Ceph performance. We have 10GigE
> interconnect (as a shared public/internal network).
Which kind of CPU do you use for the OSD-hosts?

>
> We're wondering whether it makes sense to buy SSDs and put journals on
> them. But we're looking for a way to verify that this will actually
> help BEFORE we splash cash on SSDs.
I can recommend the Intel DC S3700 SSD for journaling! In the beginning
I started with different much cheaper models, but this was the wrong
decision.
>
> The problem is that the way we have things configured now, with
> journals on spinning HDDs (shared with OSDs as the backend storage),
> apart from slow read/write performance to Ceph I already mention,
> we're also seeing fairly low disk utilization on OSDs. 
>
> This low disk utilization suggests that journals are not really used
> to their max, which begs for the questions whether buying SSDs for
> journals will help.
>
> This kind of suggests that the bottleneck is NOT the disk. But,m yeah,
> we cannot really confirm that.
>
> Our typical data access use case is a lot of small random read/writes.
> We're doing a lot of rsyncing (entire regular linux filesystems) from
> one VM to another.
>
> We're using Ceph for OpenStack storage (kvm). Enabling RBD cache
> didn't really help all that much.
The read speed can be optimized with an bigger read ahead cache inside
the VM, like:
echo 4096 > /sys/block/vda/queue/read_ahead_kb

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer release data and a Design question

2015-03-27 Thread Udo Lembke

Hi,

Am 26.03.2015 11:18, schrieb 10 minus:
> Hi ,
> 
> I 'm just starting on small Ceph implementation and wanted to know the 
> release date for Hammer.
> Will it coincide with relase of Openstack.
> 
> My Conf:  (using 10G and Jumboframes on Centos 7 / RHEL7 )
> 
> 3x Mons (VMs) :
> CPU - 2
> Memory - 4G
> Storage - 20 GB
> 
> 4x OSDs :
> CPU - Haswell Xeon
> Memory - 8 GB
> Sata - 3x 2TB (3 OSD per node)
> SSD - 2x 480 GB ( Journaling and if possible tiering)
> 
> 
> This is a test environment to see how all the components play . If all goes 
> well
> then we plan to increase the OSDs to 24 per node and RAM to 32 GB and a dual 
> Socket Haswell Xeons
32GB for 24 OSDs are much to less!! I have 32GB for 12 OSDs - that's ok, but 
64GB will be better.
CPU depends on you Model (Cores, DualSocket?).

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi Greg,

On 26.03.2015 18:46, Gregory Farnum wrote:
> I don't know why you're mucking about manually with the rbd directory;
> the rbd tool and rados handle cache pools correctly as far as I know.
that's because I deleted the cache tier pool, so the files like 
rbd_header.2cfc7ce74b0dc51 and rbd_directory are gone.
The whole vm-disk data are in the ec pool (rbd_data.2cfc7ce74b0dc51.*)

I can't see or recreate the VM-disk, because rados setomapval don't like
binary-data and the rbd-tool can't (re)create an rbd-disk with an given
hash (like 2cfc7ce74b0dc51).

The only way I see in the moment, is to create new rbd-disks and copy
all blocks with rados get -> file -> rados put.
The problem is the time it's take (days to weeks for 3 * 16TB)...

Udo

> -Greg
>
> On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke  wrote:
>> Hi Greg,
>> ok!
>>
>> It's looks like, that my problem is more setomapval-related...
>>
>> I must o something like
>> rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
>> "\0x0f\0x00\0x00\0x00"2cfc7ce74b0dc51
>>
>> but "rados setomapval" don't use the hexvalues - instead of this I got
>> rados -p ssd-archiv listomapvals rbd_directory
>> name_vm-409-disk-2
>> value: (35 bytes) :
>>  : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
>> 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
>> 0020 : 63 35 31: c51
>>
>>
>> hmm, strange. With  "rados -p ssd-archiv getomapval rbd_directory 
>> name_vm-409-disk-2 name_vm-409-disk-2"
>> I got the binary inside the file name_vm-409-disk-2, but reverse do an
>> "rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
>> name_vm-409-disk-2"
>> fill the variable with name_vm-409-disk-2 and not with the content of the 
>> file...
>>
>> Are there other tools for the rbd_directory?
>>
>> regards
>>
>> Udo
>>
>> Am 26.03.2015 15:03, schrieb Gregory Farnum:
>>> You shouldn't rely on "rados ls" when working with cache pools. It
>>> doesn't behave properly and is a silly operation to run against a pool
>>> of any size even when it does. :)
>>>
>>> More specifically, "rados ls" is invoking the "pgls" operation. Normal
>>> read/write ops will go query the backing store for objects if they're
>>> not in the cache tier. pgls is different — it just tells you what
>>> objects are present in the PG on that OSD right now. So any objects
>>> which aren't in cache won't show up when listing on the cache pool.
>>> -Greg
>>>
>>> On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke  wrote:
>>>> Hi all,
>>>> due an very silly approach, I removed the cache tier of an filled EC pool.
>>>>
>>>> After recreate the pool and connect with the EC pool I don't see any 
>>>> content.
>>>> How can I see the rbd_data and other files through the new ssd cache tier?
>>>>
>>>> I think, that I must recreate the rbd_directory (and fill with 
>>>> setomapval), but I don't see anything yet!
>>>>
>>>> $ rados ls -p ecarchiv | more
>>>> rbd_data.2e47de674b0dc51.00390074
>>>> rbd_data.2e47de674b0dc51.0020b64f
>>>> rbd_data.2fbb1952ae8944a.0016184c
>>>> rbd_data.2cfc7ce74b0dc51.00363527
>>>> rbd_data.2cfc7ce74b0dc51.0004c35f
>>>> rbd_data.2fbb1952ae8944a.0008db43
>>>> rbd_data.2cfc7ce74b0dc51.0015895a
>>>> rbd_data.31229f0238e1f29.000135eb
>>>> ...
>>>>
>>>> $ rados ls -p ssd-archiv
>>>>  nothing 
>>>>
>>>> generation of the cache tier:
>>>> $ rados mkpool ssd-archiv
>>>> $ ceph osd pool set ssd-archiv crush_ruleset 5
>>>> $ ceph osd tier add ecarchiv ssd-archiv
>>>> $ ceph osd tier cache-mode ssd-archiv writeback
>>>> $ ceph osd pool set ssd-archiv hit_set_type bloom
>>>> $ ceph osd pool set ssd-archiv hit_set_count 1
>>>> $ ceph osd pool set ssd-archiv hit_set_period 3600
>>>> $ ceph osd pool set ssd-archiv target_max_bytes 500
>>>>
>>>>
>>>> rule ssd {
>>>> ruleset 5
>>>> type replicated
>>>> min_size 1
>>>> max_size 10
>>>> step take ssd
>>>> step choose firstn 0 type osd
>>>> step emit
>>>> }
>>>>
>>>>
>>>> Are there any "magic" (or which command I missed?) to see the excisting 
>>>> data throug the cache tier?
>>>>
>>>>
>>>> regards - and hoping for answers
>>>>
>>>> Udo
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
"\0x0f\0x00\0x00\0x00"2cfc7ce74b0dc51

but "rados setomapval" don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  "rados -p ssd-archiv getomapval rbd_directory 
name_vm-409-disk-2 name_vm-409-disk-2"
I got the binary inside the file name_vm-409-disk-2, but reverse do an
"rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2"
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:
> You shouldn't rely on "rados ls" when working with cache pools. It
> doesn't behave properly and is a silly operation to run against a pool
> of any size even when it does. :)
> 
> More specifically, "rados ls" is invoking the "pgls" operation. Normal
> read/write ops will go query the backing store for objects if they're
> not in the cache tier. pgls is different — it just tells you what
> objects are present in the PG on that OSD right now. So any objects
> which aren't in cache won't show up when listing on the cache pool.
> -Greg
> 
> On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke  wrote:
>> Hi all,
>> due an very silly approach, I removed the cache tier of an filled EC pool.
>>
>> After recreate the pool and connect with the EC pool I don't see any content.
>> How can I see the rbd_data and other files through the new ssd cache tier?
>>
>> I think, that I must recreate the rbd_directory (and fill with setomapval), 
>> but I don't see anything yet!
>>
>> $ rados ls -p ecarchiv | more
>> rbd_data.2e47de674b0dc51.00390074
>> rbd_data.2e47de674b0dc51.0020b64f
>> rbd_data.2fbb1952ae8944a.0016184c
>> rbd_data.2cfc7ce74b0dc51.00363527
>> rbd_data.2cfc7ce74b0dc51.0004c35f
>> rbd_data.2fbb1952ae8944a.0008db43
>> rbd_data.2cfc7ce74b0dc51.0015895a
>> rbd_data.31229f0238e1f29.000135eb
>> ...
>>
>> $ rados ls -p ssd-archiv
>>  nothing 
>>
>> generation of the cache tier:
>> $ rados mkpool ssd-archiv
>> $ ceph osd pool set ssd-archiv crush_ruleset 5
>> $ ceph osd tier add ecarchiv ssd-archiv
>> $ ceph osd tier cache-mode ssd-archiv writeback
>> $ ceph osd pool set ssd-archiv hit_set_type bloom
>> $ ceph osd pool set ssd-archiv hit_set_count 1
>> $ ceph osd pool set ssd-archiv hit_set_period 3600
>> $ ceph osd pool set ssd-archiv target_max_bytes 500
>>
>>
>> rule ssd {
>> ruleset 5
>> type replicated
>> min_size 1
>> max_size 10
>> step take ssd
>> step choose firstn 0 type osd
>> step emit
>> }
>>
>>
>> Are there any "magic" (or which command I missed?) to see the excisting data 
>> throug the cache tier?
>>
>>
>> regards - and hoping for answers
>>
>> Udo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
ruleset 5
type replicated
min_size 1
max_size 10
step take ssd
step choose firstn 0 type osd
step emit
}


Are there any "magic" (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-26 Thread Udo Lembke

Hi Don,
after a lot of trouble due an unfinished setcrushmap, I was able to remove the 
new EC pool.
Load the old crushmap and edit agin. After include an "step set_choose_tries 
100" in the crushmap the EC pool creation with
ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile
work without trouble.

Due to defect PGs from this test, I remove the cache tier from the old EC pool 
which gaves the next trouble - but this
is another story!


Thanks again

Udo

Am 25.03.2015 20:37, schrieb Don Doerner:
> More info please: how did you create your EC pool?  It's hard to imagine that 
> you could have specified enough PGs to make it impossible to form PGs out of 
> 84 OSDs (I'm assuming your SSDs are in a separate root) but I have to ask...
> 
> -don-
> 
> 

> -Original Message-
> From: Udo Lembke [mailto:ulem...@polarzone.de] 
> Sent: 25 March, 2015 08:54
> To: Don Doerner; ceph-us...@ceph.com
> Subject: Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 
> active+undersized+degraded
> 
> Hi Don,
> thanks for the info!
> 
> looks that choose_tries set to 200 do the trick.
> 
> But the setcrushmap takes a long long time (alarming, but the client have 
> still IO)... hope it's finished soon ;-)
> 
> 
> Udo
> 
> Am 25.03.2015 16:00, schrieb Don Doerner:
>> Assuming you've calculated the number of PGs reasonably, see here 
>> <https://urldefense.proofpoint.com/v1/url?u=http://tracker.ceph.com/issues/10350&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0A&s=b2547ec4aefa0f1b25d47bc813cab344a24c22c2464d4ff2cb199be0ef9b15cf>
>>  and here 
>> <https://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/%23crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0A&s=09d9aeb34481797e2d8f24938980db3697f26d94e92ff4c72714651181329de9>.
>> I'm guessing these will address your issue.  That weird number means that no 
>> OSD was found/assigned to the PG.
>>
>>  
>>
>> -don-
> 
> --
> The information contained in this transmission may be confidential. Any 
> disclosure, copying, or further distribution of confidential information is 
> not permitted unless such privilege is explicitly granted in writing by 
> Quantum. Quantum reserves the right to have electronic communications, 
> including email and attachments, sent across its networks filtered through 
> anti virus and spam software programs and retain such messages in order to 
> comply with applicable data security and retention requirements. Quantum is 
> not responsible for the proper and complete transmission of the substance of 
> this communication or for any delay in its receipt.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] "won leader election with quorum" during "osd setcrushmap"

2015-03-25 Thread Udo Lembke

Hi,
due to PG-trouble with an EC-Pool I modify the crushmap (step set_choose_tries 
200) from

rule ec7archiv {
ruleset 6
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take default
step chooseleaf indep 0 type host
step emit
}

to

rule ec7archiv {
ruleset 6
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step set_choose_tries 200
step take default
step chooseleaf indep 0 type host
step emit
}

"ceph osd setcrushmap" runs since one hour and ceph -w give following output:

2015-03-25 17:20:18.163295 mon.0 [INF] mdsmap e766: 1/1/1 up {0=b=up:active}, 1 
up:standby
2015-03-25 17:20:18.163370 mon.0 [INF] osdmap e130004: 91 osds: 91 up, 91 in
2015-03-25 17:20:28.525445 mon.0 [INF] from='client.? 172.20.2.1:0/1007537' 
entity='client.admin' cmd=[{"prefix": "osd
setcrushmap"}]: dispatch
2015-03-25 17:20:28.525580 mon.0 [INF] mon.0 calling new monitor election
2015-03-25 17:20:28.526263 mon.0 [INF] mon.0@0 won leader election with quorum 
0,1,2


Fortunaly the clients have still access to the cluster (kvm)!!

How long take such an setcrushmap?? Normaly it's done in few seconds.
Has the setcrushmap chance to get ready?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke

Hi Don,
thanks for the info!

looks that choose_tries set to 200 do the trick.

But the setcrushmap takes a long long time (alarming, but the client have still 
IO)... hope it's finished soon ;-)


Udo

Am 25.03.2015 16:00, schrieb Don Doerner:
> Assuming you've calculated the number of PGs reasonably, see here 
>  and here
> .
>  
> I’m guessing these will address your issue.  That weird number means that no 
> OSD was found/assigned to the PG.
> 
>  
> 
> -don-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke

Hi Gregory,
thanks for the answer!

I have look which storage nodes are missing, and it's two differrent:
pg 22.240 is stuck undersized for 24437.862139, current state 
active+undersized+degraded, last acting
[38,85,17,74,2147483647,10,58]
pg 22.240 is stuck undersized for 24437.862139, current state 
active+undersized+degraded, last acting
[ceph-04,ceph-07,ceph-02,ceph-06,2147483647,ceph-01,ceph-05]
ceph-03 is missing

pg 22.3e5 is stuck undersized for 24437.860025, current state 
active+undersized+degraded, last acting
[76,15,82,11,57,29,2147483647]
pg 22.3e5 is stuck undersized for 24437.860025, current state 
active+undersized+degraded, last acting
[ceph-06,ceph-ceph-02,ceph-07,ceph-01,ceph-05,ceph-03,2147483647]
ceph-04 is missing

Perhaps I hit an PGs/OSD max?!

I look with the script from 
http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

pool :  17  18  19  9   10  20  21  13  22  
23  16  | SUM

...
host ceph-03:
osd.24  0   12  2   2   4   76  16  5   74  
0   66  | 257
osd.25  0   17  3   4   4   89  16  4   82  
0   60  | 279
osd.26  0   20  2   5   3   71  12  5   81  
0   61  | 260
osd.27  0   18  2   4   3   73  21  3   76  
0   61  | 261
osd.28  0   14  2   9   4   73  23  9   94  
0   64  | 292
osd.29  0   19  3   3   4   54  25  4   89  
0   62  | 263
osd.30  0   22  2   6   3   80  15  6   92  
0   47  | 273
osd.31  0   25  4   2   3   87  20  3   76  
0   62  | 282
osd.32  0   13  4   2   2   64  14  1   82  
0   69  | 251
osd.33  0   12  2   5   5   89  25  7   83  
0   68  | 296
osd.34  0   28  0   8   5   81  18  3   99  
0   65  | 307
osd.35  0   17  3   2   4   74  21  3   95  
0   58  | 277
host ceph-04:
osd.36  0   13  1   9   6   72  17  5   93  
0   56  | 272
osd.37  0   21  2   5   6   83  20  4   78  
0   71  | 290
osd.38  0   17  3   2   5   64  22  7   76  
0   57  | 253
osd.39  0   21  3   7   6   79  27  4   80  
0   68  | 295
osd.40  0   15  1   5   7   71  17  6   93  
0   74  | 289
osd.41  0   16  5   5   6   76  18  6   95  
0   70  | 297
osd.42  0   13  0   6   1   71  25  4   83  
0   56  | 259
osd.43  0   20  2   2   6   81  23  4   89  
0   59  | 286
osd.44  0   21  2   5   6   77  9   5   76  
0   52  | 253
osd.45  0   11  4   8   3   76  24  6   82  
0   49  | 263
osd.46  0   17  2   5   6   57  15  4   84  
0   62  | 252
osd.47  0   19  3   2   3   84  19  5   94  
0   48  | 277
...

SUM :   768 1536192 384 384 61441536384 7168
24  5120|


Pool 22 is the new ec7archiv.

But on ceph-04 there aren't OSD with more than 300 PGs...

Udo

Am 25.03.2015 14:52, schrieb Gregory Farnum:
> On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke  wrote:
>> Hi,
>> due to two more hosts (now 7 storage nodes) I want to create an new
>> ec-pool and get an strange effect:
>>
>> ceph@admin:~$ ceph health detail
>> HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
>> pgs stuck undersized; 2 pgs undersized
> 
> This is the big clue: you have two undersized PGs!
> 
>> pg 22.3e5 is stuck unclean since forever, current state
>> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
> 
> 2147483647 is the largest number you can represent in a signed 32-bit
> integer. There's an output error of some kind which is fixed
> elsewhere; this should be "-1".
> 
> So for whatever reason (in general it's hard on CRUSH trying to select
> N entries out of N choices), CRUSH hasn't been able to map an OSD to
> this slot for you. You'll want to figure out why th

[ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-25 Thread Udo Lembke

Hi,
due to two more hosts (now 7 storage nodes) I want to create an new
ec-pool and get an strange effect:

ceph@admin:~$ ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
pgs stuck undersized; 2 pgs undersized
pg 22.3e5 is stuck unclean since forever, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck unclean since forever, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is stuck undersized for 406.614447, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck undersized for 406.616563, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is stuck degraded for 406.614566, current state
active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
pg 22.240 is stuck degraded for 406.616679, current state
active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
pg 22.3e5 is active+undersized+degraded, acting
[76,15,82,11,57,29,2147483647]
pg 22.240 is active+undersized+degraded, acting
[38,85,17,74,2147483647,10,58]

But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647!
Where the heck came the 2147483647 from?

I do following commands:
ceph osd erasure-code-profile set 7hostprofile k=5 m=2
ruleset-failure-domain=host
ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile

my version:
ceph -v
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)


I found an issue in my crush-map - one SSD was twice in the map:
host ceph-061-ssd {
id -16  # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
}
root ssd {
id -13  # do not change unnecessarily
# weight 0.780
alg straw
hash 0  # rjenkins1
item ceph-01-ssd weight 0.170
item ceph-02-ssd weight 0.170
item ceph-03-ssd weight 0.000
item ceph-04-ssd weight 0.170
item ceph-05-ssd weight 0.170
item ceph-06-ssd weight 0.050
item ceph-07-ssd weight 0.050
item ceph-061-ssd weight 0.000
}

Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd,
but after fix the crusmap the issue with the osd 2147483647 still excist.

Any idea how to fix that?

regards

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Udo Lembke

Hi Tony,
sounds like an good idea!

Udo
On 09.03.2015 21:55, Tony Harris wrote:
> I know I'm not even close to this type of a problem yet with my small
> cluster (both test and production clusters) - but it would be great if
> something like that could appear in the cluster HEALTHWARN, if Ceph
> could determine the amount of used processes and compare them against
> the current limit then throw a health warning if it gets within say 10
> or 15% of the max value.  That would be a really quick indicator for
> anyone who frequently checks the health status (like through a web
> portal) as they may see it more quickly then during their regular log
> check interval.  Just a thought.
>
> -Tony
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] too few pgs in cache tier

2015-02-27 Thread Udo Lembke

Hi all,
we use an EC-Pool with an small cache tier in front of, for our
archive-data (4 * 16TB VM-disks).

The ec-pool has k=3;m=2 because we startet with 5 nodes and want to
migrate to an new ec-pool with k=5;m=2. Therefor we migrate one VM-disk
(16TB) from the ceph-cluster to an fc-raid with the proxmox-ve interface
"move disk".

The move was finished, but during removing the ceph-vm file the warning
'ssd-archiv' at/near target max; pool ssd-archiv has too few pgs occour.

Some hour later only the second warning exsist.

ceph health detail
HEALTH_WARN pool ssd-archiv has too few pgs
pool ssd-archiv objects per pg (51196) is more than 14.7709 times
cluster average (3466)

info about the image, which was deleted:
rbd image 'vm-409-disk-1':
size 16384 GB in 4194304 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.2b8fda574b0dc51
format: 2
features: layering

I think we hit http://tracker.ceph.com/issues/8103
but normaly one reading should not put the data in the cache tier, or??
Is deleting a second read??

Our ceph version: 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)


Regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Udo Lembke

Hi Jeff,
is the osd /var/lib/ceph/osd/ceph-2 mounted?

If not, does it helps, if you mounted the osd and start with
service ceph start osd.2
??

Udo

Am 17.02.2015 09:54, schrieb Jeff:
> Hi,
> 
> We had a nasty power failure yesterday and even with UPS's our small (5
> node, 12 OSD) cluster is having problems recovering.
> 
> We are running ceph 0.87
> 
> 3 of our OSD's are down consistently (others stop and are restartable,
> but our cluster is so slow that almost everything we do times out).
> 
> We are seeing errors like this on the OSD's that never run:
> 
> ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
> Operation not permitted
> 
> We are seeing errors like these of the OSD's that run some of the time:
> 
> osd/PGLog.cc: 844: FAILED assert(last_e.version.version <
> e.version.version)
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
> Does anyone have any suggestions on how to recover our cluster?
> 
> Thanks!
>   Jeff
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Placement Groups fail on fresh Ceph cluster installation with all OSDs up and in

2015-02-10 Thread Udo Lembke

Hi,
use:
ceph osd crush set 0 0.01 pool=default host=ceph-node1
ceph osd crush set 1 0.01 pool=default host=ceph-node1
ceph osd crush set 2 0.01 pool=default host=ceph-node3
ceph osd crush set 3 0.01 pool=default host=ceph-node3
ceph osd crush set 4 0.01 pool=default host=ceph-node2
ceph osd crush set 5 0.01 pool=default host=ceph-node2

Udo
Am 10.02.2015 15:01, schrieb B L:
> Thanks Vikhyat,
> 
> As suggested .. 
> 
> ceph@ceph-node1:/home/ubuntu$ ceph osd crush reweight 0.0095 osd.0
> 
> Invalid command:  osd.0 doesn't represent a float
> osd crush reweight   :  change 's weight to
>  in crush map
> Error EINVAL: invalid command
> 
> What do you think
> 
> 
>> On Feb 10, 2015, at 3:18 PM, Vikhyat Umrao > > wrote:
>>
>> sudo ceph osd crush reweight 0.0095 osd.0 to osd.5
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Placement Groups fail on fresh Ceph cluster installation with all OSDs up and in

2015-02-10 Thread Udo Lembke

Hi,
your will get further trouble, because your weight is not correct.

You need an weight >= 0.01 for each OSD. This mean, you OSD must be 10GB
or greater!


Udo

Am 10.02.2015 12:22, schrieb B L:
> Hi Vickie,
> 
> My OSD tree looks like this:
> 
> ceph@ceph-node3:/home/ubuntu$ ceph osd tree
> # idweighttype nameup/downreweight
> -10root default
> -20host ceph-node1
> 00osd.0up1
> 10osd.1up1
> -30host ceph-node3
> 20osd.2up1
> 30osd.3up1
> -40host ceph-node2
> 40osd.4up1
> 50osd.5up1
> 
> 
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] erasure code : number of chunks for a small cluster ?

2015-02-06 Thread Udo Lembke

Am 06.02.2015 09:06, schrieb Hector Martin:
> On 02/02/15 03:38, Udo Lembke wrote:
>> With 3 hosts only you can't survive an full node failure, because for
>> that you need
>> host >= k + m.
> 
> Sure you can. k=2, m=1 with the failure domain set to host will survive
> a full host failure.
> 

Hi,
Alexandre has the requirement of 2 failed disk or one full node failure.
This is the reason why I wrote, that this is not possible...

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke

Hi Josh,
thanks for the info.

detach/reattach schould be fine for me, because it's only for
performance testing.

#2468 would be fine of course.

Udo

On 05.02.2015 08:02, Josh Durgin wrote:
> On 02/05/2015 07:44 AM, Udo Lembke wrote:
>> Hi all,
>> is there any command to flush the rbd cache like the
>> "echo 3 > /proc/sys/vm/drop_caches" for the os cache?
>
> librbd exposes it as rbd_invalidate_cache(), and qemu uses it
> internally, but I don't think you can trigger that via any user-facing
> qemu commands.
>
> Exposing it through the admin socket would be pretty simple though:
>
> http://tracker.ceph.com/issues/2468
>
> You can also just detach and reattach the device to flush the rbd cache.
>
> Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke

Hi Dan,
I mean qemu-kvm, also librbd.
But how I can kvm told to flush the buffer?

Udo

On 05.02.2015 07:59, Dan Mick wrote:
> On 02/04/2015 10:44 PM, Udo Lembke wrote:
>> Hi all,
>> is there any command to flush the rbd cache like the
>> "echo 3 > /proc/sys/vm/drop_caches" for the os cache?
>>
>> Udo
> Do you mean the kernel rbd or librbd?  The latter responds to flush
> requests from the hypervisor.  The former...I'm not sure it has a
> separate cache.
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke

Hi all,
is there any command to flush the rbd cache like the
"echo 3 > /proc/sys/vm/drop_caches" for the os cache?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-04 Thread Udo Lembke

Hi Marco,

Am 04.02.2015 10:20, schrieb Colombo Marco:
...
> We choosen the 6TB of disk, because we need a lot of storage in a small 
> amount of server and we prefer server with not too much disks.
> However we plan to use max 80% of a 6TB Disk
> 

80% is too much! You will run into trouble.
Ceph don't write the data in equal distribution. Sometimes I see an
difference of 20% in the usage of the OSD.

I recommend 60-70% as maximum.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] estimate the impact of changing pg_num

2015-02-01 Thread Udo Lembke

Hi Xu,

On 01.02.2015 21:39, Xu (Simon) Chen wrote:
> RBD doesn't work extremely well when ceph is recovering - it is common
> to see hundreds or a few thousands of blocked requests (>30s to
> finish). This translates high IO wait inside of VMs, and many
> applications don't deal with this well.
this sounds like you don't have settings like
osd max backfills = 1
osd recovery max active = 1


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] erasure code : number of chunks for a small cluster ?

2015-02-01 Thread Udo Lembke

Hi Alexandre,

nice to meet you here ;-)

With 3 hosts only you can't survive an full node failure, because for
that you need
host >= k + m.
And k:1 m:2 don't make any sense.

I start with 5 hosts and use k:3, m:2. In this case two hdds can fail or
one host can be down for maintenance.

Udo

PS: you also can't change k+m on a pool later...

On 01.02.2015 18:15, Alexandre DERUMIER wrote:
> Hi,
>
> I'm currently trying to understand how to setup correctly a pool with erasure 
> code
>
>
> https://ceph.com/docs/v0.80/dev/osd_internals/erasure_coding/developer_notes/
>
>
> My cluster is 3 nodes with 6 osd for each node (18 osd total).
>
> I want to be able to survive of 2 disk failures, but also a full node failure.
>
> What is the best setup for this ? Does I need M=2 or M=6 ?
>
>
>
>
> Also, how to determinate the best chunk number ?
>
> for example,
> K = 4 , M=2
> K = 8 , M=2
> K = 16 , M=2
>
> you can loose which each config 2 osd, but the more data chunks you have, the 
> less space is used by coding chunks right ?
> Does the number of chunk have performance impact ? (read/write ?)
>
> Regards,
>
> Alexandre
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD capacity variance ?

2015-02-01 Thread Udo Lembke

Hi Howard,
I assume it's an typo with 160 + 250 MB.
Ceph OSDs must be min. 10GB to get an weight of 0.01

Udo

On 31.01.2015 23:39, Howard Thomson wrote:
> Hi All,
>
> I am developing a custom disk storage backend for the Bacula backup
> system, and am in the process of setting up a trial Ceph system,
> intending to use a direct interface to RADOS.
>
> I have a variety of 1Tb, 250Mb and 160Mb disk drives that I would like
> to use, but it is not [as yet] obvious as to whether having differences
> in capacity at different OSDs matters.
>
> Can anyone comment, or point me in the right direction on
> docs.ceph.com ?
>
> Thanks,
>
> Howard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD caching on 4K reads???

2015-01-30 Thread Udo Lembke

Hi Bruce,
you can also look on the mon, like
ceph --admin-daemon /var/run/ceph/ceph-mon.b.asok config show | grep cache

(I guess you have an number instead of the .b. )

Udo
On 30.01.2015 22:02, Bruce McFarland wrote:
>
> The ceph daemon isn’t running on the client with the rbd device so I
> can’t verify if it’s disabled at the librbd level on the client. If
> you mean on the storage nodes I’ve had some issues dumping the config.
> Does the rbd caching occur on the storage nodes, client, or both?
>
>  
>
>  
>
> *From:*Udo Lembke [mailto:ulem...@polarzone.de]
> *Sent:* Friday, January 30, 2015 1:00 PM
> *To:* Bruce McFarland; ceph-us...@ceph.com
> *Cc:* Prashanth Nednoor
> *Subject:* Re: [ceph-users] RBD caching on 4K reads???
>
>  
>
> Hi Bruce,
> hmm, sounds for me like the rbd cache.
> Can you look, if the cache is realy disabled in the running config with
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep cache
>
> Udo
>
> On 30.01.2015 21:51, Bruce McFarland wrote:
>
> I have a cluster and have created a rbd device - /dev/rbd1. It
> shows up as expected with ‘rbd –image test info’ and rbd
> showmapped. I have been looking at cluster performance with the
> usual Linux block device tools – fio and vdbench. When I look at
> writes and large block sequential reads I’m seeing what I’d expect
> with performance limited by either my cluster interconnect
> bandwidth or the backend device throughput speeds – 1 GE frontend
> and cluster network and 7200rpm SATA OSDs with 1 SSD/osd for
> journal. Everything looks good EXCEPT 4K random reads. There is
> caching occurring somewhere in my system that I haven’t been able
> to detect and suppress - yet.
>
>  
>
> I’ve set ‘rbd_cache=false’ in the [client] section of ceph.conf on
> the client, monitor, and storage nodes. I’ve flushed the system
> caches on the client and storage nodes before test run ie
> vm.drop_caches=3 and set the huge pages to the maximum available
> to consume free system memory so that it can’t be used for system
> cache . I’ve also disabled read-ahead on all of the HDD/OSDs.
>
>  
>
> When I run a 4k randon read workload on the client the most I
> could expect would be ~100iops/osd x number of osd’s – I’m seeing
> an order of magnitude greater than that AND running IOSTAT on the
> storage nodes show no read activity on the OSD disks.
>
>  
>
> Any ideas on what I’ve overlooked? There appears to be some
> read-ahead caching that I’ve missed.
>
>  
>
> Thanks,
>
> Bruce
>
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>  
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD caching on 4K reads???

2015-01-30 Thread Udo Lembke

Hi Bruce,
hmm, sounds for me like the rbd cache.
Can you look, if the cache is realy disabled in the running config with

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep cache

Udo

On 30.01.2015 21:51, Bruce McFarland wrote:
>
> I have a cluster and have created a rbd device - /dev/rbd1. It shows
> up as expected with ‘rbd –image test info’ and rbd showmapped. I have
> been looking at cluster performance with the usual Linux block device
> tools – fio and vdbench. When I look at writes and large block
> sequential reads I’m seeing what I’d expect with performance limited
> by either my cluster interconnect bandwidth or the backend device
> throughput speeds – 1 GE frontend and cluster network and 7200rpm SATA
> OSDs with 1 SSD/osd for journal. Everything looks good EXCEPT 4K
> random reads. There is caching occurring somewhere in my system that I
> haven’t been able to detect and suppress - yet.
>
>  
>
> I’ve set ‘rbd_cache=false’ in the [client] section of ceph.conf on the
> client, monitor, and storage nodes. I’ve flushed the system caches on
> the client and storage nodes before test run ie vm.drop_caches=3 and
> set the huge pages to the maximum available to consume free system
> memory so that it can’t be used for system cache . I’ve also disabled
> read-ahead on all of the HDD/OSDs.
>
>  
>
> When I run a 4k randon read workload on the client the most I could
> expect would be ~100iops/osd x number of osd’s – I’m seeing an order
> of magnitude greater than that AND running IOSTAT on the storage nodes
> show no read activity on the OSD disks.
>
>  
>
> Any ideas on what I’ve overlooked? There appears to be some read-ahead
> caching that I’ve missed.
>
>  
>
> Thanks,
>
> Bruce
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sizing SSD's for ceph

2015-01-29 Thread Udo Lembke

Hi,

Am 29.01.2015 07:53, schrieb Christian Balzer:
> On Thu, 29 Jan 2015 01:30:41 + Ramakrishna Nishtala (rnishtal) wrote:

>> * Per my understanding once writes are complete to journal then
>> it is read again from the journal before writing to data disk. Does this
>> mean, we have to do, not just sync/async writes but also reads
>> ( random/seq ? ) in order to correctly size them?
>>
> You might want to read this thread:
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg12952.html
> 
> Assuming this didn't change (and just looking at my journal SSDs and OSD
> HDDs with atop I don't think so) your writes go to the HDDs pretty much in
> parallel.
> 
> In either case, an SSD that can _write_ fast enough to satisfy your needs
> will definitely have no problems reading fast enough. 
> 

due, that the data are in the cache (ram), there are only marginal reads
from the journal-ssd!

iostat from an journal ssd:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdc 304,45 0,16 82750,46  29544 15518960008

I would say, if you have much more reads, you have to less memory.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow read-performance inside the vm

2015-01-27 Thread Udo Lembke

Hi Patrik,

Am 27.01.2015 14:06, schrieb Patrik Plank:
> 

> ...
> I am really happy, these values above are enough for my little amount of
> vms. Inside the vms I get now for write 80mb/s and read 130mb/s, with
> write-cache enabled.
> 
> But there is one little problem.
> 
> Are there some tuning parameters for small files?
> 
> For 4kb to 50kb files the cluster is very slow.
> 

do you use an higher read-ahead inside the VM?
Like "echo 4096 > /sys/block/vda/queue/read_ahead_kb"

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Better way to use osd's of different size

2015-01-16 Thread Udo Lembke

Hi Megov,
you should weight the OSD so it's represent the size (like an weight of
3.68 for an 4TB HDD).
cephdeploy do this automaticly.

Nevertheless also with the correct weight the disk was not filled in
equal distribution. For that purposes you can use reweight for single
OSDs, or automaticly with "ceph osd reweight-by-utilization".

Udo

On 14.01.2015 16:36, Межов Игорь Александрович wrote:
>
> Hi!
>
>
> We have a small production ceph cluster, based on firefly release.
>
>
> It was built using hardware we already have in our site so it is not
> "new & shiny",
>
> but works quite good. It was started in 2014.09 as a "proof of
> concept" from 4 hosts
>
> with 3 x 1tb osd's each: 1U dual socket Intel 54XX & 55XX platforms on
> 1 gbit network.
>
>
> Now it contains 4x12 osd nodes on shared 10Gbit network. We use it as
> a backstore
>
> for running VMs under qemu+rbd.
>
>
> During migration we temporarily use 1U nodes with 2tb osds and already
> face some
>
> problems with uneven distribution. I know, that the best practice is
> to use osds of same
>
> capacity, but it is impossible sometimes.
>
>
> Now we have 24-28 spare 2tb drives and want to increase capacity on
> the same boxes.
>
> What is the more right way to do it:
>
> - replace 12x1tb drives with 12x2tb drives, so we will have 2 nodes
> full of 2tb drives and
>
> other nodes remains in 12x1tb confifg
>
> - or replace 1tb to 2tb drives in more unify way, so every node will
> have 6x1tb + 6x2tb drives?
>
>
> I feel that the second way will give more smooth distribution among
> the nodes, and
>
> outage of one node may give lesser impact on cluster. Am I right and
> what you can
>
> advice me in such a situation?
>
>
>
>
> Megov Igor
> yuterra.ru, CIO
> me...@yuterra.ru
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Part 2: ssd osd fails often with "FAILED assert(soid < scrubber.start || soid >= scrubber.end)"

2015-01-16 Thread Udo Lembke

Hi Loic,
thanks for the answer. I hope it's not like in
http://tracker.ceph.com/issues/8747 where the issue happens with an
patched version if understand right.

So I must only wait few month ;-) for an backport...

Udo

Am 14.01.2015 09:40, schrieb Loic Dachary:
> Hi,
> 
> This is http://tracker.ceph.com/issues/8011 which is being
> backported.
> 
> Cheers
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Part 2: ssd osd fails often with "FAILED assert(soid < scrubber.start || soid >= scrubber.end)"

2015-01-14 Thread Udo Lembke

Hi again,
sorry for not threaded, but my last email don't came back on the mailing
list (often miss some posts!).

Just after sending the last mail, the first time another SSD fails - in
this case an cheap one, but with the same error:

root@ceph-04:/var/log/ceph# more ceph-osd.62.log
2015-01-13 16:40:55.712967 7fb29cfd3700  0 log [INF] : 17.2 scrub ok
2015-01-13 17:54:35.548361 7fb29dfd5700  0 log [INF] : 17.3 scrub ok
2015-01-13 17:54:38.007014 7fb29dfd5700  0 log [INF] : 17.5 scrub ok
2015-01-13 17:54:41.215558 7fb29d7d4700  0 log [INF] : 17.f scrub ok
2015-01-13 17:54:42.277585 7fb29dfd5700  0 log [INF] : 17.a scrub ok
2015-01-13 17:54:48.961582 7fb29d7d4700  0 log [INF] : 17.6 scrub ok
2015-01-13 20:15:08.749597 7fb292337700  0 -- 192.168.3.14:6824/9185 >>
192.168.3.15:6824/11735 pipe(0x107d9680 sd=307 :6824 s=2 pgs=2 cs=1
l=0 c=0x124a09a0).fault, initiating reconnect
2015-01-13 20:15:08.750803 7fb296dbe700  0 -- 192.168.3.14:0/9185 >>
192.168.3.15:6825/11735 pipe(0xd011180 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x
8d19760).fault
2015-01-13 20:15:08.750804 7fb292b3f700  0 -- 192.168.3.14:0/9185 >>
172.20.2.15:6837/11735 pipe(0x1210f900 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x
beae840).fault
2015-01-13 20:15:08.751056 7fb291d31700  0 -- 192.168.3.14:6824/9185 >>
192.168.3.15:6824/11735 pipe(0x107d9680 sd=29 :6824 s=1 pgs=2 cs=2 l
=0 c=0x124a09a0).fault
2015-01-13 20:15:27.035342 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:07.035339)
2015-01-13 20:15:28.036773 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.036769)
2015-01-13 20:15:28.945179 7fb29b7d0700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.945178)
2015-01-13 20:15:29.037016 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:09.037014)
2015-01-13 20:15:30.037204 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.037202)
2015-01-13 20:15:30.645491 7fb29b7d0700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.645483)
2015-01-13 20:15:31.037326 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:11.037323)
2015-01-13 20:15:32.037442 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:12.037439)
2015-01-13 20:15:33.037641 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:13.037637)
2015-01-13 20:15:34.037843 7fb2b3edd700 -1 osd.62 116422
heartbeat_check: no reply from osd.61 since back 2015-01-13
20:15:06.843259 front 2
015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:14.037839)
2015-01-13 21:39:35.241153 7fb29dfd5700  0 log [INF] : 17.d scrub ok
2015-01-13 21:39:39.293113 7fb29a7ce700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
ol)' thread 7fb29a7ce700 time 2015-01-13 21:39:39.279799
osd/ReplicatedPG.cc: 5306: FAILED assert(soid < scrubber.start || soid
>= scrubber.end)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
bool)+0x1320) [0x9296b0]
 2:
(ReplicatedPG::try_flush_mark_clean(boost::shared_ptr)+0x5f6)
[0x92b076]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
[0x92b876]
 4: (C_Flush::finish(int)+0x86) [0x986226]
 5: (Context::complete(int)+0x9) [0x78f449]
 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
 7: (()+0x6b50) [0x7fb2b94ceb50]
 8: (clone()+0x6d) [0x7fb2b80dc7bd]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
  -127> 2015-01-10 19:39:41.861724 7fb2b9faa780  5 asok(0x28e4230)
register_command perfcounters_dump hook 0x28d4010
  -126> 2015-01-10 19:39:41.861749 7fb2b9faa780  5 asok(0x28e4230)
register_command 1 hook 0x28d4010
  -125> 2015-01-10 19:39:41.861753 7fb2b9faa780  5 asok(0x28e4230)
register_command perf dump hook 0x28d4010
  -124> 2015-01-10 19:39:41.861756 7fb2b9faa780  5 asok(0x28e4230)
register_command perfcounters_schema hook 0x28d4010
  -123> 2015-01-10 19:39:41.861759 7fb2b9faa780  5 asok(0x28e4230)
register_command 2 hook 0x28d4010
  -122> 2015-01-10 19:39:41.861762 7fb2b9faa780  5 asok(0x28e4230

[ceph-users] ssd osd fails often with "FAILED assert(soid < scrubber.start || soid >= scrubber.end)"

2015-01-13 Thread Udo Lembke

Hi,
since last thursday we had an ssd-pool (cache tier) in front of an
ec-pool and fill the pools with data via rsync (app. 50MB/s).
The ssd-pool has tree disks and one of them (an DC S3700) fails four
times since that.
I simply start the osd again and the pool pas rebuilded and work again
for some hours up to some days.

I switched the ceph-node and the ssh-adapter, but this don't solve the
issue.
There wasn't any messages in syslog/messages and an fsck runs without
trouble, so I guess the problem is not OS-related.

I found this issue http://tracker.ceph.com/issues/8747 but my
ceph-version is newer (debian: ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3)),
and it's looks that i can reproduce this issue during 1-3 days.

The osd is ext4-formatted. All other OSDs (62) runs without trouble.

# more ceph-osd.61.log
2015-01-13 16:29:26.494458 7fedf9a3d700  0 log [INF] : 17.0 scrub ok
2015-01-13 17:29:03.988530 7fedf823a700  0 log [INF] : 17.16 scrub ok
2015-01-13 17:30:31.901032 7fedf8a3b700  0 log [INF] : 17.18 scrub ok
2015-01-13 17:31:58.983736 7fedf823a700  0 log [INF] : 17.9 scrub ok
2015-01-13 17:32:30.780308 7fedf9a3d700  0 log [INF] : 17.c scrub ok
2015-01-13 17:32:33.311433 7fedf8a3b700  0 log [INF] : 17.11 scrub ok
2015-01-13 17:37:22.237214 7fedf9a3d700  0 log [INF] : 17.7 scrub ok
2015-01-13 20:15:07.874376 7fedf6236700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
ol)' thread 7fedf6236700 time 2015-01-13 20:15:07.853440
osd/ReplicatedPG.cc: 5306: FAILED assert(soid < scrubber.start || soid
>= scrubber.end)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
bool)+0x1320) [0x9296b0]
 2:
(ReplicatedPG::try_flush_mark_clean(boost::shared_ptr)+0x5f6)
[0x92b076]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
[0x92b876]
 4: (C_Flush::finish(int)+0x86) [0x986226]
 5: (Context::complete(int)+0x9) [0x78f449]
 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
 7: (()+0x6b50) [0x7fee152f6b50]
 8: (clone()+0x6d) [0x7fee13f047bd]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -70> 2015-01-11 19:54:47.962164 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_dump hook 0x2f44010
   -69> 2015-01-11 19:54:47.962190 7fee15dd4780  5 asok(0x2f56230)
register_command 1 hook 0x2f44010
   -68> 2015-01-11 19:54:47.962195 7fee15dd4780  5 asok(0x2f56230)
register_command perf dump hook 0x2f44010
   -67> 2015-01-11 19:54:47.962201 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_schema hook 0x2f44010
   -66> 2015-01-11 19:54:47.962203 7fee15dd4780  5 asok(0x2f56230)
register_command 2 hook 0x2f44010
   -65> 2015-01-11 19:54:47.962207 7fee15dd4780  5 asok(0x2f56230)
register_command perf schema hook 0x2f44010
   -64> 2015-01-11 19:54:47.962209 7fee15dd4780  5 asok(0x2f56230)
register_command config show hook 0x2f44010
   -63> 2015-01-11 19:54:47.962214 7fee15dd4780  5 asok(0x2f56230)
register_command config set hook 0x2f44010
   -62> 2015-01-11 19:54:47.962219 7fee15dd4780  5 asok(0x2f56230)
register_command config get hook 0x2f44010
   -61> 2015-01-11 19:54:47.962223 7fee15dd4780  5 asok(0x2f56230)
register_command log flush hook 0x2f44010
   -60> 2015-01-11 19:54:47.962226 7fee15dd4780  5 asok(0x2f56230)
register_command log dump hook 0x2f44010
   -59> 2015-01-11 19:54:47.962229 7fee15dd4780  5 asok(0x2f56230)
register_command log reopen hook 0x2f44010
   -58> 2015-01-11 19:54:47.965000 7fee15dd4780  0 ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 117
35
   -57> 2015-01-11 19:54:47.967362 7fee15dd4780  1 finished
global_init_daemonize
   -56> 2015-01-11 19:54:47.971666 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -55> 2015-01-11 19:54:47.971682 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -54> 2015-01-11 19:54:47.973281 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
syscall(SYS_syncfs, f
d) fully supported
   -53> 2015-01-11 19:54:47.975393 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) limited size xattrs
   -52> 2015-01-11 19:54:48.013905 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) mount: enabling WRITEAHEAD journal
mode: checkpoint
is not enabled
   -51> 2015-01-11 19:54:49.245360 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -50> 2015-01-11 19:54:49.245370 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -49> 2015-01-11 19:54:49.247017 7fee15dd4780  0
genericfilestorebackend(/var/lib/ce

Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread Udo Lembke

Hi,
I had an similiar effect two weeks ago - 1PG backfill_toofull and due
reweighting and delete there was enough free space but the rebuild
process stopped after a while.

After stop and start ceph on the second node, the rebuild process runs
without trouble and the backfill_toofull are gone.

This happens with firefly.

Udo

On 09.01.2015 21:29, c3 wrote:
> In this case the root cause was half denied reservations.
>
> http://tracker.ceph.com/issues/9626
>
> This stopped backfills since, those listed as backfilling were
> actually half denied and doing nothing. The toofull status is not
> checked until a free backfill slot happens, so everything was just stuck.
>
> Interestingly, the toofull was created by other backfills which were
> not stoppped.
> http://tracker.ceph.com/issues/9594
>
> Quite the log jam to clear.
>
>
> Quoting Craig Lewis :
>
>> What was the osd_backfill_full_ratio?  That's the config that controls
>> backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
>> ceph status.
>>
>> I've noticed that it takes a while for backfilling to restart after
>> changing osd_backfill_full_ratio.  Backfilling usually restarts for
>> me in
>> 10-15 minutes.  Some PGs will stay in that state until the cluster is
>> nearly done recoverying.
>>
>> I've only seen backfill_toofull happen after the OSD exceeds the
>> ratio (so
>> it's reactive, no proactive).  Mine usually happen when I'm
>> rebalancing a
>> nearfull cluster, and an OSD backfills itself toofull.
>>
>>
>>
>>
>> On Mon, Jan 5, 2015 at 11:32 AM, c3  wrote:
>>
>>> Hi,
>>>
>>> I am wondering how a PG gets marked backfill_toofull.
>>>
>>> I reweighted several OSDs using ceph osd crush reweight. As
>>> expected, PG
>>> began moving around (backfilling).
>>>
>>> Some PGs got marked +backfilling (~10), some +wait_backfill (~100).
>>>
>>> But some are marked +backfill_toofull. My OSDs are between 25% and 72%
>>> full.
>>>
>>> Looking at ceph pg dump, I can find the backfill_toofull PGs and
>>> verified
>>> the OSDs involved are less than 72% full.
>>>
>>> Do backfill reservations include a size? Are these OSDs projected to be
>>> toofull, once the current backfilling complete? Some of the
>>> backfill_toofull and backfilling point to the same OSDs.
>>>
>>> I did adjust the full ratios, but that did not change the
>>> backfill_toofull
>>> status.
>>> ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
>>> ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Improving Performance with more OSD's?

2015-01-04 Thread Udo Lembke

Hi Lindsay,

On 05.01.2015 06:52, Lindsay Mathieson wrote:
> ...
> So two OSD Nodes had:
> - Samsung 840 EVO SSD for Op. Sys.
> - Intel 530 SSD for Journals (10GB Per OSD)
> - 3TB WD Red
> - 1 TB WD Blue
> - 1 TB WD Blue
> - Each disk weighted at 1.0
> - Primary affinity of the WD Red (slow) set to 0
the weight should be the size of the filesystem. With weight 1 for all
disks, you run in trouble if your cluster filled, because the 1TB-Disks
are full, before the 3TB disk!

You should have something like 0.9 for the 1TB and 2.82 for the 3TB
disks ( "df -k | grep osd | awk '{print $2/(1024^3) }' " ).

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there an negative relationship between storage utilization and ceph performance?

2015-01-02 Thread Udo Lembke

Hi again,
... after a long time!

Now I have change the whole ceph-cluster from xfs to ext4 (60 OSDs),
change tunables and fill the cluster again.

So I can compare the bench values.

For my setup the cluster runs better with ext4 than with xfs - latency
drop from ~14ms to ~8ms. (rados -p test bench 60 seq --no-cleanup)
Still with the old tunables.

Now with the new tunables (and filled again to 65%) the read performance
was also much better - raised from 440MB/s to ~760MB/s.

The write performance was less after before, but I had problems with the
read-performance (write was OK for me).

I lost a little bit of space - the weight of reach disk was 3.64 before
and 3.58 now.

For me it's looks, that the storage utilization has less impact with
ext4 and ext4 performs better than xfs!

Udo

Am 05.11.2014 01:22, schrieb Christian Balzer:
> 
> Hello,
> 
> On Tue, 04 Nov 2014 20:49:02 +0100 Udo Lembke wrote:
> 
>> Hi,
>> since a long time I'm looking for performance improvements for our
>> ceph-cluster.
>> The last expansion got better performance, because we add another node
>> (with 12 OSDs). The storage utilization was after that 60%.
>>
> Another node of course does more than lower per OSD disk utilization, it
> also adds more RAM (cached objects), more distribution of requests, etc.
> 
> So the question here is, did the usage (number of client IOPS) stay the
> same and just the total amount of stored data did grow?
> 
>> Now we reach again 69% (the next nodes are waiting for installation) and
>> the performance drop! OK, we also change the ceph-version from 0.72.x to
>> firefly.
>> But I'm wonder if there an relationship between utilization an
>> performance?! The OSDs are xfs disks, but now i start to use ext4,
>> because of the bad fragmentation on a xfs-filesystem (yes, I use the
>> mountoption allocsize=4M allready).
>>
> Does defragmenting (all of) the XFS backed OSDs help?
> 
>> Has anybody the same effect?
>>
> I have nothing anywhere near that full, but I can confirm that XFS
> fragments worse than ext4 and the less said about BTRFS, the better. ^.^
> Also defragmenting (not that they needed it) ext4 volumes felt more
> lightweight than XFS.
> 
> Since you now have ext4 OSDs, how about doing a osd bench and fio on those
> compared to XFS backed ones?
> 
> Other than the above, Mark listed a number of good reasons why OSDs (HDDs)
> become slower when getting fuller besides fragmentation.
> 
> Christian
>> Udo
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Udo Lembke

Hi,
for monitoring only I use the Ceph Dashboard
https://github.com/Crapworks/ceph-dash/

Fo me it's an nice tool for an good overview - for administration i use
the cli.


Udo

On 23.12.2014 01:11, Tony wrote:
> Please don't mention calamari :-)
>
> The best web interface for ceph that actually works with RHEL6.6 
>
> Preferable something in repo and controls and monitors all other ceph
> osd, mon, etc.
>
>
> Take everything and live for the moment.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.90 released

2014-12-23 Thread Udo Lembke

Hi Sage,

Am 23.12.2014 15:39, schrieb Sage Weil:
...
> 
> You can't reduce the PG count without creating new (smaller) pools 
> and migrating data. 
does this also work with the pool metadata, or is this pool essential
for ceph?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-20 Thread Udo Lembke

Hi Craig,
right! I had also post one mail in that thread.

My question was, if the whole step to "chooseleaf_vary_r 1" take the
same amount of time like the setting tunables to firefly.

The funny thing: I just decompile the crushmap to start with
chooseleaf_vary_r 4 and see, that after upgrade tonight the
chooseleaf_vary_r  allready on 1!

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
...

ceph osd crush show-tunables -f json-pretty

{ "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 1,
  "profile": "firefly",
  "optimal_tunables": 1,
  "legacy_tunables": 0,
  "require_feature_tunables": 1,
  "require_feature_tunables2": 1}


Udo

On 20.12.2014 17:53, Craig Lewis wrote:
> There was a tunables discussion on the ML a few months ago, with a lot
> of good suggestions.  Sage gave some suggestions on rolling out (and
> rolling back) chooseleaf_vary_r changes.  That reminds me... I
> intended to try those changes over the holidays...
>
>
> Found it; the subject was "ceph osd crush tunables optimal AND add new
> OSD at the same time".
>
>
> On Sat, Dec 20, 2014 at 3:26 AM, Udo Lembke  <mailto:ulem...@polarzone.de>> wrote:
>
> Hi,
> for information for other cepher...
>
> I switched from unknown crush tunables to firefly and it's takes 6
> hour
> (30.853% degration) to finisched on our production-cluster (5
> Nodes, 60
> OSDs, 10GBE, 20% data used:  pgmap v35678572: 3904 pgs, 4 pools, 21947
> GB data, 5489 kobjects).
>
> Should an "chooseleaf_vary_r 1" (from 0) take round about the same
> time
> to finished??
>
>
> Regards
>
> Udo
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-20 Thread Udo Lembke

Hi,
for information for other cepher...

I switched from unknown crush tunables to firefly and it's takes 6 hour
(30.853% degration) to finisched on our production-cluster (5 Nodes, 60
OSDs, 10GBE, 20% data used:  pgmap v35678572: 3904 pgs, 4 pools, 21947
GB data, 5489 kobjects).

Should an "chooseleaf_vary_r 1" (from 0) take round about the same time
to finished??


Regards

Udo

On 04.12.2014 14:09, Udo Lembke wrote:
> Hi,
> to answer myself.
>
> With ceph osd crush show-tunables I see a little bit more, but doesn't
> know how far away from firefly-tunables I'm at the procuction cluster are.
>
> New testcluster with profile optimal:
> ceph osd crush show-tunables
> { "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 1,
>   "profile": "firefly",
>   "optimal_tunables": 1,
>   "legacy_tunables": 0,
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 1}
>
> the production cluster:
>  ceph osd crush show-tunables
> { "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 0,
>   "profile": "unknown",
>   "optimal_tunables": 0,
>   "legacy_tunables": 0,
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 0}
>
> Look this like argonaut or bobtail?
>
> And how proceed to update?
> Does in makes sense first go to profile bobtail and then to firefly?
>
>
> Regards
>
> Udo
>
> Am 01.12.2014 17:39, schrieb Udo Lembke:
>> Hi all,
>> http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
>> described how to set the tunables to legacy, argonaut, bobtail, firefly
>> or optimal.
>>
>> But how can I see, which profile is active in an ceph-cluster?
>>
>> With "ceph osd getcrushmap" I got not realy much info
>> (only "tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50)
>>
>>
>> Udo
>>
>> _
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Udo Lembke

Hi Lindsay,
have you tried the different cache-options (no cache, write through,
...) which proxmox offer, for the drive?


Udo

On 18.12.2014 05:52, Lindsay Mathieson wrote:
> I'be been experimenting with CephFS for funning KVM images (proxmox).
>
> cephfs fuse version - 0.87
>
> cephfs kernel module - kernel version 3.10
>
>
> Part of my testing involves running a Windows 7 VM up and running
> CrystalDiskMark to check the I/O in the VM. Its surprisingly good with
> both the fuse and the kernel driver, seq reads & writes are actually
> faster than the underlying disk, so I presume the FS is aggressively
> caching.
>
> With the fuse driver I have no problems.
>
> With the kernel driver, the benchmark runs fine, but when I reboot the
> VM the drive is corrupted and unreadable, every time. Rolling back to
> a snapshot fixes the disk. This does not happen unless I run the
> benchmark, which I presume is writing a lot of data.
>
> No problems with the same test for Ceph rbd, or NFS.
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Any tuning of LVM-Storage inside an VM related to ceph?

2014-12-18 Thread Udo Lembke

Hi all,
I have some fileserver with insufficient read speed.
Enabling read ahead inside the VM improve the read speed, but it's
looks, that this has an drawback during lvm-operations like pvmove.

For test purposes, I move the lvm-storage inside an VM from vdb to vdc1.
It's take days, because it's 3TB data.
After enbling read ahead (echo 4096 >
/sys/block/vdb/queue/read_ahead_kb; echo 4096 >
/sys/block/vdc/queue/read_ahead_kb) the move-speed drop noticeable!

Are they any tunings to improve speed related to lvm on rbd-storage?
Perhaps, if using partitions, align the partition on 4MB?

Any hints?


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help with SSDs

2014-12-18 Thread Udo Lembke

Hi Mark,

On 18.12.2014 07:15, Mark Kirkwood wrote:

> While you can't do much about the endurance lifetime being a bit low,
> you could possibly improve performance using a journal *file* that is
> located on the 840's (you'll need to symlink it - disclaimer - have
> not tried this myself, but will experiment if you are interested).
> Slightly different open() options are used in this case and these
> cheaper consumer SSD seem to work better with them.
I had the symlink->file method before, (with different SSDs) but the
performance was much better after changing to partitions.
I try fist some different "consumer" SSDs with journal as file and end
now with DC S3700 with partitions.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help with SSDs

2014-12-17 Thread Udo Lembke

Hi Mikaël,

>
> I have EVOs too, what to you mean by "not playing well with D_SYNC"?
> Is there something I can test on my side to compare results with you,
> as I have mine flashed?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
described how test the ssd-performance for an journal ssd (your ssd will
be overwritten!!).

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph

2014-12-15 Thread Udo Lembke

Hi,
see here:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15546.html

Udo

On 16.12.2014 05:39, Benjamin wrote:
> I increased the OSDs to 10.5GB each and now I have a different issue...
>
> cephy@ceph-admin0:~/ceph-cluster$ echo {Test-data} > testfile.txt
> cephy@ceph-admin0:~/ceph-cluster$ rados put test-object-1 testfile.txt
> --pool=data
> error opening pool data: (2) No such file or directory
> cephy@ceph-admin0:~/ceph-cluster$ ceph osd lspools
> 0 rbd,
>
> Here's ceph -w:
> cephy@ceph-admin0:~/ceph-cluster$ ceph -w
> cluster b3e15af-SNIP
>  health HEALTH_WARN mon.ceph0 low disk space; mon.ceph1 low disk
> space; mon.ceph2 low disk space; clock skew detected on mon.ceph0,
> mon.ceph1, mon.ceph2
>  monmap e3: 4 mons at
> {ceph-admin0=10.0.1.10:6789/0,ceph0=10.0.1.11:6789/0,ceph1=10.0.1.12:6789/0,ceph2=10.0.1.13:6789/0
> },
> election epoch 10, quorum 0,1,2,3 ceph-admin0,ceph0,ceph1,ceph2
>  osdmap e17: 3 osds: 3 up, 3 in
>   pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
> 19781 MB used, 7050 MB / 28339 MB avail
>   64 active+clean
>
> Any other commands to run that would be helpful? Is it safe to simply
> manually create the "data" and "metadata" pools myself?
>
> On Mon, Dec 15, 2014 at 5:07 PM, Benjamin  > wrote:
>
> Aha, excellent suggestion! I'll try that as soon as I get back,
> thank you.
> - B
>
> On Dec 15, 2014 5:06 PM, "Craig Lewis"  > wrote:
>
>
> On Sun, Dec 14, 2014 at 6:31 PM, Benjamin  > wrote:
>
> The machines each have Ubuntu 14.04 64-bit, with 1GB of
> RAM and 8GB of disk. They have between 10% and 30% disk
> utilization but common between all of them is that they
> *have free disk space* meaning I have no idea what the
> heck is causing Ceph to complain.
>
>
> Each OSD is 8GB?  You need to make them at least 10 GB.
>
> Ceph weights each disk as it's size in TiB, and it truncates
> to two decimal places.  So your 8 GiB disks have a weight of
> 0.00.  Bump it up to 10 GiB, and it'll get a weight of 0.01.
>
> You should have 3 OSDs, one for each of ceph0,ceph1,ceph2.
>
> If that doesn't fix the problem, go ahead and post the things
> Udo mentioned.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph

2014-12-15 Thread Udo Lembke

Hi Benjamin,
On 15.12.2014 03:31, Benjamin wrote:
> Hey there,
>
> I've set up a small VirtualBox cluster of Ceph VMs. I have one
> "ceph-admin0" node, and three "ceph0,ceph1,ceph2" nodes for a total of 4.
>
> I've been following this
> guide: http://ceph.com/docs/master/start/quick-ceph-deploy/ to the letter.
>
> At the end of the guide, it calls for you to run "ceph health"... this
> is what happens when I do.
>
> "HEALTH_ERR 64 pgs stale; 64 pgs stuck stale; 2 full osd(s); 2/2 in
> osds are down"
hmm, why you have two OSDs only with tree nodes?

Can you post the output of following commands
ceph health detail
ceph osd tree
rados df
ceph osd pool get data size
ceph osd pool get rbd size
df -h # on all OSD-nodes

/etc/init.d/ceph start osd.0  # on node with osd.0
/etc/init.d/ceph start osd.1  # on node with osd.1


Udo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] For all LSI SAS9201-16i user - don't upgrate to firmware P20

2014-12-11 Thread Udo Lembke

Hi all,
I have upgrade two LSI SAS9201-16i HBAs to the latest Firmware P20.00.00
and after that I got following syslog messages:

Dec  9 18:11:31 ceph-03 kernel: [  484.602834] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)
Dec  9 18:12:15 ceph-03 kernel: [  528.310174] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)
Dec  9 18:15:25 ceph-03 kernel: [  718.782477] mpt2sas0: log_info(0x3108): 
originator(PL), code(0x08), sub_code(0x)

Next night one OSD went down (read only mounted, and I must repair the 
filesystem with fsck) and then two other OSDs follows.

Then I change the card and after some tries I'm able to downgrade* the cards to 
P17 which run stable.


Udo


* downgrade on the fourth computer with dos booted and "sas2flsh -o -e 6"...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Old OSDs on new host, treated as new?

2014-12-05 Thread Udo Lembke

Hi,
perhaps an stupid question, but why you change the hostname?

Not tried, but I guess if you boot the node with an new hostname, the
old hostname are in the crush map, but without any OSDs - because they
are on the new host.
Don't know ( I guess not) if the degration level stay also on 5% if you
delete the empty host from the crush map.

I would simply use the same hostconfig on an rebuildet host.

Udo

On 03.12.2014 05:06, Indra Pramana wrote:
> Dear all,
>
> We have a Ceph cluster with several nodes, each node contains 4-6
> OSDs. We are running the OS off USB drive to maximise the use of the
> drive bays for the OSDs and so far everything is running fine.
>
> Occasionally, the OS running on the USB drive would fail, and we would
> normally replace the drive with a pre-configured similar OS and Ceph
> running, so when the new OS boots up, it will automatically detect all
> the OSDs and start them. It works fine without any issues.
>
> However, the issue is in recovery. When one node goes down, all the
> OSDs would be down and recovery will start to move the pg replicas on
> the affected OSDs to other available OSDs, and cause the Ceph to be
> degraded, say 5%, which is expected. However, when we boot up the
> failed node with a new OS, and bring back the OSDs up, more PGs are
> being scheduled for backfilling and instead of reducing, the
> degradation level will shoot up again to, for example, 10%, and in
> some occasion, it goes up to 19%.
>
> We had experience when one node is down, it will degraded to 5% and
> recovery will start, but when we manage to bring back up the node
> (still the same OS), the degradation level will reduce to below 1% and
> eventually recovery will be completed faster.
>
> Why the same behaviour doesn't apply on the above situation? The OSD
> numbers are the same when the node boots up, the crush map weight
> values are also the same. Only the hostname is different.
>
> Any advice / suggestions?
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-04 Thread Udo Lembke

Hi,
to answer myself.

With ceph osd crush show-tunables I see a little bit more, but doesn't
know how far away from firefly-tunables I'm at the procuction cluster are.

New testcluster with profile optimal:
ceph osd crush show-tunables
{ "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 1,
  "profile": "firefly",
  "optimal_tunables": 1,
  "legacy_tunables": 0,
  "require_feature_tunables": 1,
  "require_feature_tunables2": 1}

the production cluster:
 ceph osd crush show-tunables
{ "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 0,
  "profile": "unknown",
  "optimal_tunables": 0,
  "legacy_tunables": 0,
  "require_feature_tunables": 1,
  "require_feature_tunables2": 0}

Look this like argonaut or bobtail?

And how proceed to update?
Does in makes sense first go to profile bobtail and then to firefly?


Regards

Udo

Am 01.12.2014 17:39, schrieb Udo Lembke:
> Hi all,
> http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
> described how to set the tunables to legacy, argonaut, bobtail, firefly
> or optimal.
> 
> But how can I see, which profile is active in an ceph-cluster?
> 
> With "ceph osd getcrushmap" I got not realy much info
> (only "tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50)
> 
> 
> Udo
> 
> _
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-01 Thread Udo Lembke

Hi all,
http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
described how to set the tunables to legacy, argonaut, bobtail, firefly
or optimal.

But how can I see, which profile is active in an ceph-cluster?

With "ceph osd getcrushmap" I got not realy much info
(only "tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50)


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Typical 10GbE latency

2014-11-12 Thread Udo Lembke

Hi Wido,
On 12.11.2014 12:55, Wido den Hollander wrote:
> (back to list)
>
>
> Indeed, there must be something! But I can't figure it out yet. Same
> controllers, tried the same OS, direct cables, but the latency is 40%
> higher.
>
>
perhaps something with pci-e order / interupts?
have you checked the bios settings or use another pcie-slot?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Typical 10GbE latency

2014-11-06 Thread Udo Lembke

Hi,
no special optimizations on the host.
In this case the pings are from an proxmox-ve host to ceph-osds (ubuntu
+ debian).

The pings from one osd to the others are comparable.

Udo

On 06.11.2014 15:00, Irek Fasikhov wrote:
> Hi,Udo.
> Good value :)
>
> Whether an additional optimization on the host?
> Thanks.
>
> Thu Nov 06 2014 at 16:57:36, Udo Lembke  <mailto:ulem...@polarzone.de>>:
>
> Hi,
> from one host to five OSD-hosts.
>
> NIC Intel 82599EB; jumbo-frames; single Switch IBM G8124 (blade
> network).
>
> rtt min/avg/max/mdev = 0.075/0.114/0.231/0.037 ms
> rtt min/avg/max/mdev = 0.088/0.164/0.739/0.072 ms
> rtt min/avg/max/mdev = 0.081/0.141/0.229/0.030 ms
> rtt min/avg/max/mdev = 0.083/0.115/0.183/0.030 ms
> rtt min/avg/max/mdev = 0.087/0.144/0.190/0.028 ms
>
>
> Udo
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Typical 10GbE latency

2014-11-06 Thread Udo Lembke

Hi,
from one host to five OSD-hosts.

NIC Intel 82599EB; jumbo-frames; single Switch IBM G8124 (blade network).

rtt min/avg/max/mdev = 0.075/0.114/0.231/0.037 ms
rtt min/avg/max/mdev = 0.088/0.164/0.739/0.072 ms
rtt min/avg/max/mdev = 0.081/0.141/0.229/0.030 ms
rtt min/avg/max/mdev = 0.083/0.115/0.183/0.030 ms
rtt min/avg/max/mdev = 0.087/0.144/0.190/0.028 ms


Udo

Am 06.11.2014 14:18, schrieb Wido den Hollander:
> Hello,
> 
> While working at a customer I've ran into a 10GbE latency which seems
> high to me.
> 
> I have access to a couple of Ceph cluster and I ran a simple ping test:
> 
> $ ping -s 8192 -c 100 -n 
> 
> Two results I got:
> 
> rtt min/avg/max/mdev = 0.080/0.131/0.235/0.039 ms
> rtt min/avg/max/mdev = 0.128/0.168/0.226/0.023 ms
> 
> Both these environment are running with Intel 82599ES 10Gbit cards in
> LACP. One with Extreme Networks switches, the other with Arista.
> 
> Now, on a environment with Cisco Nexus 3000 and Nexus 7000 switches I'm
> seeing:
> 
> rtt min/avg/max/mdev = 0.160/0.244/0.298/0.029 ms
> 
> As you can see, the Cisco Nexus network has high latency compared to the
> other setup.
> 
> You would say the switches are to blame, but we also tried with a direct
> TwinAx connection, but that didn't help.
> 
> This setup also uses the Intel 82599ES cards, so the cards don't seem to
> be the problem.
> 
> The MTU is set to 9000 on all these networks and cards.
> 
> I was wondering, others with a Ceph cluster running on 10GbE, could you
> perform a simple network latency test like this? I'd like to compare the
> results.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Is there an negative relationship between storage utilization and ceph performance?

2014-11-04 Thread Udo Lembke

Hi,
since a long time I'm looking for performance improvements for our
ceph-cluster.
The last expansion got better performance, because we add another node
(with 12 OSDs). The storage utilization was after that 60%.

Now we reach again 69% (the next nodes are waiting for installation) and
the performance drop! OK, we also change the ceph-version from 0.72.x to
firefly.
But I'm wonder if there an relationship between utilization an performance?!
The OSDs are xfs disks, but now i start to use ext4, because of the bad
fragmentation on a xfs-filesystem (yes, I use the mountoption
allocsize=4M allready).

Has anybody the same effect?

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about activate OSD

2014-10-31 Thread Udo Lembke

Hi German,
if i'm right the journal-creation on /dev/sdc1 failed (perhaps because
you only say /dev/sdc instead of /dev/sdc1?).

Do you have partitions on sdc?


Udo

On 31.10.2014 22:02, German Anders wrote:
> Hi all,
>   I'm having some issues while trying to activate a new osd in a
> new cluster, the prepare command run fine, but then the activate
> command failed:
>
> ceph@cephbkdeploy01:~/desp-bkp-cluster$ ceph-deploy --overwrite-conf
> disk prepare --fs-type btrfs ceph-bkp-osd01:sdf:/dev/sdc
> [ceph_deploy.cli][INFO  ] Invoked (1.4.0): /usr/bin/ceph-deploy
> --overwrite-conf disk prepare --fs-type btrfs ceph-bkp-osd01:sdf:/dev/sdc
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> ceph-bkp-osd01:/dev/sdf:/dev/sdc
> [ceph-bkp-osd01][DEBUG ] connected to host: ceph-bkp-osd01
> [ceph-bkp-osd01][DEBUG ] detect platform information from remote host
> [ceph-bkp-osd01][DEBUG ] detect machine type
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
> [ceph_deploy.osd][DEBUG ] Deploying osd to ceph-bkp-osd01
> [ceph-bkp-osd01][DEBUG ] write cluster configuration to
> /etc/ceph/{cluster}.conf
> [ceph-bkp-osd01][INFO  ] Running command: sudo udevadm trigger
> --subsystem-match=block --action=add
> [ceph_deploy.osd][DEBUG ] Preparing host ceph-bkp-osd01 disk /dev/sdf
> journal /dev/sdc activate False
> [ceph-bkp-osd01][INFO  ] Running command: sudo ceph-disk-prepare
> --fs-type btrfs --cluster ceph -- /dev/sdf /dev/sdc
> [ceph-bkp-osd01][WARNIN] libust[13609/13609]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)
> [ceph-bkp-osd01][WARNIN] libust[13627/13627]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)
> [ceph-bkp-osd01][WARNIN] WARNING:ceph-disk:OSD will not be
> hot-swappable if journal is not the same device as the osd data
> [ceph-bkp-osd01][WARNIN] Turning ON incompat feature 'extref':
> increased hardlink limit per file to 65536
> [ceph-bkp-osd01][DEBUG ] Creating new GPT entries.
> [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
> [ceph-bkp-osd01][DEBUG ] Creating new GPT entries.
> [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
> [ceph-bkp-osd01][DEBUG ]
> [ceph-bkp-osd01][DEBUG ] WARNING! - Btrfs v3.12 IS EXPERIMENTAL
> [ceph-bkp-osd01][DEBUG ] WARNING! - see http://btrfs.wiki.kernel.org
> before using
> [ceph-bkp-osd01][DEBUG ]
> [ceph-bkp-osd01][DEBUG ] fs created label (null) on /dev/sdf1
> [ceph-bkp-osd01][DEBUG ] nodesize 32768 leafsize 32768 sectorsize
> 4096 size 2.73TiB
> [ceph-bkp-osd01][DEBUG ] Btrfs v3.12
> [ceph-bkp-osd01][DEBUG ] The operation has completed successfully.
> [ceph_deploy.osd][DEBUG ] Host ceph-bkp-osd01 is now ready for osd use.
> ceph@cephbkdeploy01:~/desp-bkp-cluster$
> ceph@cephbkdeploy01:~/desp-bkp-cluster$ ceph-deploy --overwrite-conf
> disk activate --fs-type btrfs ceph-bkp-osd01:sdf1:/dev/sdc1
> [ceph_deploy.cli][INFO  ] Invoked (1.4.0): /usr/bin/ceph-deploy
> --overwrite-conf disk activate --fs-type btrfs
> ceph-bkp-osd01:sdf1:/dev/sdc1
> [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks
> ceph-bkp-osd01:/dev/sdf1:/dev/sdc1
> [ceph-bkp-osd01][DEBUG ] connected to host: ceph-bkp-osd01
> [ceph-bkp-osd01][DEBUG ] detect platform information from remote host
> [ceph-bkp-osd01][DEBUG ] detect machine type
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
> [ceph_deploy.osd][DEBUG ] activating host ceph-bkp-osd01 disk /dev/sdf1
> [ceph_deploy.osd][DEBUG ] will use init type: upstart
> [ceph-bkp-osd01][INFO  ] Running command: sudo ceph-disk-activate
> --mark-init upstart --mount /dev/sdf1
> [ceph-bkp-osd01][WARNIN] libust[14025/14025]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)
> [ceph-bkp-osd01][WARNIN] libust[14028/14028]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)
> [ceph-bkp-osd01][WARNIN] got monmap epoch 1
> [ceph-bkp-osd01][WARNIN] libust[14059/14059]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)
> [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936163 7ffb41d32900 -1
> journal FileJournal::_open: disabling aio for non-block journal.  Use
> journal_force_aio to force use of aio anyway
> [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936221 7ffb41d32900 -1
> journal check: ondisk fsid ----
> doesn't match expected 6a26ef1f-6ece-4383-8304-7a8d064ef2b4, invalid
> (someone else's?) journal
> [ceph-bkp-osd01][WARNIN] 2014-10-31 17:00:10.936275 7ffb41d32900 -1
> filestore(/var/lib/ceph/tmp/mnt.vt_waK) mkjournal error creating
> journal on /var/lib/ceph/tmp/mnt.vt_waK/journal: (22) Invalid argument
> [ceph-bkp-osd01][WARNIN] 201

Re: [ceph-users] Replacing a disk: Best practices?

2014-10-16 Thread Udo Lembke

Am 15.10.2014 22:08, schrieb Iban Cabrillo:
> HI Cephers,
> 
>  I have an other question related to this issue, What would be the
> procedure to restore a server fail (a whole server for example due to a
> mother board trouble with no damage on disk).
> 
> Regards, I 
> 
Hi,
- change serverboard.
- perhaps adapt /etc/udev/rules.d/70-persistent-net.rules (to get the
same devices (eth0/1...) for your network.
boot and wait for resync.

To avoid to much traffic I set noout if a whole server is lost.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [PG] Slow request *** seconds old，v4 currently waiting for pg to exist locally

2014-09-24 Thread Udo Lembke

Hi again,
sorry - forgot my post... see

osdmap e421: 9 osds: 9 up, 9 in

shows that all your 9 osds are up!

Do you have trouble with your journal/filesystem?

Udo

Am 25.09.2014 08:01, schrieb Udo Lembke:
> Hi,
> looks that some osds are down?!
> 
> What is the output of "ceph osd tree"
> 
> Udo
> 
> Am 25.09.2014 04:29, schrieb Aegeaner:
>> The cluster healthy state is WARN:
>>
>>  health HEALTH_WARN 118 pgs degraded; 8 pgs down; 59 pgs
>> incomplete; 28 pgs peering; 292 pgs stale; 87 pgs stuck inactive;
>> 292 pgs stuck stale; 205 pgs stuck unclean; 22 requests are blocked
>> > 32 sec; recovery 12474/46357 objects degraded (26.909%)
>>  monmap e3: 3 mons at
>> 
>> {CVM-0-mon01=172.18.117.146:6789/0,CVM-0-mon02=172.18.117.152:6789/0,CVM-0-mon03=172.18.117.153:6789/0},
>> election epoch 24, quorum 0,1,2 CVM-0-mon01,CVM-0-mon02,CVM-0-mon03
>>  osdmap e421: 9 osds: 9 up, 9 in
>>   pgmap v2261: 292 pgs, 4 pools, 91532 MB data, 23178 objects
>> 330 MB used, 3363 GB / 3363 GB avail
>> 12474/46357 objects degraded (26.909%)
>>   20 stale+peering
>>   87 stale+active+clean
>>8 stale+down+peering
>>   59 stale+incomplete
>>  118 stale+active+degraded
>>
>>
>> What does these errors mean? Can these PGs be recovered?
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 135 matches

Mail list logo