Re: [ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Christian Balzer
On Mon, 27 Oct 2014 19:30:23 +0400 Mike wrote:

> Hello,
> My company is plaining to build a big Ceph cluster for achieving and
> storing data.
> By requirements from customer - 70% of capacity is SATA, 30% SSD.
> First day data is storing in SSD storage, on next day moving SATA
> storage.
> 
Lots of data movement. Is the design to store data on SSDs for the first
day done to assure fast writes from the clients?
Knowing the reason for this requirement would really help to find a
potentially more appropriate solution.

> By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
> 50 SATA drives.
I suppose you're talking about these:
http://www.supermicro.com.tw/products/system/4U/6047/SSG-6047R-E1R72L2K.cfm

Which is the worst thing that ever came out from Supermicro, IMNSHO.

Have you actually read the documentation and/or talked to a Supermicro
representative?

Firstly and most importantly, if you have to replace a failed disk the
other one on the same dual disk tray will also get disconnected from the
system. That's why they require you to run RAID all the time so pulling a
tray doesn't destroy your data. 
But even then, the other affected RAID will of course have to rebuild
itself once you re-insert the tray. 
And you can substitute RAID with OSD, doubling the impact of a failed disk
on your cluster.
The fact that they make you buy the complete system with IT mode
controllers also means that if you would want to do something like RAID6,
you'd be forced to do it in software.

Secondly, CPU requirements.
A purely HDD based OSD (journal on the same HDD) requires about 1GHz of
CPU power. So to make sure the CPU isn't your bottleneck, you'd need about
3000USD worth of CPUs (2x 10core 2.6GHz) but that's ignoring your SSDs.

To get even remotely close to utilize the potential speed of SSDs you don't
want more than 10-12 SSD based OSDs per node and to give that server the
highest CPU GHz total count you can afford.

Look at the "anti-cephalod question" thread in the ML archives for a
discussion of dense servers and all the recent threads about SSD
performance.

Lastly, even just the 50 HDD based OSDs will saturate a 10GbE link, never
mind the 22 SSDs. Building a Ceph cluster is a careful balancing act
between storage, network speeds and CPU requirements while also taking
density and budget into consideration.

> Our racks can hold 10 this servers and 50 this racks in ceph cluster =
> 36000 OSD's,
Others have already pointed out that this number can have undesirable
effects, but see more below.

> With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
> Petabyte of useful capacity.
> 
A replica of 2 with a purely SSD based pool can work, if you constantly
monitor those SSDs for wear level and replace them early before they fail.
Deploying those SSDs staggered would be a good idea to prevent having them
all needed to be replaced at the same time. A sufficiently fast network to
replicate the data in a very short period is also a must.
But with your deployment goal of 11000(!) SSDs all in the same pool the
statistics are stacked against you. I'm sure somebody more versed than me
in these matters can run the exact numbers (which SSDs are you planning to
use?), but I'd be terrified.

And with 25000 HDDs a replication factor of 2 is GUARANTEED to make you
loose data, probably a lot earlier in the life of your cluster than you
think. You'll be replacing several disk per day on average.

If there is no real need for SSDs, build your cluster with a simple, 4U 24
drive server, put a fast RAID card (I like ARECA) in it and create 2 11
disk RAID6 with 2 global spares, thus 2 OSDs.
Add a NVMe like the DC P3700 400GB for journals and OS, which will limit
one node to 1GB/s writes and that in turn would be a nice match for a 10GbE
network.
The combination of RAID card and NVMe (or 2 fast SSDs) will make this a
pretty snappy/speedy beast and as a bonus you'll likely never have to deal
with a failed OSD, just easily replaced failed disks in a RAID.
This will also drive your OSD count for HDDs from 25000 to about 2800+. 

If you need more dense storage, look at something like this
http://www.45drives.com/ (there are other, similar products). 
With this particular case I'd again put RAID controllers and a (fast,
2GB/s) NVMe (or 2 slower ones) in it, for 4 10 disk RAID6 with 5 spares.
Given the speed of the storage you will want a 2x10GbE bonded or Infiniband
link.

And if you need really need SSD backed pools, but don't want to risk data
loss, get a 2U case with 24 2.5" hotswap bays and run 2 RAID5s (2x 12port
RAID cards). Add some fast CPUs (but you can get away with much less than
what you would need with 24 distinct OSDs) and you're gold. 
This will nicely reduce your SSD OSD count from 11000 to something in the
1000+ range AND allow for a low risk deployment with a replica size of 2.
And while not giving you as much performance as individual OSDs, it will
definitely be faster than your original de

Re: [ceph-users] All SSD storage and journals

2014-10-27 Thread Christian Balzer
On Mon, 27 Oct 2014 15:13:30 +0100 Sebastien Han wrote:

> They were some investigations as well around F2FS
> (https://www.kernel.org/doc/Documentation/filesystems/f2fs.txt), the
> last time I tried to install an OSD dir under f2fs it failed. I tried to
> run the OSD on f2fs however ceph-osd mkfs got stuck on a xattr test:
> 
> fremovexattr(10, "user.test@5848273")   = 0
> 
> Maybe someone from the core dev has an update on this?
> 
That looks interesting, but wouldn't Ceph also need to be told that this
can be used journal-less like BTRFS?

Along those lines, I would love to hear the current status of ZFS with
Ceph. From where I'm standing BTRFS just isn't getting there, while ZFS
would give us checksummed storage, the possibility to forgo journals and
also compression.
The last bit being of interest to me, as I'm trying to position Ceph
against SolidFire here.

Christian

> > On 24 Oct 2014, at 07:58, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > as others have reported in the past and now having tested things here
> > myself, there really is no point in having journals for SSD backed
> > OSDs on other SSDs.
> > 
> > It is a zero sum game, because:
> > a) using that journal SSD as another OSD with integrated journal will
> > yield the same overall result performance wise, if all SSDs are the
> > same. And In addition its capacity will be made available for actual
> > storage. b) if the journal SSD is faster than the OSD SSDs it tends to
> > be priced accordingly. For example the DC P3700 400GB is about twice
> > as fast (write) and expensive as the DC S3700 400GB.
> > 
> > Things _may_ be different if one doesn't look at bandwidth but IOPS
> > (though certainly not in the near future in regard to Ceph actually
> > getting SSDs busy), but even there the difference is negligible when
> > for example comparing the Intel S and P models in write performance.
> > Reads are another thing, but nobody cares about those in journals. ^o^
> > 
> > Obvious things that come to mind in this context would be the ability
> > to disable journals (difficult, I know, not touching BTRFS, thank you)
> > and probably K/V store in the future.
> > 
> > Regards,
> > 
> > Christian
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> Cheers.
>  
> Sébastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien@enovance.com 
> Address : 11 bis, rue Roquépine - 75008 Paris
> Web : www.enovance.com - Twitter : @enovance 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Sage Weil
On Tue, 28 Oct 2014, Chen, Xiaoxi wrote:
> 
> Hi Chris,
> 
>  I am not the expert of LIO but from your result, seems RBD/Ceph
> works well(RBD on local system, no iSCSI) and LIO works well(Ramdisk (No
> RBD) -> LIO target)  ,  and if you change LIO to use other interface (file,
> loopback) to play with RBD, it also works well.
> 
>   So seems the issue is in the LIO RBD driver?  May be need some
> tuning, or just not optimized enough through .

My guess is that when you use the loopback driver it stops blocking on 
flush/sync?

Copying Mike and ceph-devel..

sage


> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christopher Spearman
> Sent: Tuesday, October 28, 2014 5:24 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Poor RBD performance as LIO iSCSI target
> 
>  
> 
> I've noticed a pretty steep performance degradation when using RBDs with
> LIO. I've tried a multitude of configurations to see if there are any
> changes in performance and I've only found a few that work (sort of).
> 
> Details about the systems being used:
> 
>  - All network hardware for data is 10gbe, there is some management on 1gbe,
> but I can assure that it isn't being used (perf & bwm-ng shows this)
>  - Ceph version 0.80.5
>  - 20GB RBD (for our test, prod will be much larger, the size doesn't seem
> to matter tho)
>  - LIO version 4.1.0, RisingTide
>  - Initiator is another linux system (However I've used ESXi as well with no
> difference)
>  - We have 8 OSD nodes, each with 8 2TB OSDs, 64 OSDs total
>    * 4 nodes are in one rack 4 in another, crush maps have been configured
> with this as well
>    * All OSD nodes are running Centos 6.5
>  - 2 Gateway nodes on HP Proliant blades (but I've only been using one for
> testing, however the problem does exist on both)
>    * All gateway nodes are running Centos 7
> 
> I've tested a multitude of things, mainly to see where the issue lies.
> 
>  - The performance of the RBD as a target using LIO
>  - The performance of the RBD itself (no iSCSI or LIO)
>  - LIO performance by using a ramdisk as a target (no RBD involved)
>  - Setting the RBD up with LVM, then using a logical volume from that as a
> target with LIO
>  - Setting the RBD up in RAID0 & RAID1 (single disk, using mdadm), then
> using that volume as a target with LIO
>  - Mounting the RBD as ext4, then using a disk image and fileio as a target
>  - Mounting the RBD as ext4, then using a disk image as a loop device and
> blockio as a target
>  - Setting the RBD up as a loop device, then setting that up as a target
> with LIO
> 
>  - What tested with bad performance (Reads ~25-50MB/s - Writes ~25-50MB/s)
>    * RBD setup as target using LIO
>    * RBD -> LVM -> LIO target
>    * RBD -> RAID0/1 -> LIO target
>  - What tested with good performance (Reads ~700-800MB/s - Writes
> ~400-700MB/s)
>    * RBD on local system, no iSCSI
>    * Ramdisk (No RBD) -> LIO target
>    * RBD -> Mounted ext4 -> disk image -> LIO fileio target
>    * RBD -> Mounted ext4 -> disk image -> loop device -> LIO blockio target
>    * RBD -> loop device -> LIO target
> 
> I'm just curious if anybody else has experienced these issues or has any
> idea what's going on or has any suggestions on fixing this. I know using
> loop devices sounds like a solution, but we hit a brick wall with the fact
> loop devices are single threaded. The intent is to use this with VMWare ESXi
> with the 2 gateways setup as a path to the target block devices. I'm not
> opposed to using something somewhat kludgy, provided we can still use
> multipath iSCSI within VMWare
> 
> Thanks for any help anyone can provide!
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Chen, Xiaoxi
Hi Chris,
 I am not the expert of LIO but from your result, seems RBD/Ceph works 
well(RBD on local system, no iSCSI) and LIO works well(Ramdisk (No RBD) -> LIO 
target)  ,  and if you change LIO to use other interface (file, loopback) to 
play with RBD, it also works well.
  So seems the issue is in the LIO RBD driver?  May be need some 
tuning, or just not optimized enough through .

Xiaoxi


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christopher Spearman
Sent: Tuesday, October 28, 2014 5:24 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Poor RBD performance as LIO iSCSI target

I've noticed a pretty steep performance degradation when using RBDs with LIO. 
I've tried a multitude of configurations to see if there are any changes in 
performance and I've only found a few that work (sort of).

Details about the systems being used:

 - All network hardware for data is 10gbe, there is some management on 1gbe, 
but I can assure that it isn't being used (perf & bwm-ng shows this)
 - Ceph version 0.80.5
 - 20GB RBD (for our test, prod will be much larger, the size doesn't seem to 
matter tho)
 - LIO version 4.1.0, RisingTide
 - Initiator is another linux system (However I've used ESXi as well with no 
difference)
 - We have 8 OSD nodes, each with 8 2TB OSDs, 64 OSDs total
   * 4 nodes are in one rack 4 in another, crush maps have been configured with 
this as well
   * All OSD nodes are running Centos 6.5
 - 2 Gateway nodes on HP Proliant blades (but I've only been using one for 
testing, however the problem does exist on both)
   * All gateway nodes are running Centos 7

I've tested a multitude of things, mainly to see where the issue lies.

 - The performance of the RBD as a target using LIO
 - The performance of the RBD itself (no iSCSI or LIO)
 - LIO performance by using a ramdisk as a target (no RBD involved)
 - Setting the RBD up with LVM, then using a logical volume from that as a 
target with LIO
 - Setting the RBD up in RAID0 & RAID1 (single disk, using mdadm), then using 
that volume as a target with LIO
 - Mounting the RBD as ext4, then using a disk image and fileio as a target
 - Mounting the RBD as ext4, then using a disk image as a loop device and 
blockio as a target
 - Setting the RBD up as a loop device, then setting that up as a target with 
LIO

 - What tested with bad performance (Reads ~25-50MB/s - Writes ~25-50MB/s)
   * RBD setup as target using LIO
   * RBD -> LVM -> LIO target
   * RBD -> RAID0/1 -> LIO target
 - What tested with good performance (Reads ~700-800MB/s - Writes ~400-700MB/s)
   * RBD on local system, no iSCSI
   * Ramdisk (No RBD) -> LIO target
   * RBD -> Mounted ext4 -> disk image -> LIO fileio target
   * RBD -> Mounted ext4 -> disk image -> loop device -> LIO blockio target
   * RBD -> loop device -> LIO target

I'm just curious if anybody else has experienced these issues or has any idea 
what's going on or has any suggestions on fixing this. I know using loop 
devices sounds like a solution, but we hit a brick wall with the fact loop 
devices are single threaded. The intent is to use this with VMWare ESXi with 
the 2 gateways setup as a path to the target block devices. I'm not opposed to 
using something somewhat kludgy, provided we can still use multipath iSCSI 
within VMWare

Thanks for any help anyone can provide!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Christopher Spearman
Hi Nick,

Thanks for the response, I'm glad to hear you've got something that
provides reasonable performance, that brings some hope to my situation.

I am using the kernel RBD client.

Using a different OS for the gateway/iSCSI nodes was going to be my next
step. Especially now, seeing that you have a working implementation using
Ubuntu.

I'll roll out an Ubuntu VM, upgrade the kernel like you did, then report
back here with my findings.

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Nick Fisk
Hi Chris,

I'm doing something very similar to you, however only in a very early stage of 
testing, but I don't seem to see the same problem that you are experiencing.

My setup is as follows:-

1x HP DL360 Server Running ESX 5.1
8x 10K SAS 146GB drives each configured as a RAID 0 and with a separate ESX 
Datastore on them
4 Virtual Ceph VM's with 2 OSD's each residing on above datastores
Ceph VM's running Ubuntu 14.04 with manually upgraded kernel to 3.16
2x Replicated pool over all OSD's with a test 50GB RBD
The first 3 VM's also function as a Pacemaker cluster which exports a HA LIO 
iSCSI target

ESX server then remounts this iSCSI target as a new datastore
Windows 2008 VM running IOmeter sits on this datastore

I realise this is a bit of a mess, but it has allowed me to do some basic 
testing and understand the theory behind gluing Ceph, LIO and pacemaker 
together. The next stage will be to try and source some more production like 
hardware to do a more lifelike test.

But back to the original point, from this limited hardware I am seeing 
200-400MB/s 256kb reads in IOMeter, which is probably about right for the 
number of disks.

I'm afraid I can't offer much advice at this stage as I'm still fairly early on 
in understanding Ceph/LIO, but maybe that my performance is not limited 
suggests that in some combination what we are both trying to achieve is 
possible. I assume you are using the kernel RBD client? Only other things I can 
think to try from basic fault finding experience, would be trying the same 
Kernel version and/or Ubuntu in case there are some minor differences which 
cause this problem. Of course it maybe that the nested nature of my setup is 
masking this problem.

If you need any configs from my test cluster to compare or any specific tests 
runs, please let me know.

Nick


Nick Fisk
Technical Support Engineer

System Professional Ltd
tel: 01825 83
mob: 07711377522
fax: 01825 830001
mail: nick.f...@sys-pro.co.uk
web: www.sys-pro.co.uk

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200


Confidentiality: This e-mail and its attachments are intended for the above 
named only and may be confidential. If they have come to you in error you must 
take no action based on them, nor must you copy or show them to anyone; please 
reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the 
knowledge that Internet e-mail is not a 100% secure communications medium. We 
advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free. 
Any views expressed in this e-mail message are those of the individual and not 
necessarily those of the company or any of its subsidiaries.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor RBD performance as LIO iSCSI target

2014-10-27 Thread Christopher Spearman
I've noticed a pretty steep performance degradation when using RBDs with
LIO. I've tried a multitude of configurations to see if there are any
changes in performance and I've only found a few that work (sort of).

Details about the systems being used:

 - All network hardware for data is 10gbe, there is some management on
1gbe, but I can assure that it isn't being used (perf & bwm-ng shows this)
 - Ceph version 0.80.5
 - 20GB RBD (for our test, prod will be much larger, the size doesn't seem
to matter tho)
 - LIO version 4.1.0, RisingTide
 - Initiator is another linux system (However I've used ESXi as well with
no difference)
 - We have 8 OSD nodes, each with 8 2TB OSDs, 64 OSDs total
   * 4 nodes are in one rack 4 in another, crush maps have been configured
with this as well
   * All OSD nodes are running Centos 6.5
 - 2 Gateway nodes on HP Proliant blades (but I've only been using one for
testing, however the problem does exist on both)
   * All gateway nodes are running Centos 7

I've tested a multitude of things, mainly to see where the issue lies.

 - The performance of the RBD as a target using LIO
 - The performance of the RBD itself (no iSCSI or LIO)
 - LIO performance by using a ramdisk as a target (no RBD involved)
 - Setting the RBD up with LVM, then using a logical volume from that as a
target with LIO
 - Setting the RBD up in RAID0 & RAID1 (single disk, using mdadm), then
using that volume as a target with LIO
 - Mounting the RBD as ext4, then using a disk image and fileio as a target
 - Mounting the RBD as ext4, then using a disk image as a loop device and
blockio as a target
 - Setting the RBD up as a loop device, then setting that up as a target
with LIO

 - What tested with bad performance (Reads ~25-50MB/s - Writes ~25-50MB/s)
   * RBD setup as target using LIO
   * RBD -> LVM -> LIO target
   * RBD -> RAID0/1 -> LIO target
 - What tested with good performance (Reads ~700-800MB/s - Writes
~400-700MB/s)
   * RBD on local system, no iSCSI
   * Ramdisk (No RBD) -> LIO target
   * RBD -> Mounted ext4 -> disk image -> LIO fileio target
   * RBD -> Mounted ext4 -> disk image -> loop device -> LIO blockio target
   * RBD -> loop device -> LIO target

I'm just curious if anybody else has experienced these issues or has any
idea what's going on or has any suggestions on fixing this. I know using
loop devices sounds like a solution, but we hit a brick wall with the fact
loop devices are single threaded. The intent is to use this with VMWare
ESXi with the 2 gateways setup as a path to the target block devices. I'm
not opposed to using something somewhat kludgy, provided we can still use
multipath iSCSI within VMWare

Thanks for any help anyone can provide!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Steve Anthony
Oh, hey look at that. I must have screwed something up before. I thought
it was strange that it didn't work.

Works now, thanks!

-Steve

On 10/27/2014 03:20 PM, Scott Laird wrote:
> Double-check that you did it right.  Does 'ls -lL
> /var/lib/ceph/osd/ceph-33/journal' resolve to a block-special device?
>
> On Mon Oct 27 2014 at 12:12:20 PM Steve Anthony  > wrote:
>
> Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
> /dev/disk/by-id for future ODSs.
>
> I tried stopping an existing OSD on another node (which is working
> - osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal
> to point to the same partition using /dev/disk/by-id, and starting
> the OSD again, but it fails to start with:
>
> 2014-10-27 11:03:31.607060 7fa65018e780 -1
> filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
> /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
> 2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error
> converting store /var/lib/ceph/osd/ceph-33: (2) No such file or
> directory
>
> The journal symlink exists and points to the same partition as
> before when it was /dev/sde1. Can I not change these existing
> symlinks manually to point to the same partition using
> /dev/disk/by-id?
>
>
> -Steve
>
>
> On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
> > * /dev/disk/by-id
> >
> > by-path will change if you connect it to different controller, or
> > replace your controller with other model, or put it in different pci
> > slot
> >
> > On Sat, 25 Oct 2014 17:20:58 +, Scott Laird
>  
> > wrote:
> >
> >> You'd be best off using /dev/disk/by-path/ or similar links;
> that way they
> >> follow the disks if they're renamed again.
> >>
> >> On Fri, Oct 24, 2014, 9:40 PM Steve Anthony 
>  wrote:
> >>
> >>> Hello,
> >>>
> >>> I was having problems with a node in my cluster (Ceph
> v0.80.7/Debian
> >>> Wheezy/Kernel 3.12), so I rebooted it and the disks were
> relabled when
> >>> it came back up. Now all the symlinks to the journals are
> broken. The
> >>> SSDs are now sda, sdb, and sdc but the journals were sdc, sdd,
> and sde:
> >>>
> >>> root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:47
> /var/lib/ceph/osd/ceph-150/journal
> >>> -> /dev/sde1
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:53
> /var/lib/ceph/osd/ceph-157/journal
> >>> -> /dev/sdd1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 08:31
> /var/lib/ceph/osd/ceph-164/journal
> >>> -> /dev/sdc1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 16:33
> /var/lib/ceph/osd/ceph-171/journal
> >>> -> /dev/sde2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 10:50
> /var/lib/ceph/osd/ceph-178/journal
> >>> -> /dev/sdc2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 15:48
> /var/lib/ceph/osd/ceph-184/journal
> >>> -> /dev/sdd2
> >>> lrwxrwxrwx 1 root root 9 Oct 23 10:46
> /var/lib/ceph/osd/ceph-191/journal
> >>> -> /dev/sde3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 15:22
> /var/lib/ceph/osd/ceph-195/journal
> >>> -> /dev/sdc3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 16:59
> /var/lib/ceph/osd/ceph-201/journal
> >>> -> /dev/sdd3
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:32
> /var/lib/ceph/osd/ceph-214/journal
> >>> -> /dev/sde4
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:33
> /var/lib/ceph/osd/ceph-215/journal
> >>> -> /dev/sdd4
> >>>
> >>> Any way to fix this without just removing all the OSDs and
> re-adding
> >>> them? I thought about recreating the symlinks to point at the
> new SSD
> >>> labels, but I figured I'd check here first. Thanks!
> >>>
> >>> -Steve
> >>>
> >>> --
> >>> Steve Anthony
> >>> LTS HPC Support Specialist
> >>> Lehigh University
> >>> sma...@lehigh.edu 
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com 
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> >
> >
>
> -- 
> Steve Anthony
> LTS HPC Support Specialist
> Lehigh University
> sma...@lehigh.edu 
>

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Scott Laird
Double-check that you did it right.  Does 'ls -lL
/var/lib/ceph/osd/ceph-33/journal' resolve to a block-special device?

On Mon Oct 27 2014 at 12:12:20 PM Steve Anthony  wrote:

>  Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
> /dev/disk/by-id for future ODSs.
>
> I tried stopping an existing OSD on another node (which is working -
> osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to point
> to the same partition using /dev/disk/by-id, and starting the OSD again,
> but it fails to start with:
>
> 2014-10-27 11:03:31.607060 7fa65018e780 -1
> filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
> /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
> 2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error converting
> store /var/lib/ceph/osd/ceph-33: (2) No such file or directory
>
> The journal symlink exists and points to the same partition as before when
> it was /dev/sde1. Can I not change these existing symlinks manually to
> point to the same partition using /dev/disk/by-id?
>
>
> -Steve
>
>
> On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
> > * /dev/disk/by-id
> >
> > by-path will change if you connect it to different controller, or
> > replace your controller with other model, or put it in different pci
> > slot
> >
> > On Sat, 25 Oct 2014 17:20:58 +, Scott Laird 
> 
> > wrote:
> >
> >> You'd be best off using /dev/disk/by-path/ or similar links; that way
> they
> >> follow the disks if they're renamed again.
> >>
> >> On Fri, Oct 24, 2014, 9:40 PM Steve Anthony 
>  wrote:
> >>
> >>> Hello,
> >>>
> >>> I was having problems with a node in my cluster (Ceph v0.80.7/Debian
> >>> Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
> >>> it came back up. Now all the symlinks to the journals are broken. The
> >>> SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:
> >>>
> >>> root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:47
> /var/lib/ceph/osd/ceph-150/journal
> >>> -> /dev/sde1
> >>> lrwxrwxrwx 1 root root 9 Oct 20 16:53
> /var/lib/ceph/osd/ceph-157/journal
> >>> -> /dev/sdd1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 08:31
> /var/lib/ceph/osd/ceph-164/journal
> >>> -> /dev/sdc1
> >>> lrwxrwxrwx 1 root root 9 Oct 21 16:33
> /var/lib/ceph/osd/ceph-171/journal
> >>> -> /dev/sde2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 10:50
> /var/lib/ceph/osd/ceph-178/journal
> >>> -> /dev/sdc2
> >>> lrwxrwxrwx 1 root root 9 Oct 22 15:48
> /var/lib/ceph/osd/ceph-184/journal
> >>> -> /dev/sdd2
> >>> lrwxrwxrwx 1 root root 9 Oct 23 10:46
> /var/lib/ceph/osd/ceph-191/journal
> >>> -> /dev/sde3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 15:22
> /var/lib/ceph/osd/ceph-195/journal
> >>> -> /dev/sdc3
> >>> lrwxrwxrwx 1 root root 9 Oct 23 16:59
> /var/lib/ceph/osd/ceph-201/journal
> >>> -> /dev/sdd3
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:32
> /var/lib/ceph/osd/ceph-214/journal
> >>> -> /dev/sde4
> >>> lrwxrwxrwx 1 root root 9 Oct 24 21:33
> /var/lib/ceph/osd/ceph-215/journal
> >>> -> /dev/sdd4
> >>>
> >>> Any way to fix this without just removing all the OSDs and re-adding
> >>> them? I thought about recreating the symlinks to point at the new SSD
> >>> labels, but I figured I'd check here first. Thanks!
> >>>
> >>> -Steve
> >>>
> >>> --
> >>> Steve Anthony
> >>> LTS HPC Support Specialist
> >>> Lehigh University
> >>> sma...@lehigh.edu
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> >
> >
>
> --
> Steve Anthony
> LTS HPC Support Specialist
> Lehigh University
> sma...@lehigh.edu
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Steve Anthony
Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
/dev/disk/by-id for future ODSs.

I tried stopping an existing OSD on another node (which is working -
osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to
point to the same partition using /dev/disk/by-id, and starting the OSD
again, but it fails to start with:

2014-10-27 11:03:31.607060 7fa65018e780 -1
filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
/var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-33: (2) No such file or directory

The journal symlink exists and points to the same partition as before
when it was /dev/sde1. Can I not change these existing symlinks manually
to point to the same partition using /dev/disk/by-id?

-Steve

On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
> * /dev/disk/by-id
>
> by-path will change if you connect it to different controller, or
> replace your controller with other model, or put it in different pci
> slot
>
> On Sat, 25 Oct 2014 17:20:58 +, Scott Laird 
> wrote:
>
>> You'd be best off using /dev/disk/by-path/ or similar links; that way
they
>> follow the disks if they're renamed again.
>>
>> On Fri, Oct 24, 2014, 9:40 PM Steve Anthony  wrote:
>>
>>> Hello,
>>>
>>> I was having problems with a node in my cluster (Ceph v0.80.7/Debian
>>> Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
>>> it came back up. Now all the symlinks to the journals are broken. The
>>> SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:
>>>
>>> root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
>>> lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal
>>> -> /dev/sde1
>>> lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal
>>> -> /dev/sdd1
>>> lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal
>>> -> /dev/sdc1
>>> lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal
>>> -> /dev/sde2
>>> lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal
>>> -> /dev/sdc2
>>> lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal
>>> -> /dev/sdd2
>>> lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal
>>> -> /dev/sde3
>>> lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal
>>> -> /dev/sdc3
>>> lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal
>>> -> /dev/sdd3
>>> lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal
>>> -> /dev/sde4
>>> lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal
>>> -> /dev/sdd4
>>>
>>> Any way to fix this without just removing all the OSDs and re-adding
>>> them? I thought about recreating the symlinks to point at the new SSD
>>> labels, but I figured I'd check here first. Thanks!
>>>
>>> -Steve
>>>
>>> --
>>> Steve Anthony
>>> LTS HPC Support Specialist
>>> Lehigh University
>>> sma...@lehigh.edu
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
>
>

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't start osd- one osd alway be down.

2014-10-27 Thread Craig Lewis
My experience is that once you hit this bug, those PGs are gone.  I tried
marking the primary OSD OUT, which caused this problem to move to the new
primary OSD.  Luckily for me, my affected PGs were using replication state
in the secondary cluster.  I ended up deleting the whole pool and
recreating it.

Which pools are 7 and 23?  It's possible that it's something that easy to
replace.



On Fri, Oct 24, 2014 at 9:26 PM, Ta Ba Tuan  wrote:

>  Hi Craig, Thanks for replying.
> When i started that osd, Ceph Log from "ceph -w" warns pgs 7.9d8 23.596,
> 23.9c6, 23.63 can't recovery as pasted log.
>
> Those pgs are "active+degraded" state.
> #ceph pg map 7.9d8
> osdmap e102808 pg 7.9d8 (7.9d8) -> up [93,49] acting [93,49]  (When start
> osd.21 then pg 7.9d8 and three remain pgs  to changed to state
> "active+recovering") . osd.21 still down after following logs:
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can we deploy multi-rgw on one ceph cluster?

2014-10-27 Thread Craig Lewis
On Sun, Oct 26, 2014 at 9:08 AM, yuelongguang  wrote:

> hi,
> 1. if  one radosgw daemon  *corresponds* to one zone ?   the rate is 1:1
>

Not necessarily.  You need at least one radosgw daemon per zone, but you
can have more.  I have a two small clusters.  The primary has 5 nodes, and
the secondary has 4 nodes.  Every node in the clusters run an apache and
radosgw.

It's possible (and confusing) to run multiple radosgw daemons on a single
node for different clusters.  You can either use Apache VHosts, or have
CivetWeb listening on different ports.  I won't recommend this though, as
it introduces a common failure mode to both zones.




> 2. it seems that we can deploy any number of rgw in a gingle ceph
> cluster,  those rgw can work separately or cooperate by using radosgw-agent
> to sync data and metadata, am i right?
>

You can deploy as many zones as you want in a single cluster.  Each zone
needs a set of pools and a radosgw daemon.  They can be completely
independant, or have a master-slave replication setup using radosgw-agent.

Keep in mind that radosgw-agent is not bi-directional replication, and the
secondary zone is read-only.



> 3. do you know how to set up load balance for rgws?  is nginx a good
> choose, how to let nginx work with rgw?
>

Any Load Balancer should work, since the protocol is just HTTP/HTTPS.  Some
people on the list had issues with nginx.  Search the list archive for
radosgw and tengine.

I'm using HAProxy, and it's working for me.  I have a slight issue in my
secondary cluster, with locking during replication.  I believe I need to
enable some kind of stickiness, but I haven't gotten around to
investigating.  In the mean time, I've configured that cluster with a
single node in the active backend, and the other nodes in a
backup backend.  It's not a setup that can work for everybody, but it meets
my needs until I fix the real issue.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] get/put files with radosgw once MDS crash

2014-10-27 Thread Craig Lewis
I don't imagine this will ever be a feature.  CephFS and RadosGW have
fundamentally different goals and use cases.  While I can think up a way to
map from one to the other and back, it would be a very limited and
frustrating experience.

If you're having problems with MDS stability, you're better off fixing that
than trying to access the data via RadosGW.




On Sun, Oct 26, 2014 at 5:31 PM, 廖建锋  wrote:

>  Does CEPH has schedule for this?
>
>
>  *From:* Craig Lewis 
> *Date:* 2014-10-25 05:35
> *To:* 廖建锋 
> *CC:* ceph-users 
> *Subject:* Re: [ceph-users] get/put files with radosgw once MDS crash
>   No, MDS and RadosGW store their data in different pools.  There's no
> way for them to access the other's data.
>
>  All of the data is stored in RADOS, and can be accessed via the rados
> CLI.  It's not easy, and you'd probably have to spend a lot of time reading
> the source code to do it.
>
>
> On Fri, Oct 24, 2014 at 1:49 AM, 廖建锋  wrote:
>
>>  dear cepher,
>>  Today, I use mds to put/get files from ceph storgate cluster as
>> it is very easy to use for each side of a company.
>> But ceph mds is not very stable, So my question:
>> is it possbile to get the file name and contentes from OSD with
>> radosgw once MDS crash and how ?
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Wido den Hollander
On 10/27/2014 05:32 PM, Dan van der Ster wrote:
> Hi,
> 
> October 27 2014 5:07 PM, "Wido den Hollander"  wrote: 
>> On 10/27/2014 04:30 PM, Mike wrote:
>>
>>> Hello,
>>> My company is plaining to build a big Ceph cluster for achieving and
>>> storing data.
>>> By requirements from customer - 70% of capacity is SATA, 30% SSD.
>>> First day data is storing in SSD storage, on next day moving SATA storage.
>>
>> How are you planning on moving this data? Do you expect Ceph to do this?
>>
>> What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
>> The RADOS Gateway (S3/Swift)?
>>
>>> By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
>>> 50 SATA drives.
>>
>> That are some serious machines. It will require a LOT of CPU power in
>> those machines to run 72 OSDs. Probably 4 CPUs per machine.
>>
>>> Our racks can hold 10 this servers and 50 this racks in ceph cluster =
>>> 36000 OSD's,
>>
>> 36.000 OSDs shouldn't really be the problem, but you are thinking really
>> big scale here.
>>
> 
> AFAIK, the OSDs should scale, since they only peer with ~100 others 
> regardless of the cluster size. I wonder about the mon's though -- 36,000 
> OSDs will send a lot of pg_stats updates so the mon's will have some work to 
> keep up. But the main issue I foresee is on the clients: don't be surprised 
> when you see that each client needs close to 100k threads when connected to 
> this cluster. A hypervisor with 10 VMs running would approach 1 million 
> threads -- I have no idea if that will present any problems. There were 
> discussions about limiting the number of client threads, but I don't know if 
> there was any progress on that yet.
> 

True about the mons. 3 monitors will not cut it here. You need 9 MONs at
least I think, on dedicated resources.

> Anyway, it would be good to know if there are any current installations even 
> close to this size (even in test). We are in the early days of planning a 10k 
> OSD test, but haven't exceed ~1200 yet.
> 
> Cheers, Dan
> 
> 
>>> With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
>>> Petabyte of useful capacity.
>>>
>>> It's too big or normal use case for ceph?
>>
>> No, it's not to big for Ceph. This is what it was designed for. But a
>> setup like this shouldn't be taken lightly.
>>
>> Think about the network connectivity required to connect all these
>> machines and other decisions to be made.
>>
>> ___ 
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Mariusz Gronczewski
* /dev/disk/by-id

by-path will change if you connect it to different controller, or
replace your controller with other model, or put it in different pci
slot

On Sat, 25 Oct 2014 17:20:58 +, Scott Laird 
wrote:

> You'd be best off using /dev/disk/by-path/ or similar links; that way they
> follow the disks if they're renamed again.
> 
> On Fri, Oct 24, 2014, 9:40 PM Steve Anthony  wrote:
> 
> > Hello,
> >
> > I was having problems with a node in my cluster (Ceph v0.80.7/Debian
> > Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
> > it came back up. Now all the symlinks to the journals are broken. The
> > SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:
> >
> > root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
> > lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal
> > -> /dev/sde1
> > lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal
> > -> /dev/sdd1
> > lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal
> > -> /dev/sdc1
> > lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal
> > -> /dev/sde2
> > lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal
> > -> /dev/sdc2
> > lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal
> > -> /dev/sdd2
> > lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal
> > -> /dev/sde3
> > lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal
> > -> /dev/sdc3
> > lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal
> > -> /dev/sdd3
> > lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal
> > -> /dev/sde4
> > lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal
> > -> /dev/sdd4
> >
> > Any way to fix this without just removing all the OSDs and re-adding
> > them? I thought about recreating the symlinks to point at the new SSD
> > labels, but I figured I'd check here first. Thanks!
> >
> > -Steve
> >
> > --
> > Steve Anthony
> > LTS HPC Support Specialist
> > Lehigh University
> > sma...@lehigh.edu
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >



-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Dan van der Ster
Hi,

October 27 2014 5:07 PM, "Wido den Hollander"  wrote: 
> On 10/27/2014 04:30 PM, Mike wrote:
> 
>> Hello,
>> My company is plaining to build a big Ceph cluster for achieving and
>> storing data.
>> By requirements from customer - 70% of capacity is SATA, 30% SSD.
>> First day data is storing in SSD storage, on next day moving SATA storage.
> 
> How are you planning on moving this data? Do you expect Ceph to do this?
> 
> What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
> The RADOS Gateway (S3/Swift)?
> 
>> By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
>> 50 SATA drives.
> 
> That are some serious machines. It will require a LOT of CPU power in
> those machines to run 72 OSDs. Probably 4 CPUs per machine.
> 
>> Our racks can hold 10 this servers and 50 this racks in ceph cluster =
>> 36000 OSD's,
> 
> 36.000 OSDs shouldn't really be the problem, but you are thinking really
> big scale here.
> 

AFAIK, the OSDs should scale, since they only peer with ~100 others regardless 
of the cluster size. I wonder about the mon's though -- 36,000 OSDs will send a 
lot of pg_stats updates so the mon's will have some work to keep up. But the 
main issue I foresee is on the clients: don't be surprised when you see that 
each client needs close to 100k threads when connected to this cluster. A 
hypervisor with 10 VMs running would approach 1 million threads -- I have no 
idea if that will present any problems. There were discussions about limiting 
the number of client threads, but I don't know if there was any progress on 
that yet.

Anyway, it would be good to know if there are any current installations even 
close to this size (even in test). We are in the early days of planning a 10k 
OSD test, but haven't exceed ~1200 yet.

Cheers, Dan


>> With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
>> Petabyte of useful capacity.
>> 
>> It's too big or normal use case for ceph?
> 
> No, it's not to big for Ceph. This is what it was designed for. But a
> setup like this shouldn't be taken lightly.
> 
> Think about the network connectivity required to connect all these
> machines and other decisions to be made.
> 
> ___ 
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Wido den Hollander
On 10/27/2014 04:30 PM, Mike wrote:
> Hello,
> My company is plaining to build a big Ceph cluster for achieving and
> storing data.
> By requirements from customer - 70% of capacity is SATA, 30% SSD.
> First day data is storing in SSD storage, on next day moving SATA storage.
> 

How are you planning on moving this data? Do you expect Ceph to do this?

What kind of access to Ceph are you planning on using? RBD? Raw RADOS?
The RADOS Gateway (S3/Swift)?

> By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
> 50 SATA drives.

That are some serious machines. It will require a LOT of CPU power in
those machines to run 72 OSDs. Probably 4 CPUs per machine.

> Our racks can hold 10 this servers and 50 this racks in ceph cluster =
> 36000 OSD's,

36.000 OSDs shouldn't really be the problem, but you are thinking really
big scale here.

> With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
> Petabyte of useful capacity.
> 
> It's too big or normal use case for ceph?
> 

No, it's not to big for Ceph. This is what it was designed for. But a
setup like this shouldn't be taken lightly.

Think about the network connectivity required to connect all these
machines and other decisions to be made.

___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD getting unmapped every time when server reboot

2014-10-27 Thread Laurent Barbe

Hi,
Which version of the script do you use? From ceph-common package ?
If this is not the case, you should verify that it contains this line 
somewhere at the begining :

# chkconfig: 2345 20 80

and make a `chkconfig --add rbdmap`

To avoid the errors "log_*: command not found", you need to install 
redhat-lsb-core package.


Laurent

Le 26/10/2014 12:13, Vickey Singh a écrit :

Hi Chris

yes i have checked this message and i am sure that secret file is
present in correct location. Any other suggestions are weelcome.

Hi Sebastien

Can you suggest something here.

Thanks
vicky


On Sun, Oct 26, 2014 at 1:23 AM, Christopher Armstrong
mailto:ch...@opdemand.com>> wrote:

unable to read secretfile: No such file or directory

Looks like it's trying to mount, but your secretfile is gone.

*Chris Armstrong
*Head of Services
OpDemand / Deis.io

GitHub: https://github.com/deis/deis -- Docs: http://docs.deis.io/


On Sat, Oct 25, 2014 at 2:07 PM, Vickey Singh
mailto:vickey.singh22...@gmail.com>>
wrote:

Hello Cephers , need your advice and tips here.

*Problem statement : Ceph RBD getting unmapped each time i
reboot my server . After reboot every time i need to manually
map it and mount it.*
*
*
*Setup : *
*
*
Ceph Firefly 0.80.1
CentOS 6.5  , Kernel : 3.15.0-1

*
*
I have tried doing as mentioned in the blog , but looks like
this does not works with CentOS

http://ceph.com/planet/mapunmap-rbd-device-on-bootshutdown/



# /etc/init.d/rbdmap start
/etc/init.d/rbdmap: line 26: log_daemon_msg: command not found
/etc/init.d/rbdmap: line 42: log_progress_msg: command not found
/etc/init.d/rbdmap: line 47: echo: write error: Invalid argument
/etc/init.d/rbdmap: line 52: log_end_msg: command not found
/etc/init.d/rbdmap: line 56: log_action_begin_msg: command not found
unable to read secretfile: No such file or directory
error reading secret file
failed to parse ceph_options
Thread::try_create(): pthread_create failed with error
13common/Thread.cc: In function 'void Thread::create(size_t)'
thread 7fb8ec4ed760 time 2014-10-26 00:01:10.180440
common/Thread.cc: 110: FAILED assert(ret == 0)
  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
  1: (Thread::create(unsigned long)+0x8a) [0x6ba82a]
  2: (CephContext::CephContext(unsigned int)+0xba) [0x60ef7a]
  3: (common_preinit(CephInitParameters const&,
code_environment_t, int)+0x45) [0x6e8305]
  4: (global_pre_init(std::vector >*, std::vector >&, unsigned int,
code_environment_t, int)+0xaf) [0x5ee21f]
  5: (global_init(std::vector >*, std::vector
 >&, unsigned int, code_environment_t, int)+0x2f) [0x5eed6f]
  6: (main()+0x7f) [0x5289af]
  7: (__libc_start_main()+0xfd) [0x3efa41ed1d]
  8: ceph-fuse() [0x5287c9]
  NOTE: a copy of the executable, or `objdump -rdS `
is needed to interpret this.
terminate called after throwing an instance of
'ceph::FailedAssertion'
/etc/init.d/rbdmap: line 58: log_action_end_msg: command not found
#


# cat /etc/ceph/rbdmap
rbd/rbd-disk1id=admin,secret=AQAinItT8Ip9AhAAS93FrXLrrnVp8/sQhjvTIg==
#


Many Thanks in Advance
Vicky

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What a maximum theoretical and practical capacity in ceph cluster?

2014-10-27 Thread Mike
Hello,
My company is plaining to build a big Ceph cluster for achieving and
storing data.
By requirements from customer - 70% of capacity is SATA, 30% SSD.
First day data is storing in SSD storage, on next day moving SATA storage.

By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD +
50 SATA drives.
Our racks can hold 10 this servers and 50 this racks in ceph cluster =
36000 OSD's,
With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have 40
Petabyte of useful capacity.

It's too big or normal use case for ceph?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change port of Mon

2014-10-27 Thread Wido den Hollander
On 10/27/2014 03:50 PM, Daniel Takatori Ohara wrote:
> Hello Wido,
> 
> Thanks for the answer. I new in Ceph, and i have a problem.
> 
> I have 2 clusters, but when i execute the command "df" in clients, i saw
> one directory only. With command "mount", i saw both clusters.
> 

It could be that they are mounted on the same location?

> I think, it's possible about the conflict with ports, or conflict with the
> cluster name, because the clusters have the same name.
>

The cluster name isn't a problem, the UUID of a cluster is leading.

It's kind of hard to diagnose, but I doubt it is the port number of the
monitor which is causing the problem.

> Sorry for my english.
> 
> 
> Att.
> 
> ---
> Daniel Takatori Ohara.
> System Administrator - Lab. of Bioinformatics
> Molecular Oncology Center
> Instituto Sírio-Libanês de Ensino e Pesquisa
> Hospital Sírio-Libanês
> Phone: +55 11 3155-0200 (extension 1927)
> R: Cel. Nicolau dos Santos, 69
> São Paulo-SP. 01308-060
> http://www.bioinfo.mochsl.org.br
> 
> 
> On Mon, Oct 27, 2014 at 12:45 PM, Wido den Hollander  wrote:
> 
>> On 10/27/2014 03:42 PM, Daniel Takatori Ohara wrote:
>>> Hello,
>>>
>>> Anyone help me. How can i modify the port of the mon?
>>>
>>
>> The default port is 6789. Why would you want to change it?
>>
>> It is possible by changing the monmap, but I'm just trying to understand
>> the reasoning behind it.
>>
>>> And how can i modify the cluster name?
>>>
>>> Thanks,
>>>
>>> Att.
>>>
>>> ---
>>> Daniel Takatori Ohara.
>>> System Administrator - Lab. of Bioinformatics
>>> Molecular Oncology Center
>>> Instituto Sírio-Libanês de Ensino e Pesquisa
>>> Hospital Sírio-Libanês
>>> Phone: +55 11 3155-0200 (extension 1927)
>>> R: Cel. Nicolau dos Santos, 69
>>> São Paulo-SP. 01308-060
>>> http://www.bioinfo.mochsl.org.br
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change port of Mon

2014-10-27 Thread Wido den Hollander
On 10/27/2014 03:42 PM, Daniel Takatori Ohara wrote:
> Hello,
> 
> Anyone help me. How can i modify the port of the mon?
> 

The default port is 6789. Why would you want to change it?

It is possible by changing the monmap, but I'm just trying to understand
the reasoning behind it.

> And how can i modify the cluster name?
> 
> Thanks,
> 
> Att.
> 
> ---
> Daniel Takatori Ohara.
> System Administrator - Lab. of Bioinformatics
> Molecular Oncology Center
> Instituto Sírio-Libanês de Ensino e Pesquisa
> Hospital Sírio-Libanês
> Phone: +55 11 3155-0200 (extension 1927)
> R: Cel. Nicolau dos Santos, 69
> São Paulo-SP. 01308-060
> http://www.bioinfo.mochsl.org.br
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change port of Mon

2014-10-27 Thread Daniel Takatori Ohara
Hello,

Anyone help me. How can i modify the port of the mon?

And how can i modify the cluster name?

Thanks,

Att.

---
Daniel Takatori Ohara.
System Administrator - Lab. of Bioinformatics
Molecular Oncology Center
Instituto Sírio-Libanês de Ensino e Pesquisa
Hospital Sírio-Libanês
Phone: +55 11 3155-0200 (extension 1927)
R: Cel. Nicolau dos Santos, 69
São Paulo-SP. 01308-060
http://www.bioinfo.mochsl.org.br
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and hadoop

2014-10-27 Thread John Spray
Hi Matan,

Hadoop on CephFS is part of the regular test suites run on CephFS, so
it should work at least to some extent.   Any testing/feedback on this
will be appreciated.

As far as I know, the article you link is the best available documentation.

Cheers,
John


On Fri, Oct 24, 2014 at 8:30 PM, Matan Safriel  wrote:
> Hi,
>
> Given HDFS is far from ideal for small files, I am examining the possibility
> of using Hadoop on top Ceph. I found mainly one online resource about it
> https://ceph.com/docs/v0.79/cephfs/hadoop/. I am wondering whether there is
> any reference implementation or blog post you are aware of, about hadoop on
> top Ceph. Likewise happy to have any pointers about why _not_ to attempt
> just that
>
> Thanks!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD storage and journals

2014-10-27 Thread Sebastien Han
They were some investigations as well around F2FS 
(https://www.kernel.org/doc/Documentation/filesystems/f2fs.txt), the last time 
I tried to install an OSD dir under f2fs it failed.
I tried to run the OSD on f2fs however ceph-osd mkfs got stuck on a xattr test:

fremovexattr(10, "user.test@5848273")   = 0

Maybe someone from the core dev has an update on this?

> On 24 Oct 2014, at 07:58, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> as others have reported in the past and now having tested things here
> myself, there really is no point in having journals for SSD backed OSDs on
> other SSDs.
> 
> It is a zero sum game, because:
> a) using that journal SSD as another OSD with integrated journal will
> yield the same overall result performance wise, if all SSDs are the same.
> And In addition its capacity will be made available for actual storage.
> b) if the journal SSD is faster than the OSD SSDs it tends to be priced
> accordingly. For example the DC P3700 400GB is about twice as fast (write)
> and expensive as the DC S3700 400GB.
> 
> Things _may_ be different if one doesn't look at bandwidth but IOPS (though
> certainly not in the near future in regard to Ceph actually getting SSDs
> busy), but even there the difference is negligible when for example
> comparing the Intel S and P models in write performance.
> Reads are another thing, but nobody cares about those in journals. ^o^
> 
> Obvious things that come to mind in this context would be the ability to
> disable journals (difficult, I know, not touching BTRFS, thank you) and
> probably K/V store in the future.
> 
> Regards,
> 
> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.
 
Sébastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien@enovance.com 
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com