Re: [ceph-users] Recommended way of leveraging multiple disks by Ceph

2015-09-16 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Sep 15, 2015 at 04:16:47PM +, fangzhe.chang wrote:

> Hi,

> I'd like to run Ceph on a few machines, each of which has multiple disks. The 
> disks are heterogeneous: some are rotational disks of larger capacities while 
> others are smaller solid state disks. What are the recommended ways of 
> running ceph osd-es on them?

> Two of the approaches can be:

> 1)  Deploy an osd instance on each hard disk. For instance, if a machine 
> has six hard disks, there will be six osd instances running on it. In this 
> case, does Ceph's replication algorithm recognize that these osd-es are on 
> the same machine therefore try to avoid placing replicas on disks/osd-es of a 
> same machine?

When adding osd or whenever later You can set crush location for osd. pg placing
is based on Your crush rules and crush locations. In general case, data would be
written to different hosts.

I have confid with multiple disks on 3 nodes, some of them are hdd and 1 ssd per
node. Each serve 1 osd.

> 2)  Create a logical volume spanning multiple hard disks of a machine and 
> run a single copy of osd per machine.

It is more reliable to have several osd'es, one per drive. When loosing drive,
You will not loose all data on host.

> If you have previous experiences, benchmarking results, or know a pointer to 
> the corresponding documentation, please share with me and other users. Thanks 
> a lot.

I preferred this fine article:
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploy osd with btrfs not success.

2015-09-16 Thread Ilya Dryomov
On Wed, Sep 16, 2015 at 12:57 PM, Vickie ch  wrote:
> Hi cephers,
> Have anyone ever created osd with btrfs in Hammer 0.94.3 ? I can create
> btrfs partition successfully.  But once use "ceph-deploy" then always get
> error like below. Another question there is no parameter " -f " with mkfs.
> Any suggestion is appreciated.
> -
> [osd3][DEBUG ] The operation has completed successfully.
> [osd3][WARNIN] DEBUG:ceph-disk:Calling partprobe on created device /dev/sda
> [osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda
> [osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
> [osd3][WARNIN] DEBUG:ceph-disk:Creating btrfs fs on /dev/sda1
> [osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t btrfs -m single
> -l 32768 -n 32768 -- /dev/sda1
> [osd3][WARNIN] /dev/sda1 appears to contain an existing filesystem (xfs).
> [osd3][WARNIN] Error: Use the -f option to force overwrite.
> [osd3][WARNIN] ceph-disk: Error: Command '['/sbin/mkfs', '-t', 'btrfs',
> '-m', 'single', '-l', '32768', '-n', '32768', '--', '/dev/sda1']' returned
> non-zero exit status 1
> [osd3][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare
> --zap-disk --cluster ceph --fs-type btrfs -- /dev/sda
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
> 

ceph-deploy not using -f is probably a safety measure.  You can nuke
xfs superblock on /dev/sda1 with wipefs(8).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
The move  journal, partition resize, grow file system approach would
work nicely if the spare capacity were at the end of the disk.

Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
journal placement to be at the 800GB mark, leaving the largest remaining
partition at the end of the disk.   I'm assuming the gdisk bug was
caused by overflowing a 32bit int during the -1000M offset from end of
disk calculation.  When it computed the end of disk for the journal
placement on disks >2TB it dropped the 2TB part of the size and was left
only with the 800GB value, putting the journal there.  After gdisk
created the journal at the 800GB mark (splitting the disk),
ceph-disk-prepare told gdisk to take the largest remaining partition for
data, using the 2TB partition at the end.

Here's an example of the buggy partitioning:

crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5859442654
Partitions will be aligned on 2048-sector boundaries
Total free space is 1562425343 sectors (745.0 GiB)

Number  Start (sector)End (sector)  Size   Code  Name
   1  1564475392  5859442654   2.0 TiB   ceph data
   2  1562425344  1564475358   1001.0 MiB    ceph journal



I assume I could still follow a disk-level relocation of data using dd
and shift all my content forward in the disk and then grow the file
system to the end, but this would take a significant amount of time,
more than a quick restart of the OSD. 

This leaves me the option of setting noout and hoping for the best (no
other failures) during my somewhat lengthy dd data movement or taking my
osd down and letting the cluster begin repairing the redundancy.

If I follow the second option of normal osd loss repair, my disk
repartition step would be fast and I could bring the OSD back up rather
quickly.  Does taking an OSD out of service, erasing it and bringing the
same OSD back into service present any undue stress to the cluster?  

I'd prefer to use the second option if I can because I'm likely to
repeat this in the near future in order to add encryption to these disks.

~jpr

On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> Le 16/09/2015 01:21, John-Paul Robinson a écrit :
>> Hi,
>>
>> I'm working to correct a partitioning error from when our cluster was
>> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
>> partitions for our OSDs, instead of the 2.8TB actually available on
>> disk, a 29% space hit.  (The error was due to a gdisk bug that
>> mis-computed the end of the disk during the ceph-disk-prepare and placed
>> the journal at the 2TB mark instead of the true end of the disk at
>> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>>
>> I'd like to fix this problem by taking my existing 2TB OSDs offline one
>> at a time, repartitioning them and then bringing them back into the
>> cluster.  Unfortunately I can't just grow the partitions, so the
>> repartition will be destructive.
> Hum, why should it be? If the journal is at the 2TB mark, you should be
> able to:
> - stop the OSD,
> - flush the journal, (ceph-osd -i  --flush-journal),
> - unmount the data filesystem (might be superfluous but the kernel seems
> to cache the partition layout when a partition is active),
> - remove the journal partition,
> - extend the data partition,
> - place the journal partition at the end of the drive (in fact you
> probably want to write a precomputed partition layout in one go).
> - mount the data filesystem, resize it online,
> - ceph-osd -i  --mkjournal (assuming your setup can find the
> partition again automatically without reconfiguration)
> - start the OSD
>
> If you script this you should not have to use noout: the OSD should come
> back in a matter of seconds and the impact on the storage network minimal.
>
> Note that the start of the disk is where you get the best sequential
> reads/writes. Given that most data accesses are random and all journal
> accesses are sequential I put the journal at the start of the disk when
> data and journal are sharing the same platters.
>
> Best regards,
>
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy osd with btrfs not success.

2015-09-16 Thread Vickie ch
Hi cephers,
Have anyone ever created osd with btrfs in Hammer 0.94.3 ? I can create
btrfs partition successfully.  But once use "ceph-deploy" then always get
error like below. Another question there is no parameter " -f " with mkfs.
Any suggestion is appreciated.
​-
[osd3][DEBUG ] The operation has completed successfully.
[osd3][WARNIN] DEBUG:ceph-disk:Calling partprobe on created device /dev/sda
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
[osd3][WARNIN] DEBUG:ceph-disk:Creating btrfs fs on /dev/sda1
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t btrfs -m
single -l 32768 -n 32768 -- /dev/sda1
[osd3][WARNIN] /dev/sda1 appears to contain an existing filesystem (xfs).
[osd3][WARNIN] Error: Use the -f option to force overwrite.
[osd3][WARNIN] ceph-disk: Error: Command '['/sbin/mkfs', '-t', 'btrfs',
'-m', 'single', '-l', '32768', '-n', '32768', '--', '/dev/sda1']' returned
non-zero exit status 1
[osd3][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare
--zap-disk --cluster ceph --fs-type btrfs -- /dev/sda
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
​​


Best wishes,
Mika
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploy osd with btrfs not success.

2015-09-16 Thread Simon Hallam
This may help: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004295.html

Cheers,

Simon

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickie 
ch
Sent: 16 September 2015 10:58
To: ceph-users
Subject: [ceph-users] Deploy osd with btrfs not success.

Hi cephers,
Have anyone ever created osd with btrfs in Hammer 0.94.3 ? I can create 
btrfs partition successfully.  But once use "ceph-deploy" then always get error 
like below. Another question there is no parameter " -f " with mkfs. Any 
suggestion is appreciated.
​-
[osd3][DEBUG ] The operation has completed successfully.
[osd3][WARNIN] DEBUG:ceph-disk:Calling partprobe on created device /dev/sda
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
[osd3][WARNIN] DEBUG:ceph-disk:Creating btrfs fs on /dev/sda1
[osd3][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t btrfs -m single -l 
32768 -n 32768 -- /dev/sda1
[osd3][WARNIN] /dev/sda1 appears to contain an existing filesystem (xfs).
[osd3][WARNIN] Error: Use the -f option to force overwrite.
[osd3][WARNIN] ceph-disk: Error: Command '['/sbin/mkfs', '-t', 'btrfs', '-m', 
'single', '-l', '32768', '-n', '32768', '--', '/dev/sda1']' returned non-zero 
exit status 1
[osd3][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare 
--zap-disk --cluster ceph --fs-type btrfs -- /dev/sda
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
​​


Best wishes,
Mika



Please visit our new website at www.pml.ac.uk and follow us on Twitter  
@PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
in England & Wales, company number 4178503. Registered Charity No. 1091222. 
Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. You are 
reminded that e-mail communications are not secure and may contain viruses; PML 
accepts no liability for any loss or damage which may be caused by viruses.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson (Campus)
So I just realized I had described the partition error incorrectly in my 
initial post. The journal was placed at the 800GB mark leaving the 2TB data 
partition at the end of the disk. (See my follow-up to Lionel for details.) 

I'm working to correct that so I have a single large partition the size of the 
disk, save for the journal.

Sorry for any confusion. 

~jpr



> On Sep 15, 2015, at 6:21 PM, John-Paul Robinson  wrote:
> 
> I'm working to correct a partitioning error from when our cluster was
> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> partitions for our OSDs, instead of the 2.8TB actually available on
> disk, a 29% space hit.  (The error was due to a gdisk bug that
> mis-computed the end of the disk during the ceph-disk-prepare and placed
> the journal at the 2TB mark instead of the true end of the disk at
> 2.8TB. I've updated gdisk to a newer release that works correctly.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread Christian Balzer

Hello,

On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:

> The move  journal, partition resize, grow file system approach would
> work nicely if the spare capacity were at the end of the disk.
>
That shouldn't matter, you can "safely" loose your journal in controlled
circumstances.

This would also be an ideal time to put your journals on SSDs. ^o^

Roughly (you do have a test cluster, do you? Or at least try this with
just one OSD):

1. set noout just to be sure.
2. stop the OSD
3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
--help)
4. clobber your partitions in a way that leaves you with an intact data
partition, grow that and the FS in it as desired.
5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
6. start the OSD and rejoice. 
 
More below.

> Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
> journal placement to be at the 800GB mark, leaving the largest remaining
> partition at the end of the disk.   I'm assuming the gdisk bug was
> caused by overflowing a 32bit int during the -1000M offset from end of
> disk calculation.  When it computed the end of disk for the journal
> placement on disks >2TB it dropped the 2TB part of the size and was left
> only with the 800GB value, putting the journal there.  After gdisk
> created the journal at the 800GB mark (splitting the disk),
> ceph-disk-prepare told gdisk to take the largest remaining partition for
> data, using the 2TB partition at the end.
> 
> Here's an example of the buggy partitioning:
> 
> crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
> GPT fdisk (gdisk) version 0.8.8
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 5859442654
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 1562425343 sectors (745.0 GiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
>1  1564475392  5859442654   2.0 TiB   ceph data
>2  1562425344  1564475358   1001.0 MiB    ceph journal
> 
> 
> 
> I assume I could still follow a disk-level relocation of data using dd
> and shift all my content forward in the disk and then grow the file
> system to the end, but this would take a significant amount of time,
> more than a quick restart of the OSD. 
> 
> This leaves me the option of setting noout and hoping for the best (no
> other failures) during my somewhat lengthy dd data movement or taking my
> osd down and letting the cluster begin repairing the redundancy.
> 
> If I follow the second option of normal osd loss repair, my disk
> repartition step would be fast and I could bring the OSD back up rather
> quickly.  Does taking an OSD out of service, erasing it and bringing the
> same OSD back into service present any undue stress to the cluster?  
> 
Undue is such a nicely ambiguous word.
Recovering/Backfilling an OSD will stress your cluster, especially
considering that you're not using SSDs and a positively ancient version of
Ceph. 

Make sure to set all appropriate recovery/backfill options to their
minimum.

OTOH your cluster should be able to handle losses of OSDs w/o melting down
and given the presumed age of your cluster you must have had OSD failures
before. 
How did it fare then?

I have one cluster where loosing an OSD would be just background noise,
while another one would be seriously impacted by such a loss (working on
correcting that).

Regards,

Christian
> I'd prefer to use the second option if I can because I'm likely to
> repeat this in the near future in order to add encryption to these disks.
> 
> ~jpr
> 
> On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> > Le 16/09/2015 01:21, John-Paul Robinson a écrit :
> >> Hi,
> >>
> >> I'm working to correct a partitioning error from when our cluster was
> >> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> >> partitions for our OSDs, instead of the 2.8TB actually available on
> >> disk, a 29% space hit.  (The error was due to a gdisk bug that
> >> mis-computed the end of the disk during the ceph-disk-prepare and
> >> placed the journal at the 2TB mark instead of the true end of the
> >> disk at 2.8TB. I've updated gdisk to a newer release that works
> >> correctly.)
> >>
> >> I'd like to fix this problem by taking my existing 2TB OSDs offline
> >> one at a time, repartitioning them and then bringing them back into
> >> the cluster.  Unfortunately I can't just grow the partitions, so the
> >> repartition will be destructive.
> > Hum, why should it be? If the journal is at the 2TB mark, you should be
> > able to:
> > - stop the OSD,
> 

[ceph-users] ceph osd won't boot, resource shortage?

2015-09-16 Thread Peter Sabaini
Hi all,

I'm having trouble adding OSDs to a storage node; I've got about
28 OSDs running, but adding more fails.

Typical log excerpt:

2015-09-16 13:55:58.083797 7f3e7b821800  1 journal _open
/var/lib/ceph/osd/ceph-28/journal fd 20: 21474836480 bytes, block
size 4096 bytes, directio = 1, aio = 1
2015-09-16 13:55:58.090709 7f3e7b821800 -1 journal
FileJournal::_open: unable to setup io_context (61) No data available
2015-09-16 13:55:58.090825 7f3e74a96700 -1 journal io_submit to
0~4096 got (22) Invalid argument
2015-09-16 13:55:58.091061 7f3e7b821800  1 journal close
/var/lib/ceph/osd/ceph-28/journal
2015-09-16 13:55:58.091993 7f3e74a96700 -1 os/FileJournal.cc: In
function 'int FileJournal::write_aio_bl(off64_t&,
ceph::bufferlist&, uint64_t)' thread 7f3e74a96700 time 2
015-09-16 13:55:58.090842
os/FileJournal.cc: 1337: FAILED assert(0 == "io_submit got
unexpected error")

More complete: http://pastebin.ubuntu.com/12427041/

If, however, I stop one of the running OSDs, starting the original
OSD works fine. I'm guessing I'm running out of resources
somewhere, but where?

Some poss. relevant sysctl values:

vm.max_map_count=524288
kernel.pid_max=2097152
kernel.threads-max=2097152
fs.aio-max-nr = 65536
fs.aio-nr = 129024
fs.dentry-state = 75710 49996   45  0   0   0
fs.file-max = 26244198
fs.file-nr = 13504  0   26244198
fs.inode-nr = 60706 202
fs.nr_open = 1048576

I've also set max open files = 1048576 in ceph.conf

The OSDs are setup with dedicated journal disks - 3 OSDs share one
journal device.

Any advice on what I'm missing, or where I should dig deeper?

Thanks,
peter.






signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploy osd with btrfs not success.

2015-09-16 Thread Ilya Dryomov
On Wed, Sep 16, 2015 at 2:06 PM, darko  wrote:
> Sorry is this was asked already. Is there an "optimal" file system one
> should use for ceph?

See 
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/#filesystems.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-16 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I was out of the office for a few days. We have some more hosts to
add. I'll send some logs for examination.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 11, 2015 at 12:45 AM, GuangYang  wrote:
> If we are talking about requests being blocked 60+ seconds, those tunings 
> might not help (they help a lot for average latency during 
> recovering/backfilling).
>
> It would be interesting to see the logs for those blocked requests at OSD 
> side (they have level 0), pattern to search might be "slow requests \d+ 
> seconds old".
>
> I had a problem that for a recovery candidate object, all updates to that 
> object would stuck until it is recovered, that might take extremely long time 
> if there are large number of PG and objects to recover. But I think that is 
> resolved by Sam to allow write for degraded objects in Hammer.
>
> 
>> Date: Thu, 10 Sep 2015 14:56:12 -0600
>> From: rob...@leblancnet.us
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Hammer reduce recovery impact
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We are trying to add some additional OSDs to our cluster, but the
>> impact of the backfilling has been very disruptive to client I/O and
>> we have been trying to figure out how to reduce the impact. We have
>> seen some client I/O blocked for more than 60 seconds. There has been
>> CPU and RAM head room on the OSD nodes, network has been fine, disks
>> have been busy, but not terrible.
>>
>> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
>> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
>> S51G-1UL.
>>
>> Clients are QEMU VMs.
>>
>> [ulhglive-root@ceph5 current]# ceph --version
>> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>>
>> Some nodes are 0.94.3
>>
>> [ulhglive-root@ceph5 current]# ceph status
>> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>> health HEALTH_WARN
>> 3 pgs backfill
>> 1 pgs backfilling
>> 4 pgs stuck unclean
>> recovery 2382/33044847 objects degraded (0.007%)
>> recovery 50872/33044847 objects misplaced (0.154%)
>> noscrub,nodeep-scrub flag(s) set
>> monmap e2: 3 mons at
>> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
>> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
>> flags noscrub,nodeep-scrub
>> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
>> 128 TB used, 322 TB / 450 TB avail
>> 2382/33044847 objects degraded (0.007%)
>> 50872/33044847 objects misplaced (0.154%)
>> 2300 active+clean
>> 3 active+remapped+wait_backfill
>> 1 active+remapped+backfilling
>> recovery io 70401 kB/s, 16 objects/s
>> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>>
>> Each pool is size 4 with min_size 2.
>>
>> One problem we have is that the requirements of the cluster changed
>> after setting up our pools, so our PGs are really out of wack. Our
>> most active pool has only 256 PGs and each PG is about 120 GB is size.
>> We are trying to clear out a pool that has way too many PGs so that we
>> can split the PGs in that pool. I think these large PGs is part of our
>> issues.
>>
>> Things I've tried:
>>
>> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
>> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
>> it has also reduced the huge swings in latency, but has also reduced
>> throughput somewhat.
>> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
>> OSD process gives the recovery threads a different disk priority or if
>> changing the scheduler without restarting the OSD allows the OSD to
>> use disk priorities.
>> * Reduced the number of osd_max_backfills from 2 to 1.
>> * Tried setting noin to give the new OSDs time to get the PG map and
>> peer before starting the backfill. This caused more problems than
>> solved as we had blocked I/O (over 200 seconds) until we set the new
>> OSDs to in.
>>
>> Even adding one OSD disk into the cluster is causing these slow I/O
>> messages. We still have 5 more disks to add from this server and four
>> more servers to add.
>>
>> In addition to trying to minimize these impacts, would it be better to
>> split the PGs then add the rest of the servers, or add the servers
>> then do the PG split. I'm thinking splitting first would be better,
>> but I'd like to get other opinions.
>>
>> No spindle stays at high utilization for long and the await drops
>> below 20 ms usually within 10 seconds so I/O should be serviced
>> "pretty quick". My next guess is that the journals are getting full
>> and blocking while waiting for flushes, but I'm not exactly sure how
>> to identify that. We are using the defaults for the journal except for
>> size (10G). We'd like to have journals large to handle bursts, but if
>> they are getting filled with backfill 

[ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"

2015-09-16 Thread Stefan Eriksson
I have a completely new cluster for testing and its three servers which 
all are monitors and hosts for OSD, they each have one disk.

The issue is ceph status shows: 64 stale+undersized+degraded+peered

health:

 health HEALTH_WARN
clock skew detected on mon.ceph01-osd03
64 pgs degraded
64 pgs stale
64 pgs stuck degraded
64 pgs stuck inactive
64 pgs stuck stale
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
too few PGs per OSD (21 < min 30)
Monitor clock skew detected
 monmap e1: 3 mons at 
{ceph01-osd01=192.1.41.51:6789/0,ceph01-osd02=192.1.41.52:6789/0,ceph01-osd03=192.1.41.53:6789/0}
election epoch 82, quorum 0,1,2 
ceph01-osd01,ceph01-osd02,ceph01-osd03

 osdmap e36: 3 osds: 3 up, 3 in
  pgmap v85: 64 pgs, 1 pools, 0 bytes data, 0 objects
101352 kB used, 8365 GB / 8365 GB avail
  64 stale+undersized+degraded+peered


ceph osd tree shows:
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.15996 root default
-2 2.71999 host ceph01-osd01
 0 2.71999 osd.0  up  1.0  1.0
-3 2.71999 host ceph01-osd02
 1 2.71999 osd.1  up  1.0  1.0
-4 2.71999 host ceph01-osd03
 2 2.71999 osd.2  up  1.0  1.0





Here is my crushmap:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph01-osd01 {
id -2   # do not change unnecessarily
# weight 2.720
alg straw
hash 0  # rjenkins1
item osd.0 weight 2.720
}
host ceph01-osd02 {
id -3   # do not change unnecessarily
# weight 2.720
alg straw
hash 0  # rjenkins1
item osd.1 weight 2.720
}
host ceph01-osd03 {
id -4   # do not change unnecessarily
# weight 2.720
alg straw
hash 0  # rjenkins1
item osd.2 weight 2.720
}
root default {
id -1   # do not change unnecessarily
# weight 8.160
alg straw
hash 0  # rjenkins1
item ceph01-osd01 weight 2.720
item ceph01-osd02 weight 2.720
item ceph01-osd03 weight 2.720
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

And the ceph.conf which is shared among all nodes:

ceph.conf
[global]
fsid = b9043917-5f65-98d5-8624-ee12ff32a5ea
public_network = 192.1.41.0/24
cluster_network = 192.168.0.0/24
mon_initial_members = ceph01-osd01, ceph01-osd02, ceph01-osd03
mon_host = 192.1.41.51,192.1.41.52,192.1.41.53
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default pg num = 512
osd pool default pgp num = 512

Logs doesnt say much, the only active log which adds something is:

mon.ceph01-osd01@0(leader).data_health(82) update_stats avail 88% total 
9990 MB, used 1170 MB, avail 8819 MB
mon.ceph01-osd02@1(peon).data_health(82) update_stats avail 88% total 
9990 MB, used 1171 MB, avail 8818 MB
mon.ceph01-osd03@2(peon).data_health(82) update_stats avail 88% total 
9990 MB, used 1172 MB, avail 8817 MB


Does anyone have a thoughts of what might be wrong? Or if there is other 
info I can provide to ease the search for what it might be?


Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] C example of using libradosstriper?

2015-09-16 Thread Paul Mansfield

Hello,
I'm using the C interface librados striper and am looking for examples
on how to use it.


Please can someone point me to any useful code snippets? All I've found
so far is the source code :-(

Thanks very much
Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

My understanding of growing file systems is the same as yours, it can
only grow at the end not the beginning. In addition to that, having
partition 2 before partition 1 just cries to me to have it fixed, but
that is just aesthetic.

Because the weights of the drives will be different, there will be
some additional data movement (probably minimized if you are using
straw2). Setting noout will prevent Ceph from shuffling data around
while you are making the changes. When you bring the OSD back in, it
should receive only the PGs that were on it before minimizing the data
movement in the cluster. But because you are adding 800 GB, it will
want to take a few more PGs and so some shuffling in the cluster is
inevitable.

I don't know how well it would work, but you could bring in all the
reformatted OSDs in at the same weight as the current weight and then
when you have them all re-done, edit the crush map to set the weights
right, ideally the ratio would be the same so no (or very little) data
movement would occur. Due to an error in the straw algorithm, there is
still the potential of large amounts of data movement with small
weight changes.

As to your question about adding the disk before the rebalance is
completed, it will be fine to do so. Ceph will complete the PGs that
are currently being relocated, but compute new locations based on the
new disk. This may result in a PG that just finished moving to be
relocated again. The cluster will still perform and not lose data.

About saving OSD IDs; I only know that if you don't have gaps in your
OSDs (some were retired and not replaced) then if you remove an OSD
and recreate it, it will get the same number as the lowest available
number is the same as the OSD being replaced. I don't know about
saving off the files before wiping the OSD if it will keep the
identity.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV+aE4CRDmVDuy+mK58QAA7hEQAIluPpdYtvhpkIJiWabb
jWBkjOk3W6Am9aosQm88IF3biOMVGBQN2Xs9PgDW2lMz4aU1Vh6rpACCRFt0
Xn46pLanS4lPF/nYClUhu34z5LzNOZv84YEhwbc9KOUHIUs0Ijv7AlkyOn3S
bn1fbx7YUVbliqj6171jvEZKYndYdVe/nLeGVQu+DAkFyycSe+cj4fSnXtgr
xkRd6EDLiXBf8YuqX1sLjwDrtVYoNiPh4R7q1XA1zOkemuMlqwCwxCCJAxuq
5mKMg3DbJfPelSeOV6GXrMJt7GGTj8qUDzBGhvfhPBu1/XtfgRQar6VTi3gG
tdE0S+i8u5Ir9ze8aGvcl7ocmJXtcDa4LIyKmspz1vhPHCgG451W/vCu4mPV
lhym50/+arLSePxoZiQLwazfCx2T3XxcGBOK2KJ13rMVnt4HXsnfnG1x4T9U
0yIolZhPJDY30kyNXAEkivXnShfT9iOsIEFgb3LwhMJNR3uVVgOzQOL5CGlj
NDj5ZebzqsowfflwRxhQIWTo+F2zLXMt5gv5Xqq8UeLuEsx81I9wJh0+DwYM
ISHOHtE/COhlaRiyEk1q3ZzZe56baW5W3KnjNuYmF13jpMfS2ctoAEAUvGxS
d4frVCFJYXZ+5d8b7dYTU5mbqKe59yEPq3yjAOIZPL9PWn1jHfgjylvOMyMw
hihd
=GGct
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 16, 2015 at 10:12 AM, John-Paul Robinson  wrote:
> Christian,
>
> Thanks for the feedback.
>
> I guess I'm wondering about step 4 "clobber partition, leaving data in
> tact and grow partition and the file system as needed".
>
> My understanding of xfs_growfs is that the free space must be at the end
> of the existing file system.  In this case the existing partition starts
> around the 800GB mark on the disk and and extends to the end of the
> disk.  My goal is to add the first 800GB on the disk to that partition
> so it can become a single data partition.
>
> Note that my volumes are not LVM based so I can't extend the volume by
> incorporating the free space at the start of the disk.
>
> Am I misunderstanding something about file system grow commands?
>
> Regarding your comments, on impact to the cluster of a downed OSD.  I
> have lost OSDs and the impact is minimal (acceptable).
>
> My concern is around taking an OSD down, having the cluster initiate
> recovery and then bringing that same OSD back into the cluster in an
> empty state.  Are the placement groups that originally had data on this
> OSD already remapped by this point (even if they aren't fully recovered)
> so that bring the empty, replacement OSD on-line simply causes a
> different set of placement groups to be mapped onto it to achieve the
> rebalance?
>
> Thanks,
>
> ~jpr
>
> On 09/16/2015 08:37 AM, Christian Balzer wrote:
>> Hello,
>>
>> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>>
>>> > The move  journal, partition resize, grow file system approach would
>>> > work nicely if the spare capacity were at the end of the disk.
>>> >
>> That shouldn't matter, you can "safely" loose your journal in controlled
>> circumstances.
>>
>> This would also be an ideal time to put your journals on SSDs. ^o^
>>
>> Roughly (you do have a test cluster, do you? Or at least try this with
>> just one OSD):
>>
>> 1. set noout just to be sure.
>> 2. stop the OSD
>> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
>> --help)
>> 4. clobber your partitions in a way that leaves you with an intact 

Re: [ceph-users] Receiving "failed to parse date for auth header"

2015-09-16 Thread Ramon Marco Navarro
That worked. Thank you!

On Fri, Sep 4, 2015 at 11:31 PM Ilya Dryomov  wrote:

> On Fri, Sep 4, 2015 at 12:42 PM, Ramon Marco Navarro
>  wrote:
> > Good day everyone!
> >
> > I'm having a problem using aws-java-sdk to connect to Ceph using
> radosgw. I
> > am reading a " NOTICE: failed to parse date for auth header" message in
> the
> > logs. HTTP_DATE is "Fri, 04 Sep 2015 09:25:33 +00:00", which is I think a
> > valid rfc 1123 date...
>
> Completely unfamiliar with rgw, but try "... +" (i.e. no colon)?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
Christian,

Thanks for the feedback.

I guess I'm wondering about step 4 "clobber partition, leaving data in
tact and grow partition and the file system as needed".

My understanding of xfs_growfs is that the free space must be at the end
of the existing file system.  In this case the existing partition starts
around the 800GB mark on the disk and and extends to the end of the
disk.  My goal is to add the first 800GB on the disk to that partition
so it can become a single data partition.

Note that my volumes are not LVM based so I can't extend the volume by
incorporating the free space at the start of the disk.

Am I misunderstanding something about file system grow commands?

Regarding your comments, on impact to the cluster of a downed OSD.  I
have lost OSDs and the impact is minimal (acceptable).

My concern is around taking an OSD down, having the cluster initiate
recovery and then bringing that same OSD back into the cluster in an
empty state.  Are the placement groups that originally had data on this
OSD already remapped by this point (even if they aren't fully recovered)
so that bring the empty, replacement OSD on-line simply causes a
different set of placement groups to be mapped onto it to achieve the
rebalance?

Thanks,

~jpr

On 09/16/2015 08:37 AM, Christian Balzer wrote:
> Hello,
>
> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>
>> > The move  journal, partition resize, grow file system approach would
>> > work nicely if the spare capacity were at the end of the disk.
>> >
> That shouldn't matter, you can "safely" loose your journal in controlled
> circumstances.
>
> This would also be an ideal time to put your journals on SSDs. ^o^
>
> Roughly (you do have a test cluster, do you? Or at least try this with
> just one OSD):
>
> 1. set noout just to be sure.
> 2. stop the OSD
> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
> --help)
> 4. clobber your partitions in a way that leaves you with an intact data
> partition, grow that and the FS in it as desired.
> 5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
> 6. start the OSD and rejoice. 
>  
> More below.
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] benefit of using stripingv2

2015-09-16 Thread Corin Langosch
Hi guys,

afaik rbd always splits the image into chunks of size 2^order (2^22 = 4MB by 
default). What's the benefit of specifying
the feature flag "STRIPINGV2"? I couldn't find any documenation about it except
http://ceph.com/docs/master/man/8/rbd/#striping which doesn't explain the 
benefits (or I just don't get it). Better docs
in this area would be greatm, so I created an issue for that: 
http://tracker.ceph.com/issues/13123

I also noticed the rbd client (0.94.3) ignores the striping feature on image 
creation, I created issue
http://tracker.ceph.com/issues/13122 for that. It is really a bug or does it 
mean stripingv2 is going away and should
not be used?

Cheers
Corin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados bench seq throttling

2015-09-16 Thread Deneau, Tom


> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: Monday, September 14, 2015 5:32 PM
> To: Deneau, Tom
> Cc: ceph-users
> Subject: Re: [ceph-users] rados bench seq throttling
> 
> On Thu, Sep 10, 2015 at 1:02 PM, Deneau, Tom  wrote:
> > Running 9.0.3 rados bench on a 9.0.3 cluster...
> > In the following experiments this cluster is only 2 osd nodes, 6 osds
> > each and a separate mon node (and a separate client running rados
> bench).
> >
> > I have two pools populated with 4M objects.  The pools are replicated
> > x2 with identical parameters.  The objects appear to be spread evenly
> across the 12 osds.
> >
> > In all cases I drop caches on all nodes before doing a rados bench seq
> test.
> > In all cases I run rados bench seq for identical times (30 seconds)
> > and in that time we do not run out of objects to read from the pool.
> >
> > I am seeing significant bandwidth differences between the following:
> >
> >* running a single instance of rados bench reading from one pool with
> 32 threads
> >  (bandwidth approx 300)
> >
> >* running two instances rados bench each reading from one of the two
> pools
> >  with 16 threads per instance (combined bandwidth approx. 450)
> >
> > I have already increased the following:
> >   objecter_inflight_op_bytes = 10485760
> >   objecter_inflight_ops = 8192
> >   ms_dispatch_throttle_bytes = 1048576000  #didn't seem to have any
> > effect
> >
> > The disks and network are not reaching anywhere near 100% utilization
> >
> > What is the best way to diagnose what is throttling things in the one-
> instance case?
> 
> Pretty sure the rados bench main threads are just running into their
> limits. There's some work that Piotr (I think?) has been doing to make it
> more efficient if you want to browse the PRs, but I don't think they're
> even in a dev release yet.
> -Greg

Some further experiments with numbers of rados-bench clients:
   * All of the following are reading 4M sized objects with dropped caches as
 described above:
   * When we run multiple clients, they are run on different pools but from
 the same separate client node, which is not anywhere near CPU or 
network-limited
* threads is the total across all clients, as is BW

Case 1: two node cluster, 3 osds on each node
total  BW  BW  BW 
threads   1 cli   2cli4cli
---   -     
  4174 185 194
  8214 273 301
 16198 309 399
 32226 309 409
 64246 341 421


Case 2: one node cluster, 6 osds on one node.
total  BW  BW  BW 
threads   1 cli   2cli4cli
---   -     
  4   339  262 236
  8   465  426 383
 16   467  433 353
 32   470  432 339
 64   471  429 345

So, from the above data, having multiple clients definitely helps
in the 2-node case (Case 1) but hurts in the single-node case.
Still interested in any tools that would help analyze this more deeply...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] C example of using libradosstriper?

2015-09-16 Thread 张冬卯

Hi,

src/tools/rados.c has some  striper rados snippet.

and I have  this little project using striper rados.
see:https://github.com/thesues/striprados

wish could help you

Dongmao Zhang

在 2015年09月17日 01:05, Paul Mansfield 写道:
> Hello,
> I'm using the C interface librados striper and am looking for examples
> on how to use it.
>
>
> Please can someone point me to any useful code snippets? All I've found
> so far is the source code :-(
>
> Thanks very much
> Paul
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cant get cluster to become healthy. "stale+undersized+degraded+peered"

2015-09-16 Thread Jonas Björklund


On Wed, 16 Sep 2015, Stefan Eriksson wrote:

I have a completely new cluster for testing and its three servers which all 
are monitors and hosts for OSD, they each have one disk.

The issue is ceph status shows: 64 stale+undersized+degraded+peered

health:

health HEALTH_WARN
   clock skew detected on mon.ceph01-osd03
   64 pgs degraded
   64 pgs stale
   64 pgs stuck degraded
   64 pgs stuck inactive
   64 pgs stuck stale
   64 pgs stuck unclean
   64 pgs stuck undersized
   64 pgs undersized
   too few PGs per OSD (21 < min 30)
   Monitor clock skew detected
monmap e1: 3 mons at 
{ceph01-osd01=192.1.41.51:6789/0,ceph01-osd02=192.1.41.52:6789/0,ceph01-osd03=192.1.41.53:6789/0}
   election epoch 82, quorum 0,1,2 
ceph01-osd01,ceph01-osd02,ceph01-osd03

osdmap e36: 3 osds: 3 up, 3 in
 pgmap v85: 64 pgs, 1 pools, 0 bytes data, 0 objects
   101352 kB used, 8365 GB / 8365 GB avail
 64 stale+undersized+degraded+peered


To start you can add more PGs and setup NTPd on your servers.

/Jonas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com