Re: [ceph-users] Improving Performance with more OSD's?

2015-01-04 Thread Udo Lembke
Hi Lindsay,

On 05.01.2015 06:52, Lindsay Mathieson wrote:
> ...
> So two OSD Nodes had:
> - Samsung 840 EVO SSD for Op. Sys.
> - Intel 530 SSD for Journals (10GB Per OSD)
> - 3TB WD Red
> - 1 TB WD Blue
> - 1 TB WD Blue
> - Each disk weighted at 1.0
> - Primary affinity of the WD Red (slow) set to 0
the weight should be the size of the filesystem. With weight 1 for all
disks, you run in trouble if your cluster filled, because the 1TB-Disks
are full, before the 3TB disk!

You should have something like 0.9 for the 1TB and 2.82 for the 3TB
disks ( "df -k | grep osd | awk '{print $2/(1024^3) }' " ).

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Improving Performance with more OSD's?

2015-01-04 Thread Lindsay Mathieson
Well I upgraded my cluster over the weekend :)
To each node I added:
- Intel SSD 530 for journals
- 2 * 1TB WD Blue

So two OSD Nodes had:
- Samsung 840 EVO SSD for Op. Sys.
- Intel 530 SSD for Journals (10GB Per OSD)
- 3TB WD Red
- 1 TB WD Blue
- 1 TB WD Blue
- Each disk weighted at 1.0
- Primary affinity of the WD Red (slow) set to 0

Took about 8 hours for 1TB of data to rebalance over the OSD's

Very pleased with results so far.

rados benchmark:
- Write bandwidth has increased from 49 MB/s to 140 MB/s
- Reads have stayed roughly the same at 500 MB/s

VM Benchmarks:
- Actually have stayed much the same, but have more "depth" - multiple VM's
share the bandwidth nicely.

Users are finding their VM's *much* less laggy.

Thanks for all the help and suggestions.

Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Worthwhile setting up Cache tier with small leftover SSD partions?

2015-01-04 Thread Lindsay Mathieson
On 5 January 2015 at 13:02, Christian Balzer  wrote:

> On Fri, 02 Jan 2015 06:38:49 +1000 Lindsay Mathieson wrote:
>
>
> If you research the ML archives you will find that cache tiering currently
> isn't just fraught with peril (there are bugs) but most importantly isn't
> really that fast.
>


Yah, I had wondered that. Also  it seems to involve a lot of manual
tinkering with the crush map which I really want to avoid.



>
>
> Also given your setup, you should be able to saturate your network now, so
> probably negating the need for super fast storage to some extent.
>


Agreed - now I have it installed and configured, performance all round has
vastly improved - user are already commenting that their VM's are much more
responsive.

Pretty sure that we are now at a stage where I can just leave it alone :)

Though the boss now wants to migrate the big ass vsphere server to proxmox
(KVM), so I could use it as a third OSD server ,,,


Thanks for the help, much appreciated,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to remove mds from cluster

2015-01-04 Thread Lindsay Mathieson
Did you remove the mds.0 entry from ceph.conf?

On 5 January 2015 at 14:13, debian Only  wrote:

> i have tried ' ceph mds newfs 1 0 --yes-i-really-mean-it'but not fix
> the problem
>
> 2014-12-30 17:42 GMT+07:00 Lindsay Mathieson 
> :
>
>>  On Tue, 30 Dec 2014 03:11:25 PM debian Only wrote:
>>
>> > ceph 0.87 , Debian 7.5,   anyone can help ?
>>
>> >
>>
>> > 2014-12-29 20:03 GMT+07:00 debian Only :
>>
>> > i want to move mds from one host to another.
>>
>> >
>>
>> > how to do it ?
>>
>> >
>>
>> > what did i do as below, but ceph health not ok, mds was not removed :
>>
>> >
>>
>> > root@ceph06-vm:~# ceph mds rm 0 mds.ceph06-vm
>>
>> > mds gid 0 dne
>>
>> >
>>
>> > root@ceph06-vm:~# ceph health detail
>>
>> > HEALTH_WARN mds ceph06-vm is laggy
>>
>> > mds.ceph06-vm at 192.168.123.248:6800/4350 is laggy/unresponsive
>>
>>
>>
>> I removed an mds using this guide:
>>
>>
>>
>>
>> http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/
>>
>>
>>
>> and ran into your problem, which is also mentioned there.
>>
>>
>>
>> resolved it using the guide suggestion:
>>
>>
>>
>> $ ceph mds newfs metadata data --yes-i-really-mean-it
>>
>>
>>
>> --
>>
>> Lindsay
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to remove mds from cluster

2015-01-04 Thread debian Only
i have tried ' ceph mds newfs 1 0 --yes-i-really-mean-it'but not fix
the problem

2014-12-30 17:42 GMT+07:00 Lindsay Mathieson :

>  On Tue, 30 Dec 2014 03:11:25 PM debian Only wrote:
>
> > ceph 0.87 , Debian 7.5,   anyone can help ?
>
> >
>
> > 2014-12-29 20:03 GMT+07:00 debian Only :
>
> > i want to move mds from one host to another.
>
> >
>
> > how to do it ?
>
> >
>
> > what did i do as below, but ceph health not ok, mds was not removed :
>
> >
>
> > root@ceph06-vm:~# ceph mds rm 0 mds.ceph06-vm
>
> > mds gid 0 dne
>
> >
>
> > root@ceph06-vm:~# ceph health detail
>
> > HEALTH_WARN mds ceph06-vm is laggy
>
> > mds.ceph06-vm at 192.168.123.248:6800/4350 is laggy/unresponsive
>
>
>
> I removed an mds using this guide:
>
>
>
>
> http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/
>
>
>
> and ran into your problem, which is also mentioned there.
>
>
>
> resolved it using the guide suggestion:
>
>
>
> $ ceph mds newfs metadata data --yes-i-really-mean-it
>
>
>
> --
>
> Lindsay
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Worthwhile setting up Cache tier with small leftover SSD partions?

2015-01-04 Thread Christian Balzer
On Fri, 02 Jan 2015 06:38:49 +1000 Lindsay Mathieson wrote:

> Expanding my tiny ceph setup from 2 OSD's to six, and two extra SSD's
> for journals (IBM 530 120GB)
> 
> Yah, I know the 5300's would be much better 
> 
S5700's really, if you look at the prices and especially TBW/$.

> Assuming I use 10GB ber OSD for journal and 5GB spare to improve the SSD 
> lifetime, that leaves 85GB spare per SSD.
> 
> 
> Is it worthwhile setting up a 2 *85GB OSD Cache Tier (Replica 2)? Usage
> is for approx 15 Active VM's, used mainly for development and light
> database work.
> 

If you research the ML archives you will find that cache tiering currently
isn't just fraught with peril (there are bugs) but most importantly isn't
really that fast.

You're likely better off making this a dedicated pool for DB storage, etc.
Note that getting even a fraction of the performance of those SSDs will
require quite a lot of CPU power.

Also given your setup, you should be able to saturate your network now, so
probably negating the need for super fast storage to some extent.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Chen, Xiaoxi
You could use rbd info   to see the block_name_prefix, the object 
name consist like .,  so for example, 
rb.0.ff53.3d1b58ba.e6ad should be the th object  of the volume 
with block_name_prefix rb.0.ff53.3d1b58ba.

 $ rbd info huge
rbd image 'huge':
 size 1024 TB in 268435456 objects
 order 22 (4096 kB objects)
 block_name_prefix: rb.0.8a14.2ae8944a
 format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Edwin 
Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd 
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:
>
>
> On Sunday, January 4, 2015, Dyweni - Ceph-Users 
> <6exbab4fy...@dyweni.com > wrote:
>
> Hi,
>
> If its the only think in your pool, you could try deleting the
> pool instead.
>
> I found that to be faster in my testing; I had created 500TB when
> I meant to create 500GB.
>
> Note for the Devs: I would be nice if rbd create/resize would
> accept sizes with units (i.e. MB GB TB PB, etc).
>
>
>
>
> On 2015-01-04 08:45, Edwin Peer wrote:
>
> Hi there,
>
> I did something stupid while growing an rbd image. I accidentally
> mistook the units of the resize command for bytes instead of
> megabytes
> and grew an rbd image to 650PB instead of 650GB. This all happened
> instantaneously enough, but trying to rectify the mistake is
> not going
> nearly as well.
>
> 
> ganymede ~ # rbd resize --size 665600 --allow-shrink
> client-disk-img0/vol-x318644f-0
> Resizing image: 1% complete...
> 
>
> It took a couple days before it started showing 1% complete
> and has
> been stuck on 1% for a couple more. At this rate, I should be
> able to
> shrink the image back to the intended size in about 2016.
>
> Any ideas?
>
> Regards,
> Edwin Peer
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> You can just delete the rbd header. See Sebastien's excellent blog:
>
> http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your
> -ceph-cluster/
>
> Jake
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Edwin Peer

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd 
command line tools understands?


Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:



On Sunday, January 4, 2015, Dyweni - Ceph-Users 
<6exbab4fy...@dyweni.com > wrote:


Hi,

If its the only think in your pool, you could try deleting the
pool instead.

I found that to be faster in my testing; I had created 500TB when
I meant to create 500GB.

Note for the Devs: I would be nice if rbd create/resize would
accept sizes with units (i.e. MB GB TB PB, etc).




On 2015-01-04 08:45, Edwin Peer wrote:

Hi there,

I did something stupid while growing an rbd image. I accidentally
mistook the units of the resize command for bytes instead of
megabytes
and grew an rbd image to 650PB instead of 650GB. This all happened
instantaneously enough, but trying to rectify the mistake is
not going
nearly as well.


ganymede ~ # rbd resize --size 665600 --allow-shrink
client-disk-img0/vol-x318644f-0
Resizing image: 1% complete...


It took a couple days before it started showing 1% complete
and has
been stuck on 1% for a couple more. At this rate, I should be
able to
shrink the image back to the intended size in about 2016.

Any ideas?

Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


You can just delete the rbd header. See Sebastien's excellent blog:

http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your-ceph-cluster/

Jake


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Andrey Korolyov
On Sun, Jan 4, 2015 at 10:43 PM, Edwin Peer  wrote:
> Thanks Jake, however, I need to shrink the image, not delete it as it
> contains a live customer image. Is it possible to manually edit the RBD
> header to make the necessary adjustment?
>

Technically speaking, yes - the size information is contained in omap
attributes of header, but I`d recommend you to check this approach
somewhere else. Even if this will work, this will leave a lot of stray
files in filestore. If you are running VM on top of it with recent
qemu, it`s easy to tell the emulator desired block geometry (size),
then launch a drive-mirror job and occasionally change backing image
(during power cycle for example). Although second one may work, I am
not absolutely sure that drive-mirror job will respect geometry
override, better to check this too.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Data recovery after RBD I/O error

2015-01-04 Thread Jérôme Poulin
Happy holiday everyone,

TL;DR: Hardware corruption is really bad, if btrfs-restore work,
kernel Btrfs can!

I'm cross-posting this message since the root cause for this problem
is the Ceph RBD device however, my main concern is data loss from a
BTRFS filesystem hosted on this device.

I'm running a file server which is a staging area for rsync backups of
many folders and also a snapshot store which allow me to recover much
faster older files and folders while our backup still is exported to
an EXT4 filesystem using rdiff-backup.

The server is running Debian Wheezy with kernel 3.16 and I already had
corruption on this volume before, I had to copy the whole device and
since we now had a working Ceph cluster, I copied the volume using
«btrfs send» to another BTRFS hosted on a RBD device. The corruption
was not causing any issue for reading however when writing, the volume
would switch read only once upon a time.

First day of new year, I wake up to see the monitoring telling me the
FS on the server has switched to read only. I took a look at dmesg,
and had some I/O errors from the RBD device. I was unable to unmount
it but had full access to the data, so I wanted to reboot to see if
the glitch would dismiss now that I/O errors were gone. After the
reboot, the BTRFS would not mount anymore.


After trying the usual, read only mount, recovery mount, btrfsck
--repair on a snapshot, only btrfs-restore was working. Btrfs-restore
could restore everything but my data was in snapshot, regex was not
working correctly and it didn't restore file attributes
(normal/extended) even with -x, I used btrfs-tools 3.18.

This is what I was getting:
[   31.582823] parent transid verify failed on 308470693888 wanted
91730 found 90755
[   31.584738] parent transid verify failed on 308470693888 wanted
91730 found 90755
[   31.584743] BTRFS: Failed to read block groups: -5

After looking at the code a bit, I did this change to get BTRFS
recovery working and rsync my stuff. I also tried to use btrfs send by
forcing it to use a read/write snapshot since the whole volume is read
only anyway but failed with oopses.

Patch for recovery
---
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0229c37..aed4062 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2798,7 +2798,8 @@ retry_root_backup:
ret = btrfs_read_block_groups(extent_root);
if (ret) {
printk(KERN_ERR "BTRFS: Failed to read block groups:
%d\n", ret);
-   goto fail_sysfs;
+   if (!btrfs_test_opt(tree_root, RECOVERY))
+   goto fail_sysfs;
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
---
Also: http://pastebin.com/YPY3eMMX


Trace when forcing BTRFS send on my R/O volume with R/W subvolume:
[ cut here ]
WARNING: CPU: 3 PID: 27883 at fs/btrfs/send.c:5533
btrfs_ioctl_send+0x8c9/0xfa0 [btrfs]()
Modules linked in: btrfs(O) ufs qnx4 hfsplus hfs minix ntfs vfat msdos
fat jfs xfs reiserfs vhost_net vhost macvtap macvlan tun
ip6table_filter ip6_tabl
es ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT cbc
rbd libceph xt_CHECKSUM iptable_mangle libcrc32c xt_tcpudp ip
table_filter ip_tables x_tables parport_pc ppdev lp parport ib_iser
rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss
oid_registry n
fs_acl nfs lockd fscache sunrpc bridge fuse ipmi_devintf 8021q garp
stp mrp llc loop iTCO_wdt iTCO_vendor_support ttm drm_kms_helper
pcspkr drm evdev lpc_ich i2c_algo_bit i2c_core mfd_core i7core_edac
processor edac_core button coretemp tpm_tis tpm dcdbas kvm_intel
acpi_power_meter ipmi_si thermal_sys ipmi_msghandler kvm ext4 crc16
mbcache jbd2 dm_mod raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx xor ra
Jan  2 18:55:43 CASRV0104 kernel: id6_pq raid1 md_mod sg sd_mod
crc_t10dif crct10dif_common mvsas libsas ehci_pci ehci_hcd bnx2
crc32c_intel libata scsi_transport_sas scsi_mod usbcore usb_common
[last
unloaded: btrfs]
CPU: 3 PID: 27883 Comm: btrfs Tainted: G   O
3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1
Hardware name: Dell Inc. PowerEdge R310/05XKKK, BIOS 1.5.2 10/15/2010
  a0a52557 81541f8f 
 8106cecc 8800ba625a00 8803152da000 7fffa69f7ab0
 880312f2d1e0 8800ba625a00 a0a419c9 
Call Trace:
 [] ? dump_stack+0x41/0x51
 [] ? warn_slowpath_common+0x8c/0xc0
 [] ? btrfs_ioctl_send+0x8c9/0xfa0 [btrfs]
 [] ? __alloc_pages_nodemask+0x165/0xbb0
 [] ? dput+0x31/0x1a0
 [] ? cache_alloc_refill+0x92/0x2e0
 [] ? btrfs_ioctl+0x1a50/0x2890 [btrfs]
 [] ? alloc_pid+0x1e8/0x4d0
 [] ? set_task_cpu+0x82/0x1d0
 [] ? cpumask_next_and+0x30/0x40
 [] ? select

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Edwin Peer
Thanks Jake, however, I need to shrink the image, not delete it as it 
contains a live customer image. Is it possible to manually edit the RBD 
header to make the necessary adjustment?


Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:



On Sunday, January 4, 2015, Dyweni - Ceph-Users 
<6exbab4fy...@dyweni.com > wrote:


Hi,

If its the only think in your pool, you could try deleting the
pool instead.

I found that to be faster in my testing; I had created 500TB when
I meant to create 500GB.

Note for the Devs: I would be nice if rbd create/resize would
accept sizes with units (i.e. MB GB TB PB, etc).




On 2015-01-04 08:45, Edwin Peer wrote:

Hi there,

I did something stupid while growing an rbd image. I accidentally
mistook the units of the resize command for bytes instead of
megabytes
and grew an rbd image to 650PB instead of 650GB. This all happened
instantaneously enough, but trying to rectify the mistake is
not going
nearly as well.


ganymede ~ # rbd resize --size 665600 --allow-shrink
client-disk-img0/vol-x318644f-0
Resizing image: 1% complete...


It took a couple days before it started showing 1% complete
and has
been stuck on 1% for a couple more. At this rate, I should be
able to
shrink the image back to the intended size in about 2016.

Any ideas?

Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


You can just delete the rbd header. See Sebastien's excellent blog:

http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your-ceph-cluster/

Jake


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Jake Young
On Sunday, January 4, 2015, Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> Hi,
>
> If its the only think in your pool, you could try deleting the pool
> instead.
>
> I found that to be faster in my testing; I had created 500TB when I meant
> to create 500GB.
>
> Note for the Devs: I would be nice if rbd create/resize would accept sizes
> with units (i.e. MB GB TB PB, etc).
>
>
>
>
> On 2015-01-04 08:45, Edwin Peer wrote:
>
>> Hi there,
>>
>> I did something stupid while growing an rbd image. I accidentally
>> mistook the units of the resize command for bytes instead of megabytes
>> and grew an rbd image to 650PB instead of 650GB. This all happened
>> instantaneously enough, but trying to rectify the mistake is not going
>> nearly as well.
>>
>> 
>> ganymede ~ # rbd resize --size 665600 --allow-shrink
>> client-disk-img0/vol-x318644f-0
>> Resizing image: 1% complete...
>> 
>>
>> It took a couple days before it started showing 1% complete and has
>> been stuck on 1% for a couple more. At this rate, I should be able to
>> shrink the image back to the intended size in about 2016.
>>
>> Any ideas?
>>
>> Regards,
>> Edwin Peer
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

You can just delete the rbd header. See Sebastien's excellent blog:

http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your-ceph-cluster/

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-04 Thread hemant burman
Hello,

Is there an answer to why this is happening, I am facing the same issue, I
have the non-system user replicated to the slave zone, but still getting
403, the same thing happening when I am replicating from the master zone of
master region to master zone of secondary region.
I am using swift, and have created a non-system user for the same

-Hemant

On Tue, Nov 25, 2014 at 12:37 AM, Craig Lewis 
wrote:

> I'm really not sure.  I'm using the S3 interface rather than the Swift
> interface.  Once my non-systm user replicated, I was able to access
> everything in the secondary cluster just fine.
>
> Hopefully somebody else with Swift experience will chime in.
>
>
>
> On Sat, Nov 22, 2014 at 12:47 AM, Vinod H I  wrote:
>
>> Thanks for the clarification.
>> Now I have done exactly as you suggested.
>> "us-east" is the master zone and "us-west" is the secondary zone.
>> Each zone has two system users "us-east" and "us-west".
>> These system users have same access/secret keys in both zones.
>> I have checked the pools to confirm that the non-system swift user which
>> i created("east-user:swift") in the primary has been replicated to the
>> secondary zone.
>> The buckets which are created in primary by the swift user are also there
>> in the pools of the secondary zone.
>> But when i try to authenticate this swift user in secondary zone, it says
>> access denied.
>>
>> Here are the relevant logs from the secondary zone, when i try to
>> authenticate the swift user.
>>
>> 2014-11-22 14:19:14.239976 7f73ecff9700  2
>> RGWDataChangesLog::ChangesRenewThread: start
>> 2014-11-22 14:19:14.243454 7f73fe236780 20 get_obj_state: rctx=0x2316ce0
>> obj=.us.rgw.root:region_info.us state=0x2319048 s->prefetch_data=0
>> 2014-11-22 14:19:14.243454 7f73fe236780 10 cache get: name=.us.rgw.root+
>> region_info.us : miss
>> 2014-11-22 14:19:14.252263 7f73fe236780 10 cache put: name=.us.rgw.root+
>> region_info.us
>> 2014-11-22 14:19:14.252283 7f73fe236780 10 adding .us.rgw.root+
>> region_info.us to cache LRU end
>> 2014-11-22 14:19:14.252310 7f73fe236780 20 get_obj_state: s->obj_tag was
>> set empty
>> 2014-11-22 14:19:14.252336 7f73fe236780 10 cache get: name=.us.rgw.root+
>> region_info.us : type miss (requested=1, cached=6)
>> 2014-11-22 14:19:14.252376 7f73fe236780 20 get_obj_state: rctx=0x2316ce0
>> obj=.us.rgw.root:region_info.us state=0x2319958 s->prefetch_data=0
>> 2014-11-22 14:19:14.252386 7f73fe236780 10 cache get: name=.us.rgw.root+
>> region_info.us : hit
>> 2014-11-22 14:19:14.252391 7f73fe236780 20 get_obj_state: s->obj_tag was
>> set empty
>> 2014-11-22 14:19:14.252404 7f73fe236780 20 get_obj_state: rctx=0x2316ce0
>> obj=.us.rgw.root:region_info.us state=0x2319958 s->prefetch_data=0
>> 2014-11-22 14:19:14.252409 7f73fe236780 20 state for obj=.us.rgw.root:
>> region_info.us is not atomic, not appending atomic test
>> 2014-11-22 14:19:14.252412 7f73fe236780 20 rados->read obj-ofs=0
>> read_ofs=0 read_len=524288
>> 2014-11-22 14:19:14.264611 7f73fe236780 20 rados->read r=0 bl.length=266
>> 2014-11-22 14:19:14.264650 7f73fe236780 10 cache put: name=.us.rgw.root+
>> region_info.us
>> 2014-11-22 14:19:14.264653 7f73fe236780 10 moving .us.rgw.root+
>> region_info.us to cache LRU end
>> 2014-11-22 14:19:14.264766 7f73fe236780 20 get_obj_state: rctx=0x2319860
>> obj=.us-west.rgw.root:zone_info.us-west state=0x2313b98 s->prefetch_data=0
>> 2014-11-22 14:19:14.264779 7f73fe236780 10 cache get:
>> name=.us-west.rgw.root+zone_info.us-west : miss
>> 2014-11-22 14:19:14.276114 7f73fe236780 10 cache put:
>> name=.us-west.rgw.root+zone_info.us-west
>> 2014-11-22 14:19:14.276131 7f73fe236780 10 adding
>> .us-west.rgw.root+zone_info.us-west to cache LRU end
>> 2014-11-22 14:19:14.276142 7f73fe236780 20 get_obj_state: s->obj_tag was
>> set empty
>> 2014-11-22 14:19:14.276161 7f73fe236780 10 cache get:
>> name=.us-west.rgw.root+zone_info.us-west : type miss (requested=1, cached=6)
>> 2014-11-22 14:19:14.276203 7f73fe236780 20 get_obj_state: rctx=0x2314660
>> obj=.us-west.rgw.root:zone_info.us-west state=0x2313b98 s->prefetch_data=0
>> 2014-11-22 14:19:14.276212 7f73fe236780 10 cache get:
>> name=.us-west.rgw.root+zone_info.us-west : hit
>> 2014-11-22 14:19:14.276218 7f73fe236780 20 get_obj_state: s->obj_tag was
>> set empty
>> 2014-11-22 14:19:14.276229 7f73fe236780 20 get_obj_state: rctx=0x2314660
>> obj=.us-west.rgw.root:zone_info.us-west state=0x2313b98 s->prefetch_data=0
>> 2014-11-22 14:19:14.276235 7f73fe236780 20 state for
>> obj=.us-west.rgw.root:zone_info.us-west is not atomic, not appending atomic
>> test
>> 2014-11-22 14:19:14.276238 7f73fe236780 20 rados->read obj-ofs=0
>> read_ofs=0 read_len=524288
>> 2014-11-22 14:19:14.290757 7f73fe236780 20 rados->read r=0 bl.length=997
>> 2014-11-22 14:19:14.290797 7f73fe236780 10 cache put:
>> name=.us-west.rgw.root+zone_info.us-west
>> 2014-11-22 14:19:14.290803 7f73fe236780 10 moving
>> .us-west.rgw.root+zone_info.us-west to cache LRU end
>> 2014-11-22 14:19:14.290857 7f73fe236

Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Lionel Bouton
On 01/04/15 16:25, Jiri Kanicky wrote:
> Hi.
>
> I have been experiencing same issues on both nodes over the past 2
> days (never both nodes at the same time).  It seems the issue occurs
> after some time when copying  a large number of files to CephFS on my
> client node (I dont use RBD yet).
>
> These are new HP servers and the memory does not seem to have any
> issues in mem test. I use SSD for OS and normal drives for OSD. I
> think that the issue is not related to drives as it would be too much
> coincident to have 6 drives with bad blocks on both nodes.

The kernel can't allocate enough memory for btrfs, see this:

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020

and this:

Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device sdb1)
in create_pending_snapshot:1334: errno=-12 Out of memory

OSDs need a lot of memory: 1GB during normal operation and probably
around 2GB during resynchronisations (at least my monitoring very rarely
detect them going past this limit). So you probably had a short spike of
memory usage (some of which can't be moved to swap: kernel memory and
mlocked memory).

Even if you don't use Btrfs if you want to avoid any headache when
replacing / repairing / ... OSD you probably want to put at least 4GB in
your servers instead of 2GB.



I didn't realize there were BTRFS configuration options until now, there's:
filestore btrfs snap
filestore btrfs clone range

I believed that the single write for both the journal and store updates
in BTRFS was depending on snapshots, but "clone range" may hint that
this is supported independently.

Could anyone familiar with Ceph internals elaborate on what the
consequences of (de)activating the two configuration options above are
(expected performance gains? Additional Ceph features?).

Best regards,

Lionel Bouton


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

Do you know how to tell that the option "filestore btrfs snap = false" 
is set?


Thx Jiri

On 5/01/2015 02:25, Jiri Kanicky wrote:

Hi.

I have been experiencing same issues on both nodes over the past 2 
days (never both nodes at the same time).  It seems the issue occurs 
after some time when copying  a large number of files to CephFS on my 
client node (I dont use RBD yet).


These are new HP servers and the memory does not seem to have any 
issues in mem test. I use SSD for OS and normal drives for OSD. I 
think that the issue is not related to drives as it would be too much 
coincident to have 6 drives with bad blocks on both nodes.


I will also disable the snapshots and report back after few days.

Thx Jiri


On 5/01/2015 01:33, Dyweni - Ceph-Users wrote:



On 2015-01-04 08:21, Jiri Kanicky wrote:


More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html 



Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that 
snapshotting can be

disabled with "filestore btrfs snap = false".


I am wondering if this can be the problem.




Very interesting... I think I was just hit with that over night. :)

Yes, I would definitely recommend turning off snapshots.  I'm going 
to do that myself now.


Have you tested the memory in your server lately?  Memtest86+ on the 
ram, and badblocks on the SSD swap partition?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

I have been experiencing same issues on both nodes over the past 2 days 
(never both nodes at the same time).  It seems the issue occurs after 
some time when copying  a large number of files to CephFS on my client 
node (I dont use RBD yet).


These are new HP servers and the memory does not seem to have any issues 
in mem test. I use SSD for OS and normal drives for OSD. I think that 
the issue is not related to drives as it would be too much coincident to 
have 6 drives with bad blocks on both nodes.


I will also disable the snapshots and report back after few days.

Thx Jiri


On 5/01/2015 01:33, Dyweni - Ceph-Users wrote:



On 2015-01-04 08:21, Jiri Kanicky wrote:


More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html 



Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting 
can be

disabled with "filestore btrfs snap = false".


I am wondering if this can be the problem.




Very interesting... I think I was just hit with that over night. :)

Yes, I would definitely recommend turning off snapshots.  I'm going to 
do that myself now.


Have you tested the memory in your server lately?  Memtest86+ on the 
ram, and badblocks on the SSD swap partition?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Dyweni - Ceph-Users

Hi,

If its the only think in your pool, you could try deleting the pool 
instead.


I found that to be faster in my testing; I had created 500TB when I 
meant to create 500GB.


Note for the Devs: I would be nice if rbd create/resize would accept 
sizes with units (i.e. MB GB TB PB, etc).





On 2015-01-04 08:45, Edwin Peer wrote:

Hi there,

I did something stupid while growing an rbd image. I accidentally
mistook the units of the resize command for bytes instead of megabytes
and grew an rbd image to 650PB instead of 650GB. This all happened
instantaneously enough, but trying to rectify the mistake is not going
nearly as well.


ganymede ~ # rbd resize --size 665600 --allow-shrink
client-disk-img0/vol-x318644f-0
Resizing image: 1% complete...


It took a couple days before it started showing 1% complete and has
been stuck on 1% for a couple more. At this rate, I should be able to
shrink the image back to the intended size in about 2016.

Any ideas?

Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Edwin Peer

Hi there,

I did something stupid while growing an rbd image. I accidentally 
mistook the units of the resize command for bytes instead of megabytes 
and grew an rbd image to 650PB instead of 650GB. This all happened 
instantaneously enough, but trying to rectify the mistake is not going 
nearly as well.



ganymede ~ # rbd resize --size 665600 --allow-shrink 
client-disk-img0/vol-x318644f-0

Resizing image: 1% complete...


It took a couple days before it started showing 1% complete and has been 
stuck on 1% for a couple more. At this rate, I should be able to shrink 
the image back to the intended size in about 2016.


Any ideas?

Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Edwin Peer

Hi there,

I did something stupid while growing an rbd image. I accidentally 
mistook the units of the resize command for bytes instead of megabytes 
and grew an rbd image to 650PB instead of 650GB. This all happened 
instantaneously enough, but trying to rectify the mistake is not going 
nearly as well.



ganymede ~ # rbd resize --size 665600 --allow-shrink 
client-disk-img0/vol-x318644f-0

Resizing image: 1% complete...


It took a couple days before it started showing 1% complete and has been 
stuck on 1% for a couple more. At this rate, I should be able to shrink 
the image back to the intended size in about 2016.


Any ideas?

Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Dyweni - Ceph-Users



On 2015-01-04 08:21, Jiri Kanicky wrote:


More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html

Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed 
in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file system 
if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting 
can be

disabled with "filestore btrfs snap = false".


I am wondering if this can be the problem.




Very interesting... I think I was just hit with that over night.  :)

Yes, I would definitely recommend turning off snapshots.  I'm going to 
do that myself now.


Have you tested the memory in your server lately?  Memtest86+ on the 
ram, and badblocks on the SSD swap partition?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

Correction. My SWAP is 3GB on SSD disk. I dont use th nodes for client 
stuff.


Thx Jiri

On 5/01/2015 01:21, Jiri Kanicky wrote:

Hi,

Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 
500MB on SSD disk.


cephadmin@ceph1:~$ free
 total   used   free sharedbuffers cached
Mem:   18857201817860  67860  0 32 694552
-/+ buffers/cache:1123276 762444
Swap:  3859452 6334923225960

More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html

Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting 
can be

disabled with "filestore btrfs snap = false".


I am wondering if this can be the problem.

-Jiri


On 5/01/2015 01:17, Dyweni - BTRFS wrote:

Hi,

BTRFS crashed because the system ran out of memory...

I see these entries in your logs:


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020



Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device
sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory



Jan  4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device
sdb1) in cleanup_transaction:1577: errno=-12 Out of memory



How much memory do you have in this node?  Where you using Ceph
(as a client) on this node?  Do you have swap configured on this
node?









On 2015-01-04 07:12, Jiri Kanicky wrote:

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this 
state:


cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs
stuck undersized; 631 pgs undersized; recovery 397226/915548 objects
degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub
errors
 monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 30, quorum 0,1 ceph1,ceph2
 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548
objects misplaced (7.867%)
  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I
forgot to use?

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm:
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian
3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP
ProLiant MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704] 
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707] 811519ed
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710] 
0001 880075de0c08 0096
Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [] ?
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [] ?
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [] ?
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [] ?
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [] ?
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [] ?
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [] ?
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [] ?
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [] ?
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [] ?
btrfs_read_fs_root+0x33/0x40 [btrfs]
Jan  4 17:11

Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi,

Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 
500MB on SSD disk.


cephadmin@ceph1:~$ free
 total   used   free sharedbuffers cached
Mem:   18857201817860  67860  0 32 694552
-/+ buffers/cache:1123276 762444
Swap:  3859452 6334923225960

More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html

Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in
later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file system if
OSDs are using snapshots (which is the default). Due to kernel bugs related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting can be
disabled with "filestore btrfs snap = false".


I am wondering if this can be the problem.

-Jiri


On 5/01/2015 01:17, Dyweni - BTRFS wrote:

Hi,

BTRFS crashed because the system ran out of memory...

I see these entries in your logs:


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020



Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device
sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory



Jan  4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device
sdb1) in cleanup_transaction:1577: errno=-12 Out of memory



How much memory do you have in this node?  Where you using Ceph
(as a client) on this node?  Do you have swap configured on this
node?









On 2015-01-04 07:12, Jiri Kanicky wrote:

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this 
state:


cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs
stuck undersized; 631 pgs undersized; recovery 397226/915548 objects
degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub
errors
 monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 30, quorum 0,1 ceph1,ceph2
 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548
objects misplaced (7.867%)
  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I
forgot to use?

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm:
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian
3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP
ProLiant MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704]  
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707]  811519ed
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710]  
0001 880075de0c08 0096
Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [] ?
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [] ?
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [] ?
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [] ?
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [] ?
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [] ?
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [] ?
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [] ?
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [] ?
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [] ?
btrfs_read_fs_root+0x33/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535785] [] ?
btrfs_get_fs_root+0xd6/0x230 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535796] [] ?
create_pendin

[ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this state:

cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs 
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs stuck 
undersized; 631 pgs undersized; recovery 397226/915548 objects degraded 
(43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub errors
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
30, quorum 0,1 ceph1,ceph2

 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548 
objects misplaced (7.867%)

  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I 
forgot to use?


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page 
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm: 
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP ProLiant 
MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events 
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704]   
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707]  811519ed 
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710]   
0001 880075de0c08 0096

Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [] ? 
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [] ? 
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [] ? 
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [] ? 
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [] ? 
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [] ? 
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [] ? 
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [] ? 
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [] ? 
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [] ? 
btrfs_read_fs_root+0x33/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535785] [] ? 
btrfs_get_fs_root+0xd6/0x230 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535796] [] ? 
create_pending_snapshot+0x793/0xa00 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535807] [] ? 
create_pending_snapshots+0x89/0xa0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535818] [] ? 
btrfs_commit_transaction+0x35a/0xa10 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535824] [] ? 
mod_timer+0x10e/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535834] [] ? 
do_async_commit+0x2a/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535839] [] ? 
process_one_work+0x15c/0x450
Jan  4 17:11:06 ceph1 kernel: [756636.535843] [] ? 
worker_thread+0x112/0x540
Jan  4 17:11:06 ceph1 kernel: [756636.535847] [] ? 
create_and_start_worker+0x60/0x60
Jan  4 17:11:06 ceph1 kernel: [756636.535851] [] ? 
kthread+0xc1/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535854] [] ? 
flush_kthread_worker+0xb0/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535858] [] ? 
ret_from_fork+0x7c/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535861] [] ? 
flush_kthread_worker+0xb0/0xb0

Jan  4 17:11:06 ceph1 kernel: [756636.535863] Mem-Info:
Jan  4 17:11:06 ceph1 kernel: [756636.535865] Node 0 DMA per-cpu:
Jan  4 17:11:06 ceph1 kernel: [756636.535867] CPU0: hi:0, 
btch:   1 usd:   0
Jan  4 17:11:06 ceph1 kernel: [756636.535869] CPU1: hi:0, 
btch:   1 usd:   0

Jan  4 17:11:06 ceph1 kernel: [756636.535870] Node 0 DMA32 per-cpu:
Jan  4 17:11:06 ceph1 kernel: [756636.535872] CPU0: hi:  186, btch:  
31 usd: 216
Jan  4 17:11:06 ceph1 kernel: [756636.535874] CPU1: hi:  186, btch:  
31 usd: 150
Jan  4 17:11:06 ceph1 kernel: [756636.535879] active_anon:156968 
inactive_anon:52877 is

Re: [ceph-users] redundancy with 2 nodes

2015-01-04 Thread Chen, Xiaoxi
Did you shut down the node with 2 mon?

I think it might be impossible to have redundancy with only 2 node,  paxos 
quorum is the reason:

Say you have N (N=2K+1) monitors, you always have a node(let's named it node A) 
with majority number of MONs(>= K+1), another node(node B) with minority number 
of MONs(<=K)
To form a quorum ,you need at least K+1 alive MONs, so, if node B down, 
everything is good. But if node A goes down, you can never have a majority with 
<=K monitor.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jiri 
Kanicky
Sent: Thursday, January 1, 2015 12:50 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] redundancy with 2 nodes

Hi,

I noticed this message after shutting down the other node. You might be right 
that I need 3 monitors.
2015-01-01 15:47:35.990260 7f22858dd700  0 monclient: hunting for new mon

But what is quite unexpected is that you cannot run even "ceph status" on the 
running node t find out the state of the cluster.

Thx Jiri

On 1/01/2015 15:46, Jiri Kanicky wrote:
Hi,

I have:
- 2 monitors, one on each node
- 4 OSDs, two on each node
- 2 MDS, one on each node

Yes, all pools are set with size=2 and min_size=1

cephadmin@ceph1:~$ ceph osd dump
epoch 88
fsid bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
created 2014-12-27 23:38:00.455097
modified 2014-12-30 20:45:51.343217
flags
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
p  g_num 256 pgp_num 256 last_change 86 flags hashpspool 
stripe_width 0
pool 1 'media' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins   pg_num 256 pgp_num 256 last_change 60 flags 
hashpspool stripe_width 0
pool 2 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
  pg_num 256 pgp_num 256 last_change 63 flags hashpspool 
stripe_width 0
pool 3 'cephfs_test' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rj  enkins pg_num 256 pgp_num 256 last_change 71 flags 
hashpspool crash_replay_inter  val 45 stripe_width 0
pool 4 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 
object_has  h rjenkins pg_num 256 pgp_num 256 last_change 69 
flags hashpspool stripe_width 0
max_osd 4
osd.0 up   in  weight 1 up_from 55 up_thru 86 down_at 51 last_clean_interval 
[39  ,50) 192.168.30.21:6800/17319 10.1.1.21:6800/17319 
10.1.1.21:6801/17319 192.168.  30.21:6801/17319 exists,up 
4f3172e1-adb8-4ca3-94af-6f0b8fcce35a
osd.1 up   in  weight 1 up_from 57 up_thru 86 down_at 53 last_clean_interval 
[41  ,52) 192.168.30.21:6803/17684 10.1.1.21:6802/17684 
10.1.1.21:6804/17684 192.168.  30.21:6805/17684 exists,up 
1790347a-94fa-4b81-b429-1e7c7f11d3fd
osd.2 up   in  weight 1 up_from 79 up_thru 86 down_at 74 last_clean_interval 
[13  ,73) 192.168.30.22:6801/3178 10.1.1.22:6800/3178 
10.1.1.22:6801/3178 192.168.30.  22:6802/3178 exists,up 
5520835f-c411-4750-974b-34e9aea2585d
osd.3 up   in  weight 1 up_from 81 up_thru 86 down_at 72 last_clean_interval 
[20  ,71) 192.168.30.22:6804/3414 10.1.1.22:6802/3414 
10.1.1.22:6803/3414 192.168.30.  22:6805/3414 exists,up 
25e62059-6392-4a69-99c9-214ae335004

Thx Jiri
On 1/01/2015 15:21, Lindsay Mathieson wrote:

On Thu, 1 Jan 2015 02:59:05 PM Jiri Kanicky wrote:

I would expect that if I shut down one node, the system will keep

running. But when I tested it, I cannot even execute "ceph status"

command on the running node.

2 osd Nodes, 3 Mon nodes here, works perfectly for me.



How many monitors do you have?

Maybe you need a third monitor only node for quorum?





I set "osd_pool_default_size = 2" (min_size=1) on all pools, so I

thought that each copy will reside on each node. Which means that if 1

node goes down the second one will be still operational.



does:

ceph osd pool get {pool name} size

  return 2



ceph osd pool get {pool name} min_size

  return 1








___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com