Re: [ceph-users] How ceph client abort IO

2015-10-21 Thread Jason Dillaman
> On Tue, 20 Oct 2015, Jason Dillaman wrote:
> > There is no such interface currently on the librados / OSD side to abort
> > IO operations.  Can you provide some background on your use-case for
> > aborting in-flight IOs?
> 
> The internal Objecter has a cancel interface, but it can't yank back
> buffers, and it's not exposed to librados.
> 
> But... if you're using librados or librbd then I think we're making a copy
> of the buffer anyway so you can reuse it as soon as the IO is submitted.
> Unless you're using the C++ librados API and submitting a bufferlist?
> 
> sage
> 
> 

Yes, copies are used for both the C/C++ APIs -- but my pending scatter/gather 
patch for librbd removes all copying with the exception cache integration under 
C.

--

Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing pg and pgs

2015-10-21 Thread Paras pradhan
Hi,

When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min
20)"

I have 40osds and tried to increase the pg to 2000 with the following
command. It says creating 1936 but not sure if it is working or not. Is
there a way to check the progress? It has passed more than 48hrs and I
still see the health warning.

--

root@node-30:~# ceph osd pool set rbd pg_num 2000

Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on
~40 OSDs exceeds per-OSD max of 32)

--


Thanks in advance

Paras.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Michael Hackett
Hello Paras,

This is a limit that was added pre-firefly to prevent users from knocking IO 
off clusters for several seconds when PG's are being split in existing pools. 
This limit is not called into effect when creating new pools though.

If you try and limit the number to 

# ceph osd pool set rbd pg_num 1280

This should go fine as this will be at the 32 PG per OSD limit in the existing 
pool.

This limit is set when expanding PG's on an existing pool because splits are a 
little more expensive for the OSD, and have to happen synchronously instead of 
asynchronously.

I believe Greg covered this in a previous email thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html

Thanks,

- Original Message -
From: "Paras pradhan" 
To: ceph-users@lists.ceph.com
Sent: Wednesday, October 21, 2015 12:31:57 PM
Subject: [ceph-users] Increasing pg and pgs

Hi, 

When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min 20)" 

I have 40osds and tried to increase the pg to 2000 with the following command. 
It says creating 1936 but not sure if it is working or not. Is there a way to 
check the progress? It has passed more than 48hrs and I still see the health 
warning. 

-- 


root@node-30:~# ceph osd pool set rbd pg_num 2000 

Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on ~40 
OSDs exceeds per-OSD max of 32) 

-- 




Thanks in advance 

Paras. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Michael Hackett 
Software Maintenance Engineer CEPH Storage 
Phone: 1-978-399-2196 
Westford, MA 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Michael Hackett
Hello Paras,

You pgp-num should mirror your pg-num on a pool. pgp-num is what the cluster 
will use for actual object placement purposes.

- Original Message -
From: "Paras pradhan" 
To: "Michael Hackett" 
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, October 21, 2015 1:39:11 PM
Subject: Re: [ceph-users] Increasing pg and pgs

Thanks Michael for the clarification. I should set the pg and pgp_num to
all the pools . Am i right? . I am asking beacuse setting the pg to just
only one pool already set the status to HEALTH OK.


-Paras.

On Wed, Oct 21, 2015 at 12:21 PM, Michael Hackett 
wrote:

> Hello Paras,
>
> This is a limit that was added pre-firefly to prevent users from knocking
> IO off clusters for several seconds when PG's are being split in existing
> pools. This limit is not called into effect when creating new pools though.
>
> If you try and limit the number to
>
> # ceph osd pool set rbd pg_num 1280
>
> This should go fine as this will be at the 32 PG per OSD limit in the
> existing pool.
>
> This limit is set when expanding PG's on an existing pool because splits
> are a little more expensive for the OSD, and have to happen synchronously
> instead of asynchronously.
>
> I believe Greg covered this in a previous email thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html
>
> Thanks,
>
> - Original Message -
> From: "Paras pradhan" 
> To: ceph-users@lists.ceph.com
> Sent: Wednesday, October 21, 2015 12:31:57 PM
> Subject: [ceph-users] Increasing pg and pgs
>
> Hi,
>
> When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min
> 20)"
>
> I have 40osds and tried to increase the pg to 2000 with the following
> command. It says creating 1936 but not sure if it is working or not. Is
> there a way to check the progress? It has passed more than 48hrs and I
> still see the health warning.
>
> --
>
>
> root@node-30:~# ceph osd pool set rbd pg_num 2000
>
> Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on
> ~40 OSDs exceeds per-OSD max of 32)
>
> --
>
>
>
>
> Thanks in advance
>
> Paras.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Michael Hackett
> Software Maintenance Engineer CEPH Storage
> Phone: 1-978-399-2196
> Westford, MA
>
>

-- 
Michael Hackett 
Software Maintenance Engineer CEPH Storage 
Phone: 1-978-399-2196 
Westford, MA 

Hello 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Michael Hackett
One thing I forgot to note Paras, If you are increasing the PG count on a pool 
by a large number you will want to increase the PGP value slowly and allow the 
cluster to rebalance the data instead of just setting the pgp-num to 
immediately reflect the pg-num. This will give you greater control over how 
much data is rebalancing in the cluster. 

So for example if your pg-num is to be set to 2048 on a pool which has a 
current PG count set to 512, you could step up as follows:

ceph osd pool set data pgp_num 1024 <--- Increase the 
hashing buckets gradually
Wait for cluster to finish rebalancing

ceph osd pool set data pgp_num 2048 <--- Increase the 
hashing buckets gradually
Wait for cluster to finish rebalancing

Thank you,

- Original Message -
From: "Paras pradhan" 
To: "Michael Hackett" 
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, October 21, 2015 1:53:13 PM
Subject: Re: [ceph-users] Increasing pg and pgs

Thanks!

On Wed, Oct 21, 2015 at 12:52 PM, Michael Hackett 
wrote:

> Hello Paras,
>
> You pgp-num should mirror your pg-num on a pool. pgp-num is what the
> cluster will use for actual object placement purposes.
>
> - Original Message -
> From: "Paras pradhan" 
> To: "Michael Hackett" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, October 21, 2015 1:39:11 PM
> Subject: Re: [ceph-users] Increasing pg and pgs
>
> Thanks Michael for the clarification. I should set the pg and pgp_num to
> all the pools . Am i right? . I am asking beacuse setting the pg to just
> only one pool already set the status to HEALTH OK.
>
>
> -Paras.
>
> On Wed, Oct 21, 2015 at 12:21 PM, Michael Hackett 
> wrote:
>
> > Hello Paras,
> >
> > This is a limit that was added pre-firefly to prevent users from knocking
> > IO off clusters for several seconds when PG's are being split in existing
> > pools. This limit is not called into effect when creating new pools
> though.
> >
> > If you try and limit the number to
> >
> > # ceph osd pool set rbd pg_num 1280
> >
> > This should go fine as this will be at the 32 PG per OSD limit in the
> > existing pool.
> >
> > This limit is set when expanding PG's on an existing pool because splits
> > are a little more expensive for the OSD, and have to happen synchronously
> > instead of asynchronously.
> >
> > I believe Greg covered this in a previous email thread:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html
> >
> > Thanks,
> >
> > - Original Message -
> > From: "Paras pradhan" 
> > To: ceph-users@lists.ceph.com
> > Sent: Wednesday, October 21, 2015 12:31:57 PM
> > Subject: [ceph-users] Increasing pg and pgs
> >
> > Hi,
> >
> > When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min
> > 20)"
> >
> > I have 40osds and tried to increase the pg to 2000 with the following
> > command. It says creating 1936 but not sure if it is working or not. Is
> > there a way to check the progress? It has passed more than 48hrs and I
> > still see the health warning.
> >
> > --
> >
> >
> > root@node-30:~# ceph osd pool set rbd pg_num 2000
> >
> > Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on
> > ~40 OSDs exceeds per-OSD max of 32)
> >
> > --
> >
> >
> >
> >
> > Thanks in advance
> >
> > Paras.
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > --
> > Michael Hackett
> > Software Maintenance Engineer CEPH Storage
> > Phone: 1-978-399-2196
> > Westford, MA
> >
> >
>
> --
> Michael Hackett
> Software Maintenance Engineer CEPH Storage
> Phone: 1-978-399-2196
> Westford, MA
>
> Hello
>

-- 
Michael Hackett 
Software Maintenance Engineer CEPH Storage 
Phone: 1-978-399-2196 
Westford, MA 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librbd regression with Hammer v0.94.4 -- use caution!

2015-10-21 Thread Sage Weil
There is a regression in librbd in v0.94.4 that can cause VMs to crash.  
For now, please refrain from upgrading hypervisor nodes or other librbd 
users to v0.94.4.

http://tracker.ceph.com/issues/13559

The problem does not affect server-side daemons (ceph-mon, ceph-osd, 
etc.).

Jason's identified the bug and has a fix prepared, but it'll probably take 
a few days before we have v0.94.5 out.


https://github.com/ceph/ceph/commit/4692c330bd992a06b97b5b8975ab71952b22477a

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Paras pradhan
Thanks Michael for the clarification. I should set the pg and pgp_num to
all the pools . Am i right? . I am asking beacuse setting the pg to just
only one pool already set the status to HEALTH OK.


-Paras.

On Wed, Oct 21, 2015 at 12:21 PM, Michael Hackett 
wrote:

> Hello Paras,
>
> This is a limit that was added pre-firefly to prevent users from knocking
> IO off clusters for several seconds when PG's are being split in existing
> pools. This limit is not called into effect when creating new pools though.
>
> If you try and limit the number to
>
> # ceph osd pool set rbd pg_num 1280
>
> This should go fine as this will be at the 32 PG per OSD limit in the
> existing pool.
>
> This limit is set when expanding PG's on an existing pool because splits
> are a little more expensive for the OSD, and have to happen synchronously
> instead of asynchronously.
>
> I believe Greg covered this in a previous email thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html
>
> Thanks,
>
> - Original Message -
> From: "Paras pradhan" 
> To: ceph-users@lists.ceph.com
> Sent: Wednesday, October 21, 2015 12:31:57 PM
> Subject: [ceph-users] Increasing pg and pgs
>
> Hi,
>
> When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min
> 20)"
>
> I have 40osds and tried to increase the pg to 2000 with the following
> command. It says creating 1936 but not sure if it is working or not. Is
> there a way to check the progress? It has passed more than 48hrs and I
> still see the health warning.
>
> --
>
>
> root@node-30:~# ceph osd pool set rbd pg_num 2000
>
> Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on
> ~40 OSDs exceeds per-OSD max of 32)
>
> --
>
>
>
>
> Thanks in advance
>
> Paras.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Michael Hackett
> Software Maintenance Engineer CEPH Storage
> Phone: 1-978-399-2196
> Westford, MA
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4

2015-10-21 Thread Jan Schermer
If I'm reading it correctly his cmdline says cache=none for the rbd device, so 
there should be no writeback caching:

file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none

If that's actually overriden by ceph.conf setting then that is another bug I 
guess :-)

Jan



> On 21 Oct 2015, at 19:46, Jason Dillaman  wrote:
> 
> There is an edge case with cloned image writeback caching that occurs after 
> an attempt to read a non-existent clone RADOS object, followed by a write to 
> said object, followed by another read.  This second read will cause the 
> cached write to be flushed to the OSD while the appropriate locks are not 
> being held.  This issue is being tracked via an upstream tracker ticket [1].
> 
> This issue effects librbd clients using v0.94.4 and v9.x.  Disabling the 
> cache or switching to write-through caching (rbd_cache_max_dirty = 0) should 
> avoid the issue until it is fixed in the next Ceph release.
> 
> [1] http://tracker.ceph.com/issues/13559
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message - 
> 
>> From: "Andrei Mikhailovsky" 
>> To: ceph-us...@ceph.com
>> Sent: Wednesday, October 21, 2015 8:17:39 AM
>> Subject: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4
> 
>> Hello guys,
> 
>> I've upgraded to the latest Hammer release and I've just noticed a massive
>> issue after the upgrade (((
> 
>> I am using ceph for virtual machine rbd storage over cloudstack. I am having
>> issues with starting virtual routers. The libvirt error message is:
> 
>> cat r-1407-VM.log
>> 2015-10-21 11:04:59.262+: starting up
>> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
>> QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name r-1407-VM -S -machine
>> pc-i440fx-trusty,accel=kvm,usb=off -m 256 -realtime mlock=off -smp
>> 1,sockets=1,cores=1,threads=1 -uuid 815d2860-cc7f-475d-bf63-02814c720fe4
>> -no-user-config -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/r-1407-VM.monitor,server,nowait
>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
>> -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
>> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive
>> file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
>> -drive
>> file=/usr/share/cloudstack-common/vms/systemvm.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none
>> -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1
>> -netdev tap,fd=54,id=hostnet0,vhost=on,vhostfd=55 -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:2e:f7:00:18,bus=pci.0,addr=0x3,rombar=0,romfile=
>> -netdev tap,fd=56,id=hostnet1,vhost=on,vhostfd=57 -device
>> virtio-net-pci,netdev=hostnet1,id=net1,mac=0e:00:a9:fe:01:42,bus=pci.0,addr=0x4,rombar=0,romfile=
>> -netdev tap,fd=58,id=hostnet2,vhost=on,vhostfd=59 -device
>> virtio-net-pci,netdev=hostnet2,id=net2,mac=06:0c:b6:00:02:13,bus=pci.0,addr=0x5,rombar=0,romfile=
>> -chardev pty,id=charserial0 -device
>> isa-serial,chardev=charserial0,id=serial0 -chardev
>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/r-1407-VM.agent,server,nowait
>> -device
>> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=r-1407-VM.vport
>> -device usb-tablet,id=input0 -vnc 192.168.169.2:10,password -device
>> cirrus-vga,id=video0,bus=pci.0,addr=0x2
>> Domain id=42 is tainted: high-privileges
>> libust[20136/20136]: Warning: HOME environment variable not set. Disabling
>> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
>> char device redirected to /dev/pts/13 (label charserial0)
>> librbd/LibrbdWriteback.cc: In function 'virtual ceph_tid_t
>> librbd::LibrbdWriteback::write(const object_t&, const object_locator_t&,
>> uint64_t, uint64_t, const SnapContext&, const bufferlist&, utime_t,
>> uint64_t, __u32, Context*)' thread 7ffa6b7fe700 time 2015-10-21
>> 12:05:07.901876
>> librbd/LibrbdWriteback.cc: 160: FAILED assert(m_ictx->owner_lock.is_locked())
>> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>> 1: (()+0x17258b) [0x7ffa92ef758b]
>> 2: (()+0xa9573) [0x7ffa92e2e573]
>> 3: (()+0x3a90ca) [0x7ffa9312e0ca]
>> 4: (()+0x3b583d) [0x7ffa9313a83d]
>> 5: (()+0x7212c) [0x7ffa92df712c]
>> 6: (()+0x9590f) [0x7ffa92e1a90f]
>> 7: (()+0x969a3) [0x7ffa92e1b9a3]
>> 8: (()+0x4782a) [0x7ffa92dcc82a]
>> 9: (()+0x56599) [0x7ffa92ddb599]
>> 10: (()+0x7284e) [0x7ffa92df784e]
>> 11: (()+0x162b7e) [0x7ffa92ee7b7e]
>> 

Re: [ceph-users] pg incomplete state

2015-10-21 Thread Gregory Farnum
On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson  wrote:
> Hi folks
>
> I've been rebuilding drives in my cluster to add space.  This has gone
> well so far.
>
> After the last batch of rebuilds, I'm left with one placement group in
> an incomplete state.
>
> [sudo] password for jpr:
> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
> pg 3.ea is stuck inactive since forever, current state incomplete, last
> acting [30,11]
> pg 3.ea is stuck unclean since forever, current state incomplete, last
> acting [30,11]
> pg 3.ea is incomplete, acting [30,11]
>
> I've restarted both OSD a few times but it hasn't cleared the error.
>
> On the primary I see errors in the log related to slow requests:
>
> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
> 3 included below; oldest blocked for > 31.922487 secs
> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
> 3.e4bd50ea) v4 currently reached pg
>
> Note's online suggest this is an issue with the journal and that it may
> be possible to export and rebuild thepg.  I don't have firefly.
>
> https://ceph.com/community/incomplete-pgs-oh-my/
>
> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
> but missing entirely on osd.30 (the primary).
>
> on osd.33 (primary):
>
> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
> 0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>
> on osd.11 (secondary):
>
> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
> 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>
> This makes some sense since, my disk drive rebuilding activity
> reformatted the primary osd.30.  It also gives me some hope that my data
> is not lost.
>
> I understand incomplete means problem with journal, but is there a way
> to dig deeper into this or possible to get the secondary's data to take
> over?

If you're running an older version of Ceph (Firefly or earlier,
maybe?), "incomplete" can also mean "not enough replicas". It looks
like that's what you're hitting here, if osd.11 is not reporting any
issues. If so, simply setting the min_size on this pool to 1 until the
backfilling is done should let you get going.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and upgrading OS version

2015-10-21 Thread Warren Wang - ISD
Depending on how busy your cluster is, I’d nuke and pave node by node. You can 
slow the data movement off the old box, and also slow it on the way back in 
with weighting. My own personal preference, if you have performance overhead to 
spare.

Warren

From: Andrei Mikhailovsky >
Date: Tuesday, October 20, 2015 at 3:05 PM
To: "ceph-us...@ceph.com" 
>
Subject: [ceph-users] ceph and upgrading OS version

Hello everyone

I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
wondering if you have a recommended process of upgrading the OS version without 
causing any issues to the ceph cluster?

Many thanks

Andrei

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-hammer and debian jessie - missing files on repository

2015-10-21 Thread Alfredo Deza
We did had some issues a few days ago where the Jessie packages didn't make it.

This shouldn't be a problem, would you mind trying again? I just
managed to install on Debian Jessie without problems::


  Debian GNU/Linux 8.2 (jessie)   built 2015-10-04

vagrant@vagrant:~$ ceph --version
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)

vagrant@vagrant:~$ apt-cache policy ceph
ceph:
  Installed: 0.94.4-1~bpo80+1
  Candidate: 0.94.4-1~bpo80+1
  Version table:
 *** 0.94.4-1~bpo80+1 0
999 http://download.ceph.com/debian-hammer/ jessie/main amd64 Packages
100 /var/lib/dpkg/status

On Tue, Oct 20, 2015 at 6:33 AM, Björn Lässig  wrote:
> Hi,
>
> Thanks guys for supporting the latest debian stable release with latest ceph
> stable!
>
> as version 0.94.4. has been released, i tried to upgrade my debian/jessie
> cluster with hammer/wheezy packages to hammer/jessie.
>
> Unfortunately the download.ceph.com/debian-hammer debian repository is in
> some strange state. (even, if you ignore, that the ipv6 connection is very
> flaky from different sites in europe)
>
> eg:
>
> in
> http://download.ceph.com/debian-hammer/dists/jessie/main/binary-amd64/Packages
> a package called ceph-common is referenced
>
> - /
> Package: ceph-common
> Version: 0.94.4-1~bpo80+1
> Architecture: amd64
> Filename: pool/main/c/ceph/ceph-common_0.94.4-1~bpo80+1_amd64.deb
> --\
>
> but this package file does not exists. All files whith version ~bpo80 do not
> exist. (yet?)
>
>
> could you check this please?
>
> Thanks in advance
> Björn Lässig
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4

2015-10-21 Thread Jason Dillaman
There is an edge case with cloned image writeback caching that occurs after an 
attempt to read a non-existent clone RADOS object, followed by a write to said 
object, followed by another read.  This second read will cause the cached write 
to be flushed to the OSD while the appropriate locks are not being held.  This 
issue is being tracked via an upstream tracker ticket [1].

This issue effects librbd clients using v0.94.4 and v9.x.  Disabling the cache 
or switching to write-through caching (rbd_cache_max_dirty = 0) should avoid 
the issue until it is fixed in the next Ceph release.

[1] http://tracker.ceph.com/issues/13559

-- 

Jason Dillaman 


- Original Message - 

> From: "Andrei Mikhailovsky" 
> To: ceph-us...@ceph.com
> Sent: Wednesday, October 21, 2015 8:17:39 AM
> Subject: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4

> Hello guys,

> I've upgraded to the latest Hammer release and I've just noticed a massive
> issue after the upgrade (((

> I am using ceph for virtual machine rbd storage over cloudstack. I am having
> issues with starting virtual routers. The libvirt error message is:

> cat r-1407-VM.log
> 2015-10-21 11:04:59.262+: starting up
> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
> QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name r-1407-VM -S -machine
> pc-i440fx-trusty,accel=kvm,usb=off -m 256 -realtime mlock=off -smp
> 1,sockets=1,cores=1,threads=1 -uuid 815d2860-cc7f-475d-bf63-02814c720fe4
> -no-user-config -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/r-1407-VM.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
> -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive
> file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
> -drive
> file=/usr/share/cloudstack-common/vms/systemvm.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none
> -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1
> -netdev tap,fd=54,id=hostnet0,vhost=on,vhostfd=55 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:2e:f7:00:18,bus=pci.0,addr=0x3,rombar=0,romfile=
> -netdev tap,fd=56,id=hostnet1,vhost=on,vhostfd=57 -device
> virtio-net-pci,netdev=hostnet1,id=net1,mac=0e:00:a9:fe:01:42,bus=pci.0,addr=0x4,rombar=0,romfile=
> -netdev tap,fd=58,id=hostnet2,vhost=on,vhostfd=59 -device
> virtio-net-pci,netdev=hostnet2,id=net2,mac=06:0c:b6:00:02:13,bus=pci.0,addr=0x5,rombar=0,romfile=
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/r-1407-VM.agent,server,nowait
> -device
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=r-1407-VM.vport
> -device usb-tablet,id=input0 -vnc 192.168.169.2:10,password -device
> cirrus-vga,id=video0,bus=pci.0,addr=0x2
> Domain id=42 is tainted: high-privileges
> libust[20136/20136]: Warning: HOME environment variable not set. Disabling
> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
> char device redirected to /dev/pts/13 (label charserial0)
> librbd/LibrbdWriteback.cc: In function 'virtual ceph_tid_t
> librbd::LibrbdWriteback::write(const object_t&, const object_locator_t&,
> uint64_t, uint64_t, const SnapContext&, const bufferlist&, utime_t,
> uint64_t, __u32, Context*)' thread 7ffa6b7fe700 time 2015-10-21
> 12:05:07.901876
> librbd/LibrbdWriteback.cc: 160: FAILED assert(m_ictx->owner_lock.is_locked())
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
> 1: (()+0x17258b) [0x7ffa92ef758b]
> 2: (()+0xa9573) [0x7ffa92e2e573]
> 3: (()+0x3a90ca) [0x7ffa9312e0ca]
> 4: (()+0x3b583d) [0x7ffa9313a83d]
> 5: (()+0x7212c) [0x7ffa92df712c]
> 6: (()+0x9590f) [0x7ffa92e1a90f]
> 7: (()+0x969a3) [0x7ffa92e1b9a3]
> 8: (()+0x4782a) [0x7ffa92dcc82a]
> 9: (()+0x56599) [0x7ffa92ddb599]
> 10: (()+0x7284e) [0x7ffa92df784e]
> 11: (()+0x162b7e) [0x7ffa92ee7b7e]
> 12: (()+0x163c10) [0x7ffa92ee8c10]
> 13: (()+0x8182) [0x7ffa8ec49182]
> 14: (clone()+0x6d) [0x7ffa8e97647d]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> 2015-10-21 11:05:08.091+: shutting down

> From what I can see, there seem to be an issue with locking
> (librbd/LibrbdWriteback.cc: 160: FAILED
> assert(m_ictx->owner_lock.is_locked())). However, the r-1407-VM virtual
> router is a new router and has not been created or ran before. So, I don't
> see why there is an issue with locking.

> Could someone please help 

Re: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4

2015-10-21 Thread Jason Dillaman
> If I'm reading it correctly his cmdline says cache=none for the rbd device,
> so there should be no writeback caching:
> 
> file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none
> 
> If that's actually overriden by ceph.conf setting then that is another bug I
> guess :-)
> 
> Jan

There was a bug in qemu which resulted in ceph.conf config properties 
overwriting your qemu command-line properties [1].  If you have "rbd cache = 
true" in your ceph.conf, it would override "cache=none" in your qemu 
command-line.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg03078.html

-- 

Jason Dillaman 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Paras pradhan
Michael, Yes i did wait for the rebalance to complete.

Thanks
Paras.

On Wed, Oct 21, 2015 at 1:02 PM, Michael Hackett 
wrote:

> One thing I forgot to note Paras, If you are increasing the PG count on a
> pool by a large number you will want to increase the PGP value slowly and
> allow the cluster to rebalance the data instead of just setting the pgp-num
> to immediately reflect the pg-num. This will give you greater control over
> how much data is rebalancing in the cluster.
>
> So for example if your pg-num is to be set to 2048 on a pool which has a
> current PG count set to 512, you could step up as follows:
>
> ceph osd pool set data pgp_num 1024 <--- Increase
> the hashing buckets gradually
> Wait for cluster to finish rebalancing
>
> ceph osd pool set data pgp_num 2048 <--- Increase
> the hashing buckets gradually
> Wait for cluster to finish rebalancing
>
> Thank you,
>
> - Original Message -
> From: "Paras pradhan" 
> To: "Michael Hackett" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, October 21, 2015 1:53:13 PM
> Subject: Re: [ceph-users] Increasing pg and pgs
>
> Thanks!
>
> On Wed, Oct 21, 2015 at 12:52 PM, Michael Hackett 
> wrote:
>
> > Hello Paras,
> >
> > You pgp-num should mirror your pg-num on a pool. pgp-num is what the
> > cluster will use for actual object placement purposes.
> >
> > - Original Message -
> > From: "Paras pradhan" 
> > To: "Michael Hackett" 
> > Cc: ceph-users@lists.ceph.com
> > Sent: Wednesday, October 21, 2015 1:39:11 PM
> > Subject: Re: [ceph-users] Increasing pg and pgs
> >
> > Thanks Michael for the clarification. I should set the pg and pgp_num to
> > all the pools . Am i right? . I am asking beacuse setting the pg to just
> > only one pool already set the status to HEALTH OK.
> >
> >
> > -Paras.
> >
> > On Wed, Oct 21, 2015 at 12:21 PM, Michael Hackett 
> > wrote:
> >
> > > Hello Paras,
> > >
> > > This is a limit that was added pre-firefly to prevent users from
> knocking
> > > IO off clusters for several seconds when PG's are being split in
> existing
> > > pools. This limit is not called into effect when creating new pools
> > though.
> > >
> > > If you try and limit the number to
> > >
> > > # ceph osd pool set rbd pg_num 1280
> > >
> > > This should go fine as this will be at the 32 PG per OSD limit in the
> > > existing pool.
> > >
> > > This limit is set when expanding PG's on an existing pool because
> splits
> > > are a little more expensive for the OSD, and have to happen
> synchronously
> > > instead of asynchronously.
> > >
> > > I believe Greg covered this in a previous email thread:
> > >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html
> > >
> > > Thanks,
> > >
> > > - Original Message -
> > > From: "Paras pradhan" 
> > > To: ceph-users@lists.ceph.com
> > > Sent: Wednesday, October 21, 2015 12:31:57 PM
> > > Subject: [ceph-users] Increasing pg and pgs
> > >
> > > Hi,
> > >
> > > When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 <
> min
> > > 20)"
> > >
> > > I have 40osds and tried to increase the pg to 2000 with the following
> > > command. It says creating 1936 but not sure if it is working or not. Is
> > > there a way to check the progress? It has passed more than 48hrs and I
> > > still see the health warning.
> > >
> > > --
> > >
> > >
> > > root@node-30:~# ceph osd pool set rbd pg_num 2000
> > >
> > > Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs
> on
> > > ~40 OSDs exceeds per-OSD max of 32)
> > >
> > > --
> > >
> > >
> > >
> > >
> > > Thanks in advance
> > >
> > > Paras.
> > >
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > --
> > > Michael Hackett
> > > Software Maintenance Engineer CEPH Storage
> > > Phone: 1-978-399-2196
> > > Westford, MA
> > >
> > >
> >
> > --
> > Michael Hackett
> > Software Maintenance Engineer CEPH Storage
> > Phone: 1-978-399-2196
> > Westford, MA
> >
> > Hello
> >
>
> --
> Michael Hackett
> Software Maintenance Engineer CEPH Storage
> Phone: 1-978-399-2196
> Westford, MA
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-21 Thread John-Paul Robinson
Greg,

Thanks for the insight.  I suspect things are somewhat sane given that I
did erase the primary (osd.30) and the secondary (osd.11) still contains
pg data.

If I may, could you clarify the process of backfill a little?

I understand the min_size allows I/O on the object to resume while there
are only that many replicas (ie. 1 once changed) and this would let
things move forward.

I would expect, however, that some backfill would already be on-going
for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
happening.  The pg 3.ea directory is just as empty today as it was
yesterday.

Will changing the min_size actually trigger backfill to begin for an
object if has stalled or never got started?

An alternative idea I had was to take osd.30 back out of the cluster so
that pg 3.ae [30,11] would get mapped to some other osd to maintain
replication.  This seems a bit heavy handed though, given that only this
one pg is affected.

Thanks for any follow up.

~jpr 


On 10/21/2015 01:21 PM, Gregory Farnum wrote:
> On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson  wrote:
>> Hi folks
>>
>> I've been rebuilding drives in my cluster to add space.  This has gone
>> well so far.
>>
>> After the last batch of rebuilds, I'm left with one placement group in
>> an incomplete state.
>>
>> [sudo] password for jpr:
>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
>> pg 3.ea is stuck inactive since forever, current state incomplete, last
>> acting [30,11]
>> pg 3.ea is stuck unclean since forever, current state incomplete, last
>> acting [30,11]
>> pg 3.ea is incomplete, acting [30,11]
>>
>> I've restarted both OSD a few times but it hasn't cleared the error.
>>
>> On the primary I see errors in the log related to slow requests:
>>
>> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
>> 3 included below; oldest blocked for > 31.922487 secs
>> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
>> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
>> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
>> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
>> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
>> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
>> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
>> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
>> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
>> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
>> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
>> 3.e4bd50ea) v4 currently reached pg
>>
>> Note's online suggest this is an issue with the journal and that it may
>> be possible to export and rebuild thepg.  I don't have firefly.
>>
>> https://ceph.com/community/incomplete-pgs-oh-my/
>>
>> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
>> but missing entirely on osd.30 (the primary).
>>
>> on osd.33 (primary):
>>
>> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
>> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>> 0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>
>> on osd.11 (secondary):
>>
>> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
>> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>> 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>
>> This makes some sense since, my disk drive rebuilding activity
>> reformatted the primary osd.30.  It also gives me some hope that my data
>> is not lost.
>>
>> I understand incomplete means problem with journal, but is there a way
>> to dig deeper into this or possible to get the secondary's data to take
>> over?
> If you're running an older version of Ceph (Firefly or earlier,
> maybe?), "incomplete" can also mean "not enough replicas". It looks
> like that's what you're hitting here, if osd.11 is not reporting any
> issues. If so, simply setting the min_size on this pool to 1 until the
> backfilling is done should let you get going.
> -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-21 Thread Gregory Farnum
I don't remember the exact timeline, but min_size is designed to
prevent data loss from under-replicated objects (ie, if you only have
1 copy out of 3 and you lose that copy, you're in trouble, so maybe
you don't want it to go active). Unfortunately it could also prevent
the OSDs from replicating/backfilling the data to new OSDs in the case
where you only had one copy left — that's fixed now, but wasn't
initially. And in that case it reported the PG as incomplete (in later
versions, PGs in this state get reported as undersized).

So if you drop the min_size to 1, it will allow new writes to the PG
(which might not be great), but it will also let the OSD go into the
backfilling state. (At least, assuming the number of replicas is the
only problem.). Based on your description of the problem I think this
is the state you're in, and decreasing min_size is the solution.
*shrug*
You could also try and do something like extracting the PG from osd.11
and copying it to osd.30, but that's quite tricky without the modern
objectstore tool stuff, and I don't know if any of that works on
dumpling (which it sounds like you're on — incidentally, you probably
want to upgrade from that).
-Greg

On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson  wrote:
> Greg,
>
> Thanks for the insight.  I suspect things are somewhat sane given that I
> did erase the primary (osd.30) and the secondary (osd.11) still contains
> pg data.
>
> If I may, could you clarify the process of backfill a little?
>
> I understand the min_size allows I/O on the object to resume while there
> are only that many replicas (ie. 1 once changed) and this would let
> things move forward.
>
> I would expect, however, that some backfill would already be on-going
> for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
> happening.  The pg 3.ea directory is just as empty today as it was
> yesterday.
>
> Will changing the min_size actually trigger backfill to begin for an
> object if has stalled or never got started?
>
> An alternative idea I had was to take osd.30 back out of the cluster so
> that pg 3.ae [30,11] would get mapped to some other osd to maintain
> replication.  This seems a bit heavy handed though, given that only this
> one pg is affected.
>
> Thanks for any follow up.
>
> ~jpr
>
>
> On 10/21/2015 01:21 PM, Gregory Farnum wrote:
>> On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson  wrote:
>>> Hi folks
>>>
>>> I've been rebuilding drives in my cluster to add space.  This has gone
>>> well so far.
>>>
>>> After the last batch of rebuilds, I'm left with one placement group in
>>> an incomplete state.
>>>
>>> [sudo] password for jpr:
>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
>>> pg 3.ea is stuck inactive since forever, current state incomplete, last
>>> acting [30,11]
>>> pg 3.ea is stuck unclean since forever, current state incomplete, last
>>> acting [30,11]
>>> pg 3.ea is incomplete, acting [30,11]
>>>
>>> I've restarted both OSD a few times but it hasn't cleared the error.
>>>
>>> On the primary I see errors in the log related to slow requests:
>>>
>>> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
>>> 3 included below; oldest blocked for > 31.922487 secs
>>> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
>>> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
>>> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
>>> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
>>> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
>>> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
>>> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
>>> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
>>> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
>>> 3.e4bd50ea) v4 currently reached pg
>>>
>>> Note's online suggest this is an issue with the journal and that it may
>>> be possible to export and rebuild thepg.  I don't have firefly.
>>>
>>> https://ceph.com/community/incomplete-pgs-oh-my/
>>>
>>> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
>>> but missing entirely on osd.30 (the primary).
>>>
>>> on osd.33 (primary):
>>>
>>> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
>>> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>> 0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>>
>>> on osd.11 (secondary):
>>>
>>> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
>>> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>> 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>>
>>> This makes some sense since, my disk drive rebuilding activity
>>> reformatted the primary osd.30.  It also gives me some hope that my data
>>> is not lost.
>>>
>>> I understand incomplete means problem with 

Re: [ceph-users] pg incomplete state

2015-10-21 Thread John-Paul Robinson
Yes.  That's the intention.  I was fixing the osd size to ensure the
cluster was in health ok for the upgrades (instead of multiple osds in
near full).

Thanks again for all the insight.  Very helpful.

~jpr

On 10/21/2015 03:01 PM, Gregory Farnum wrote:
>  (which it sounds like you're on — incidentally, you probably
> want to upgrade from that).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse and its memory usage

2015-10-21 Thread Gregory Farnum
On Tue, Oct 13, 2015 at 10:09 PM, Goncalo Borges
 wrote:
> Hi all...
>
> Thank you for the feedback, and I am sorry for my delay in replying.
>
> 1./ Just to recall the problem, I was testing cephfs using fio in two
> ceph-fuse clients:
>
> - Client A is in the same data center as all OSDs connected at 1 GbE
> - Client B is in a different data center (in another city) also connected at
> 1 GbE. However, I've seen that the connection is problematic, and sometimes,
> the network performance is well bellow the theoretical 1 Gbps limit.
> - Client A has 24 GB RAM + 98 GB of SWAP and client B has 48 GB of RAM + 98
> GB of SWAP
>
> and I was seeing that Client B was giving much better fio throughput
> because it was hitting the cache much more than Client A.
>
> --- * ---
>
> 2./ I was suspecting that Client B was hitting the cache because it had bad
> connectivity to the Ceph Cluster. I actually tried to sort that out and I
> was able to nail down a problem in a bad switch. However, after that, I
> still see the same behaviour which I can reproduce in a systematic way.
>
> --- * ---
>
> 3./ In a new round of tests in Client B, I've applied the following
> procedure:
>
> 3.1/ This is the network statistics right before starting my fio test:
>
> * Printing network statistics:
> * /sys/class/net/eth0/statistics/collisions: 0
> * /sys/class/net/eth0/statistics/multicast: 453650
> * /sys/class/net/eth0/statistics/rx_bytes: 437704562785
> * /sys/class/net/eth0/statistics/rx_compressed: 0
> * /sys/class/net/eth0/statistics/rx_crc_errors: 0
> * /sys/class/net/eth0/statistics/rx_dropped: 0
> * /sys/class/net/eth0/statistics/rx_errors: 0
> * /sys/class/net/eth0/statistics/rx_fifo_errors: 0
> * /sys/class/net/eth0/statistics/rx_frame_errors: 0
> * /sys/class/net/eth0/statistics/rx_length_errors: 0
> * /sys/class/net/eth0/statistics/rx_missed_errors: 0
> * /sys/class/net/eth0/statistics/rx_over_errors: 0
> * /sys/class/net/eth0/statistics/rx_packets: 387690140
> * /sys/class/net/eth0/statistics/tx_aborted_errors: 0
> * /sys/class/net/eth0/statistics/tx_bytes: 149206610455
> * /sys/class/net/eth0/statistics/tx_carrier_errors: 0
> * /sys/class/net/eth0/statistics/tx_compressed: 0
> * /sys/class/net/eth0/statistics/tx_dropped: 0
> * /sys/class/net/eth0/statistics/tx_errors: 0
> * /sys/class/net/eth0/statistics/tx_fifo_errors: 0
> * /sys/class/net/eth0/statistics/tx_heartbeat_errors: 0
> * /sys/class/net/eth0/statistics/tx_packets: 241698327
> * /sys/class/net/eth0/statistics/tx_window_errors: 0
>
> 3.2/ I've then launch my fio test. Please note that I am dropping caches
> before starting the test (sync; echo 3 > /proc/sys/vm/drop_caches). My
> current fio test has nothing fancy. Here are the options:
>
> # cat fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.in
> [fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036]
> ioengine=libaio
> iodepth=64
> rw=write
> bs=512K
> direct=1
> size=8192m
> numjobs=128
> filename=fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.data

Oh right, so you're only using 8GB of data to write over (and you're
hitting it a bunch of times). So if not for the direct IO flag this
would sort of make sense.

But with that, I'm very confused. There can be some annoying little
pieces of making direct IO get passed correctly through all the FUSE
interfaces, but I *thought* we were going through the hoops and making
things work. Perhaps I am incorrect. Zheng, do you know anything about
this?

>
> I am no sure if it matters, but the layout of my dir is the following:
>
> # getfattr -n ceph.dir.layout /cephfs/sydney
> getfattr: Removing leading '/' from absolute path names
> # file: cephfs/sydney
> ceph.dir.layout="stripe_unit=524288 stripe_count=8 object_size=4194304
> pool=cephfs_dt"
>
> 3.3/ fio produced the following result for the aggregated bandwidth. If
> I translate that number to Gbps, I get almost 3 Gbps which is impossible.
>
> # grep aggrb
> fio128write_ioenginelibaio_iodepth64_direct1_bs512K_20151013041036.out
>   WRITE: io=1024.0GB, aggrb=403101KB/s, minb=3149KB/s, maxb=3154KB/s,
> mint=2659304msec, maxt=2663699msec
>
> 3.4 This is the network statistics immediately after the test
>
> * Printing network statistics:
> * /sys/class/net/eth0/statistics/collisions: 0
> * /sys/class/net/eth0/statistics/multicast: 454539
> * /sys/class/net/eth0/statistics/rx_bytes: 440300506875
> * /sys/class/net/eth0/statistics/rx_compressed: 0
> * /sys/class/net/eth0/statistics/rx_crc_errors: 0
> * /sys/class/net/eth0/statistics/rx_dropped: 0
> * /sys/class/net/eth0/statistics/rx_errors: 0
> * /sys/class/net/eth0/statistics/rx_fifo_errors: 0
> * /sys/class/net/eth0/statistics/rx_frame_errors: 0
> * /sys/class/net/eth0/statistics/rx_length_errors: 0
> * /sys/class/net/eth0/statistics/rx_missed_errors: 0
> * /sys/class/net/eth0/statistics/rx_over_errors: 0
> * /sys/class/net/eth0/statistics/rx_packets: 423468075
> * 

[ceph-users] Preparing Ceph for CBT, disk labels by-id

2015-10-21 Thread Artie Ziff
My inquiry may be a fundamental Linux thing and/or requiring basic
Ceph guidance.

According to the CBT ReadMe -- https://github.com/ceph/cbt

Currently CBT looks for specific partition labels in
/dev/disk/by-partlabel for the Ceph OSD data and journal partitions.
...each OSD host partitions should be specified with the following gpt labels:
osd-device--data
osd-device--journal


Does this mean that a disk formatted with fdisk in MBR/DOS format
style should be changed to GPT?

I've been taking some advice from peers directing use of fdisk.
What is recommended disk prep tool and partition format (GPT/MBR)?
Or should I be using ceph-disk exclusively? (and be done with it! )

Also on the CBT ReadMe is a script that Users are encouraged to
inspect: mkpartmagna.sh
https://github.com/ceph/cbt/blob/master/tools/mkpartmagna.sh

The core task is iterating over items in directory /dev/disk/by-id.

==>  However, my /dev/disk/by-id is not populated with items.  <==

I realize this is a Linux thing... however I am not familiar with it.
When I google the topic I appears to be called persistent block device naming.

Does the parted command create the necessary labels that CBT requires?
Is there an extra step required to make the labels appear in /dev/disk/by-id

Are the Ceph udev rules related to this disk by-id naming?

And finally, are the Ceph udev rules a requirement for a proper
installation of Ceph?

And if you read this far... bonus question. :)

In this parted command,

parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data $sp% $ep%"

What function/feature do the variables $sp and $ep, hold for us?
Or what may have been the author's intent?

BTW, although cross--posted, I tried to set a reply-to for CBT list
only. We see how it goes. Thanks in advance.
-az
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-21 Thread Gregory Farnum
On Wed, Oct 14, 2015 at 7:20 PM, Francois Lafont  wrote:
> Hi,
>
> On 14/10/2015 06:45, Gregory Farnum wrote:
>
>>> Ok, however during my tests I had been careful to replace the correct
>>> file by a bad file with *exactly* the same size (the content of the
>>> file was just a little string and I have changed it by a string with
>>> exactly the same size). I had been careful to undo the mtime update
>>> too (I had restore the mtime of the file before the change). Despite
>>> this, the "repair" command worked well. Tested twice: 1. with the change
>>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>>> because I though the test 1. (in primary OSD) will fail.
>>
>> Hm. I'm a little confused by that, actually. Exactly what was the path
>> to the files you changed, and do you have before-and-after comparisons
>> on the content and metadata?
>
> I didn't remember exactly the process I have made so I have just retried
> today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
> Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
> /mnt on one of the nodes.
>
> ~# cat /mnt/file.txt # yes it's a little file. ;)
> 123456
>
> ~# ls -i /mnt/file.txt
> 1099511627776 /mnt/file.txt
>
> ~# printf "%x\n" 1099511627776
> 100
>
> ~# rados -p data ls - | grep 100
> 100.
>
> I have the name of the object mapped to my "file.txt".
>
> ~# ceph osd map data 100.
> osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 
> (3.30) -> up ([1,2], p1) acting ([1,2], p1)
>
> So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
> So I open a terminal in the node which hosts the primary OSD OSD-1 and
> then:
>
> ~# cat 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
> 123456
>
> ~# ll 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
> -rw-r--r-- 1 root root 7 Oct 15 03:46 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
>
> Now, I change the content with this script called "change_content.sh" to
> preserve the mtime after the change:
>
> -
> #!/bin/sh
>
> f="$1"
> f_tmp="${f}.tmp"
> content="$2"
> cp --preserve=all "$f" "$f_tmp"
> echo "$content" >"$f"
> touch -r "$f_tmp" "$f" # to restore the mtime after the change
> rm "$f_tmp"
> -
>
> So, let's go, I replace the content by a new content with exactly
> the same size (ie "ABCDEF" in this example):
>
> ~# ./change_content.sh 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
>  ABCDEF
>
> ~# cat 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
> ABCDEF
>
> ~# ll 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
> -rw-r--r-- 1 root root 7 Oct 15 03:46 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
>
> Now, the secondary OSD contains the good version of the object and
> the primary a bad version. Now, I launch a "ceph pg repair":
>
> ~# ceph pg repair 3.30
> instructing pg 3.30 on osd.1 to repair
>
> # I'm in the primary OSD and the file below has been repaired correctly.
> ~# cat 
> /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
> 123456
>
> As you can see, the repair command has worked well.
> Maybe my little is too trivial?

Hmm, maybe David has some idea.

>
>>> Greg, if I understand you well, I shouldn't have too much confidence in
>>> the "ceph pg repair" command, is it correct?
>>>
>>> But, if yes, what is the good way to repair a PG?
>>
>> Usually what we recommend is for those with 3 copies to find the
>> differing copy, delete it, and run a repair — then you know it'll
>> repair from a good version. But yeah, it's not as reliable as we'd
>> like it to be on its own.
>
> I would like to be sure to well understand. The process could be (in
> the case where size == 3):
>
> 1. In each of the 3 OSDs where my object is put:
>
> md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
>
> 2. Normally, I will have the same result in 2 OSDs, and in the other
> OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
> I run:
>
> rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
>
> 3. And now I can run the "ceph pg repair" command without risk:
>
> ceph pg repair $pg_id
>
> Is it the correct process?

Yes, I would expect this to work.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS file to rados object mapping

2015-10-21 Thread David Zafman


See below

On 10/21/15 2:44 PM, Gregory Farnum wrote:

On Wed, Oct 14, 2015 at 7:20 PM, Francois Lafont  wrote:

Hi,

On 14/10/2015 06:45, Gregory Farnum wrote:


Ok, however during my tests I had been careful to replace the correct
file by a bad file with *exactly* the same size (the content of the
file was just a little string and I have changed it by a string with
exactly the same size). I had been careful to undo the mtime update
too (I had restore the mtime of the file before the change). Despite
this, the "repair" command worked well. Tested twice: 1. with the change
on the primary OSD and 2. on the secondary OSD. And I was surprised
because I though the test 1. (in primary OSD) will fail.

Hm. I'm a little confused by that, actually. Exactly what was the path
to the files you changed, and do you have before-and-after comparisons
on the content and metadata?

I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.

~# cat /mnt/file.txt # yes it's a little file. ;)
123456

~# ls -i /mnt/file.txt
1099511627776 /mnt/file.txt

~# printf "%x\n" 1099511627776
100

~# rados -p data ls - | grep 100
100.

I have the name of the object mapped to my "file.txt".

~# ceph osd map data 100.
osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 (3.30) 
-> up ([1,2], p1) acting ([1,2], p1)

So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
123456

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:

-
#!/bin/sh

f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-

So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):

~# ./change_content.sh 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 ABCDEF

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
ABCDEF

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":

~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair

# I'm in the primary OSD and the file below has been repaired correctly.
~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
123456

As you can see, the repair command has worked well.
Maybe my little is too trivial?

Hmm, maybe David has some idea.


As of the Hammer release, a replicated object that is written 
sequentially maintains a CRC of the entire object.  This no I/O cost CRC 
is saved with other object information like size and mtime.   So in your 
test the bad replica is identified by comparing the CRC of what is read 
off of disk with the value in the object info.


David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs best practice

2015-10-21 Thread Gregory Farnum
On Wed, Oct 21, 2015 at 3:12 PM, Erming Pei  wrote:
> Hi,
>
>   I am just wondering which use case is better: (within one single file
> system) set up one data pool for each project, or let project to share a big
> pool?

I don't think anybody has that kind of operational experience. I think
that if your projects have distinct requirements it might be worth
giving them their own pool, and right now individual pools are the
only way to enforce RADOS-level access controls or storage quotas. But
other than that I'd probably stick with just one pool.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Preparing Ceph for CBT, disk labels by-id

2015-10-21 Thread David Burley
My response got held up in moderation to the CBT list, so posting to
ceph-users and sending a copy to you as well to ensure you get it.


Artie,

I'd just use ceph-disk unless you need a config that it doesn't support.
Its a lot fewer commands, pre-tested, and works.

That said, I had to create some journal partitions outside of ceph-disk
recently, and here's what I did:
sgdisk --new=1:0:+1M --change-name=1:"ceph journal"
--partition-guid=1:R --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106
/dev/$device

If you need more than one journal partition on the same device, modify the
above command by replacing 1: with 2:

For an OSD data partition, modify the partition size, set the change-name
to "ceph data" and the typecode UUID
to: 4fbd7e29-9d25-41b8-afd0-062c0ceff05d

The list of typecode UUIDs is easily viewable in the ceph-disk source, here:
https://github.com/ceph/ceph/blob/master/src/ceph-disk#L83-L94

I assume this will work with CBT since it worked with Ceph OSDs.

--David

On Wed, Oct 21, 2015 at 5:34 PM, Artie Ziff  wrote:

> My inquiry may be a fundamental Linux thing and/or requiring basic
> Ceph guidance.
>
> According to the CBT ReadMe -- https://github.com/ceph/cbt
> 
> Currently CBT looks for specific partition labels in
> /dev/disk/by-partlabel for the Ceph OSD data and journal partitions.
> ...each OSD host partitions should be specified with the following gpt
> labels:
> osd-device--data
> osd-device--journal
> 
>
> Does this mean that a disk formatted with fdisk in MBR/DOS format
> style should be changed to GPT?
>
> I've been taking some advice from peers directing use of fdisk.
> What is recommended disk prep tool and partition format (GPT/MBR)?
> Or should I be using ceph-disk exclusively? (and be done with it! )
>
> Also on the CBT ReadMe is a script that Users are encouraged to
> inspect: mkpartmagna.sh
> https://github.com/ceph/cbt/blob/master/tools/mkpartmagna.sh
>
> The core task is iterating over items in directory /dev/disk/by-id.
>
> ==>  However, my /dev/disk/by-id is not populated with items.  <==
>
> I realize this is a Linux thing... however I am not familiar with it.
> When I google the topic I appears to be called persistent block device
> naming.
>
> Does the parted command create the necessary labels that CBT requires?
> Is there an extra step required to make the labels appear in
> /dev/disk/by-id
>
> Are the Ceph udev rules related to this disk by-id naming?
>
> And finally, are the Ceph udev rules a requirement for a proper
> installation of Ceph?
>
> And if you read this far... bonus question. :)
>
> In this parted command,
>
> parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data $sp% $ep%"
>
> What function/feature do the variables $sp and $ep, hold for us?
> Or what may have been the author's intent?
>
> BTW, although cross--posted, I tried to set a reply-to for CBT list
> only. We see how it goes. Thanks in advance.
> -az
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com



-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs best practice

2015-10-21 Thread John Spray
On Wed, Oct 21, 2015 at 11:12 PM, Erming Pei  wrote:
> Hi,
>
>   I am just wondering which use case is better: (within one single file
> system) set up one data pool for each project, or let project to share a big
> pool?

In general you want to use a single data pool.  Using multiple pools
is only useful if the pools will be somehow different, like being on
different hardware (ssds vs hdds, etc), having different replica count
etc.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse crush

2015-10-21 Thread Gregory Farnum
On Thu, Oct 15, 2015 at 10:41 PM, 黑铁柱  wrote:
>
> cluster info:
>cluster b23b48bf-373a-489c-821a-31b60b5b5af0
>  health HEALTH_OK
>  monmap e1: 3 mons at
> {node1=192.168.0.207:6789/0,node2=192.168.0.208:6789/0,node3=192.168.0.209:6789/0},
> election epoch 24, quorum 0,1,2 node1,node2,node3
>  mdsmap e42: 2/2/1 up {0=0=up:active,1=1=up:active}, 1 up:standby
>  osdmap e474: 33 osds: 33 up, 33 in
>   pgmap v206523: 3200 pgs, 3 pools, 73443 MB data, 1505 kobjects
> 330 GB used, 8882 GB / 9212 GB avail
> 3200 active+clean
>
>
>
> ceph-client log:
> 2015-10-16 03:01:33.396095 7f63b1ffb700 -1 ./include/xlist.h: In function
> 'xlist::~xlist() [with T = ObjectCacher::Object*]' thread 7f63b1ffb700
> time 2015-10-16 03:01:33.336379
> ./include/xlist.h: 69: FAILED assert(_size == 0)
>
>  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
>  1: ceph-fuse() [0x58fe48]
>  2: (Client::put_inode(Inode*, int)+0x3a6) [0x537f36]
>  3: (Client::_ll_put(Inode*, int)+0xa6) [0x539f86]
>  4: (Client::ll_forget(Inode*, int)+0x3ae) [0x53a84e]
>  5: ceph-fuse() [0x5275e7]
>  6: (()+0x16beb) [0x7f64e2801beb]
>  7: (()+0x13481) [0x7f64e27fe481]
>  8: (()+0x7df3) [0x7f64e2052df3]
>  9: (clone()+0x6d) [0x7f64e0f413dd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> --- begin dump of recent events ---
> -1> 2015-10-16 03:01:33.100899 7f63c97fa700  3 client.6429 ll_lookup
> 0x7f63c71faf20 D -> 0 (100013a8e0f)
>  -> 2015-10-16 03:01:33.100905 7f63c97fa700  3 client.6429 ll_forget
> 100013a8df3 1
>  -9998> 2015-10-16 03:01:33.100910 7f63c97fa700  3 client.6429 ll_getattr
> 100013a8e0f.head
>  -9997> 2015-10-16 03:01:33.100913 7f63c97fa700  3 client.6429 ll_getattr
> 100013a8e0f.head = 0
>  -9996> 2015-10-16 03:01:33.100916 7f63c97fa700  3 client.6429 ll_forget
> 100013a8e0f 1
>  -9995> 2015-10-16 03:01:33.100921 7f63c97fa700  3 client.6429 ll_lookup
> 0x7f64cd8fcd70 doss_web_rep
>  -9994> 2015-10-16 03:01:33.100924 7f63c97fa700  3 client.6429 ll_lookup
> 0x7f64cd8fcd70 doss_web_rep -> 0 (100013a8e10)
>  -9993> 2015-10-16 03:01:33.100928 7f63c97fa700  3 client.6429 ll_forget
> 100013a8e0f 1
>  -9992> 2015-10-16 03:01:33.100944 7f63d19ed700  3 client.6429 ll_getattr
> 100013a8e10.head
>  -9991> 2015-10-16 03:01:33.100949 7f63d19ed700  3 client.6429 ll_getattr
> 100013a8e10.head = 0
>  -9990> 2015-10-16 03:01:33.100955 7f63d19ed700  3 client.6429 ll_forget
> 100013a8e10 1
>  -9989> 2015-10-16 03:01:33.100960 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f64cddee1c0 1051_SPOA3_proj
>  -9988> 2015-10-16 03:01:33.100964 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f64cddee1c0 1051_SPOA3_proj -> 0 (2153d64)
>  -9987> 2015-10-16 03:01:33.100969 7f63d19ed700  3 client.6429 ll_forget
> 100013a8e10 1
>  -9986> 2015-10-16 03:01:33.100974 7f63d19ed700  3 client.6429 ll_getattr
> 2153d64.head
>  -9985> 2015-10-16 03:01:33.100979 7f63d19ed700  3 client.6429 ll_getattr
> 2153d64.head = 0
>  -9984> 2015-10-16 03:01:33.100983 7f63d19ed700  3 client.6429 ll_forget
> 2153d64 1
>  -9983> 2015-10-16 03:01:33.100987 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f63c62a2a80 tags
>  -9982> 2015-10-16 03:01:33.100989 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f63c62a2a80 tags -> 0 (2153d7d)
>  -9981> 2015-10-16 03:01:33.100994 7f63d19ed700  3 client.6429 ll_forget
> 2153d64 1
>  -9980> 2015-10-16 03:01:33.100999 7f63d19ed700  3 client.6429 ll_getattr
> 2153d7d.head
>  -9979> 2015-10-16 03:01:33.101002 7f63d19ed700  3 client.6429 ll_getattr
> 2153d7d.head = 0
>  -9978> 2015-10-16 03:01:33.101006 7f63d19ed700  3 client.6429 ll_forget
> 2153d7d 1
>  -9977> 2015-10-16 03:01:33.101011 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f63c4a9b5a0 v20101206
>  -9976> 2015-10-16 03:01:33.101015 7f63d19ed700  3 client.6429 ll_lookup
> 0x7f63c4a9b5a0 v20101206 -> 0 (2153d7f)
>
>
> I always have this problem.how to slove?

I assume http://tracker.ceph.com/issues/13472 is yours, right? Can you
please upload your existing client log file to the tracker, and
reproduce this with "debug client = 20"?

Is the entire cluster running .80.10, or just the client?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs best practice

2015-10-21 Thread Erming Pei

Hi,

  I am just wondering which use case is better: (within one single file 
system) set up one data pool for each project, or let project to share a 
big pool?



Thanks,
Erming


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg and pgs

2015-10-21 Thread Paras pradhan
Thanks!

On Wed, Oct 21, 2015 at 12:52 PM, Michael Hackett 
wrote:

> Hello Paras,
>
> You pgp-num should mirror your pg-num on a pool. pgp-num is what the
> cluster will use for actual object placement purposes.
>
> - Original Message -
> From: "Paras pradhan" 
> To: "Michael Hackett" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, October 21, 2015 1:39:11 PM
> Subject: Re: [ceph-users] Increasing pg and pgs
>
> Thanks Michael for the clarification. I should set the pg and pgp_num to
> all the pools . Am i right? . I am asking beacuse setting the pg to just
> only one pool already set the status to HEALTH OK.
>
>
> -Paras.
>
> On Wed, Oct 21, 2015 at 12:21 PM, Michael Hackett 
> wrote:
>
> > Hello Paras,
> >
> > This is a limit that was added pre-firefly to prevent users from knocking
> > IO off clusters for several seconds when PG's are being split in existing
> > pools. This limit is not called into effect when creating new pools
> though.
> >
> > If you try and limit the number to
> >
> > # ceph osd pool set rbd pg_num 1280
> >
> > This should go fine as this will be at the 32 PG per OSD limit in the
> > existing pool.
> >
> > This limit is set when expanding PG's on an existing pool because splits
> > are a little more expensive for the OSD, and have to happen synchronously
> > instead of asynchronously.
> >
> > I believe Greg covered this in a previous email thread:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041399.html
> >
> > Thanks,
> >
> > - Original Message -
> > From: "Paras pradhan" 
> > To: ceph-users@lists.ceph.com
> > Sent: Wednesday, October 21, 2015 12:31:57 PM
> > Subject: [ceph-users] Increasing pg and pgs
> >
> > Hi,
> >
> > When I check ceph health I see "HEALTH_WARN too few pgs per osd (11 < min
> > 20)"
> >
> > I have 40osds and tried to increase the pg to 2000 with the following
> > command. It says creating 1936 but not sure if it is working or not. Is
> > there a way to check the progress? It has passed more than 48hrs and I
> > still see the health warning.
> >
> > --
> >
> >
> > root@node-30:~# ceph osd pool set rbd pg_num 2000
> >
> > Error E2BIG: specified pg_num 2000 is too large (creating 1936 new PGs on
> > ~40 OSDs exceeds per-OSD max of 32)
> >
> > --
> >
> >
> >
> >
> > Thanks in advance
> >
> > Paras.
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > --
> > Michael Hackett
> > Software Maintenance Engineer CEPH Storage
> > Phone: 1-978-399-2196
> > Westford, MA
> >
> >
>
> --
> Michael Hackett
> Software Maintenance Engineer CEPH Storage
> Phone: 1-978-399-2196
> Westford, MA
>
> Hello
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-21 Thread Gregory Farnum
On Sun, Oct 18, 2015 at 8:27 PM, Yan, Zheng  wrote:
> On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
>  wrote:
>> Hi,
>>
>> I've noticed that CephFS (both ceph-fuse and kernel client in version 4.2.3)
>> remove files from page cache as soon as they are not in use by a process
>> anymore.
>>
>> Is this intended behaviour? We use CephFS as a replacement for NFS in our
>> HPC cluster. It should serve large files which are read by multiple jobs on
>> multiple hosts, so keeping them in the page cache over the duration of
>> several job invocations is crucial.
>
> Yes. MDS needs resource to track the cached data. We don't want MDS
> use too much resource.

So if I'm reading things right, the code to drop the page cache for
ceph-fuse was added in https://github.com/ceph/ceph/pull/1594
(specifically 82015e409d09701a7048848f1d4379e51dd00892). I don't think
it's actually needed for the cap trimming stuff or to prevent MDS
cache pressure and it's actually not clear to me why it was added here
anyway. But you do say the PR as a whole fixed a lot of bugs. Do you
know if the page cache clearing was for any bugs in particular, Zheng?

In general I think proactively clearing the page cache is something we
really only want to do as part of our consistency and cap handling
story, and file closes don't really play into that. I've pushed a
TOTALLY UNTESTED (NOT EVEN COMPILED) branch client-pagecache-norevoke
based on master to the gitbuilders. If it does succeed in building you
should be able to download it and you can use it for testing, or
cherry-pick the top commit out of git and build your own packages.
Then set the (new to this branch) client_preserve_pagecache config
option to true (default: false) and it should avoid flushing the page
cache.

But there might be (probably are?) bugs as a result of that. No idea.
Use at your own risk. But let us know if it makes things better for
you.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Core dump when running OSD service

2015-10-21 Thread James O'Neill
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all packages. 
I then upgraded everything to 0.94.4 hoping that would fix it but it 
hasn't. There are three OSDs, this is the only one having problems (it 
also contains the inconsistent pgs). Can anyone tell me what the 
problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal FileJournal::_open: 
disabling aio for non-block journal. Use journal_force_aio to force use 
of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal (Aborted) 
**

in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- begin dump of recent events ---
  -61> 2015-10-22 14:15:48.308047 7f4edabec900 5 asok(0x5648000) 
register_command perfcounters_dump hook 0x55e8050
  -60> 2015-10-22 14:15:48.308138 7f4edabec900 5 asok(0x5648000) 
register_command 1 hook 0x55e8050
  -59> 2015-10-22 14:15:48.308164 7f4edabec900 5 asok(0x5648000) 
register_command perf dump hook 0x55e8050
  

Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-21 Thread Wido den Hollander
On 10/20/2015 07:44 PM, Mark Nelson wrote:
> On 10/20/2015 09:00 AM, Wido den Hollander wrote:
>> Hi,
>>
>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>> bcache in production and Mark Nelson asked me to share some details.
>>
>> Bcache is running in two clusters now that I manage, but I'll keep this
>> information to one of them (the one at PCextreme behind CloudStack).
>>
>> In this cluster has been running for over 2 years now:
>>
>> epoch 284353
>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>> created 2013-09-23 11:06:11.819520
>> modified 2015-10-20 15:27:48.734213
>>
>> The system consists out of 39 hosts:
>>
>> 2U SuperMicro chassis:
>> * 80GB Intel SSD for OS
>> * 240GB Intel S3700 SSD for Journaling + Bcache
>> * 6x 3TB disk
>>
>> This isn't the newest hardware. The next batch of hardware will be more
>> disks per chassis, but this is it for now.
>>
>> All systems were installed with Ubuntu 12.04, but they are all running
>> 14.04 now with bcache.
>>
>> The Intel S3700 SSD is partitioned with a GPT label:
>> - 5GB Journal for each OSD
>> - 200GB Partition for bcache
>>
>> root@ceph11:~# df -h|grep osd
>> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>> /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>> /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>> /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>> /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>> root@ceph11:~#
>>
>> root@ceph11:~# lsb_release -a
>> No LSB modules are available.
>> Distributor ID:Ubuntu
>> Description:Ubuntu 14.04.3 LTS
>> Release:14.04
>> Codename:trusty
>> root@ceph11:~# uname -r
>> 3.19.0-30-generic
>> root@ceph11:~#
>>
>> "apply_latency": {
>>  "avgcount": 2985023,
>>  "sum": 226219.891559000
>> }
>>
>> What did we notice?
>> - Less spikes on the disk
>> - Lower commit latencies on the OSDs
>> - Almost no 'slow requests' during backfills
>> - Cache-hit ratio of about 60%
>>
>> Max backfills and recovery active are both set to 1 on all OSDs.
>>
>> For the next generation hardware we are looking into using 3U chassis
>> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
>> tested those yet, so nothing to say about it.
>>
>> The current setup is 200GB of cache for 18TB of disks. The new setup
>> will be 1200GB for 64TB, curious to see what that does.
>>
>> Our main conclusion however is that it does smoothen the I/O-pattern
>> towards the disks and that gives a overall better response of the disks.
> 
> Hi Wido, thanks for the big writeup!  Did you guys happen to do any
> benchmarking?  I think Xiaoxi looked at flashcache a while back but had
> mixed results if I remember right.  It would be interesting to know how
> bcache is affecting performance in different scenarios.
> 

No, we didn't do any benchmarking. Initially this cluster was build for
just the RADOS Gateway, so we went for 2Gbit (2x 1Gbit) per machine. 90%
is still Gbit networking and we are in the process of upgrading it all
to 10Gbit.

Since the 1Gbit network latency is about 4 times higher then 10Gbit we
aren't really benchmarking the cluster.

What counts for us most is that we can do recovery operations without
any slow requests.

Before bcache we saw disks spike to 100% busy while a backfill was busy.
Now bcache smoothens this and we see peaks of maybe 70%, but that's it.

> Thanks,
> Mark
> 
>>
>> Wido
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help with Bug #12738: scrub bogus results when missing a clone

2015-10-21 Thread Chris Taylor
 

Is there some way to manually correct this error while this bug is still
needing review? I have one PG that is stuck inconsistent with the same
error. I already created a new RBD image and migrated the data to it.
The original RBD image was "rb.0.ac3386.238e1f29". The new image is
"rb.0.bfcb12.238e1f29". 

2015-10-20 19:18:07.686783 7f50e4c1d700 0 log_channel(cluster) log [INF]
: 8.e82 repair starts
2015-10-20 19:18:40.300721 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82
1fc8ce82/rb.0.ac3386.238e1f29.0008776e/snapdir//8 missing clones
2015-10-20 19:18:40.301094 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/head//8
expected clone 1fc8ce82/rb.0.ac3386.238e1f29.0008776e/44//8
2015-10-20 19:18:40.301124 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/head//8
expected clone 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/44//8
2015-10-20 19:18:40.301140 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 8038ce82/rb.0.bfcb12.238e1f29.002b7781/head//8
expected clone fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/44//8
2015-10-20 19:18:40.301155 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 c8b7ce82/rb.0.bfcb12.238e1f29.00059252/head//8
expected clone 8038ce82/rb.0.bfcb12.238e1f29.002b7781/44//8
2015-10-20 19:18:40.301170 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/head//8
expected clone c8b7ce82/rb.0.bfcb12.238e1f29.00059252/44//8
2015-10-20 19:18:40.301185 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 c006ce82/rb.0.bfcb12.238e1f29.000c53d6/head//8
expected clone 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/44//8
2015-10-20 19:18:40.301200 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : repair 8.e82 3434ce82/rb.0.bfcb12.238e1f29.002cb957/head//8
expected clone c006ce82/rb.0.bfcb12.238e1f29.000c53d6/44//8
2015-10-20 19:18:47.724047 7f50e4c1d700 -1 log_channel(cluster) log
[ERR] : 8.e82 repair 8 errors, 0 fixed
2 

Thanks, 

Chris 

 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-21 Thread Wido den Hollander
On 10/20/2015 09:45 PM, Martin Millnert wrote:
> The thing that worries me with your next-gen design (actually your current 
> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day 
> per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read 
> ratio is quite high in terms of writes as-is.
> You're also throughput-limiting yourself to the pci-e bw of the NVME device 
> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok 
> of course in relative terms. NVRAM vs SSD here is simply a choice between 
> wear (NVRAM as journal minimum), and cache hit probability (size).  
> Interesting thought experiment anyway for me, thanks for sharing Wido.
> /M

We are looking at the PC 3600DC 1.2TB, according to the specs from
Intel: 10.95PBW

Like I mentioned in my reply to Mark, we are still running on 1Gbit and
heading towards 10Gbit.

Bandwidth isn't really a issue in our cluster. During peak moments we
average about 30k IOps through the cluster, but the TOTAL client I/O is
just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.

Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
lower latency and thus more IOps.

Currently our S3700 SSDs are peaking at 50% utilization according to iostat.

After 2 years of operation the lowest Media_Wearout_Indicator we see is
33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
that the SSD is worn out.

So in 24 months we worn through 67% of the SSD. A quick calculation
tells me we still have 12 months left on that SSD before it dies.

But this is the lowest, other SSDs which were taken into production at
the same moment are ranging between 36 and 61.

Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
has some spare cells.

Wido

> 
>  Original message 
> From: Wido den Hollander  
> Date: 20/10/2015  16:00  (GMT+01:00) 
> To: ceph-users  
> Subject: [ceph-users] Ceph OSDs with bcache experience 
> 
> Hi,
> 
> In the "newstore direction" thread on ceph-devel I wrote that I'm using
> bcache in production and Mark Nelson asked me to share some details.
> 
> Bcache is running in two clusters now that I manage, but I'll keep this
> information to one of them (the one at PCextreme behind CloudStack).
> 
> In this cluster has been running for over 2 years now:
> 
> epoch 284353
> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
> created 2013-09-23 11:06:11.819520
> modified 2015-10-20 15:27:48.734213
> 
> The system consists out of 39 hosts:
> 
> 2U SuperMicro chassis:
> * 80GB Intel SSD for OS
> * 240GB Intel S3700 SSD for Journaling + Bcache
> * 6x 3TB disk
> 
> This isn't the newest hardware. The next batch of hardware will be more
> disks per chassis, but this is it for now.
> 
> All systems were installed with Ubuntu 12.04, but they are all running
> 14.04 now with bcache.
> 
> The Intel S3700 SSD is partitioned with a GPT label:
> - 5GB Journal for each OSD
> - 200GB Partition for bcache
> 
> root@ceph11:~# df -h|grep osd
> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
> /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
> /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
> /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
> /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
> root@ceph11:~#
> 
> root@ceph11:~# lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 14.04.3 LTS
> Release:  14.04
> Codename: trusty
> root@ceph11:~# uname -r
> 3.19.0-30-generic
> root@ceph11:~#
> 
> "apply_latency": {
> "avgcount": 2985023,
> "sum": 226219.891559000
> }
> 
> What did we notice?
> - Less spikes on the disk
> - Lower commit latencies on the OSDs
> - Almost no 'slow requests' during backfills
> - Cache-hit ratio of about 60%
> 
> Max backfills and recovery active are both set to 1 on all OSDs.
> 
> For the next generation hardware we are looking into using 3U chassis
> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
> tested those yet, so nothing to say about it.
> 
> The current setup is 200GB of cache for 18TB of disks. The new setup
> will be 1200GB for 64TB, curious to see what that does.
> 
> Our main conclusion however is that it does smoothen the I/O-pattern
> towards the disks and that gives a overall better response of the disks.
> 
> Wido
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] planet.ceph.com

2015-10-21 Thread Patrick McGarry
Hey Luis,

The planet was broken as a result of the new site (although will be
rejuvenated in the ceph.com rebuild). The redirect to a dental site
was a DNS problem that has since been fixed. Thanks!



On Tue, Oct 20, 2015 at 4:21 AM, Luis Periquito  wrote:
> Hi,
>
> I was looking for some ceph resources and saw a reference to
> planet.ceph.com. However when I opened it I was sent to a dental
> clinic (?). That doesn't sound right, does it?
>
> I was at this page when I saw the reference...
>
> thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread Alexandre DERUMIER
can you send me also your ceph.conf ?

do you have a ceph.conf on the vms hosts too ?


- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 10:31:56
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 

qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 

vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 


Thanks! 

hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 
Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 
But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 
BTW, my VM is located on the ceph storage node. 
Anyone can give me more sugestions? 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd, 
so cpu could be bottleneck too. 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd 
- Mail original - 
De: hzwuli...@gmail.com 
À: "ceph-users"  
Envoyé: Mardi 20 Octobre 2015 10:22:33 
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 
volume1 got about 24k IOPS and volume got about 14k IOPS. 
We could see performance of volume2 is not good compare to volume1, so 

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread hzwuli...@gmail.com
Hi, 
Yeah, i have the ceph.conf on the real machine which VM located on.
A simple configuration, -:)
[global]
fsid = ***
mon_initial_members = *, *, *
mon_host = *, *, *
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

I change the configuration on line, let me post it:
"journal_queue_max_ops": "3000",
"journal_queue_max_bytes": "1048576000",
"journal_max_corrupt_search": "10485760",
"journal_max_write_bytes": "1048576000",
"journal_max_write_entries": "1000",
"filestore_queue_max_ops": "500",
"filestore_queue_max_bytes": "104857600",
"filestore_queue_committing_max_ops": "5000",
"filestore_queue_committing_max_bytes": "1048576000",
"filestore_max_inline_xattr_size": "254",
"filestore_max_inline_xattr_size_xfs": "65536",
"filestore_max_inline_xattr_size_btrfs": "2048",
"filestore_max_inline_xattr_size_other": "512",
"filestore_max_inline_xattrs": "6",
"filestore_max_inline_xattrs_xfs": "10",
"filestore_max_inline_xattrs_btrfs": "10",
"filestore_max_inline_xattrs_other": "2",
"filestore_max_alloc_hint_size": "1048576",
"filestore_max_sync_interval": "10",
"osd_op_num_shards": "10",

But, anyway, from my test, the configuration impact less for the performance.

Btw, ceph version, 0.94.3 hammer
Thanks!


hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-21 17:12
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
can you send me also your ceph.conf ?
 
do you have a ceph.conf on the vms hosts too ?
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 10:31:56
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
 
 
Thanks! 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 
Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 
But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 
BTW, my VM is located on the 

Re: [ceph-users] v0.94.4 Hammer released

2015-10-21 Thread Christoph Adomeit
Hi there,

I was hoping for the following changes in 0.94.4 release:

-Stable Object Maps for faster Image Handling (Backups, Diffs, du etc). 
-Link against better Malloc implementation like jemalloc

Does 0.94.4 bring any improvement in these areas ?

Thanks
  Christoph



On Mon, Oct 19, 2015 at 02:07:39PM -0700, Sage Weil wrote:
> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread Alexandre DERUMIER
>>But, anyway, from my test, the configuration impact less for the performance.

A fast speed win, disable cephx and debug:

[global] 
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0



- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 11:20:09
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 
Yeah, i have the ceph.conf on the real machine which VM located on. 
A simple configuration, -:) 
[global] 
fsid = *** 
mon_initial_members = *, *, * 
mon_host = *, *, * 
auth_cluster_required = cephx 
auth_service_required = cephx 
auth_client_required = cephx 
filestore_xattr_use_omap = true 

I change the configuration on line, let me post it: 
"journal_queue_max_ops": "3000", 
"journal_queue_max_bytes": "1048576000", 
"journal_max_corrupt_search": "10485760", 
"journal_max_write_bytes": "1048576000", 
"journal_max_write_entries": "1000", 
"filestore_queue_max_ops": "500", 
"filestore_queue_max_bytes": "104857600", 
"filestore_queue_committing_max_ops": "5000", 
"filestore_queue_committing_max_bytes": "1048576000", 
"filestore_max_inline_xattr_size": "254", 
"filestore_max_inline_xattr_size_xfs": "65536", 
"filestore_max_inline_xattr_size_btrfs": "2048", 
"filestore_max_inline_xattr_size_other": "512", 
"filestore_max_inline_xattrs": "6", 
"filestore_max_inline_xattrs_xfs": "10", 
"filestore_max_inline_xattrs_btrfs": "10", 
"filestore_max_inline_xattrs_other": "2", 
"filestore_max_alloc_hint_size": "1048576", 
"filestore_max_sync_interval": "10", 
"osd_op_num_shards": "10", 

But, anyway, from my test, the configuration impact less for the performance. 

Btw, ceph version, 0.94.3 hammer 
Thanks! 

hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-21 17:12 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
can you send me also your ceph.conf ? 
do you have a ceph.conf on the vms hosts too ? 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 10:31:56 
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 
qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 
vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less 

Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread Alexandre DERUMIER
here a libvirt sample to enable iothreads:


   2

  
  
  

 
  
  
  






With this, you can scale with multiple disks. (but it should help a little bit 
with 1 disk too)


- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 10:31:56
Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 
let me post the version and configuration here first. 
host os: debian 7.8 kernel: 3.10.45 
guest os: debian 7.8 kernel: 3.2.0-4 

qemu version: 
ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM 
images for qemu 
ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 
hardware 
ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (common files) 
ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation 
binaries (x86) 
ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities 

vm config: 
 
 
 
 
 
 
 
 
 
 
 
*** 
 
 


Thanks! 

hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-21 14:01 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Damn, that's a huge difference. 
What is your host os, guest os , qemu version and vm config ? 
As an extra boost, you could enable iothread on virtio disk. 
(It's available on libvirt but not on openstack yet). 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor 
https://www.proxmox.com 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...) 
- Mail original - 
De: hzwuli...@gmail.com 
À: "aderumier"  
Cc: "ceph-users"  
Envoyé: Mercredi 21 Octobre 2015 06:11:20 
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
Thanks for you reply. 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
could achive about 18k iops. 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 
Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 
But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 
BTW, my VM is located on the ceph storage node. 
Anyone can give me more sugestions? 
Thanks! 
hzwuli...@gmail.com 
From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd, 
so cpu could be bottleneck too. 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd 
- Mail original - 
De: hzwuli...@gmail.com 
À: "ceph-users"  
Envoyé: Mardi 20 Octobre 2015 10:22:33 
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 

Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-21 Thread Jan Schermer

> On 21 Oct 2015, at 09:11, Wido den Hollander  wrote:
> 
> On 10/20/2015 09:45 PM, Martin Millnert wrote:
>> The thing that worries me with your next-gen design (actually your current 
>> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day 
>> per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read 
>> ratio is quite high in terms of writes as-is.
>> You're also throughput-limiting yourself to the pci-e bw of the NVME device 
>> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok 
>> of course in relative terms. NVRAM vs SSD here is simply a choice between 
>> wear (NVRAM as journal minimum), and cache hit probability (size).  
>> Interesting thought experiment anyway for me, thanks for sharing Wido.
>> /M
> 
> We are looking at the PC 3600DC 1.2TB, according to the specs from
> Intel: 10.95PBW
> 
> Like I mentioned in my reply to Mark, we are still running on 1Gbit and
> heading towards 10Gbit.
> 
> Bandwidth isn't really a issue in our cluster. During peak moments we
> average about 30k IOps through the cluster, but the TOTAL client I/O is
> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.
> 
> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
> lower latency and thus more IOps.
> 
> Currently our S3700 SSDs are peaking at 50% utilization according to iostat.
> 
> After 2 years of operation the lowest Media_Wearout_Indicator we see is
> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
> that the SSD is worn out.
> 
> So in 24 months we worn through 67% of the SSD. A quick calculation
> tells me we still have 12 months left on that SSD before it dies.

Could you maybe run isdct and compare what it says about expected lifetime? I 
think isdct will report a much longer lifetime than you expect.

For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating 
(~6.5PB written)

241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always   
-   1487714 <-- units of 32MB, that translates to ~47TB
233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always   
-   0 (maybe my smartdb needs updating, but this is what it says)
9 Power_On_Hours  0x0032   100   100   000Old_age   Always   -  
 1008

If I extrapolate this blindly I would expect the SSD to reach it's TBW of 6.5PB 
in about 15 years.

But isdct says:
EnduranceAnalyzer: 46.02 Years

If I reverse it and calculate the endurance based on smart values, that would 
give the expected lifetime of over 18PB (which is not impossible at all), but 
isdct is a bit smarter and looks at what the current use pattern is. It's 
clearly not only about discarding the initial bursts when the drive was filled 
during backfilling because it's not that much, and all my S3610 drives indicate 
a similiar endurance of 40 years (+-10).

I'd trust isdct over extrapolated SMART values - I think the SSD will actually 
switch to a different calculation scheme when it reaches certain lifepoint 
(when all reserve blocks are used, or when first cells start to die...) which 
is why there's a discrepancy.

Jan


> 
> But this is the lowest, other SSDs which were taken into production at
> the same moment are ranging between 36 and 61.
> 
> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
> has some spare cells.
> 
> Wido
> 
>> 
>>  Original message 
>> From: Wido den Hollander  
>> Date: 20/10/2015  16:00  (GMT+01:00) 
>> To: ceph-users  
>> Subject: [ceph-users] Ceph OSDs with bcache experience 
>> 
>> Hi,
>> 
>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>> bcache in production and Mark Nelson asked me to share some details.
>> 
>> Bcache is running in two clusters now that I manage, but I'll keep this
>> information to one of them (the one at PCextreme behind CloudStack).
>> 
>> In this cluster has been running for over 2 years now:
>> 
>> epoch 284353
>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>> created 2013-09-23 11:06:11.819520
>> modified 2015-10-20 15:27:48.734213
>> 
>> The system consists out of 39 hosts:
>> 
>> 2U SuperMicro chassis:
>> * 80GB Intel SSD for OS
>> * 240GB Intel S3700 SSD for Journaling + Bcache
>> * 6x 3TB disk
>> 
>> This isn't the newest hardware. The next batch of hardware will be more
>> disks per chassis, but this is it for now.
>> 
>> All systems were installed with Ubuntu 12.04, but they are all running
>> 14.04 now with bcache.
>> 
>> The Intel S3700 SSD is partitioned with a GPT label:
>> - 5GB Journal for each OSD
>> - 200GB Partition for bcache
>> 
>> root@ceph11:~# df -h|grep osd
>> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>> /dev/bcache22.8T  930G  1.9T  

[ceph-users] disable cephx signing

2015-10-21 Thread Corin Langosch
Hi,

we have cephx authentication and signing enabled. For performance reasons we'd 
like to keep auth but disabled signing.
Is this possible without service interruption and without having to restart the 
qemu rbd clients? Just adapt the
ceph.conf, restart mons and then osds?

Thanks
Corin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [urgent] KVM issues after upgrade to 0.94.4

2015-10-21 Thread Andrei Mikhailovsky
Hello guys, 

I've upgraded to the latest Hammer release and I've just noticed a massive 
issue after the upgrade ((( 

I am using ceph for virtual machine rbd storage over cloudstack. I am having 
issues with starting virtual routers. The libvirt error message is: 


cat r-1407-VM.log 
2015-10-21 11:04:59.262+: starting up 
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin 
QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name r-1407-VM -S -machine 
pc-i440fx-trusty,accel=kvm,usb=off -m 256 -realtime mlock=off -smp 
1,sockets=1,cores=1,threads=1 -uuid 815d2860-cc7f-475d-bf63-02814c720fe4 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/r-1407-VM.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive 
file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
 -drive 
file=/usr/share/cloudstack-common/vms/systemvm.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 
-netdev tap,fd=54,id=hostnet0,vhost=on,vhostfd=55 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:2e:f7:00:18,bus=pci.0,addr=0x3,rombar=0,romfile=
 -netdev tap,fd=56,id=hostnet1,vhost=on,vhostfd=57 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=0e:00:a9:fe:01:42,bus=pci.0,addr=0x4,rombar=0,romfile=
 -netdev tap,fd=58,id=hostnet2,vhost=on,vhostfd=59 -device 
virtio-net-pci,netdev=hostnet2,id=net2,mac=06:0c:b6:00:02:13,bus=pci.0,addr=0x5,rombar=0,romfile=
 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/r-1407-VM.agent,server,nowait 
-device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=r-1407-VM.vport
 -device usb-tablet,id=input0 -vnc 192.168.169.2:10,password -device 
cirrus-vga,id=video0,bus=pci.0,addr=0x2 
Domain id=42 is tainted: high-privileges 
libust[20136/20136]: Warning: HOME environment variable not set. Disabling 
LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) 
char device redirected to /dev/pts/13 (label charserial0) 
librbd/LibrbdWriteback.cc: In function 'virtual ceph_tid_t 
librbd::LibrbdWriteback::write(const object_t&, const object_locator_t&, 
uint64_t, uint64_t, const SnapContext&, const bufferlist&, utime_t, uint64_t, 
__u32, Context*)' thread 7ffa6b7fe700 time 2015-10-21 12:05:07.901876 
librbd/LibrbdWriteback.cc: 160: FAILED assert(m_ictx->owner_lock.is_locked()) 
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a) 
1: (()+0x17258b) [0x7ffa92ef758b] 
2: (()+0xa9573) [0x7ffa92e2e573] 
3: (()+0x3a90ca) [0x7ffa9312e0ca] 
4: (()+0x3b583d) [0x7ffa9313a83d] 
5: (()+0x7212c) [0x7ffa92df712c] 
6: (()+0x9590f) [0x7ffa92e1a90f] 
7: (()+0x969a3) [0x7ffa92e1b9a3] 
8: (()+0x4782a) [0x7ffa92dcc82a] 
9: (()+0x56599) [0x7ffa92ddb599] 
10: (()+0x7284e) [0x7ffa92df784e] 
11: (()+0x162b7e) [0x7ffa92ee7b7e] 
12: (()+0x163c10) [0x7ffa92ee8c10] 
13: (()+0x8182) [0x7ffa8ec49182] 
14: (clone()+0x6d) [0x7ffa8e97647d] 
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this. 
terminate called after throwing an instance of 'ceph::FailedAssertion' 
2015-10-21 11:05:08.091+: shutting down 

>From what I can see, there seem to be an issue with locking 
>(librbd/LibrbdWriteback.cc: 160: FAILED 
>assert(m_ictx->owner_lock.is_locked())). However, the r-1407-VM virtual router 
>is a new router and has not been created or ran before. So, I don't see why 
>there is an issue with locking. 

Could someone please help me determine the cause of the error and how to fix 
it. I've not seen this on 0.94.1. 

Many thanks 


Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread Alexandre DERUMIER
Damn, that's a huge difference.

What is your host os, guest os , qemu version and vm config ?



As an extra boost, you could enable iothread on virtio disk.
(It's available on libvirt but not on openstack yet).

If it's a test server, maybe could you test it with proxmox 4.0 hypervisor
https://www.proxmox.com

I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...)


- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 06:11:20
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

Hi, 

Thanks for you reply. 

I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 

[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 

could achive about 18k iops. 

2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 

[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 

Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 

Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 

But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 

BTW, my VM is located on the ceph storage node. 

Anyone can give me more sugestions? 

Thanks! 


hzwuli...@gmail.com 



From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd, 
so cpu could be bottleneck too. 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd 
- Mail original - 
De: hzwuli...@gmail.com 
À: "ceph-users"  
Envoyé: Mardi 20 Octobre 2015 10:22:33 
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 
volume1 got about 24k IOPS and volume got about 14k IOPS. 
We could see performance of volume2 is not good compare to volume1, so is it 
normal behabior of guest host? 
If not, what maybe the problem? 
Thanks! 
hzwuli...@gmail.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread Lindsay Mathieson
On 21 October 2015 at 16:01, Alexandre DERUMIER  wrote:

> If it's a test server, maybe could you test it with proxmox 4.0 hypervisor
> https://www.proxmox.com
>
> I have made a lot of patch inside it to optimize rbd (qemu+jemalloc,
> iothreads,...)
>

Really gotta find time to uograde my cluster ...


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Network performance

2015-10-21 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jonas Björklund
> Sent: 21 October 2015 09:23
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Network performance
> 
> Hello,
> 
> In the configuration I have read about "cluster network" and "cluster
addr".
> Is it possible to make the OSDs to listens to multiple IP addresses?
> I want to use several network interfaces to increase performance.
> 
> I hav tried
> 
> [global]
> cluster network = 172.16.3.0/24,172.16.4.0/24
> 
> [osd.0]
> public addr = 0.0.0.0
> #public addr = 172.16.3.1
> #public addr = 172.16.4.1
> 
> But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the
same
> time.
> 
> Any ideas?


I don't think this is currently possible, but it would be nice if Ceph
supported something like

http://www.multipath-tcp.org/




> 
> /Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with Bug #12738: scrub bogus results when missing a clone

2015-10-21 Thread Jan Schermer
We just had to look into a similiar problem (missing clone objects, extraneous 
clone objects, wrong sizes on few objects...)

You should do something like this:

1) find all OSDs hosting the PG
ceph pg map 8.e82
2) find the directory with the object on the OSDs
should be something like /var/lib/ceph/ceph-XX/current/8.e82_head/

3) look in this directory for files named like what you see in logs 
(rb.0.bfcb12.238e1f29.002acd39) 
there are _head_ objects that contain the original data, and then objects named 
with the snapshot id instead (_1fc8ce82_ instead of _head_)

4) compare what files are there on the OSDs
5a) you are lucky and one of the OSDs has them - in that case you could either 
copy them to the others (don't forget xattrs!) or rebuild them via backfills 
from the good OSD
5b) you are not that lucky and the files are not there - I'm not that sure what 
to do then
You could in theory just copy the _head_ object contents to the missing objects 
and then drop the image.
Or you could maybe just delete the _head_ objects (since you don't need that 
image anymore), but I don't know whether there's some info stored (in leveldb, 
or somewhere else) about the rbd image or if all the info is in the objects 
themselves.
I think others here will help you more in that case. 

I'm not sure if there's an option to "delete rbd image, ignore missing files, 
call it a day" - that one would be handy for situations like this.

Jan



> On 21 Oct 2015, at 09:01, Chris Taylor  wrote:
> 
> Is there some way to manually correct this error while this bug is still 
> needing review? I have one PG that is stuck inconsistent with the same error. 
> I already created a new RBD image and migrated the data to it. The original 
> RBD image was "rb.0.ac3386.238e1f29". The new image is "rb.0.bfcb12.238e1f29".
> 
>  
> 2015-10-20 19:18:07.686783 7f50e4c1d700 0 log_channel(cluster) log [INF] : 
> 8.e82 repair starts
> 2015-10-20 19:18:40.300721 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 1fc8ce82/rb.0.ac3386.238e1f29.0008776e/snapdir//8 missing 
> clones
> 2015-10-20 19:18:40.301094 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/head//8 expected 
> clone 1fc8ce82/rb.0.ac3386.238e1f29.0008776e/44//8
> 2015-10-20 19:18:40.301124 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/head//8 expected 
> clone 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/44//8
> 2015-10-20 19:18:40.301140 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 8038ce82/rb.0.bfcb12.238e1f29.002b7781/head//8 expected 
> clone fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/44//8
> 2015-10-20 19:18:40.301155 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 c8b7ce82/rb.0.bfcb12.238e1f29.00059252/head//8 expected 
> clone 8038ce82/rb.0.bfcb12.238e1f29.002b7781/44//8
> 2015-10-20 19:18:40.301170 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/head//8 expected 
> clone c8b7ce82/rb.0.bfcb12.238e1f29.00059252/44//8
> 2015-10-20 19:18:40.301185 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 c006ce82/rb.0.bfcb12.238e1f29.000c53d6/head//8 expected 
> clone 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/44//8
> 2015-10-20 19:18:40.301200 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 3434ce82/rb.0.bfcb12.238e1f29.002cb957/head//8 expected 
> clone c006ce82/rb.0.bfcb12.238e1f29.000c53d6/44//8
> 2015-10-20 19:18:47.724047 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> 8.e82 repair 8 errors, 0 fixed
> 2
> 
>  
> Thanks,
> 
> Chris
> 
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-21 Thread Iban Cabrillo
Hi,
   The same for us. Everything is working fine after upgrade to 0.94.4
(first the MONs and then the OSDs).
Iban

2015-10-21 0:21 GMT+02:00 Lindsay Mathieson :

>
> On 21 October 2015 at 08:09, Andrei Mikhailovsky 
> wrote:
>
>> Same here, the upgrade went well. So far so good.
>>
>
> Ditto
>
>
> --
> Lindsay
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Iban Cabrillo Bartolome
Instituto de Fisica de Cantabria (IFCA)
Santander, Spain
Tel: +34942200969
PGP PUBLIC KEY:
http://pgp.mit.edu/pks/lookup?op=get=0xD9DF0B3D6C8C08AC

Bertrand Russell:
*"El problema con el mundo es que los estúpidos están seguros de todo y los
inteligentes están llenos de dudas*"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Network performance

2015-10-21 Thread Jonas Björklund

Hello,

In the configuration I have read about "cluster network" and "cluster addr".
Is it possible to make the OSDs to listens to multiple IP addresses?
I want to use several network interfaces to increase performance.

I hav tried

[global]
cluster network = 172.16.3.0/24,172.16.4.0/24

[osd.0]
public addr = 0.0.0.0
#public addr = 172.16.3.1
#public addr = 172.16.4.1

But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the same 
time.

Any ideas?

/Jonas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [performance] rbd kernel module versus qemu librbd

2015-10-21 Thread hzwuli...@gmail.com
Hi,
let me post the version and configuration here first.
host os: debian 7.8   kernel: 3.10.45
guest os: debian 7.8 kernel: 3.2.0-4

qemu version: 
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2.1~bpo70+1  all  PXE boot firmware - ROM 
images for qemu
ii  qemu-kvm1:2.1+dfsg-12~bpo70+1   
amd64QEMU Full virtualization on x86 hardware
ii  qemu-system-common  1:2.1+dfsg-12~bpo70+1   
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-x86 1:2.1+dfsg-12~bpo70+1   
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  1:2.1+dfsg-12~bpo70+1   
amd64QEMU utilities

vm config:

  
  

  
  



  
  
  ***
  



Thanks!


hzwuli...@gmail.com
 
From: Alexandre DERUMIER
Date: 2015-10-21 14:01
To: hzwulibin
CC: ceph-users
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Damn, that's a huge difference.
 
What is your host os, guest os , qemu version and vm config ?
 
 
 
As an extra boost, you could enable iothread on virtio disk.
(It's available on libvirt but not on openstack yet).
 
If it's a test server, maybe could you test it with proxmox 4.0 hypervisor
https://www.proxmox.com
 
I have made a lot of patch inside it to optimize rbd (qemu+jemalloc, 
iothreads,...)
 
 
- Mail original -
De: hzwuli...@gmail.com
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mercredi 21 Octobre 2015 06:11:20
Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
 
Hi, 
 
Thanks for you reply. 
 
I do more test here and things change more strange, now i only could get about 
4k iops in VM: 
1. use fio with ioengine rbd to test the volume on the real machine 
[global] 
ioengine=rbd 
clientname=admin 
pool=vol_ssd 
rbdname=volume-4f4f9789-4215-4384-8e65-127a2e61a47f 
rw=randwrite 
bs=4k 
group_reporting=1 
 
[rbd_iodepth32] 
iodepth=32 
[rbd_iodepth1] 
iodepth=32 
[rbd_iodepth28] 
iodepth=32 
[rbd_iodepth8] 
iodepth=32 
 
could achive about 18k iops. 
 
2. test the same volume in VM, achive about 4.3k iops 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
#ioengine=sync 
iodepth=128 
direct=1 
group_reporting=1 
thread=1 
filename=/dev/vdb 
 
[task1] 
iodepth=32 
[task2] 
iodepth=32 
[task3] 
iodepth=32 
[task4] 
iodepth=32 
 
Using cep osd perf to check the osd latency, all less than 1 ms. 
Using iostat to check the osd %util, about 10 in case 2 test. 
Using dstat to check VM status: 
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- 
usr sys idl wai hiq siq| read writ| recv send| in out | int csw 
2 4 51 43 0 0| 0 17M| 997B 3733B| 0 0 |3476 6997 
2 5 51 43 0 0| 0 18M| 714B 4335B| 0 0 |3439 6915 
2 5 50 43 0 0| 0 17M| 594B 3150B| 0 0 |3294 6617 
1 3 52 44 0 0| 0 18M| 648B 3726B| 0 0 |3447 6991 
1 5 51 43 0 0| 0 18M| 582B 3208B| 0 0 |3467 7061 
 
Finally, using iptraf to check the package size in the VM, almost packages's 
size are around 1 to 70 and 71 to 140 bytes. That's different from real 
machine. 
 
But maybe iptraf on the VM can't prove anything, i check the real machine which 
the VM located on. 
It seems no abnormal. 
 
BTW, my VM is located on the ceph storage node. 
 
Anyone can give me more sugestions? 
 
Thanks! 
 
 
hzwuli...@gmail.com 
 
 
 
From: Alexandre DERUMIER 
Date: 2015-10-20 19:36 
To: hzwulibin 
CC: ceph-users 
Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I'm able to reach around same performance with qemu-librbd vs qemu-krbd, 
when I compile qemu with jemalloc 
(http://git.qemu.org/?p=qemu.git;a=commit;h=7b01cb974f1093885c40bf4d0d3e78e27e531363)
 
on my test, librbd with jemalloc still use 2x more cpu than krbd, 
so cpu could be bottleneck too. 
with fasts cpu (3.1ghz), I'm able to reach around 70k iops 4k with rbd volume, 
both with krbd or librbd 
- Mail original - 
De: hzwuli...@gmail.com 
À: "ceph-users"  
Envoyé: Mardi 20 Octobre 2015 10:22:33 
Objet: [ceph-users] [performance] rbd kernel module versus qemu librbd 
Hi, 
I have a question about the IOPS performance for real machine and virtual 
machine. 
Here is my test situation: 
1. ssd pool (9 OSD servers with 2 osds on each server, 10Gb networks for public 
& cluster networks) 
2. volume1: use rbd create a 100G volume from the ssd pool and map to the real 
machine 
3. volume2: use cinder create a 100G volume form the ssd pool and atach to a 
guest host 
4. disable rbd cache 
5. fio test on the two volues: 
[global] 
rw=randwrite 
bs=4k 
ioengine=libaio 
iodepth=64 
direct=1 
size=64g 
runtime=300s 
group_reporting=1 
thread=1 
volume1 got about 24k IOPS and volume got about 14k IOPS. 
We could see performance of volume2 is not good compare 

Re: [ceph-users] Network performance

2015-10-21 Thread Jonas Björklund


On Wed, 21 Oct 2015, Nick Fisk wrote:


[global]
cluster network = 172.16.3.0/24,172.16.4.0/24

[osd.0]
public addr = 0.0.0.0
#public addr = 172.16.3.1
#public addr = 172.16.4.1

But I cant get them to listen to both 172.16.3.1 and 172.16.4.1 at the

same

time.

Any ideas?


I don't think this is currently possible, but it would be nice if Ceph
supported something like

http://www.multipath-tcp.org/


In the documentation I can read how to add multiple networks. Why is that 
possible if it cant be used? :-)

cluster network

Description:	The IP address and netmask of the cluster (back-side) 
network (e.g., 10.0.0.0/24). Set in [global]. You may specify comma-delimited subnets.

Type:   {ip-address}/{netmask} [, {ip-address}/{netmask}]
Required:   No
Default:N/A

/Jonas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com