[ceph-users] Ceph cluster on AWS EC2 VMs using public ips

2015-05-01 Thread Sumit Gaur
Hi Experts,
I need a quick advise on deployment of ceph cluster on AWS EC2 VMs.
1) I have two separate AWS accounts and I am trying to create ceph cluster
on one account and
create ceph-client on another account and connect.

(EC2 Account and VMs + ceph client)  public ip--- (EC2 Account B + ceph
cluster (1mon + 3 OSD Vms))

2) I have given inbound and outbound traffic permission to ALL traffic, so
no restriction on both account

3) I have configured my ceph cluster based on *public ips* assigned to EC2
instances. I know it is not recommended but there is no other way I can
contact my ceph cluster from another Account's VM.

4) Now after adding public ip to ceph cluster Vms . I am able to make my
monitor run but still not able to connect from another account ceph client.

ifconfig eth0:0 52.24.62.171 netmask 255.255.255.0 up
and
public network = 52.24.73.240/24 in ceph.conf

Help me if at all it is possible to setup ceph on AWS EC2 based on *public
ips* ? and my ceph client on another account can contact this cluster ?

Thanks a lot
sumit
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation

2015-05-01 Thread Sean

Hey there,

Sorry for the delay. I have been moving apartments UGH. Our dev team 
found out how to quickly identify these files that are downloading a 
smaller size::


iterate through all of the objects in a bucket and call for a key.size 
in each item and compare it to conn.get_bucket().get_key().size of each 
key and the sizes differ. If the sizes differ these correspond exactly 
to any object that seems to have missing objects in ceph.


The objects always seem to be intervals of 512k as well which is really 
odd.


==
http://pastebin.com/R34wF7PB
==

My main question is why are these sizes different at all? Shouldn't they 
be exactly the same? Why are they off by multiples of 512k as well? 
Finally I need a way to rule out that this is a ceph issue and the only 
way I can think of is grabbing a list of all of the data files and 
concatenating them together in order in hopes that the manifest is wrong 
and I get the whole file.


For example::

implicit size 7745820218 explicit size 7744771642. Absolute 
1048576; name = 
86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam


I explicitly called one of the gateways and then piped the output to a 
text file while downloading this bam:


https://drive.google.com/file/d/0B16pfLB7yY6GcTZXalBQM3RHT0U/view?usp=sharing 
(25 Mb of text)


As we can see above. Ceph is saying that the size is  7745820218 bytes 
somewhere but when we download it we get 7744771642 bytes. If I download 
the object I get a 7744771642 byte file. Finally if I do a range request 
of all of the bytes from 7744771642 to the end I get a cannot compete 
request::



http://pastebin.com/CVvmex4m -- traceback of the python range request.
http://pastebin.com/4sd1Jc0G -- the radoslog of the range request

If I request the file with a shorter range (say 7744771642 -2 bytes 
(7744771640)) I am left with just a 2 byte file::


http://pastebin.com/Sn7Y0t9G -- range request of file - 2 bytes to end 
of file.

lacadmin@kh10-9:~$ ls -lhab 7gtest-range.bam
-rw-r--r-- 1 lacadmin lacadmin 2 Feb 24 01:00 7gtest-range.bam


I think that rados-gw may not be keeping track of the multipart chunks 
errors possibly? How did rados get the original and correct file size 
and why is it short when it returns the actual chunks? Finally why are 
the corrupt / missing chunks always a multipe of 512K? I do not see 
anything obvious that is set to 512K on the configuration/user side.



Sorry for the questions and babling but I am at a loss as to how to 
address this.






On 04/28/2015 05:03 PM, Yehuda Sadeh-Weinraub wrote:


- Original Message -

From: Sean seapasu...@uchicago.edu
To: ceph-users@lists.ceph.com
Sent: Tuesday, April 28, 2015 2:52:35 PM
Subject: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb 
logs stop after rotation

Hey yall!

I have a weird issue and I am not sure where to look so any help would
be appreciated. I have a large ceph giant cluster that has been stable
and healthy almost entirely since its inception. We have stored over
1.5PB into the cluster currently through RGW and everything seems to be
functioning great. We have downloaded smaller objects without issue but
last night we did a test on our largest file (almost 1 terabyte) and it
continuously times out at almost the exact same place. Investigating
further it looks like Civetweb/RGW is returning that the uploads
completed even though the objects are truncated. At least when we
download the objects they seem to be truncated.

I have tried searching through the mailing list archives to see what may
be going on but it looks like the mailing list DB may be going through
some mainenance:


Unable to read word database file
'/dh/mailman/dap/archives/private/ceph-users-ceph.com/htdig/db.words.db'


After checking through the gzipped logs I see that civetweb just stops
logging after a rotation for some reason as well and my last log is from
the 28th of march. I tried manually running /etc/init.d/radosgw reload
but this didn't seem to work. As running the download again could take
all day to error out we instead use the range request to try and pull
the missing bites.

https://gist.github.com/MurphyMarkW/8e356823cfe00de86a48 -- there is the
code we are using to download via S3 / boto as well as the returned size
report and overview of our issue.
http://pastebin.com/cVLdQBMF-- Here is some of the log from the civetweb
server they are hitting.

Here is our current config ::
http://pastebin.com/2SGfSDYG

Current output of ceph health::
http://pastebin.com/3f6iJEbu

I am thinking that this must be a civetweb/radosgw bug of somekind. My
question is 1.) is there a way to try and download the object via rados
directly I am guessing I will need to find the prefix and then just cat
all of them together and hope I get it right? 2.) Why would ceph say the
upload went fine but then return a smaller object?




Note that the returned http 

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-05-01 Thread Sage Weil
On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote:
 Hi
 
 I deleted the images and img pools and started osd's, they still die.
 
 Here's a log of one of the osd's after this, if you need it.
 
 http://beta.xaasbox.com/ceph/ceph-osd.19.log

I've pushed another commit that should avoid this case, sha1
425bd4e1dba00cc2243b0c27232d1f9740b04e34.

Note that once the pools are fully deleted (shouldn't take too long once 
the osds are up and stabilize) you should switch back to the normal 
packages that don't have these workarounds.

sage



 
 Br,
 Tuomas
 
 
  Thanks man. I'll try it tomorrow. Have a good one.
 
  Br,T
 
   Original message 
  From: Sage Weil s...@newdream.net
  Date: 30/04/2015  18:23  (GMT+02:00)
  To: Tuomas Juntunen tuomas.juntu...@databasement.fi
  Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
  Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic
 
  operations most of the OSD's went down
 
  On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote:
  Hey
 
  Yes I can drop the images data, you think this will fix it?
 
  It's a slightly different assert that (I believe) should not trigger once
  the pool is deleted.  Please give that a try and if you still hit it I'll
  whip up a workaround.
 
  Thanks!
  sage
 
   
 
  Br,
 
  Tuomas
 
   On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
   Hi
  
   I updated that version and it seems that something did happen, the osd's
   stayed up for a while and 'ceph status' got updated. But then in couple 
   of
   minutes, they all went down the same way.
  
   I have attached new 'ceph osd dump -f json-pretty' and got a new log 
   from
   one of the osd's with osd debug = 20,
   http://beta.xaasbox.com/ceph/ceph-osd.15.log
  
   Sam mentioned that you had said earlier that this was not critical data?
   If not, I think the simplest thing is to just drop those pools.  The
   important thing (from my perspective at least :) is that we understand 
   the
   root cause and can prevent this in the future.
  
   sage
  
  
  
   Thank you!
  
   Br,
   Tuomas
  
  
  
   -Original Message-
   From: Sage Weil [mailto:s...@newdream.net]
   Sent: 28. huhtikuuta 2015 23:57
   To: Tuomas Juntunen
   Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
   Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some 
   basic
   operations most of the OSD's went down
  
   Hi Tuomas,
  
   I've pushed an updated wip-hammer-snaps branch.  Can you please try it?
   The build will appear here
  
  
   http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
   2eb514067f72afda11bcde286
  
   (or a similar url; adjust for your distro).
  
   Thanks!
   sage
  
  
   On Tue, 28 Apr 2015, Sage Weil wrote:
  
[adding ceph-devel]
   
Okay, I see the problem.  This seems to be unrelated ot the giant -
hammer move... it's a result of the tiering changes you made:
   
  The following:
 
  ceph osd tier add img images --force-nonempty ceph osd
  tier cache-mode images forward ceph osd tier set-overlay
  img images
   
Specifically, --force-nonempty bypassed important safety checks.
   
1. images had snapshots (and removed_snaps)
   
2. images was added as a tier *of* img, and img's removed_snaps was
copied to images, clobbering the removed_snaps value (see
OSDMap::Incremental::propagate_snaps_to_tiers)
   
3. tiering relation was undone, but removed_snaps was still gone
   
4. on OSD startup, when we load the PG, removed_snaps is initialized
with the older map.  later, in PGPool::update(), we assume that
removed_snaps alwasy grows (never shrinks) and we trigger an assert.
   
To fix this I think we need to do 2 things:
   
1. make the OSD forgiving out removed_snaps getting smaller.  This is
probably a good thing anyway: once we know snaps are removed on all
OSDs we can prune the interval_set in the OSDMap.  Maybe.
   
2. Fix the mon to prevent this from happening, *even* when
--force-nonempty is specified.  (This is the root cause.)
   
I've opened http://tracker.ceph.com/issues/11493 to track this.
   
sage
   
   
   
 
  Idea was to make images as a tier to img, move data to img
  then change
 clients to use the new img pool.
 
  Br,
  Tuomas
 
   Can you explain exactly what you mean by:
  
   Also I created one pool for tier to be able to move
   data without
 outage.
  
   -Sam
   - Original Message -
   From: tuomas juntunen
   tuomas.juntu...@databasement.fi
   To: Ian Colle ico...@redhat.com
   Cc: ceph-users@lists.ceph.com
   Sent: Monday, April 27, 2015 4:23:44 AM
   Subject: Re: [ceph-users] Upgrade from Giant to Hammer
   and after some basic 

Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-05-01 Thread Christian Balzer

Hello,

On Fri, 1 May 2015 12:03:59 -0400 Anthony Levesque wrote:

 By what I read on some of the topics, is it you guys opinion that Ceph
 cannot scale nicely on full SSD cluster. Meaning that no matter how many
 OSD Node we add, at some point you won’t be able to scale pass some
 throughput.

No, that's not what at least I'm saying at all.
Ceph scales quite well, much better than some other distributed storage
solutions. 
The more nodes and/or OSDs, the better. 

However those nodes need to be balanced and well designed, your original
try with the 1TB EVOs was limited by those SSDs.
Having 16 (fast) SSDs per node is going to be limited by the CPU resources
to handle the potential IOPS they're capable of.
Your network might be another limiting factor at some point.

The exercise with Ceph is to deploy well balanced storage nodes, where
well is the closest fit to your IOPS needs, budget and other constraints
(rack space, power).

Christian
 --- Anthony Lévesque GloboTech Communications
 Phone: 1-514-907-0050 x 208
 Toll Free: 1-(888)-GTCOMM1 x 208
 Phone Urgency: 1-(514) 907-0047
 1-(866)-500-1555
 Fax: 1-(514)-907-0750
 aleves...@gtcomm.net mailto:aleves...@gtcomm.net
 http://www.gtcomm.net http://www.gtcomm.net/
  On Apr 30, 2015, at 9:32 PM, Christian Balzer ch...@gol.com wrote:
  
  On Thu, 30 Apr 2015 18:01:44 -0400 Anthony Levesque wrote:
  
  Im planning to setup 4-6 POC in the next 2 week to test various
  scenarios here.
  
  Im checking to get POC with s3610, s3710, p3500(seem to be knew.  I
  know the lifespam is lower) and maybe P3700
  
  Don't ignore the S3700, it is faster in sequential writes than the 3710
  because it uses older, less dense flash modules, thus more parallelism.
  
  And with Ceph, especially when looking at the journals, you will hit
  the max sequential write speed limit of the SSD long, lng before
  you'll hit the IOPS limit. 
  Both due to the nature of journal write and the little detail that
  you'll hit the CPU performance wall before that.
  
  The speed  of the 400GB p3500 seem very nice and the price is alright.
  The major difference will be the durability between the P3700 and
  P3500 and the IOPS.
  
  Read the link below about write amplification, but that is something
  that happens mostly on the OSD part, which in your case of 1TB EVOs is
  already a scary prospect in my book.
  
  in both option, they are the model with the lowest price per MB/s when
  compare to the S series.
  
  Price per MB/s is a good start, don't forget to factor in TBW/$ and
  try to estimate what write loads your cluster will see.
  
  But all of this is irrelevant if your typical write patterns will
  exceed your CPU resources while your SSDs are bored.
  For example this fio in a VM here:
  ---
  # fio --size=4G --ioengine=libaio --invalidate=1 --direct=0
  --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
  
   write: io=1381.4MB, bw=16364KB/s, iops=4091 , runt= 86419msec
  ---
  
  Will utilize all 8 3.1 GHz cores here, on a 3 node firefly cluster
  with 8 HDD OSDs and 4 journal SSDs (100GB S3700) per node. 
  While the journal SSDs are at 11% and the OSD HDDs at 30-40%
  utilization. 
  
  When changing that fio to direct=1, the IOPS drop to half of that.
  
  With a block size of 4MB things of course change to the OSDs being 100%
  busy, the SSDs about 60% (they can only do 200MB/s) and with 3-4 cores
  worth being idle or in IOwait.
  
  Model
  Price per MB/s
  DC S3500
  
  120GB
  $1.10
  240GB
  $1.01
  300GB
  $1.03
  480GB
  $1.28
  
  
  
  
  DC S3610
  
  200GB
  $0.99
  400GB
  $1.14
  480GB
  $1.24
  
  
  DC S3710
  
  200GB
  $1.17
  
  
  DC P3500
  
  400GB
  $0.64
  
  
  DC P3700
  
  400GB
  $0.96
  
  As a side note,  the expense doesn’t scare me directly,  Its more that
  we are going blind here since it seem not a lot of people do full SSD
  setup.(Or share there experiences)
  
  See this:
  http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
  
  I'd suggest you try the above tests yourself, you seem to have a
  significant amount of hardware already.
  
  There are many SSD threads, but so far there's at best one example of a
  setup going from Firefly to Giant and Hammer.
  So for me it's hard to qualify and quantify the improvements Hammer
  brings to SSD based clusters other than better, maybe about 50%.
  Which while significant, is obviously nowhere near the raw performance
  the hardware would be capable of.
  
  But then again, my guestimate is that aside from the significant code
  that gets executed per Ceph IOP, any such Ceph IOP results in 5-10
  real IOPs down the line.
  
  Christian
  
  Anyway still brainstorm this so we can work on some POC. Will you guys
  posted here. ---
  Anthony Lévesque
  
  
  On Apr 29, 2015, at 11:27 PM, Christian Balzer ch...@gol.com wrote:
  
  
  
  Hello,
  
  On Wed, 29 Apr 2015 15:01:49 -0400 Anthony Levesque wrote:
  
  We redid the test with 4MB 

Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

2015-05-01 Thread tuomas . juntunen
Thanks, I'll do this when the commit is available and report back.

And indeed, I'll change to the official ones after everything is ok.

Br,
Tuomas

 On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote:
 Hi

 I deleted the images and img pools and started osd's, they still die.

 Here's a log of one of the osd's after this, if you need it.

 http://beta.xaasbox.com/ceph/ceph-osd.19.log

 I've pushed another commit that should avoid this case, sha1
 425bd4e1dba00cc2243b0c27232d1f9740b04e34.

 Note that once the pools are fully deleted (shouldn't take too long once
 the osds are up and stabilize) you should switch back to the normal
 packages that don't have these workarounds.

 sage




 Br,
 Tuomas


  Thanks man. I'll try it tomorrow. Have a good one.
 
  Br,T
 
   Original message 
  From: Sage Weil s...@newdream.net
  Date: 30/04/2015  18:23  (GMT+02:00)
  To: Tuomas Juntunen tuomas.juntu...@databasement.fi
  Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
  Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic

  operations most of the OSD's went down
 
  On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote:
  Hey
 
  Yes I can drop the images data, you think this will fix it?
 
  It's a slightly different assert that (I believe) should not trigger once
  the pool is deleted.  Please give that a try and if you still hit it I'll
  whip up a workaround.
 
  Thanks!
  sage
 
   
 
  Br,
 
  Tuomas
 
   On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
   Hi
  
   I updated that version and it seems that something did happen, the 
   osd's
   stayed up for a while and 'ceph status' got updated. But then in couple
 of
   minutes, they all went down the same way.
  
   I have attached new 'ceph osd dump -f json-pretty' and got a new log
 from
   one of the osd's with osd debug = 20,
   http://beta.xaasbox.com/ceph/ceph-osd.15.log
  
   Sam mentioned that you had said earlier that this was not critical data?
   If not, I think the simplest thing is to just drop those pools.  The
   important thing (from my perspective at least :) is that we understand
 the
   root cause and can prevent this in the future.
  
   sage
  
  
  
   Thank you!
  
   Br,
   Tuomas
  
  
  
   -Original Message-
   From: Sage Weil [mailto:s...@newdream.net]
   Sent: 28. huhtikuuta 2015 23:57
   To: Tuomas Juntunen
   Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
   Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some
 basic
   operations most of the OSD's went down
  
   Hi Tuomas,
  
   I've pushed an updated wip-hammer-snaps branch.  Can you please try 
   it?
   The build will appear here
  
  
   http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
   2eb514067f72afda11bcde286
  
   (or a similar url; adjust for your distro).
  
   Thanks!
   sage
  
  
   On Tue, 28 Apr 2015, Sage Weil wrote:
  
[adding ceph-devel]
   
Okay, I see the problem.  This seems to be unrelated ot the giant -
hammer move... it's a result of the tiering changes you made:
   
  The following:
 
  ceph osd tier add img images --force-nonempty ceph osd
  tier cache-mode images forward ceph osd tier set-overlay
  img images
   
Specifically, --force-nonempty bypassed important safety checks.
   
1. images had snapshots (and removed_snaps)
   
2. images was added as a tier *of* img, and img's removed_snaps was
copied to images, clobbering the removed_snaps value (see
OSDMap::Incremental::propagate_snaps_to_tiers)
   
3. tiering relation was undone, but removed_snaps was still gone
   
4. on OSD startup, when we load the PG, removed_snaps is initialized
with the older map.  later, in PGPool::update(), we assume that
removed_snaps alwasy grows (never shrinks) and we trigger an assert.
   
To fix this I think we need to do 2 things:
   
1. make the OSD forgiving out removed_snaps getting smaller.  This 
is
probably a good thing anyway: once we know snaps are removed on all
OSDs we can prune the interval_set in the OSDMap.  Maybe.
   
2. Fix the mon to prevent this from happening, *even* when
--force-nonempty is specified.  (This is the root cause.)
   
I've opened http://tracker.ceph.com/issues/11493 to track this.
   
sage
   
   
   
 
  Idea was to make images as a tier to img, move data to 
  img
  then change
 clients to use the new img pool.
 
  Br,
  Tuomas
 
   Can you explain exactly what you mean by:
  
   Also I created one pool for tier to be able to move
   data without
 outage.
  
   -Sam
   - Original Message -
   From: tuomas juntunen
   tuomas.juntu...@databasement.fi
   To: Ian Colle ico...@redhat.com
   Cc: 

Re: [ceph-users] Ceph Fuse Crashed when Reading and How to Backup the data

2015-05-01 Thread John Spray



On 30/04/2015 09:21, flisky wrote:

When I read the file through the ceph-fuse, the process crashed.

Here is the log -

terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

  what():  buffer::end_of_buffer
*** Caught signal (Aborted) **
 in thread 7fe0814d3700
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (()+0x249805) [0x7fe08670b805]
 2: (()+0x10d10) [0x7fe085c39d10]
 3: (gsignal()+0x37) [0x7fe0844d3267]
 4: (abort()+0x16a) [0x7fe0844d4eca]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fe084de706d]
 6: (()+0x5eee6) [0x7fe084de4ee6]
 7: (()+0x5ef31) [0x7fe084de4f31]
 8: (()+0x5f149) [0x7fe084de5149]
 9: (ceph::buffer::list::substr_of(ceph::buffer::list const, unsigned 
int, unsigned int)+0x24b) [0x7fe08688993b]
 10: (ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool)+0x1423) [0x7fe0866c6b73]

 11: (ObjectCacher::C_RetryRead::finish(int)+0x20) [0x7fe0866cd870]
 12: (Context::complete(int)+0x9) [0x7fe086687eb9]
 13: (void finish_contextsContext(CephContext*, std::listContext*, 
std::allocatorContext* , int)+0xac) [0x7fe0866ca73c]
 14: (ObjectCacher::bh_read_finish(long, sobject_t, unsigned long, 
long, unsigned long, ceph::buffer::list, int, bool)+0x29e) 
[0x7fe0866bfd2e]

 15: (ObjectCacher::C_ReadFinish::finish(int)+0x7f) [0x7fe0866cc85f]
 16: (Context::complete(int)+0x9) [0x7fe086687eb9]
 17: (C_Lock::finish(int)+0x29) [0x7fe086688269]
 18: (Context::complete(int)+0x9) [0x7fe086687eb9]
 19: (Finisher::finisher_thread_entry()+0x1b4) [0x7fe08671f184]
 20: (()+0x76aa) [0x7fe085c306aa]
 21: (clone()+0x6d) [0x7fe0845a4eed]
=
Some part may be interesting -
   -11 2015-04-30 15:55:59.063828 7fd6a816c700 10 -- 
172.30.11.188:0/10443  172.16.3.153:6820/1532355 pipe(0x7fd6740344c0 
sd=8 :58596 s=2 pgs=3721 cs=1 l=1 c=0x7fd674038760).reader got message 
1 0x7fd65c001940 osd_op_reply(1 119. [read 0~4390] 
v0'0 uv0 ack = -1 ((1) Operation not permitted)) v6
   -10 2015-04-30 15:55:59.063848 7fd6a816c700  1 -- 
172.30.11.188:0/10443 == osd.9 172.16.3.153:6820/1532355 1  
osd_op_reply(1 119. [read 0~4390] v0'0 uv0 ack = -1 
((1) Operation not permitted)) v6  187+0+0 (689339676 0 0) 
0x7fd65c001940 con 0x7fd674038760



And the cephfs-journal seems okay.

Could anyone tell me why it is happening?


Hmm, the backtrace is the same as http://tracker.ceph.com/issues/11510

This isn't the same cluster by any chance?

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quick question - version query

2015-05-01 Thread Robert LeBlanc
ceph --admin-daemon path/to/admin/socket version

On Fri, May 1, 2015 at 10:44 AM, Tony Harris neth...@gmail.com wrote:
 Hi all,

 I feel a bit like an idiot at the moment - I know there is a command through
 ceph to query the monitor and OSD daemons to check their version level, but
 I can't remember what it is to save my life and I'm having trouble locating
 it in the docs.  I need to make sure the entire cluster is running 0.94.1 at
 this point..

 -Tony

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Piotr Wachowicz
Is there any way to confirm (beforehand) that using SSDs for journals will
help?

We're seeing very disappointing Ceph performance. We have 10GigE
interconnect (as a shared public/internal network).

We're wondering whether it makes sense to buy SSDs and put journals on
them. But we're looking for a way to verify that this will actually help
BEFORE we splash cash on SSDs.

The problem is that the way we have things configured now, with journals on
spinning HDDs (shared with OSDs as the backend storage), apart from slow
read/write performance to Ceph I already mention, we're also seeing fairly
low disk utilization on OSDs.

This low disk utilization suggests that journals are not really used to
their max, which begs for the questions whether buying SSDs for journals
will help.

This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we
cannot really confirm that.

Our typical data access use case is a lot of small random read/writes.
We're doing a lot of rsyncing (entire regular linux filesystems) from one
VM to another.

We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't
really help all that much.

So, is there any way to confirm beforehand that using SSDs for journals
will help in our case?

Kind Regards,
Piotr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Piotr Wachowicz
Thanks for your answer, Nick.

Typically it's a single rsync session at a time (sometimes two, but rarely
more concurrently). So it's a single ~5GB typical linux filesystem from one
random VM to another random VM.

Apart from using RBD Cache, is there any other way to improve the overall
performance of such a use case in a Ceph cluster?

In theory I guess we could always tarball it, and rsync the tarball, thus
effectively using sequential IO rather than random. But that's simply not
feasible for us at the moment. Any other ways?

Sidequestion: does using RBDCache impact the way data is stored on the
client? (e.g. a write call returning after data has been written to Journal
(fast) vs  written all the way to the OSD data store(slow)). I'm guessing
it's always the first one, regardless of whether client uses RBDCache or
not, right? My logic here is that otherwise that would imply that clients
can impact the way OSDs behave, which could be dangerous in some situations.

Kind Regards,
Piotr



On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk wrote:

 How many Rsync's are doing at a time? If it is only a couple, you will not
 be able to take advantage of the full number of OSD's, as each block of
 data is only located on 1 OSD (not including replicas). When you look at
 disk statistics you are seeing an average over time, so it will look like
 the OSD's are not very busy, when in fact each one is busy for a very brief
 period.



 SSD journals will help your write latency, probably going down from around
 15-30ms to under 5ms



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Piotr Wachowicz
 *Sent:* 01 May 2015 09:31
 *To:* ceph-users@lists.ceph.com
 *Subject:* [ceph-users] How to estimate whether putting a journal on SSD
 will help with performance?



 Is there any way to confirm (beforehand) that using SSDs for journals will
 help?

 We're seeing very disappointing Ceph performance. We have 10GigE
 interconnect (as a shared public/internal network).



 We're wondering whether it makes sense to buy SSDs and put journals on
 them. But we're looking for a way to verify that this will actually help
 BEFORE we splash cash on SSDs.



 The problem is that the way we have things configured now, with journals
 on spinning HDDs (shared with OSDs as the backend storage), apart from slow
 read/write performance to Ceph I already mention, we're also seeing fairly
 low disk utilization on OSDs.



 This low disk utilization suggests that journals are not really used to
 their max, which begs for the questions whether buying SSDs for journals
 will help.



 This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we
 cannot really confirm that.



 Our typical data access use case is a lot of small random read/writes.
 We're doing a lot of rsyncing (entire regular linux filesystems) from one
 VM to another.



 We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't
 really help all that much.



 So, is there any way to confirm beforehand that using SSDs for journals
 will help in our case?



 Kind Regards,
 Piotr


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Wido den Hollander


On 01-05-15 11:42, Nick Fisk wrote:
 Yeah, that’s your problem, doing a single thread rsync when you have
 quite poor write latency will not be quick. SSD journals should give you
 a fair performance boost, otherwise you need to coalesce the writes at
 the client so that Ceph is given bigger IOs at higher queue depths.
 

Exactly. But Ceph doesn't excell in serial I/O streams like these. It
performs best when I/O is done in parallel. So if you can figure a way
put to run multiple rsyncs at the same time you might see a great
performance boost.

This way all OSDs can process the I/O instead of one by one.

  
 
 RBD Cache can help here as well as potentially FS tuning to buffer more
 aggressively. If writeback RBD cache is enabled, data will be buffered
 by RBD until a sync is called by the client, so data loss can occur
 during this period if the app is not issuing fsyncs properly. Once a
 sync is called data is flushed to the journals and then later to the
 actual OSD store.
 
  
 
 *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Piotr Wachowicz
 *Sent:* 01 May 2015 10:14
 *To:* Nick Fisk
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] How to estimate whether putting a journal on
 SSD will help with performance?
 
  
 
 Thanks for your answer, Nick.
 
  
 
 Typically it's a single rsync session at a time (sometimes two, but
 rarely more concurrently). So it's a single ~5GB typical linux
 filesystem from one random VM to another random VM.
 
  
 
 Apart from using RBD Cache, is there any other way to improve the
 overall performance of such a use case in a Ceph cluster?
 
  
 
 In theory I guess we could always tarball it, and rsync the tarball,
 thus effectively using sequential IO rather than random. But that's
 simply not feasible for us at the moment. Any other ways?
 
  
 
 Sidequestion: does using RBDCache impact the way data is stored on the
 client? (e.g. a write call returning after data has been written to
 Journal (fast) vs  written all the way to the OSD data store(slow)). I'm
 guessing it's always the first one, regardless of whether client uses
 RBDCache or not, right? My logic here is that otherwise that would imply
 that clients can impact the way OSDs behave, which could be dangerous in
 some situations.
 
  
 
 Kind Regards,
 
 Piotr
 
  
 
  
 
  
 
 On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk
 mailto:n...@fisk.me.uk wrote:
 
 How many Rsync’s are doing at a time? If it is only a couple, you
 will not be able to take advantage of the full number of OSD’s, as
 each block of data is only located on 1 OSD (not including
 replicas). When you look at disk statistics you are seeing an
 average over time, so it will look like the OSD’s are not very busy,
 when in fact each one is busy for a very brief period.
 
  
 
 SSD journals will help your write latency, probably going down from
 around 15-30ms to under 5ms
 
  
 
 *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Piotr
 Wachowicz
 *Sent:* 01 May 2015 09:31
 *To:* ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 *Subject:* [ceph-users] How to estimate whether putting a journal on
 SSD will help with performance?
 
  
 
 Is there any way to confirm (beforehand) that using SSDs for
 journals will help?
 
 We're seeing very disappointing Ceph performance. We have 10GigE
 interconnect (as a shared public/internal network). 
 
  
 
 We're wondering whether it makes sense to buy SSDs and put journals
 on them. But we're looking for a way to verify that this will
 actually help BEFORE we splash cash on SSDs.
 
  
 
 The problem is that the way we have things configured now, with
 journals on spinning HDDs (shared with OSDs as the backend storage),
 apart from slow read/write performance to Ceph I already mention,
 we're also seeing fairly low disk utilization on OSDs. 
 
  
 
 This low disk utilization suggests that journals are not really used
 to their max, which begs for the questions whether buying SSDs for
 journals will help.
 
  
 
 This kind of suggests that the bottleneck is NOT the disk. But,m
 yeah, we cannot really confirm that.
 
  
 
 Our typical data access use case is a lot of small random
 read/writes. We're doing a lot of rsyncing (entire regular linux
 filesystems) from one VM to another.
 
  
 
 We're using Ceph for OpenStack storage (kvm). Enabling RBD cache
 didn't really help all that much.
 
  
 
 So, is there any way to confirm beforehand that using SSDs for
 journals will help in our case?
 
  
 
 Kind Regards,
 Piotr
 
 
 Image removed by sender.
 
  
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 

[ceph-users] Ceph hammer rgw : unbale to create bucket

2015-05-01 Thread Shashank Puntamkar
I  have freshly install the Ceph hammer  version 0.94.1 .
I am facing problems while configuring Rados gateway.I want to map
specific users to  specific pools. For this I followed the following
links.
(1).  http://comments.gmane.org/gmane.comp.file-systems.ceph.user/4992
(2). http://cephnotes.ksperis.com/blog/2014/11/28/placement-pools-on-rados-gw

I followed the methods  mentioned in this link . The problem which I
am facing is I am not able to create a new bucket in federated mode.
Also at one point of time I was able to create bucket with Dot prefix
but it also stops working after restart of rgw. At present it is
working with default .rgw.bucket pool but not working with other
pools.
Any help in this regard is apprectaied
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Fuse Crashed when Reading and How to Backup the data

2015-05-01 Thread flisky

It turns out the permission problem.

When I change to ceph.admin, I can read the file, and the file content 
seems garbage.


Best regards,

On 2015年05月01日 02:07, Gregory Farnum wrote:

The not permitted bit usually means that your client doesn't have
access permissions to the data pool in use.

I'm not sure why it would be getting aborted without any output though
— is there any traceback at all in the logs? A message about the
OOM-killer zapping it or something?
-Greg


On Thu, Apr 30, 2015 at 1:45 AM, flisky yinjif...@lianjia.com wrote:

Sorry,I cannot reproduce the Operation not permitted log

Here is a small portion of log with debug_client = 20/20
==
-22 2015-04-30 16:29:12.858309 7fe9757f2700 10 client.58272 check_caps
on 115.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664
size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr)
objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980
0x7fe968021c30) wanted pFscr used - is_delayed=1
-21 2015-04-30 16:29:12.858326 7fe9757f2700 10 client.58272  cap mds.0
issued pAsLsXsFscr implemented pAsLsXsFscr revoking -
-20 2015-04-30 16:29:12.858333 7fe9757f2700 10 client.58272 send_cap
115.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664
size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr)
objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980
0x7fe968021c30) mds.0 seq 1 used - want pFscr flush - retain
pAsxLsxXsxFsxcrwbl held pAsLsXsFscr revoking - dropping -
-19 2015-04-30 16:29:12.858358 7fe9757f2700 15 client.58272 auth cap,
setting max_size = 0
-18 2015-04-30 16:29:12.858368 7fe9757f2700 10 client.58272 _create_fh
115 mode 1
-17 2015-04-30 16:29:12.858376 7fe9757f2700 20 client.58272 trim_cache
size 14 max 16384
-16 2015-04-30 16:29:12.858378 7fe9757f2700  3 client.58272 ll_open
115.head 32768 = 0 (0x7fe95c0052c0)
-15 2015-04-30 16:29:12.858385 7fe9757f2700  3 client.58272 ll_forget
115 1
-14 2015-04-30 16:29:12.858386 7fe9757f2700 20 client.58272 _ll_put
0x7fe968021c30 115 1 - 9
-13 2015-04-30 16:29:12.858500 7fe974ff1700 20 client.58272 _ll_get
0x7fe968021c30 115 - 10
-12 2015-04-30 16:29:12.858503 7fe974ff1700  3 client.58272 ll_getattr
115.head
-11 2015-04-30 16:29:12.858505 7fe974ff1700 10 client.58272 _getattr
mask pAsLsXsFs issued=1
-10 2015-04-30 16:29:12.858509 7fe974ff1700 10 client.58272 fill_stat on
115 snap/devhead mode 0100664 mtime 2015-04-20 14:14:57.961482 ctime
2015-04-20 14:14:57.960359
 -9 2015-04-30 16:29:12.858518 7fe974ff1700  3 client.58272 ll_getattr
115.head = 0
 -8 2015-04-30 16:29:12.858525 7fe974ff1700  3 client.58272 ll_forget
115 1
 -7 2015-04-30 16:29:12.858526 7fe974ff1700 20 client.58272 _ll_put
0x7fe968021c30 115 1 - 9
 -6 2015-04-30 16:29:12.858536 7fe9577fe700  3 client.58272 ll_read
0x7fe95c0052c0 115  0~4096
 -5 2015-04-30 16:29:12.858539 7fe9577fe700 10 client.58272 get_caps
115.head(ref=3 ll_ref=9 cap_refs={} open={1=1} mode=100664
size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr)
objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980
0x7fe968021c30) have pAsLsXsFscr need Fr want Fc but not Fc revoking -
 -4 2015-04-30 16:29:12.858561 7fe9577fe700 10 client.58272 _read_async
115.head(ref=3 ll_ref=9 cap_refs={2048=1} open={1=1} mode=100664
size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr)
objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980
0x7fe968021c30) 0~4096
 -3 2015-04-30 16:29:12.858575 7fe9577fe700 10 client.58272 max_byes=0
max_periods=4
 -2 2015-04-30 16:29:12.858692 7fe9577fe700  5 client.58272 get_cap_ref
got first FILE_CACHE ref on 115.head(ref=3 ll_ref=9
cap_refs={1024=0,2048=1} open={1=1} mode=100664 size=119/0 mtime=2015-04-20
14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0
objects 1 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30)
 -1 2015-04-30 16:29:12.867657 7fe9797fa700 10 client.58272
ms_handle_connect on 172.16.3.149:6823/982446
  0 2015-04-30 16:29:12.872787 7fe97bfff700 -1 *** Caught signal
(Aborted) **



On 2015年04月30日 16:21, flisky wrote:


When I read the file through the ceph-fuse, the process crashed.

Here is the log -

terminate called after throwing an instance of
'ceph::buffer::end_of_buffer'
what():  buffer::end_of_buffer
*** Caught signal (Aborted) **
   in thread 7fe0814d3700
   ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
   1: (()+0x249805) [0x7fe08670b805]
   2: (()+0x10d10) [0x7fe085c39d10]
   3: (gsignal()+0x37) [0x7fe0844d3267]
   4: (abort()+0x16a) [0x7fe0844d4eca]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) 

Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Nick Fisk
How many Rsync's are doing at a time? If it is only a couple, you will not
be able to take advantage of the full number of OSD's, as each block of data
is only located on 1 OSD (not including replicas). When you look at disk
statistics you are seeing an average over time, so it will look like the
OSD's are not very busy, when in fact each one is busy for a very brief
period. 

 

SSD journals will help your write latency, probably going down from around
15-30ms to under 5ms 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Piotr Wachowicz
Sent: 01 May 2015 09:31
To: ceph-users@lists.ceph.com
Subject: [ceph-users] How to estimate whether putting a journal on SSD will
help with performance?

 

Is there any way to confirm (beforehand) that using SSDs for journals will
help?

We're seeing very disappointing Ceph performance. We have 10GigE
interconnect (as a shared public/internal network). 

 

We're wondering whether it makes sense to buy SSDs and put journals on them.
But we're looking for a way to verify that this will actually help BEFORE we
splash cash on SSDs.

 

The problem is that the way we have things configured now, with journals on
spinning HDDs (shared with OSDs as the backend storage), apart from slow
read/write performance to Ceph I already mention, we're also seeing fairly
low disk utilization on OSDs. 

 

This low disk utilization suggests that journals are not really used to
their max, which begs for the questions whether buying SSDs for journals
will help.

 

This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we
cannot really confirm that.

 

Our typical data access use case is a lot of small random read/writes. We're
doing a lot of rsyncing (entire regular linux filesystems) from one VM to
another.

 

We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't
really help all that much.

 

So, is there any way to confirm beforehand that using SSDs for journals will
help in our case?

 

Kind Regards,
Piotr




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Steffen W Sørensen
Also remember to drive your Ceph cluster as hard as you got means to, eg. 
tuning the VM OSes/IO sub systems like using multiple RBD devices per VM (to 
issue more out standing IOPs from VM IO subsystem), best IO scheduler, CPU 
power + memory per VM, also ensure low network latency + bandwidth between your 
rsyncing VMs etc.

 On 01/05/2015, at 11.13, Piotr Wachowicz 
 piotr.wachow...@brightcomputing.com wrote:
 
 Thanks for your answer, Nick.
 
 Typically it's a single rsync session at a time (sometimes two, but rarely 
 more concurrently). So it's a single ~5GB typical linux filesystem from one 
 random VM to another random VM.
 
 Apart from using RBD Cache, is there any other way to improve the overall 
 performance of such a use case in a Ceph cluster?
 
 In theory I guess we could always tarball it, and rsync the tarball, thus 
 effectively using sequential IO rather than random. But that's simply not 
 feasible for us at the moment. Any other ways?
 
 Sidequestion: does using RBDCache impact the way data is stored on the 
 client? (e.g. a write call returning after data has been written to Journal 
 (fast) vs  written all the way to the OSD data store(slow)). I'm guessing 
 it's always the first one, regardless of whether client uses RBDCache or not, 
 right? My logic here is that otherwise that would imply that clients can 
 impact the way OSDs behave, which could be dangerous in some situations.
 
 Kind Regards,
 Piotr
 
 
 
 On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk 
 mailto:n...@fisk.me.uk wrote:
 How many Rsync’s are doing at a time? If it is only a couple, you will not be 
 able to take advantage of the full number of OSD’s, as each block of data is 
 only located on 1 OSD (not including replicas). When you look at disk 
 statistics you are seeing an average over time, so it will look like the 
 OSD’s are not very busy, when in fact each one is busy for a very brief 
 period.
 
  
 
 SSD journals will help your write latency, probably going down from around 
 15-30ms to under 5ms
 
  
 
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz
 Sent: 01 May 2015 09:31
 To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 Subject: [ceph-users] How to estimate whether putting a journal on SSD will 
 help with performance?
 
  
 
 Is there any way to confirm (beforehand) that using SSDs for journals will 
 help?
 
 We're seeing very disappointing Ceph performance. We have 10GigE interconnect 
 (as a shared public/internal network). 
 
  
 
 We're wondering whether it makes sense to buy SSDs and put journals on them. 
 But we're looking for a way to verify that this will actually help BEFORE we 
 splash cash on SSDs.
 
  
 
 The problem is that the way we have things configured now, with journals on 
 spinning HDDs (shared with OSDs as the backend storage), apart from slow 
 read/write performance to Ceph I already mention, we're also seeing fairly 
 low disk utilization on OSDs. 
 
  
 
 This low disk utilization suggests that journals are not really used to their 
 max, which begs for the questions whether buying SSDs for journals will help.
 
  
 
 This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we 
 cannot really confirm that.
 
  
 
 Our typical data access use case is a lot of small random read/writes. We're 
 doing a lot of rsyncing (entire regular linux filesystems) from one VM to 
 another.
 
  
 
 We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't 
 really help all that much.
 
  
 
 So, is there any way to confirm beforehand that using SSDs for journals will 
 help in our case?
 
  
 
 Kind Regards,
 Piotr
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Nick Fisk
Yeah, that's your problem, doing a single thread rsync when you have quite
poor write latency will not be quick. SSD journals should give you a fair
performance boost, otherwise you need to coalesce the writes at the client
so that Ceph is given bigger IOs at higher queue depths.

 

RBD Cache can help here as well as potentially FS tuning to buffer more
aggressively. If writeback RBD cache is enabled, data will be buffered by
RBD until a sync is called by the client, so data loss can occur during this
period if the app is not issuing fsyncs properly. Once a sync is called data
is flushed to the journals and then later to the actual OSD store.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Piotr Wachowicz
Sent: 01 May 2015 10:14
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD
will help with performance?

 

Thanks for your answer, Nick.

 

Typically it's a single rsync session at a time (sometimes two, but rarely
more concurrently). So it's a single ~5GB typical linux filesystem from one
random VM to another random VM.

 

Apart from using RBD Cache, is there any other way to improve the overall
performance of such a use case in a Ceph cluster?

 

In theory I guess we could always tarball it, and rsync the tarball, thus
effectively using sequential IO rather than random. But that's simply not
feasible for us at the moment. Any other ways?

 

Sidequestion: does using RBDCache impact the way data is stored on the
client? (e.g. a write call returning after data has been written to Journal
(fast) vs  written all the way to the OSD data store(slow)). I'm guessing
it's always the first one, regardless of whether client uses RBDCache or
not, right? My logic here is that otherwise that would imply that clients
can impact the way OSDs behave, which could be dangerous in some situations.

 

Kind Regards,

Piotr

 

 

 

On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk
mailto:n...@fisk.me.uk  wrote:

How many Rsync's are doing at a time? If it is only a couple, you will not
be able to take advantage of the full number of OSD's, as each block of data
is only located on 1 OSD (not including replicas). When you look at disk
statistics you are seeing an average over time, so it will look like the
OSD's are not very busy, when in fact each one is busy for a very brief
period. 

 

SSD journals will help your write latency, probably going down from around
15-30ms to under 5ms 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
mailto:ceph-users-boun...@lists.ceph.com ] On Behalf Of Piotr Wachowicz
Sent: 01 May 2015 09:31
To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com 
Subject: [ceph-users] How to estimate whether putting a journal on SSD will
help with performance?

 

Is there any way to confirm (beforehand) that using SSDs for journals will
help?

We're seeing very disappointing Ceph performance. We have 10GigE
interconnect (as a shared public/internal network). 

 

We're wondering whether it makes sense to buy SSDs and put journals on them.
But we're looking for a way to verify that this will actually help BEFORE we
splash cash on SSDs.

 

The problem is that the way we have things configured now, with journals on
spinning HDDs (shared with OSDs as the backend storage), apart from slow
read/write performance to Ceph I already mention, we're also seeing fairly
low disk utilization on OSDs. 

 

This low disk utilization suggests that journals are not really used to
their max, which begs for the questions whether buying SSDs for journals
will help.

 

This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we
cannot really confirm that.

 

Our typical data access use case is a lot of small random read/writes. We're
doing a lot of rsyncing (entire regular linux filesystems) from one VM to
another.

 

We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't
really help all that much.

 

So, is there any way to confirm beforehand that using SSDs for journals will
help in our case?

 

Kind Regards,
Piotr




 







 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Andrei Mikhailovsky
Piotr,

You may also investigate if the cache tier made of a couple of ssds could help 
you. Not sure how the data is used in your company, but if you have a bunch of 
hot data that moves around from one vm to another it might greatly speed up the 
rsync. On the other hand, if a lot of rsync data is cold, it might have an 
adverse effect on performance.

As a test, you could try to create a small pool with a couple of ssds in a 
cache tier on top of your spinning osds. You don't need to purchase tons of 
ssds in advance. As a test case, I would suggest 2-4 ssds in a cache tier 
should be okay for the PoC.

Andrei


- Original Message -
From: Nick Fisk n...@fisk.me.uk
To: Piotr Wachowicz piotr.wachow...@brightcomputing.com
Cc: ceph-users@lists.ceph.com
Sent: Friday, 1 May, 2015 10:42:12 AM
Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance?





Yeah, that’s your problem, doing a single thread rsync when you have quite poor 
write latency will not be quick. SSD journals should give you a fair 
performance boost, otherwise you need to coalesce the writes at the client so 
that Ceph is given bigger IOs at higher queue depths. 



RBD Cache can help here as well as potentially FS tuning to buffer more 
aggressively. If writeback RBD cache is enabled, data will be buffered by RBD 
until a sync is called by the client, so data loss can occur during this period 
if the app is not issuing fsyncs properly. Once a sync is called data is 
flushed to the journals and then later to the actual OSD store. 






From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr 
Wachowicz 
Sent: 01 May 2015 10:14 
To: Nick Fisk 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance? 





Thanks for your answer, Nick. 




Typically it's a single rsync session at a time (sometimes two, but rarely more 
concurrently). So it's a single ~5GB typical linux filesystem from one random 
VM to another random VM. 





Apart from using RBD Cache, is there any other way to improve the overall 
performance of such a use case in a Ceph cluster? 





In theory I guess we could always tarball it, and rsync the tarball, thus 
effectively using sequential IO rather than random. But that's simply not 
feasible for us at the moment. Any other ways? 





Sidequestion: does using RBDCache impact the way data is stored on the client? 
(e.g. a write call returning after data has been written to Journal (fast) vs 
written all the way to the OSD data store(slow)). I'm guessing it's always the 
first one, regardless of whether client uses RBDCache or not, right? My logic 
here is that otherwise that would imply that clients can impact the way OSDs 
behave, which could be dangerous in some situations. 





Kind Regards, 




Piotr 










On Fri, May 1, 2015 at 10:59 AM, Nick Fisk  n...@fisk.me.uk  wrote: 





How many Rsync’s are doing at a time? If it is only a couple, you will not be 
able to take advantage of the full number of OSD’s, as each block of data is 
only located on 1 OSD (not including replicas). When you look at disk 
statistics you are seeing an average over time, so it will look like the OSD’s 
are not very busy, when in fact each one is busy for a very brief period. 



SSD journals will help your write latency, probably going down from around 
15-30ms to under 5ms 






From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of 
Piotr Wachowicz 
Sent: 01 May 2015 09:31 
To: ceph-users@lists.ceph.com 
Subject: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance? 







Is there any way to confirm (beforehand) that using SSDs for journals will 
help? 

We're seeing very disappointing Ceph performance. We have 10GigE interconnect 
(as a shared public/internal network). 





We're wondering whether it makes sense to buy SSDs and put journals on them. 
But we're looking for a way to verify that this will actually help BEFORE we 
splash cash on SSDs. 





The problem is that the way we have things configured now, with journals on 
spinning HDDs (shared with OSDs as the backend storage), apart from slow 
read/write performance to Ceph I already mention, we're also seeing fairly low 
disk utilization on OSDs. 





This low disk utilization suggests that journals are not really used to their 
max, which begs for the questions whether buying SSDs for journals will help. 





This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we 
cannot really confirm that. 





Our typical data access use case is a lot of small random read/writes. We're 
doing a lot of rsyncing (entire regular linux filesystems) from one VM to 
another. 





We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really 
help all that much. 





So, is there any way to confirm beforehand that using SSDs for journals 

Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Udo Lembke
Hi,

On 01.05.2015 10:30, Piotr Wachowicz wrote:
 Is there any way to confirm (beforehand) that using SSDs for journals
 will help?
yes SSD-Journal helps a lot (if you use the right SSDs) for write speed,
and I made the experiences that this also helped (but not too much) for
read-performance.


 We're seeing very disappointing Ceph performance. We have 10GigE
 interconnect (as a shared public/internal network).
Which kind of CPU do you use for the OSD-hosts?


 We're wondering whether it makes sense to buy SSDs and put journals on
 them. But we're looking for a way to verify that this will actually
 help BEFORE we splash cash on SSDs.
I can recommend the Intel DC S3700 SSD for journaling! In the beginning
I started with different much cheaper models, but this was the wrong
decision.

 The problem is that the way we have things configured now, with
 journals on spinning HDDs (shared with OSDs as the backend storage),
 apart from slow read/write performance to Ceph I already mention,
 we're also seeing fairly low disk utilization on OSDs. 

 This low disk utilization suggests that journals are not really used
 to their max, which begs for the questions whether buying SSDs for
 journals will help.

 This kind of suggests that the bottleneck is NOT the disk. But,m yeah,
 we cannot really confirm that.

 Our typical data access use case is a lot of small random read/writes.
 We're doing a lot of rsyncing (entire regular linux filesystems) from
 one VM to another.

 We're using Ceph for OpenStack storage (kvm). Enabling RBD cache
 didn't really help all that much.
The read speed can be optimized with an bigger read ahead cache inside
the VM, like:
echo 4096  /sys/block/vda/queue/read_ahead_kb

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw agent and federated config problems

2015-05-01 Thread Thomas Klaver
We run a ceph cluster with radosgw on top of it. During the installation we 
have never specified any regions or zones, which means that every bucket 
currently resides in the default region. To support a federated config we have 
built a test cluster that replicates the current production setup with the same 
default region and zone. Once that setup was running we went through the 
following steps to make the switch to a federated config. Our second zone is 
completely empty to begin with and has no data in it at this point.

1) We created a new region that includes the api_name, master_zone and 
endpoints for our two zones.
2) We created two users in zone1 and zone2 with the same access and secret key 
across the two zones.
3) We created two zones with default pools and specified the access and secret 
key.
4) We have changed ceph.conf to include the new region and zone and pushed it 
to our nodes.
5) The default region was set to our new region through radosgw-admin and the 
default was removed.
6) The regionmap was updated to reflect the changes we made to our regions.

This last step proved to be a little difficult, as radosgw-admin regionmap 
update returns:
7f7b36b7b840 -1 cannot update region map, master_region conflict

The master_region is set to 'ams' in both clusters.

It may be that we be running into issues later on because we have solved this 
the 'hard way' by changing the regionmap manually.

6) As we have changed our region and zones we have restarted radosgw. As 
expected this takes our objects offline.
7) We have updated all buckets to sit in the new region.

After our buckets have changed all of our objects are back online again. 

We have not made any changes to our pools. The new region points to the 
existing pool so this has never resulted in any physical movement of data. Once 
this was all done the cluster was up and running, as expected, but serving its 
content from the new zone.

At this point we set up radosgw-admin with the users from step 2 and 3 matching 
our zones. The first time we have done this we ran into a couple of problems. 
The first was that radosgw-admin that's available in the repository is a little 
older than the one on github. This version lacks a lot of exception handling 
and proper error output, making it difficult to diagnose issues as they come 
up. We've switched to the latest available version from github which has helped 
us a lot to get where we are now. We had to switch radosgw from sockets to tcp 
first, but the manual didn't include a specific parameter which lead to radosgw 
not being able to handle /-characters properly. Once we added 
AllowEncodedSlashes it all magically worked. 

As it took us quite some time and fiddling around to get to this point we 
wanted to replicate the exact same situation on another test environment again 
to make sure we know what to do when we would change this in a live 
environment. And this is where it all fails. We are unable to get this set up 
back up again. We've compared configurations, checked every single setting 
we've played with but we're unable to find what's going wrong. The error 
message is pretty obvious though:

2015-04-24 15:37:55,073 9406 [radosgw_agent.worker][DEBUG ] syncing object 
object/test.txt
2015-04-24 15:37:55,089 9406 [radosgw_agent.worker][DEBUG ] object 
object/test.txt not found on master, deleting from secondary

I was expecting to find this entry in our Apache log files. Surely it would 
trigger a 404. It turns out though that we're not seeing any log files at all. 
It's not being found at all. Though when I look at the logs in zone2 I see the 
following:

[24/Apr/2015:15:45:01 +] PUT 
/object/test.txt?rgwx-op-id=radosgw1%3A9727%3A1rgwx-source-zone=zone1rgwx-client-id=radosgw-agent
 HTTP/1.1 404 242 - Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic
[24/Apr/2015:15:45:01 +] GET /object/?max-keys=0 HTTP/1.1 200 408 - 
Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic
[24/Apr/2015:15:45:01 +] DELETE /object/test.txt HTTP/1.1 204 126 - 
Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic”

We’re running ceph and radosgw 0.94.1, the agent comes from github as the one 
that’s in the repository doesn’t seem entirely stable nor very clear on error 
messages.

I’m sure we may be missing something, but it feels like radosgw-agent isn’t 
production ready yet. Any thoughts?

Thanks,
Thomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Piotr Wachowicz
 yes SSD-Journal helps a lot (if you use the right SSDs)


What SSDs to avoid for journaling from your experience? Why?


  We're seeing very disappointing Ceph performance. We have 10GigE
  interconnect (as a shared public/internal network).
 Which kind of CPU do you use for the OSD-hosts?


Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz

FYI, we are hosting VMs on our OSD nodes, but the VMs use very small
amounts of CPUs and RAM


  We're wondering whether it makes sense to buy SSDs and put journals on
  them. But we're looking for a way to verify that this will actually
  help BEFORE we splash cash on SSDs.
 I can recommend the Intel DC S3700 SSD for journaling! In the beginning
 I started with different much cheaper models, but this was the wrong
 decision.


What, apart from the price, made the difference? sustained read/write
bandwidth? IOPS?

We're considering this one (PCI-e SSD). What do you think?
http://www.plextor-digital.com/index.php/en/M6e-BK/m6e-bk.html
PX-128M6e-BK


Also, we're thinking about sharing one SSD between two OSDs. Any reason why
this would be a bad idea?


  We're using Ceph for OpenStack storage (kvm). Enabling RBD cache
  didn't really help all that much.
 The read speed can be optimized with an bigger read ahead cache inside
 the VM, like:
 echo 4096  /sys/block/vda/queue/read_ahead_kb



Thanks, we will try that.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com