Re: [ceph-users] Ceph VM Backup

2013-08-19 Thread Wido den Hollander

On 08/18/2013 10:58 PM, Wolfgang Hennerbichler wrote:

On Sun, Aug 18, 2013 at 06:57:56PM +1000, Martin Rudat wrote:

Hi,

On 2013-02-25 20:46, Wolfgang Hennerbichler wrote:

maybe some of you are interested in this - I'm using a dedicated VM to
backup important VMs which have their storage in RBD. This is nothing
fancy and not implemented perfectly, but it works. The VM's don't notice
that they're backed up, the only requirement is that the filesystem of
the VM is directly on the RBD, the script doesn't calculate offsets of
partition tables.

Looking at how you're doing that, if you trust the script to be able
to create new snapshots; couldn't you do that with less machinery
involved by installing the ceph binaries on the backup host,
creating the snapshot and attaching it with rbd, rather than
attaching it to the VM?


this was written at a time where kernels could not map format 2 rbd images.


Also; where's the fsck call? You're snapshotting a running system;
it's almost guaranteed that you've done the snapshot in the middle
of a batch of writes; then again, it would be cool to be able to ask
the VM to sync, to capture a consistent filesystem, though.


I use journaling filesystems. The journal is replayed during mount (can be seen 
in kernel logs) and the FS is therefore considered to be clean.


I don't know about recent kernels, but older ones could be made to
crash by boldly mounting a filesystem that hadn't been fscked.


This works for production systems. That's what journals are all about, right?


Correct, but older kernels might not respect barriers correctly. But if 
you use a modern kernel ( I think >2.6.36 or so) there won't be a problem.


Like you said, on mount the journal will be replayed and the FS will be 
clean.


It's nothing less then an unexpected shutdown.

Wido



wogri


--
Martin Rudat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-19 Thread Georg Höllrigl

Hello Mark,
Hello list,


I fixed the monitor issue. There was another monitor, which didn't run 
any more. I've removed that - now I'm lost with the MDS still replaying 
it's journal?


root@vvx-ceph-m-02:/var/lib/ceph/mon# ceph health detail
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is degraded
pg 0.3f is stuck unclean since forever, current state active+degraded, 
last acting [28]

...
pg 2.2 is stuck unclean since forever, current state active+degraded, 
last acting [37]

pg 2.3d is active+degraded, acting [28]
...
pg 0.10 is active+degraded, acting [35]
pg 2.d is active+degraded, acting [27]
...
pg 0.0 is active+degraded, acting [23]
mds cluster is degraded
mds.vvx-ceph-m-01 at 10.0.0.176:6800/1098 rank 0 is replaying journal



# ceph mds stat
e8: 1/1/1 up {0=vvx-ceph-m-01=up:replay}, 2 up:standby

the logs for mds are empty on all three.

Removing MDS ist still not supported, whe I look at:
http://ceph.com/docs/master/rados/deployment/ceph-deploy-mds/



Georg



On 16.08.2013 16:23, Mark Nelson wrote:

Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:

Hello,

I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with "ceph pg force_create_pg"


HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
are blocked > 32 sec; mds cluster is degraded; 1 mons down, quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03

Any idea how to fix the cluster, besides completley rebuilding the
cluster from scratch? What if such a thing happens in a production
environment...

The pgs from "ceph pg dump" looks all like creating for some time now:

2.3d0   0   0   0   0   0   0 creating
  2013-08-16 13:43:08.186537   0'0 0:0 []  [] 0'0
0.000'0 0.00

Is there a way to just dump the data, that was on the discarded OSDs?




Kind Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Dipl.-Ing. (FH) Georg Höllrigl
Technik



Xidras GmbH
Stockern 47
3744 Stockern
Austria

Tel: +43 (0) 2983 201 - 30505
Fax: +43 (0) 2983 201 - 930505
Email:   georg.hoellr...@xidras.com
Web: http://www.xidras.com

FN 317036 f | Landesgericht Krems | ATU64485024



VERTRAULICHE INFORMATIONEN!
Diese eMail enthält vertrauliche Informationen und ist nur für den 
berechtigten
Empfänger bestimmt. Wenn diese eMail nicht für Sie bestimmt ist, bitten 
wir Sie,

diese eMail an uns zurückzusenden und anschließend auf Ihrem Computer und
Mail-Server zu löschen. Solche eMails und Anlagen dürfen Sie weder nutzen,
noch verarbeiten oder Dritten zugänglich machen, gleich in welcher Form.
Wir danken für Ihre Kooperation!

CONFIDENTIAL!
This email contains confidential information and is intended for the 
authorised
recipient only. If you are not an authorised recipient, please return 
the email

to us and then delete it from your computer and mail-server. You may neither
use nor edit any such emails including attachments, nor make them accessible
to third parties in any manner whatsoever.
Thank you for your cooperation

 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Samuel Just
You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey  wrote:
> Hey Mark,
>
> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
>> > Hey all,
>> >
>> > This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
>> > created in the tracker.  Thought I would pass it through the list as
>> > well, to get an idea if anyone else is running into it.  It may only
>> > show under higher loads.  More info about my setup is in the bug-report
>> > above.  Here goes:
>> >
>> >
>> > I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
>> > and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
>> > unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
>> > +MB/sec on simple linear writes to a file with `dd' inside a VM on this
>> > cluster under regular load and the osds usually averaged 20-100%
>> > CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
>> > the osds shot up to 100% to 400% in `top' (multi-core system) and the
>> > speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
>> > complained that disk-access inside the VMs was significantly slower and
>> > the backups of the RBD-store I was running, also got behind quickly.
>> >
>> > After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
>> > rest at 0.67 Dumpling, speed and load returned to normal. I have
>> > repeated this performance-hit upon upgrade on a similar test-cluster
>> > under no additional load at all. Although CPU-usage for the osds wasn't
>> > as dramatic during these tests because there was no base-load from other
>> > VMs, I/O-performance dropped significantly after upgrading during these
>> > tests as well, and returned to normal after downgrading the osds.
>> >
>> > I'm not sure what to make of it. There are no visible errors in the logs
>> > and everything runs and reports good health, it's just a lot slower,
>> > with a lot more CPU-usage.
>>
>> Hi Oliver,
>>
>> If you have access to the perf command on this system, could you try
>> running:
>>
>> "sudo perf top"
>>
>> And if that doesn't give you much,
>>
>> "sudo perf record -g"
>>
>> then:
>>
>> "sudo perf report | less"
>>
>> during the period of high CPU usage?  This will give you a call graph.
>> There may be symbols missing, but it might help track down what the OSDs
>> are doing.
>
> Thanks for your help!  I did a couple of runs on my test-cluster,
> loading it with writes from 3 VMs concurrently and measuring the results
> at the first node with all 0.67 Dumpling-components and with the osds
> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
> while, then copied anything that showed in red and green into this post.
> Here are the results (sorry for the word-wraps):
>
> First, with 0.61.7 osds:
>
>  19.91%  [kernel][k] intel_idle
>  10.18%  [kernel][k] _raw_spin_lock_irqsave
>   6.79%  ceph-osd[.] ceph_crc32c_le
>   4.93%  [kernel][k]
> default_send_IPI_mask_sequence_phys
>   2.71%  [kernel][k] copy_user_generic_string
>   1.42%  libc-2.11.3.so  [.] memcpy
>   1.23%  [kernel][k] find_busiest_group
>   1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
>   1.11%  [kernel][k] _raw_spin_lock
>   0.99%  kvm [.] 0x1931f8
>   0.92%  [igb]   [k] igb_poll
>   0.87%  [kernel][k] native_write_cr0
>   0.80%  [kernel][k] csum_partial
>   0.78%  [kernel][k] __do_softirq
>   0.63%  [kernel][k] hpet_legacy_next_event
>   0.53%  [ip_tables] [k] ipt_do_table
>   0.50%  libc-2.11.3.so  [.] 0x74433
>
> Second test, with 0.67 osds:
>
>  18.32%  [kernel]  [k] intel_idle
>   7.58%  [kernel]  [k] _raw_spin_lock_irqsave
>   7.04%  ceph-osd  [.] PGLog::undirty()
>   4.39%  ceph-osd  [.] ceph_crc32c_le_intel
>   3.92%  [kernel]  [k]
> default_send_IPI_mask_sequence_phys
>   2.25%  [kernel]  [k] copy_user_generic_string
>   1.76%  libc-2.11.3.so[.] memcpy
>   1.56%  librados.so.2.0.0 [.] ceph_crc32c_le_intel
>   1.40%  libc-2.11.3.so[.] vfprintf
>   1.12%  libc-2.11.3.so[.] 0x7217b
>   1.05%  [kernel]  [k] _raw_spin_lock
>   1.01%  [kernel]  [k] find_busiest_group
>   0.83% 

[ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
Hello, I just have some small questions about Ceph Deployment models and if
this would work for us.
Currently the first question would be, is it possible to have a ceph single
node setup, where everything is on one node?
Our Application, Ceph's object storage and a database? We focus on this
deployment model for our very small customers, who only have like 20
members that use our application, so the load wouldn't be very high.
And the next question would be, is it possible to extend the Ceph single
node to 3 nodes later, if they need more availability?

Also we always want to use Shared Nothing Machines, so every service would
be on one machine, is this Okai for Ceph, or does Ceph really need a lot of
CPU/Memory/Disk Speed?
Currently we make an archiving software for small customers and we want to
move things on the file system on a object storage. Currently we only have
customers that needs 1 machine or 3 machines. But everything should work as
fine on more.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
On 08/19/2013 10:36 AM, Schmitt, Christian wrote:
> Hello, I just have some small questions about Ceph Deployment models and
> if this would work for us.
> Currently the first question would be, is it possible to have a ceph
> single node setup, where everything is on one node?

yes. depends on 'everything', but it's possible (though not recommended)
to run mon, mds, and osd's on the same host, and even do virtualisation.

> Our Application, Ceph's object storage and a database? 

what is 'a database'?

> We focus on this
> deployment model for our very small customers, who only have like 20
> members that use our application, so the load wouldn't be very high.
> And the next question would be, is it possible to extend the Ceph single
> node to 3 nodes later, if they need more availability?

yes.

> Also we always want to use Shared Nothing Machines, so every service
> would be on one machine, is this Okai for Ceph, or does Ceph really need
> a lot of CPU/Memory/Disk Speed?

ceph needs cpu / disk speed when disks fail and need to be recovered. it
also uses some cpu when you have a lot of i/o, but generally it is
rather lightweight.
shared nothing is possible with ceph, but in the end this really depends
on your application.

> Currently we make an archiving software for small customers and we want
> to move things on the file system on a object storage. 

you mean from the filesystem to an object storage?

> Currently we only
> have customers that needs 1 machine or 3 machines. But everything should
> work as fine on more.

it would with ceph. probably :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Martin Rudat

Hi,

On 2013-08-19 15:48, Guang Yang wrote:
After walking through some documents of Ceph, I have a couple of 
questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the 
ability to handle different work-loads (from KB to GB), with 
corresponding performance report?
No idea; I've not seen one, but I haven't gone looking either. I think 
that I've seen mention of benchmarks, though.


  2. Looking at some industry solutions for distributed storage, GFS / 
Haystack / HDFS all use meta-server to store the logical-to-physical 
mapping within memory and avoid disk I/O lookup for file reading, is 
the concern valid for Ceph (in terms of latency to read file)?
I'd imagine that with enough memory on the host running the mds, it 
would be the equivalent to explicitly holding everything in memory, as 
there would be enough data buffered that there's nearly no disk i/o to 
do metadata lookup; you're going to have to have i/o for writing, but 
that's unavoidable if you want to maintain data integrity.


I haven't a clue if there's any kind of striping between multiple 
metadata servers, if you have more metadata in flight than can 
comfortably fit entirely in memory on a single host, and given you can 
cram 48G of ram into a machine with an intel CPU (at the current 
8G/dimm), without needing to go to a multiple-socket motherboard, it 
would take quite some effort to reach that state.


  3.. Some industry research shows that one issue of file system is 
the metadata-to-data ratio, in terms of both access and storage, and 
some technic uses the mechanism to combine small files to large 
physical files to reduce the ratio (Haystack for example), if we want 
to use ceph to store photos, should this be a concern as Ceph use one 
physical file per object?
What would the average object size be? The default size for a 
chunk/slice/...? in RBD is 4M (also the default extent size in LVM); I 
presume that it's not just a random number pulled out of the air, and 
there's at least some vague thought to balancing data/metadata ratio.


--
Martin Rudat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] large memory leak on scrubbing

2013-08-19 Thread Mostowiec Dominik
Thanks for your response.
Great.

In latest cuttlefish it is also fixed I think?

We have two problems with scrubbing:
- memory leaks
- slow requests and wrongly mark osd with bucket index down (when scrubbing)

Now we decided to turn off scrubbing and trigger it on maintenance window.
I noticed that "ceph osd scrub", or "ceph osd deep-scrub" trigger scrub on osd 
but not for all PG.
It is possible to trigger scrubbing all PG on one osd?

--
Regards 
Dominik


-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Saturday, August 17, 2013 5:11 PM
To: Mostowiec Dominik
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Studziński 
Krzysztof; Sydor Bohdan
Subject: Re: [ceph-users] large memory leak on scrubbing

Hi Dominic,

There is a bug fixed a couple of months back that fixes excessive memory 
consumption during scrub.  You can upgrade to the latest 'bobtail' branch.  
See

 http://ceph.com/docs/master/install/debian/#development-testing-packages

Installing that package should clear this up.

sage


On Fri, 16 Aug 2013, Mostowiec Dominik wrote:

> Hi,
> We noticed some issues on CEPH/S3 cluster, I think it related with scrubbing: 
> large memory leaks.
> 
> Logs 09.xx: 
> https://www.dropbox.com/s/4z1fzg239j43igs/ceph-osd.4.log_09xx.tar.gz
> >From 09.30 to 09.44 (14 minutes) osd.4 proces grows up to 28G. 
> 
> I think this is something curious:
> 2013-08-16 09:43:48.801331 7f6570d2e700  0 log [WRN] : slow request 
> 32.794125 seconds old, received at 2013-08-16 09:43:16.007104: 
> osd_sub_op(unknown.0.0:0 16.113d 0//0//-1 [scrub-reserve] v 0'0 
> snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
> 
> We have large rgw index and a lot of large files than on this cluster.
> ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
> Setup: 
> - 12 servers x 12 OSD
> - 3 mons
> Default scrubbing settings.
> Journal and filestore settings:
> journal aio = true
> filestore flush min = 0
> filestore flusher = false
> filestore fiemap = false
> filestore op threads = 4
> filestore queue max ops = 4096
> filestore queue max bytes = 10485760
> filestore queue committing max bytes = 10485760
> journal max write bytes = 10485760
> journal queue max bytes = 10485760
> ms dispatch throttle bytes = 10485760
> objecter infilght op bytes = 10485760
> 
> Is this a known bug in this version?
> (Do you know some workaround to fix this?)
> 
> ---
> Regards
> Dominik
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Mark Kirkwood

On 19/08/13 18:17, Guang Yang wrote:


   3. Some industry research shows that one issue of file system is the
metadata-to-data ratio, in terms of both access and storage, and some
technic uses the mechanism to combine small files to large physical
files to reduce the ratio (Haystack for example), if we want to use ceph
to store photos, should this be a concern as Ceph use one physical file
per object?


If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).



Regards

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Wolfgang Hennerbichler


On 08/19/2013 11:18 AM, Mark Kirkwood wrote:
> However if you use rados gateway (S3 or Swift look-alike
> api) then each client data object will be broken up into chunks at the
> rados level (typically 4M sized chunks).

=> which is a good thing in terms of replication and OSD usage
distribution.

> 
> Regards
> 
> Mark
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-19 Thread Georg Höllrigl

Hello List,

The troubles to fix such a cluster continue... I get output like this now:

# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is 
degraded; mds vvx-ceph-m-03 is laggy



When checking for the ceph-mds processes, there are now none left... no 
matter which server I check. And the won't start up again!?


The log starts up with:
2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67 
(e3b7bc5bce8ab330ec1661381072368af3c218a0), process ceph-mds, pid 27636

2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i am 
now mds.0.26
2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map state 
change up:standby --> up:replay

2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap epoch 
277, have 276
2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for osdmap 
277 (which blacklists prior instance)
2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish 
got (2) No such file or directory
2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In 
function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 
7f7e9904b700 time 2013-08-19 11:23:30.534107

mds/SessionMap.cc: 83: FAILED assert(0 == "failed to load sessionmap")


Anyone an idea how to get the cluster back running?





Georg




On 16.08.2013 16:23, Mark Nelson wrote:

Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:

Hello,

I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with "ceph pg force_create_pg"


HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
are blocked > 32 sec; mds cluster is degraded; 1 mons down, quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03

Any idea how to fix the cluster, besides completley rebuilding the
cluster from scratch? What if such a thing happens in a production
environment...

The pgs from "ceph pg dump" looks all like creating for some time now:

2.3d0   0   0   0   0   0   0 creating
  2013-08-16 13:43:08.186537   0'0 0:0 []  [] 0'0
0.000'0 0.00

Is there a way to just dump the data, that was on the discarded OSDs?




Kind Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Martin Rudat

On 2013-08-19 18:36, Schmitt, Christian wrote:
Currently the first question would be, is it possible to have a ceph 
single node setup, where everything is on one node?
Yes, definitely, I've currently got a single-node ceph 'cluster', but, 
to the best of my knowledge, it's not the recommended configuration for 
long-term usage; in the coming weeks (given this is a home server), I'll 
be attempting to bring up another two nodes.


Our Application, Ceph's object storage and a database? We focus on 
this deployment model for our very small customers, who only have like 
20 members that use our application, so the load wouldn't be very high.
And the next question would be, is it possible to extend the Ceph 
single node to 3 nodes later, if they need more availability?
I'm not sure how much ram the monitor and mds take, but each osd (disk) 
seems to nominally use 300M of ram. My 'server' is a micro-ATX board 
with 5 spinning disks and a SSD, plugged into a small UPS; total cost 
about 2000 AUD. It's running a mail-server, backuppc for the other VMs, 
PCs and laptops in the house, a file-server re-exporting the disk from 
ceph, and some other random stuff. The VMs chew up a little more than 8G 
of ram in total, and on the 16G machine, there doesn't seem to be any 
performance problems (with only two users, mind you).


Also we always want to use Shared Nothing Machines, so every service 
would be on one machine, is this Okai for Ceph, or does Ceph really 
need a lot of CPU/Memory/Disk Speed?
Currently we make an archiving software for small customers and we 
want to move things on the file system on a object storage. Currently 
we only have customers that needs 1 machine or 3 machines. But 
everything should work as fine on more.
Depending on your definition of 'machine', a cluster of 3 smaller 
machines may be substitutable for a single larger one; with the hope 
that hardware failure only takes out 1 node, leaving the whole cluster 
still online and able to be restored to full capacity at your (relative) 
leisure, rather than Right Now, as the backups aren't running anymore...


The two 'new' nodes I'm spinning up are my old desktop machine and its 
predecessor, which, arguably could be construed as being 'free'. =)


For firms of your target size, it may be an effective thing to suggest 
upgrading one or more desktops, and use the old machines to run the 
backup system on. Especially if you're charging for the service 
provided, more than for the hardware, you may be able to consolidate 
multiple existing servers into VMs running on a ceph cluster, with 
enough spare capacity to also run your backup suite, with minimal to no 
actual hardware outlay.


--
Martin Rudat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool

2013-08-19 Thread Joao Eduardo Luis

On 08/18/2013 07:11 PM, Oliver Daudey wrote:

Hey all,

Also created on the tracker, under http://tracker.ceph.com/issues/6047

While playing around on my test-cluster, I ran into a problem that I've
seen before, but have never been able to reproduce until now.  The use
of pool-snapshots and rbd-snapshots seems to be mutually exclusive in
the same pool, even if you have used one type of snapshot before and
have since deleted all snapshots of that type.  Unfortunately, the
condition doesn't appear to be handled gracefully yet, leading, in one
case, to monitors crashing.  I think this one goes back at least as far
as Bobtail and still exists in Dumpling.  My cluster is a
straightforward one with 3 Debian Squeeze-nodes, each running a mon, mds
and osd.  To reproduce:

# ceph osd pool create test 256 256
pool 'test' created
# ceph osd pool mksnap test snapshot
created pool test snap snapshot
# ceph osd pool rmsnap test snapshot
removed pool test snap snapshot

So far, so good.  Now we try to create an rbd-snapshot in the same pool:

# rbd --pool=test create --size=102400 image
# rbd --pool=test snap create image@snapshot
rbd: failed to create snapshot: (22) Invalid argument
2013-08-18 19:27:50.892291 7f983bc10780 -1 librbd: failed to create snap
id: (22) Invalid argument

That failed, but at least the cluster is OK.  Now we start over again
and create the rbd-snapshot first:

# ceph osd pool delete test test --yes-i-really-really-mean-it
pool 'test' deleted
# ceph osd pool create test 256 256
pool 'test' created
# rbd --pool=test create --size=102400 image
# rbd --pool=test snap create image@snapshot
# rbd --pool=test snap ls image
SNAPID NAME  SIZE
  2 snapshot 102400 MB
# rbd --pool=test snap rm image@snapshot
# ceph osd pool mksnap test snapshot
2013-08-18 19:35:59.494551 7f48d75a1700  0 monclient: hunting for new
mon
^CError EINTR:  (I pressed CTRL-C)


Thanks for the steps to reproduce Oliver!  Managed to reproduce this on 
0.67.1 on the first attempt.


This bug appears to be the same as #5959 on the tracker.  I spent some 
time last week looking into it, and although I realized it was far too 
easy to trigger it on cuttlefish, I never managed to trigger it on next 
-- which I attributed to d1501938f5d07c067d908501fc5cfe3c857d7281.


I'll be looking into this.

  -Joao





My leader monitor crashed at that last command, here's the apparent
critical point in the logs:

 -3> 2013-08-18 19:35:59.315956 7f9b870b1700  1 --
194.109.43.18:6789/0 <== c
lient.5856 194.109.43.18:0/1030570 8  mon_command({"snap":
"snapshot", "pref
ix": "osd pool mksnap", "pool": "test"} v 0) v1  107+0+0 (983560
0 0) 0x23e4200 con 0x2d202c0
 -2> 2013-08-18 19:35:59.316020 7f9b870b1700  0 mon.a@0(leader) e1
handle_command mon_command({"snap": "snapshot", "prefix": "osd pool
mksnap", "pool": "test"} v 0) v1
 -1> 2013-08-18 19:35:59.316033 7f9b870b1700  1
mon.a@0(leader).paxos(paxos active c 1190049..1190629) is_readable
now=2013-08-18 19:35:59.316034 lease_expire=2013-08-18 19:36:03.535809
has v0 lc 1190629
  0> 2013-08-18 19:35:59.317612 7f9b870b1700 -1 osd/osd_types.cc: In
function 'void pg_pool_t::add_snap(const char*, utime_t)' thread
7f9b870b1700 time 2013-08-18 19:35:59.316102
osd/osd_types.cc: 682: FAILED assert(!is_unmanaged_snaps_mode())

Apart from fixing this assert and maybe giving a more clear
error-message with the failed creation of the rbd-snapshot, maybe there
should be a way to switch from one "snaps_mode" to the other without
having to delete the entire pool, if one doesn't already exist.  BTW:
How exactly does one use the pool-snapshots?  There doesn't seem to be a
documented way of listing or using them after creation.

More info available on request.



Regards,

  Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
> Date: Mon, 19 Aug 2013 10:50:25 +0200
> From: Wolfgang Hennerbichler 
> To: 
> Subject: Re: [ceph-users] Ceph Deployments
> Message-ID: <5211dc51.4070...@risc-software.at>
> Content-Type: text/plain; charset="ISO-8859-1"
>
> On 08/19/2013 10:36 AM, Schmitt, Christian wrote:
> > Hello, I just have some small questions about Ceph Deployment models and
> > if this would work for us.
> > Currently the first question would be, is it possible to have a ceph
> > single node setup, where everything is on one node?
>
> yes. depends on 'everything', but it's possible (though not recommended)
> to run mon, mds, and osd's on the same host, and even do virtualisation.

Currently we don't want to virtualise on this machine since the
machine is really small, as said we focus on small to midsize
businesses. Most of the time they even need a tower server due to the
lack of a correct rack. ;/

> > Our Application, Ceph's object storage and a database?
>
> what is 'a database'?

We run Postgresql or MariaDB (without/with Galera depending on the cluster size)

> > We focus on this
> > deployment model for our very small customers, who only have like 20
> > members that use our application, so the load wouldn't be very high.
> > And the next question would be, is it possible to extend the Ceph single
> > node to 3 nodes later, if they need more availability?
>
> yes.

Thats good!

> > Also we always want to use Shared Nothing Machines, so every service
> > would be on one machine, is this Okai for Ceph, or does Ceph really need
> > a lot of CPU/Memory/Disk Speed?
>
> ceph needs cpu / disk speed when disks fail and need to be recovered. it
> also uses some cpu when you have a lot of i/o, but generally it is
> rather lightweight.
> shared nothing is possible with ceph, but in the end this really depends
> on your application.

hm, when disk fails we already doing some backup on a dell powervault
rd1000, so i don't think thats a problem and also we would run ceph on
a Dell PERC Raid Controller with RAID1 enabled on the data disk.

> > Currently we make an archiving software for small customers and we want
> > to move things on the file system on a object storage.
>
> you mean from the filesystem to an object storage?

yes, currently everything is on the filesystem and this is really
horrible, thousands of pdfs just on the filesystem. we can't scale up
that easily with this setup.
Currently we run on Microsoft Servers, but we plan to rewrite our
whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
7, 9, ... X²-1 should be possible.

> > Currently we only
> > have customers that needs 1 machine or 3 machines. But everything should
> > work as fine on more.
>
> it would with ceph. probably :)

That's nice to hear. I was really scared that we don't find a solution
that can run on 1 system and scale up to even more. We first looked at
HDFS but this isn't lightweight. And the overhead of Metadata etc.
just isn't that cool.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
>> yes. depends on 'everything', but it's possible (though not recommended)
>> to run mon, mds, and osd's on the same host, and even do virtualisation.
> 
> Currently we don't want to virtualise on this machine since the
> machine is really small, as said we focus on small to midsize
> businesses. Most of the time they even need a tower server due to the
> lack of a correct rack. ;/

whoa :)

>>> Our Application, Ceph's object storage and a database?
>>
>> what is 'a database'?
> 
> We run Postgresql or MariaDB (without/with Galera depending on the cluster 
> size)

You wouldn't want to put the data of postgres or mariadb on cephfs. I
would run the native versions directly on the servers and use
mysql-multi-master circular replication. I don't know about similar
features of postgres.

>> shared nothing is possible with ceph, but in the end this really depends
>> on your application.
> 
> hm, when disk fails we already doing some backup on a dell powervault
> rd1000, so i don't think thats a problem and also we would run ceph on
> a Dell PERC Raid Controller with RAID1 enabled on the data disk.

this is open to discussion, and really depends on your use case.

>>> Currently we make an archiving software for small customers and we want
>>> to move things on the file system on a object storage.
>>
>> you mean from the filesystem to an object storage?
> 
> yes, currently everything is on the filesystem and this is really
> horrible, thousands of pdfs just on the filesystem. we can't scale up
> that easily with this setup.

Got it.

> Currently we run on Microsoft Servers, but we plan to rewrite our
> whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
> 7, 9, ... X²-1 should be possible.

cool.

>>> Currently we only
>>> have customers that needs 1 machine or 3 machines. But everything should
>>> work as fine on more.
>>
>> it would with ceph. probably :)
> 
> That's nice to hear. I was really scared that we don't find a solution
> that can run on 1 system and scale up to even more. We first looked at
> HDFS but this isn't lightweight. 

not only that, HDFS also has a single point of failure.

> And the overhead of Metadata etc.
> just isn't that cool.

:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread Guang Yang
Hi ceph-users,
I would like to check if there is any manual / steps which can let me try to 
deploy ceph in RHEL?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
2013/8/19 Wolfgang Hennerbichler :
> On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
>>> yes. depends on 'everything', but it's possible (though not recommended)
>>> to run mon, mds, and osd's on the same host, and even do virtualisation.
>>
>> Currently we don't want to virtualise on this machine since the
>> machine is really small, as said we focus on small to midsize
>> businesses. Most of the time they even need a tower server due to the
>> lack of a correct rack. ;/
>
> whoa :)

Yep that's awful.

 Our Application, Ceph's object storage and a database?
>>>
>>> what is 'a database'?
>>
>> We run Postgresql or MariaDB (without/with Galera depending on the cluster 
>> size)
>
> You wouldn't want to put the data of postgres or mariadb on cephfs. I
> would run the native versions directly on the servers and use
> mysql-multi-master circular replication. I don't know about similar
> features of postgres.

No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
in CephFS or Ceph's Object Storage and hold a key or path in the
database, also other things like user management will belong to the
database

>>> shared nothing is possible with ceph, but in the end this really depends
>>> on your application.
>>
>> hm, when disk fails we already doing some backup on a dell powervault
>> rd1000, so i don't think thats a problem and also we would run ceph on
>> a Dell PERC Raid Controller with RAID1 enabled on the data disk.
>
> this is open to discussion, and really depends on your use case.

Yeah we definitely know that it isn't good to use Ceph on a single
node, but i think it's easier to design the application that it will
depends on ceph. it wouldn't be easy to manage to have a single node
without ceph and more than 1 node with ceph.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.
>>>
>>> you mean from the filesystem to an object storage?
>>
>> yes, currently everything is on the filesystem and this is really
>> horrible, thousands of pdfs just on the filesystem. we can't scale up
>> that easily with this setup.
>
> Got it.
>
>> Currently we run on Microsoft Servers, but we plan to rewrite our
>> whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
>> 7, 9, ... X²-1 should be possible.
>
> cool.
>
 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.
>>>
>>> it would with ceph. probably :)
>>
>> That's nice to hear. I was really scared that we don't find a solution
>> that can run on 1 system and scale up to even more. We first looked at
>> HDFS but this isn't lightweight.
>
> not only that, HDFS also has a single point of failure.
>
>> And the overhead of Metadata etc.
>> just isn't that cool.
>
> :)

Yeah that's why I came to Ceph. I think that's probably the way we want to go.
Really thank you for your help. It's good to know that I have a
solution for the things that are badly designed on our current
solution.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread xan.peng
On Mon, Aug 19, 2013 at 6:09 PM, Guang Yang  wrote:
> Hi ceph-users,
> I would like to check if there is any manual / steps which can let me try to
> deploy ceph in RHEL?

Setup with ceph-deploy: http://dachary.org/?p=1971
Official documentation will also be helpful:
http://ceph.com/docs/master/start/quick-ceph-deploy/
-- 
-Thanks.
- xan.peng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Hi,

I have an OSD which crash every time I try to start it (see logs below).
Is it a known problem ? And is there a way to fix it ?

root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
(8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
2013-08-19 11:07:48.516363 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:48.516380 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:48.516514 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:48.517087 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:48.517389 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps <>
2013-08-19 11:07:49.199483 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
2013-08-19 11:07:52.199908 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:52.199916 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:52.200058 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:52.200886 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:52.200919 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps <>
2013-08-19 11:07:52.215850 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for clients
2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
11:08:13.579519
osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x3c8) [0x6f8f48]
 3: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x31f) [0x6f975f]
 4: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x7391d4]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
 7: (()+0x6b50) [0x7f6fe3070b50]
 8: (clone()+0x6d) [0x7f6fe15cba7d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

full logs here : http://pastebin.com/RphNyLU0


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for 
each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for 
each node.* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool for 
cephfs.Both the data and metadata pools are set with replica size 3.Then 
mounted the cephfs on one of the three nodes, and tested the performance with 
fio.
The sequential read  performance looks good:fio -direct=1 -iodepth 1 -thread 
-rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 
60012msec
But the sequential write/random read/random write performance is very poor:fio 
-direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M 
-numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, 
bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread 
-rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 
60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: 
io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
I am mostly surprised by the seq write performance comparing to the raw sata 
disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only 
gets 1/10 performance of the raw disk.
How can I tune my cluster to improve the sequential write/random read/random 
write performance?


  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] dumpling "ceph" cli tool breaks openstack cinder

2013-08-19 Thread Øystein Lønning Nerhus
Hi,

I just noticed that in dumpling the "ceph" cli tool no longer utilises the 
"CEPH_ARGS" environment variable.  This is used by openstack cinder to specifiy 
the cephx user.   Ref: 
http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph

I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py

< stdout, _ = self._execute('ceph', 'fsid')
> stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid')

For my particular setup this seems to be sufficient as a quick workaround.  Is 
there a proper way to do this with the new tool?

Note: This only hit when i tried to create a volume from an image (i'm using 
copy on write cloning).  creating a fresh volume didnt invoke the "ceph fsid" 
command in the openstack script, so i guess some openstack users will not be 
affected.

Thanks,

Øystein___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Sorry, forget to tell the OS and kernel version.
It's Centos 6.4 with kernel 3.10.6 .fio 2.0.13 .

From: dachun...@outlook.com
To: ceph-users@lists.ceph.com
Date: Mon, 19 Aug 2013 11:28:24 +
Subject: [ceph-users] Poor write/random read/random write performance




I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for 
each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for 
each node.* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool for 
cephfs.Both the data and metadata pools are set with replica size 3.Then 
mounted the cephfs on one of the three nodes, and tested the performance with 
fio.
The sequential read  performance looks good:fio -direct=1 -iodepth 1 -thread 
-rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 
60012msec
But the sequential write/random read/random write performance is very poor:fio 
-direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M 
-numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, 
bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread 
-rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 
60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: 
io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
I am mostly surprised by the seq write performance comparing to the raw sata 
disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only 
gets 1/10 performance of the raw disk.
How can I tune my cluster to improve the sequential write/random read/random 
write performance?


  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy mon create stuck

2013-08-19 Thread Alfredo Deza
On Mon, Aug 19, 2013 at 4:26 AM, Nico Massenberg
 wrote:
> Hi Alfredo,
>
> thanks for your response. I updated ceph-deploy to v1.2.1 and got the v0.5.2
> of pushy from:
> https://launchpad.net/pushy/+download

Ah, there *should* not be need for that as the packages we publish
also have pushy listed.
>
> I then ran the pushy setup.py script with --build and --install, came back
> with no errors.
>
> Still, when trying to create a second mon, ceph-deploy gives me the
> following:
>
> ceph@vl0181:~/konkluster$ ceph-deploy mon create ceph02
> [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts ceph02
> [ceph_deploy.mon][DEBUG ] detecting platform for host ceph02 ...
> [ceph_deploy.mon][INFO  ] distro info: Debian 7.1 wheezy
> [ceph02][DEBUG ] deploying mon to ceph02
> [ceph02][DEBUG ] remote hostname: ceph02
> [ceph02][INFO  ] write cluster configuration to /etc/ceph/{cluster}.conf
> [ceph02][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-ceph02/done
> [ceph02][INFO  ] create a done file to avoid re-doing the mon deployment
> [ceph02][INFO  ] create the init path if it does not exist
> [ceph02][INFO  ] locating `service` executable...
> [ceph02][INFO  ] found `service` executable: /usr/sbin/service
> [ceph02][INFO  ] Running command: /usr/sbin/service ceph start mon.ceph02
> [ceph02][ERROR ] Traceback (most recent call last):
> [ceph02][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/hosts/debian/mon/create.py",
> line 35, in create
> [ceph02][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py", line 10,
> in inner
> [ceph02][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/util/wrappers.py", line 6, in
> remote_call
> [ceph02][ERROR ]   File "/usr/lib/python2.7/subprocess.py", line 511, in
> check_call
> [ceph02][ERROR ] raise CalledProcessError(retcode, cmd)
> [ceph02][ERROR ] CalledProcessError: Command '['/usr/sbin/service', 'ceph',
> 'start', 'mon.ceph02']' returned non-zero exit status 1
> [ceph_deploy.mon][ERROR ] Failed to execute command: /usr/sbin/service ceph
> start mon.ceph02
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors
>
>

It is unfortunate that I can't tell why this is failing by this log
output. However, it does tell you
the last command attempted. Have you tried that failing command (`sudo
service ceph start mon.ceph02`)
in the ceph02 host to see what is going on?


> Any more ideas? Thanks.
>
>
> Am 15.08.2013 um 14:40 schrieb Alfredo Deza :
>
>
>
>
> On Thu, Aug 15, 2013 at 7:45 AM, Nico Massenberg
>  wrote:
>>
>> Hello there,
>>
>> I am deploying a development system with 3 hosts. I want to deploy a
>> monitor on each of those hosts and several osds, 1 per disk.
>> In addition I have an admin machine to use ceph-deploy from. So far I have
>> 1 mon on ceph01 and a total of 6 osds on ceph01 and ceph02 in a healthy
>> cluster:
>>
>> ceph@vl0181:~/konkluster$ ceph -s -k ceph.client.admin.keyring
>>health HEALTH_OK
>>monmap e1: 1 mons at {ceph01=192.168.111.10:6789/0}, election epoch 1,
>> quorum 0 ceph01
>>osdmap e78: 6 osds: 6 up, 6 in
>> pgmap v248: 192 pgs: 192 active+clean; 0 bytes data, 211 MB used, 3854
>> GB / 3854 GB avail
>>mdsmap e1: 0/0/1 up
>>
>> When trying to add mon2 and mon3 to ceph02 and ceph03 I am confronted with
>> the following error:
>>
>> ceph@vl0181:~/konkluster$ ceph-deploy mon create ceph02
>> ceph-mon: set fsid to 3dad736b-a9fc-42bf-a2fb-399cb8cbb880
>> ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph03 for mon.ceph02
>> === mon.ceph02 ===
>> Starting Ceph mon.ceph02 on ceph02...
>> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i ceph02 --pid-file
>> /var/run/ceph/mon.ceph03.pid -c /etc/ceph/ceph.conf '
>> Starting ceph-create-keys on ceph02...
>> Traceback (most recent call last):
>>   File "/usr/bin/ceph-deploy", line 21, in 
>>
>>
>> ps aux | grep ceph on the target afterwards shows a quiet unusual output:
>>
>> root@ceph02:~# ps aux |grep ceph
>> root  2501  0.1  0.0  26652  6952 ?S11:47   0:08
>> /usr/bin/python /usr/sbin/ceph-create-keys -i ceph02
>> root  2677  0.0  0.1 413448 17324 ?Ssl  11:47   0:04
>> /usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c
>> /etc/ceph/ceph.conf
>> root  2684  0.0  0.0   4096   612 ?Ss   11:47   0:00 startpar
>> -f -- ceph
>> root  4069  0.0  0.0  71172  3564 ?Ss   11:53   0:00 sshd:
>> ceph [priv]
>> ceph  4071  0.0  0.0  71536  1804 ?S11:53   0:00 sshd:
>> ceph@notty
>> ceph  4072  0.0  0.0   4176   580 ?Ss   11:53   0:00 sh -c
>> "sudo" "python" "-u" "-c" "exec reduce(lambda a,b: a+b, map(chr,
>> (105,109,112,111,114,116,32,95,95,98,117,105,108,116,105,110,95,95,44,32,111,115,44,32,109,97,114,115,104,97,108,44,32,115,121,115,10,116,114,121,58,10,32,32,32,32,105,109,112,111,114,116,32,104,97,115,104,108,105,98,10,101,120,99,101,112,116,32,73,109,112,111,114,116,69,114,114,111,114,58,10,32,32,32,32,105,109,112,111,1

Re: [ceph-users] ceph-deploy and journal on separate disk

2013-08-19 Thread Alfredo Deza
On Fri, Aug 16, 2013 at 8:32 AM, Pavel Timoschenkov
 wrote:
> << causing this to << filesystem and prevent this.
>
> Hi. Any changes (
>
> Can you create a build that passes the -t flag with mount?
>

I tried going through these steps again and could not get any other
ideas except to pass in that flag
for mounting. Would you be willing to try a patch?
(http://fpaste.org/33099/37691580/)

You would need to apply it to the `ceph-disk` executable.


>
>
>
>
>
>
> From: Pavel Timoschenkov
> Sent: Thursday, August 15, 2013 3:43 PM
> To: 'Alfredo Deza'
> Cc: Samuel Just; ceph-us...@ceph.com
> Subject: RE: [ceph-users] ceph-deploy and journal on separate disk
>
>
>
> The separate commands (e.g. `ceph-disk -v prepare /dev/sda1`) works because
> then the journal is on the same device as the OSD data, so the execution is
> different to get them to a working state.
>
> I suspect that there are left over partitions in /dev/sdaa that are causing
> this to fail, I *think* that we could pass the `-t` flag with the filesystem
> and prevent this.
>
> Just to be sure, could you list all the partitions on /dev/sdaa (if
> /dev/sdaa is the whole device)?
>
> Something like:
>
> sudo parted /dev/sdaa print
>
> Or if you prefer any other way that could tell use what are all the
> partitions in that device.
>
>
>
>
>
> After
>
> ceph-deploy disk zap ceph001:sdaa ceph001:sda1
>
>
>
> root@ceph001:~# parted /dev/sdaa print
>
> Model: ATA ST3000DM001-1CH1 (scsi)
>
> Disk /dev/sdaa: 3001GB
>
> Sector size (logical/physical): 512B/4096B
>
> Partition Table: gpt
>
>
>
> Number  Start  End  Size  File system  Name  Flags
>
>
>
> root@ceph001:~# parted /dev/sda1 print
>
> Model: Unknown (unknown)
>
> Disk /dev/sda1: 10.7GB
>
> Sector size (logical/physical): 512B/512B
>
> Partition Table: gpt
>
> So that is after running `disk zap`. What does it say after using
> ceph-deploy and failing?
>
>
>
> Number  Start  End  Size  File system  Name  Flags
>
>
>
> After ceph-disk -v prepare /dev/sdaa /dev/sda1:
>
>
>
> root@ceph001:~# parted /dev/sdaa print
>
> Model: ATA ST3000DM001-1CH1 (scsi)
>
> Disk /dev/sdaa: 3001GB
>
> Sector size (logical/physical): 512B/4096B
>
> Partition Table: gpt
>
>
>
> Number  Start   End SizeFile system  Name   Flags
>
> 1  1049kB  3001GB  3001GB  xfs  ceph data
>
>
>
> And
>
>
>
> root@ceph001:~# parted /dev/sda1 print
>
> Model: Unknown (unknown)
>
> Disk /dev/sda1: 10.7GB
>
> Sector size (logical/physical): 512B/512B
>
> Partition Table: gpt
>
>
>
> Number  Start  End  Size  File system  Name  Flags
>
>
>
> With the same errors:
>
>
>
> root@ceph001:~# ceph-disk -v prepare /dev/sdaa /dev/sda1
>
> DEBUG:ceph-disk:Journal /dev/sda1 is a partition
>
> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same
> device as the osd data
>
> DEBUG:ceph-disk:Creating osd partition on /dev/sdaa
>
> Information: Moved requested sector from 34 to 2048 in
>
> order to align on 2048-sector boundaries.
>
> The operation has completed successfully.
>
> DEBUG:ceph-disk:Creating xfs fs on /dev/sdaa1
>
> meta-data=/dev/sdaa1 isize=2048   agcount=32, agsize=22892700
> blks
>
>  =   sectsz=512   attr=2, projid32bit=0
>
> data =   bsize=4096   blocks=732566385, imaxpct=5
>
>  =   sunit=0  swidth=0 blks
>
> naming   =version 2  bsize=4096   ascii-ci=0
>
> log  =internal log   bsize=4096   blocks=357698, version=2
>
>  =   sectsz=512   sunit=0 blks, lazy-count=1
>
> realtime =none   extsz=4096   blocks=0, rtextents=0
>
> DEBUG:ceph-disk:Mounting /dev/sdaa1 on /var/lib/ceph/tmp/mnt.UkJbwx with
> options noatime
>
> mount: /dev/sdaa1: more filesystems detected. This should not happen,
>
>use -t  to explicitly specify the filesystem type or
>
>use wipefs(8) to clean up the device.
>
>
>
> mount: you must specify the filesystem type
>
> ceph-disk: Mounting filesystem failed: Command '['mount', '-o', 'noatime',
> '--', '/dev/sdaa1', '/var/lib/ceph/tmp/mnt.UkJbwx']' returned non-zero exit
> status 32
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson

On 08/19/2013 06:28 AM, Da Chun Ng wrote:

I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool 
for cephfs.

Both the data and metadata pools are set with replica size 3.
Then mounted the cephfs on one of the three nodes, and tested the 
performance with fio.


The sequential read  performance looks good:
fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K 
-size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60

read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec


Sounds like readahead and or caching is helping out a lot here. Btw, you 
might want to make sure this is actually coming from the disks with 
iostat or collectl or something.




But the sequential write/random read/random write performance is very 
poor:
fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec


One thing to keep in mind is that unless you have SSDs in this system, 
you will be doing 2 writes for every client write to the spinning disks 
(since data and journals will both be on the same disk).


So let's do the math:

6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive


If there is no write coalescing going on, this isn't terrible.  If there 
is, this is terrible.  Have you tried buffered writes with the sync 
engine at the same IO size?


fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec


In this case:

11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.

Definitely not great!  You might want to try fiddling with read ahead 
both on the CephFS client and on the block devices under the OSDs 
themselves.


One thing I did notice back during bobtail is that increasing the number 
of osd op threads seemed to help small object read performance.  It 
might be worth looking at too.


http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread

Other than that, if you really want to dig into this, you can use tools 
like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
for what the IO going to the OSDs looks like.  That can help when 
diagnosing this sort of thing.


fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec


6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive




I am mostly surprised by the seq write performance comparing to the 
raw sata disk performance(It can get 4127 IOPS when mounted with 
ext4). My cephfs only gets 1/10 performance of the raw disk.


7200 RPM spinning disks typically top out at something like 150 IOPS 
(and some are lower).  With 15 disks, to hit 4127 IOPS you were probably 
seeing some write coalescing effects (or if these were random reads, 
some benefit to read ahead).




How can I tune my cluster to improve the sequential write/random 
read/random write performance?
I don't know what kind of controller you have, but in cases where 
journals are on the same disks as the data, using writeback cache helps 
a lot because the controller can coalesce the direct IO journal writes 
in cache and just do big periodic dumps to the drives.  That really 
reduces seek overhead for the writes.  Using SSDs for the journals 
accomplishes much of the same effect, and lets you get faster large IO 
writes too, but in many chassis there is a density (and cost) trade-off.


Hope this helps!

Mark







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Request preinstalled Virtual Machines Images for cloning.

2013-08-19 Thread Johannes Klarenbeek
Dear Ceph Developers and Users,

I was wondering if there is any download location for preinstalled virtual 
machine images with the latest release of ceph. Preferably 4 different images 
with Ceph-OSD, Ceph-Mon, Ceph-MDS and last but not least a Ceph-Client with 
iscsi target server installed. But since the latter is the client, I guess any 
distro would do.

If this doesn't exist, maybe it's a great idea for distribution from the 
ceph.com website. I could just startup an image like ceph-osd on any hypervisor 
to add its local storage via disk passthrough to my ceph "private cloud", and 
just distribute some monitors and metadata servers over the rest of the 
hypervisors. Packages like this can be kept small (like using slitaz forexample 
- since this one performs best on hyper-v hypervisors).

Any Ideas?

Regards,
Johannes


__ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8703 (20130819) __

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Thanks very much! Mark.Yes, I put the data and journal on the same disk, no SSD 
in my environment.My controllers are general SATA II.
Some more questions below in blue.

Date: Mon, 19 Aug 2013 07:48:23 -0500
From: mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Poor write/random read/random write performance


  

  
  
On 08/19/2013 06:28 AM, Da Chun Ng
  wrote:



  
  I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each
  node.
* 64G Ram for each node.

  

  
  I deployed the cluster with ceph-deploy, and created a
new data pool for cephfs.
  Both the data and metadata pools are set with replica
size 3.
  Then mounted the cephfs on
  one of the three nodes, and tested the performance with
  fio.
  

  
  The sequential read  performance looks good:
  fio -direct=1 -iodepth 1 -thread -rw=read
-ioengine=libaio -bs=16K -size=1G -numjobs=16
-group_reporting -name=mytest -runtime 60
  read : io=10630MB, bw=181389KB/s, iops=11336 , runt=
60012msec

  



Sounds like readahead and or caching is helping out a lot here. 
Btw, you might want to make sure this is actually coming from the
disks with iostat or collectl or something.
I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes before 
every test. I used collectl to watch every disk IO, the numbers should match. I 
think readahead is helping here.




  

  

  
  But the sequential write/random read/random write
performance is very poor:
  fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
  write: io=397280KB, bw=6618.2KB/s, iops=413 , runt=
60029msec

  



One thing to keep in mind is that unless you have SSDs in this
system, you will be doing 2 writes for every client write to the
spinning disks (since data and journals will both be on the same
disk).



So let's do the math:



6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS
/ drive



If there is no write coalescing going on, this isn't terrible.  If
there is, this is terrible. 
How can I know if there is write coalescing going on?
Have you tried buffered writes with the
sync engine at the same IO size?
Do you mean as below?fio -direct=0 -iodepth 1 -thread -rw=write -ioengine=sync 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60




  

  fio -direct=1 -iodepth 1 -thread -rw=randread
-ioengine=libaio -bs=16K -size=256M -numjobs=16
-group_reporting -name=mytest -runtime 60
  read : io=665664KB, bw=11087KB/s, iops=692 , runt=
60041msec

  



In this case:



11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.  



Definitely not great!  You might want to try fiddling with read
ahead both on the CephFS client and on the block devices under the
OSDs themselves.  
Could you please tell me how to enable read ahead on the CephFS client? 
For the block devices under the OSDs, the read ahead value is:[root@ceph0 ~]# 
blockdev --getra /dev/sdi256How big is appropriate for it?


One thing I did notice back during bobtail is that increasing the
number of osd op threads seemed to help small object read
performance.  It might be worth looking at too.



http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread



Other than that, if you really want to dig into this, you can use
tools like iostat, collectl, blktrace, and seekwatcher to try and
get a feel for what the IO going to the OSDs looks like.  That can
help when diagnosing this sort of thing.




  

  fio -direct=1 -iodepth 1 -thread -rw=randwrite
-ioengine=libaio -bs=16K -size=256M -numjobs=16
-group_reporting -name=mytest -runtime 60
  write: io=361056KB, bw=6001.1KB/s, iops=375 , runt=
60157msec

  



6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024
(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS
/ drive




  

  

  
  I am mostly surprised by the seq write performance
comparing to the raw sata disk performance(It can get 4127
IOPS when mounted with ext4). My cephfs only gets 1/10
performance of the raw

Re: [ceph-users] large memory leak on scrubbing

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, Mostowiec Dominik wrote:
> Thanks for your response.
> Great.
> 
> In latest cuttlefish it is also fixed I think?
> 
> We have two problems with scrubbing:
> - memory leaks
> - slow requests and wrongly mark osd with bucket index down (when scrubbing)

The slow requests can trigger if you have very large objects (including 
a very large rgw bucket index object).  But the message you quote below is 
for a scrub-reserve operation, which should really be excluded from the op 
warnings entirely.  Is that the only slow request message you see?

> Now we decided to turn off scrubbing and trigger it on maintenance window.
> I noticed that "ceph osd scrub", or "ceph osd deep-scrub" trigger scrub on 
> osd but not for all PG.
> It is possible to trigger scrubbing all PG on one osd?

It should trigger a scrub on all PGs that are clean.  If a PG is 
recovering it will be skipped.

sage


> 
> --
> Regards 
> Dominik
> 
> 
> -Original Message-
> From: Sage Weil [mailto:s...@inktank.com] 
> Sent: Saturday, August 17, 2013 5:11 PM
> To: Mostowiec Dominik
> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Studzi?ski 
> Krzysztof; Sydor Bohdan
> Subject: Re: [ceph-users] large memory leak on scrubbing
> 
> Hi Dominic,
> 
> There is a bug fixed a couple of months back that fixes excessive memory 
> consumption during scrub.  You can upgrade to the latest 'bobtail' branch.  
> See
> 
>  http://ceph.com/docs/master/install/debian/#development-testing-packages
> 
> Installing that package should clear this up.
> 
> sage
> 
> 
> On Fri, 16 Aug 2013, Mostowiec Dominik wrote:
> 
> > Hi,
> > We noticed some issues on CEPH/S3 cluster, I think it related with 
> > scrubbing: large memory leaks.
> > 
> > Logs 09.xx: 
> > https://www.dropbox.com/s/4z1fzg239j43igs/ceph-osd.4.log_09xx.tar.gz
> > >From 09.30 to 09.44 (14 minutes) osd.4 proces grows up to 28G. 
> > 
> > I think this is something curious:
> > 2013-08-16 09:43:48.801331 7f6570d2e700  0 log [WRN] : slow request 
> > 32.794125 seconds old, received at 2013-08-16 09:43:16.007104: 
> > osd_sub_op(unknown.0.0:0 16.113d 0//0//-1 [scrub-reserve] v 0'0 
> > snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
> > 
> > We have large rgw index and a lot of large files than on this cluster.
> > ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
> > Setup: 
> > - 12 servers x 12 OSD
> > - 3 mons
> > Default scrubbing settings.
> > Journal and filestore settings:
> > journal aio = true
> > filestore flush min = 0
> > filestore flusher = false
> > filestore fiemap = false
> > filestore op threads = 4
> > filestore queue max ops = 4096
> > filestore queue max bytes = 10485760
> > filestore queue committing max bytes = 10485760
> > journal max write bytes = 10485760
> > journal queue max bytes = 10485760
> > ms dispatch throttle bytes = 10485760
> > objecter infilght op bytes = 10485760
> > 
> > Is this a known bug in this version?
> > (Do you know some workaround to fix this?)
> > 
> > ---
> > Regards
> > Dominik
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dumpling "ceph" cli tool breaks openstack cinder

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, S?bastien Han wrote:
> Hi,
> 
> The new version of the driver (for Havana) doesn't need the CEPH_ARGS 
> argument, the driver now uses the librbd and librados (not the CLI anymore).
> 
> I guess a better patch will result in:
> 
> stdout, _ = self._execute('ceph', '--id', 'self.configuration.rbd_user', 
> 'fsid')
> I'll report the bug. Thanks!
> 
> However I don't know how to fix this with the new CLI.

I opened http://tracker.ceph.com/issues/6052.  This is a simple matter of 
adding a call to rados_conf_parse_env(...).

Thanks!
sage


> 
> Cheers.
> 
> 
> S?bastien Han
> Cloud Engineer
> 
> "Always give 100%. Unless you're giving blood."
> 
> 
> 
> Phone: +33 (0)1 49 70 99 72 - Mobile: +33 (0)6 52 84 44 70
> Mail: sebastien@enovance.com - Skype : han.sbastien
> Address : 10, rue de la Victoire - 75009 Paris
> Web : www.enovance.com - Twitter : @enovance
> 
> On August 19, 2013 at 1:28:57 PM, ?ystein L?nning Nerhus (ner...@vx.no) wrote:
> 
> Hi,
> 
> I just noticed that in dumpling the "ceph" cli tool no longer utilises the 
> "CEPH_ARGS" environment variable.  This is used by openstack cinder to 
> specifiy the cephx user.   Ref: 
> http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph
> 
> I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py
> 
> <         stdout, _ = self._execute('ceph', 'fsid')
> >         stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid')
> 
> For my particular setup this seems to be sufficient as a quick workaround.  
> Is there a proper way to do this with the new tool?
> 
> Note: This only hit when i tried to create a volume from an image (i'm using 
> copy on write cloning).  creating a fresh volume didnt invoke the "ceph fsid" 
> command in the openstack script, so i guess some openstack users will not be 
> affected.
> 
> Thanks,
> 
> ?ystein___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD and balanced reads

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, S?bastien Han wrote:
> Hi guys,
> 
> While reading a developer doc, I came across the following options:
> 
> * osd balance reads = true
> * osd shed reads = true
> * osd shed reads min latency
> * osd shed reads min latency diff
> 
> The problem is that I can't find any of these options in config_opts.h.

These are left over from an old unimplemented experiment and were removed 
a while back.

> Loic Dachary also gave me a flag that he found from the code.
> 
> m->get_flags() & CEPH_OSD_FLAG_LOCALIZE_READS)
> 
> So my questions are:
> 
> * Which from the above flags are correct?
> * Do balanced reads really exist in RBD?

For localized reads you want

OPTION(rbd_balance_snap_reads, OPT_BOOL, false)
OPTION(rbd_localize_snap_reads, OPT_BOOL, false)

Note that the 'localize' logic is still very primitive (it matches by IP 
address).  There is a blueprint to improve this:


http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librados%2F%2Fobjecter%3A_smarter_localized_reads

Cheers!
sage

> 
> Thanks in advance.
> 
> 
> S?bastien Han
> Cloud Engineer
> 
> "Always give 100%. Unless you're giving blood."
> 
> 
> 
> Phone : +33 (0)1 49 70 99 72 ? Mobile : +33 (0)6 52 84 44 70
> Mail: sebastien@enovance.com ? Skype : han.sbastien
> Address : 10, rue de la Victoire ? 75009 Paris
> Web : www.enovance.com ? Twitter : @enovance___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Gregory Farnum
On Sunday, August 18, 2013, Guang Yang wrote:

> Hi ceph-users,
> This is Guang and I am pretty new to ceph, glad to meet you guys in the
> community!
>
> After walking through some documents of Ceph, I have a couple of questions:
>   1. Is there any comparison between Ceph and AWS S3, in terms of the
> ability to handle different work-loads (from KB to GB), with corresponding
> performance report?
>

Not really; any comparison would be highly biased depending on your Amazon
ping and your Ceph cluster. We've got some internal benchmarks where Ceph
looks good, but they're not anything we'd feel comfortable publishing.


>   2. Looking at some industry solutions for distributed storage, GFS /
> Haystack / HDFS all use meta-server to store the logical-to-physical
> mapping within memory and avoid disk I/O lookup for file reading, is the
> concern valid for Ceph (in terms of latency to read file)?
>

These are very different systems. Thanks to CRUSH, RADOS doesn't need to do
any IO to find object locations; CephFS only does IO if the inode you
request has fallen out of the MDS cache (not terribly likely in general).
This shouldn't be an issue...


>   3. Some industry research shows that one issue of file system is the
> metadata-to-data ratio, in terms of both access and storage, and some
> technic uses the mechanism to combine small files to large physical files
> to reduce the ratio (Haystack for example), if we want to use ceph to store
> photos, should this be a concern as Ceph use one physical file per object?
>

...although this might be. The issue basically comes down to how many disk
seeks are required to retrieve an item, and one way to reduce that number
is to hack the filesystem by keeping a small number of very large files an
calculating (or caching) where different objects are inside that file.
Since Ceph is designed for MB-sized objects it doesn't go to these lengths
to optimize that path like Haystack might (I'm not familiar with Haystack
in particular).
That said, you need some pretty extreme latency requirements before this
becomes an issue and if you're also looking at HDFS or S3 I can't imagine
you're in that ballpark. You should be fine. :)
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-19 Thread Gregory Farnum
Have you ever used the FS? It's missing an object which we're
intermittently seeing failures to create (on initial setup) when the
cluster is unstable.
If so, clear out the metadata pool and check the docs for "newfs".
-Greg

On Monday, August 19, 2013, Georg Höllrigl wrote:

> Hello List,
>
> The troubles to fix such a cluster continue... I get output like this now:
>
> # ceph health
> HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is
> degraded; mds vvx-ceph-m-03 is laggy
>
>
> When checking for the ceph-mds processes, there are now none left... no
> matter which server I check. And the won't start up again!?
>
> The log starts up with:
> 2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67 (**
> e3b7bc5bce8ab330ec166138107236**8af3c218a0), process ceph-mds, pid 27636
> 2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
> 2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i am
> now mds.0.26
> 2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map state
> change up:standby --> up:replay
> 2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
> 2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
> 2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap epoch
> 277, have 276
> 2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for osdmap
> 277 (which blacklists prior instance)
> 2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish
> got (2) No such file or directory
> 2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In function
> 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f7e9904b700
> time 2013-08-19 11:23:30.534107
> mds/SessionMap.cc: 83: FAILED assert(0 == "failed to load sessionmap")
>
>
> Anyone an idea how to get the cluster back running?
>
>
>
>
>
> Georg
>
>
>
>
> On 16.08.2013 16:23, Mark Nelson wrote:
>
>> Hi Georg,
>>
>> I'm not an expert on the monitors, but that's probably where I would
>> start.  Take a look at your monitor logs and see if you can get a sense
>> for why one of your monitors is down.  Some of the other devs will
>> probably be around later that might know if there are any known issues
>> with recreating the OSDs and missing PGs.
>>
>> Mark
>>
>> On 08/16/2013 08:21 AM, Georg Höllrigl wrote:
>>
>>> Hello,
>>>
>>> I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
>>> I've created the setup with ceph-deploy from GIT.
>>> I've recreated a bunch of OSDs, to give them another journal.
>>> There already was some test data on these OSDs.
>>> I've already recreated the missing PGs with "ceph pg force_create_pg"
>>>
>>>
>>> HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
>>> are blocked > 32 sec; mds cluster is degraded; 1 mons down, quorum
>>> 0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,**vvx-ceph-m-03
>>>
>>> Any idea how to fix the cluster, besides completley rebuilding the
>>> cluster from scratch? What if such a thing happens in a production
>>> environment...
>>>
>>> The pgs from "ceph pg dump" looks all like creating for some time now:
>>>
>>> 2.3d0   0   0   0   0   0   0 creating
>>>   2013-08-16 13:43:08.186537   0'0 0:0 []  [] 0'0
>>> 0.000'0 0.00
>>>
>>> Is there a way to just dump the data, that was on the discarded OSDs?
>>>
>>>
>>>
>>>
>>> Kind Regards,
>>> Georg
>>> __**_
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson
On 08/19/2013 08:59 AM, Da Chun Ng wrote:
> Thanks very much! Mark.
> Yes, I put the data and journal on the same disk, no SSD in my environment.
> My controllers are general SATA II.

Ok, so in this case the lack of WB cache on the controller and no SSDs
for journals is probably having an effect.

> 
> Some more questions below in blue.
> 
> 
> Date: Mon, 19 Aug 2013 07:48:23 -0500
> From: mark.nel...@inktank.com
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Poor write/random read/random write performance
> 
> On 08/19/2013 06:28 AM, Da Chun Ng wrote:
> 
> I have a 3 nodes, 15 osds ceph cluster setup:
> * 15 7200 RPM SATA disks, 5 for each node.
> * 10G network
> * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
> * 64G Ram for each node.
> 
> I deployed the cluster with ceph-deploy, and created a new data pool
> for cephfs.
> Both the data and metadata pools are set with replica size 3.
> Then mounted the cephfs on one of the three nodes, and tested the
> performance with fio.
> 
> The sequential read  performance looks good:
> fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
> -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
> read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
> 
> 
> Sounds like readahead and or caching is helping out a lot here. Btw, you 
> might want to make sure this is actually coming from the disks with 
> iostat or collectl or something.
> 
> I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes 
> before every test. I used collectl to watch every disk IO, the numbers 
> should match. I think readahead is helping here.

Ok, good!  I suspect that readahead is indeed helping.

> 
> 
> But the sequential write/random read/random write performance is
> very poor:
> fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
> -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
> 
> 
> One thing to keep in mind is that unless you have SSDs in this system, 
> you will be doing 2 writes for every client write to the spinning disks 
> (since data and journals will both be on the same disk).
> 
> So let's do the math:
> 
> 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
> (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
> 
> If there is no write coalescing going on, this isn't terrible.  If there 
> is, this is terrible.
> 
> How can I know if there is write coalescing going on?

look in collectl at the average IO sizes going to the disks.  I bet they
will be 16KB.  If you were to look further with blktrace and
seekwatcher, I bet you'd see lots of seeking between OSD data writes and
journal writes since there is no controller cache helping smooth things
out (and your journals are on the same drives).

> 
> Have you tried buffered writes with the sync engine at the same IO size?
> 
> Do you mean as below?
> fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
> -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

Yeah, that'd work.

> 
> fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
> -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
> 
> 
> In this case:
> 
> 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
> 
> Definitely not great!  You might want to try fiddling with read ahead 
> both on the CephFS client and on the block devices under the OSDs 
> themselves.
> 
> Could you please tell me how to enable read ahead on the CephFS client?

It's one of the mount options:

http://ceph.com/docs/master/man/8/mount.ceph/

> 
> For the block devices under the OSDs, the read ahead value is:
> [root@ceph0 ~]# blockdev --getra /dev/sdi
> 256
> How big is appropriate for it?

To be honest I've seen different results depending on the hardware.  I'd
try anywhere from 32kb to 2048kb.

> 
> One thing I did notice back during bobtail is that increasing the number 
> of osd op threads seemed to help small object read performance.  It 
> might be worth looking at too.
> 
> http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
> 
> Other than that, if you really want to dig into this, you can use tools 
> like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
> for what the IO going to the OSDs looks like.  That can help when 
> diagnosing this sort of thing.
> 
> fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
> -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
> 
> 
> 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
> (KB->bytes) / 16384

Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Thank you! Testing now.
How about pg num? I'm using the default size 64, as I tried with (100 * 
osd_num)/replica_size, but it decreased the performance surprisingly.

> Date: Mon, 19 Aug 2013 11:33:30 -0500
> From: mark.nel...@inktank.com
> To: dachun...@outlook.com
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Poor write/random read/random write performance
> 
> On 08/19/2013 08:59 AM, Da Chun Ng wrote:
> > Thanks very much! Mark.
> > Yes, I put the data and journal on the same disk, no SSD in my environment.
> > My controllers are general SATA II.
> 
> Ok, so in this case the lack of WB cache on the controller and no SSDs
> for journals is probably having an effect.
> 
> > 
> > Some more questions below in blue.
> > 
> > 
> > Date: Mon, 19 Aug 2013 07:48:23 -0500
> > From: mark.nel...@inktank.com
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Poor write/random read/random write performance
> > 
> > On 08/19/2013 06:28 AM, Da Chun Ng wrote:
> > 
> > I have a 3 nodes, 15 osds ceph cluster setup:
> > * 15 7200 RPM SATA disks, 5 for each node.
> > * 10G network
> > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
> > * 64G Ram for each node.
> > 
> > I deployed the cluster with ceph-deploy, and created a new data pool
> > for cephfs.
> > Both the data and metadata pools are set with replica size 3.
> > Then mounted the cephfs on one of the three nodes, and tested the
> > performance with fio.
> > 
> > The sequential read  performance looks good:
> > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
> > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
> > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
> > 
> > 
> > Sounds like readahead and or caching is helping out a lot here. Btw, you 
> > might want to make sure this is actually coming from the disks with 
> > iostat or collectl or something.
> > 
> > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes 
> > before every test. I used collectl to watch every disk IO, the numbers 
> > should match. I think readahead is helping here.
> 
> Ok, good!  I suspect that readahead is indeed helping.
> 
> > 
> > 
> > But the sequential write/random read/random write performance is
> > very poor:
> > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
> > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
> > 
> > 
> > One thing to keep in mind is that unless you have SSDs in this system, 
> > you will be doing 2 writes for every client write to the spinning disks 
> > (since data and journals will both be on the same disk).
> > 
> > So let's do the math:
> > 
> > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
> > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
> > 
> > If there is no write coalescing going on, this isn't terrible.  If there 
> > is, this is terrible.
> > 
> > How can I know if there is write coalescing going on?
> 
> look in collectl at the average IO sizes going to the disks.  I bet they
> will be 16KB.  If you were to look further with blktrace and
> seekwatcher, I bet you'd see lots of seeking between OSD data writes and
> journal writes since there is no controller cache helping smooth things
> out (and your journals are on the same drives).
> 
> > 
> > Have you tried buffered writes with the sync engine at the same IO size?
> > 
> > Do you mean as below?
> > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
> > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> 
> Yeah, that'd work.
> 
> > 
> > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
> > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
> > 
> > 
> > In this case:
> > 
> > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
> > 
> > Definitely not great!  You might want to try fiddling with read ahead 
> > both on the CephFS client and on the block devices under the OSDs 
> > themselves.
> > 
> > Could you please tell me how to enable read ahead on the CephFS client?
> 
> It's one of the mount options:
> 
> http://ceph.com/docs/master/man/8/mount.ceph/
> 
> > 
> > For the block devices under the OSDs, the read ahead value is:
> > [root@ceph0 ~]# blockdev --getra /dev/sdi
> > 256
> > How big is appropriate for it?
> 
> To be honest I've seen different results depending on the hardware.  I'd
> try anywhere from 32kb to 2048kb.
> 
> > 
> > One thing I did notice back during bobtail is that increasing the number 
> > of osd op threads seemed to help small object read performance.  It 
> > might be worth looking at too.
> > 
> > http://ceph.com/communit

Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson
On 08/19/2013 12:05 PM, Da Chun Ng wrote:
> Thank you! Testing now.
> 
> How about pg num? I'm using the default size 64, as I tried with (100 * 
> osd_num)/replica_size, but it decreased the performance surprisingly.

Oh!  That's odd!  Typically you would want more than that.  Most likely
you aren't distributing PGs very evenly across OSDs with 64.  More PGs
shouldn't decrease performance unless the monitors are behaving badly.
We saw some issues back in early cuttlefish but you should be fine with
many more PGs.

Mark

> 
>  > Date: Mon, 19 Aug 2013 11:33:30 -0500
>  > From: mark.nel...@inktank.com
>  > To: dachun...@outlook.com
>  > CC: ceph-users@lists.ceph.com
>  > Subject: Re: [ceph-users] Poor write/random read/random write performance
>  >
>  > On 08/19/2013 08:59 AM, Da Chun Ng wrote:
>  > > Thanks very much! Mark.
>  > > Yes, I put the data and journal on the same disk, no SSD in my 
> environment.
>  > > My controllers are general SATA II.
>  >
>  > Ok, so in this case the lack of WB cache on the controller and no SSDs
>  > for journals is probably having an effect.
>  >
>  > >
>  > > Some more questions below in blue.
>  > >
>  > > 
> 
>  > > Date: Mon, 19 Aug 2013 07:48:23 -0500
>  > > From: mark.nel...@inktank.com
>  > > To: ceph-users@lists.ceph.com
>  > > Subject: Re: [ceph-users] Poor write/random read/random write 
> performance
>  > >
>  > > On 08/19/2013 06:28 AM, Da Chun Ng wrote:
>  > >
>  > > I have a 3 nodes, 15 osds ceph cluster setup:
>  > > * 15 7200 RPM SATA disks, 5 for each node.
>  > > * 10G network
>  > > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
>  > > * 64G Ram for each node.
>  > >
>  > > I deployed the cluster with ceph-deploy, and created a new data pool
>  > > for cephfs.
>  > > Both the data and metadata pools are set with replica size 3.
>  > > Then mounted the cephfs on one of the three nodes, and tested the
>  > > performance with fio.
>  > >
>  > > The sequential read performance looks good:
>  > > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
>  > > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
>  > > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
>  > >
>  > >
>  > > Sounds like readahead and or caching is helping out a lot here. 
> Btw, you
>  > > might want to make sure this is actually coming from the disks with
>  > > iostat or collectl or something.
>  > >
>  > > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes
>  > > before every test. I used collectl to watch every disk IO, the numbers
>  > > should match. I think readahead is helping here.
>  >
>  > Ok, good! I suspect that readahead is indeed helping.
>  >
>  > >
>  > >
>  > > But the sequential write/random read/random write performance is
>  > > very poor:
>  > > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
>  > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>  > > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
>  > >
>  > >
>  > > One thing to keep in mind is that unless you have SSDs in this system,
>  > > you will be doing 2 writes for every client write to the spinning 
> disks
>  > > (since data and journals will both be on the same disk).
>  > >
>  > > So let's do the math:
>  > >
>  > > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
>  > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / 
> drive
>  > >
>  > > If there is no write coalescing going on, this isn't terrible. If 
> there
>  > > is, this is terrible.
>  > >
>  > > How can I know if there is write coalescing going on?
>  >
>  > look in collectl at the average IO sizes going to the disks. I bet they
>  > will be 16KB. If you were to look further with blktrace and
>  > seekwatcher, I bet you'd see lots of seeking between OSD data writes and
>  > journal writes since there is no controller cache helping smooth things
>  > out (and your journals are on the same drives).
>  >
>  > >
>  > > Have you tried buffered writes with the sync engine at the same IO 
> size?
>  > >
>  > > Do you mean as below?
>  > > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K
>  > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>  >
>  > Yeah, that'd work.
>  >
>  > >
>  > > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
>  > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest 
> -runtime 60
>  > > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
>  > >
>  > >
>  > > In this case:
>  > >
>  > > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
>  > >
>  > > Definitely not great! You might want to try fiddling with read ahead
>  > > both on the CephFS client and on the block devices under the OSDs
>  > > themselves.
>  > >
>  > > Could you please tell me how to enable read ahead on the CephFS client?
>  >
>  > It's one of th

Re: [ceph-users] Ceph Deployments

2013-08-19 Thread John Wilkins
Actually, I wrote the Quick Start guides so that you could do exactly
what you are trying to do, but mostly from a "kick the tires"
perspective so that people can learn to use Ceph without imposing
$100k worth of hardware as a requirement. See
http://ceph.com/docs/master/start/quick-ceph-deploy/

I even added a section so that you could do it on one disk--e.g., on
your laptop.  
http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only

It says "demo only," because you won't get great performance out of a
single node. Monitors, OSDs, and Journals writing to disk and fsync
issues would make performance sub-optimal.

For better performance, you should consider a separate drive for each
Ceph OSD Daemon if you can, and potentially a separate SSD drive
partitioned for journals. If you can separate the OS and monitor
drives from the OSD drives, that's better too.

I wrote it as a two-node quick start, because you cannot kernel mount
the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
can get around this too. If your machine has enough RAM and CPU, you
can also install virtual machines and kernel mount cephfs and block
devices in the virtual machines with no kernel issues. You don't need
to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
OpenStack all on the same host too.  It's just not an ideal situation
from performance or high availability perspective.



On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
 wrote:
> 2013/8/19 Wolfgang Hennerbichler :
>> On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.
>>>
>>> Currently we don't want to virtualise on this machine since the
>>> machine is really small, as said we focus on small to midsize
>>> businesses. Most of the time they even need a tower server due to the
>>> lack of a correct rack. ;/
>>
>> whoa :)
>
> Yep that's awful.
>
> Our Application, Ceph's object storage and a database?

 what is 'a database'?
>>>
>>> We run Postgresql or MariaDB (without/with Galera depending on the cluster 
>>> size)
>>
>> You wouldn't want to put the data of postgres or mariadb on cephfs. I
>> would run the native versions directly on the servers and use
>> mysql-multi-master circular replication. I don't know about similar
>> features of postgres.
>
> No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
> in CephFS or Ceph's Object Storage and hold a key or path in the
> database, also other things like user management will belong to the
> database
>
 shared nothing is possible with ceph, but in the end this really depends
 on your application.
>>>
>>> hm, when disk fails we already doing some backup on a dell powervault
>>> rd1000, so i don't think thats a problem and also we would run ceph on
>>> a Dell PERC Raid Controller with RAID1 enabled on the data disk.
>>
>> this is open to discussion, and really depends on your use case.
>
> Yeah we definitely know that it isn't good to use Ceph on a single
> node, but i think it's easier to design the application that it will
> depends on ceph. it wouldn't be easy to manage to have a single node
> without ceph and more than 1 node with ceph.
>
> Currently we make an archiving software for small customers and we want
> to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?
>>>
>>> yes, currently everything is on the filesystem and this is really
>>> horrible, thousands of pdfs just on the filesystem. we can't scale up
>>> that easily with this setup.
>>
>> Got it.
>>
>>> Currently we run on Microsoft Servers, but we plan to rewrite our
>>> whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
>>> 7, 9, ... X²-1 should be possible.
>>
>> cool.
>>
> Currently we only
> have customers that needs 1 machine or 3 machines. But everything should
> work as fine on more.

 it would with ceph. probably :)
>>>
>>> That's nice to hear. I was really scared that we don't find a solution
>>> that can run on 1 system and scale up to even more. We first looked at
>>> HDFS but this isn't lightweight.
>>
>> not only that, HDFS also has a single point of failure.
>>
>>> And the overhead of Metadata etc.
>>> just isn't that cool.
>>
>> :)
>
> Yeah that's why I came to Ceph. I think that's probably the way we want to go.
> Really thank you for your help. It's good to know that I have a
> solution for the things that are badly designed on our current
> solution.
>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.c

Re: [ceph-users] RBD and balanced reads

2013-08-19 Thread Gregory Farnum
On Mon, Aug 19, 2013 at 9:07 AM, Sage Weil  wrote:
> On Mon, 19 Aug 2013, S?bastien Han wrote:
>> Hi guys,
>>
>> While reading a developer doc, I came across the following options:
>>
>> * osd balance reads = true
>> * osd shed reads = true
>> * osd shed reads min latency
>> * osd shed reads min latency diff
>>
>> The problem is that I can't find any of these options in config_opts.h.
>
> These are left over from an old unimplemented experiment and were removed
> a while back.
>
>> Loic Dachary also gave me a flag that he found from the code.
>>
>> m->get_flags() & CEPH_OSD_FLAG_LOCALIZE_READS)
>>
>> So my questions are:
>>
>> * Which from the above flags are correct?
>> * Do balanced reads really exist in RBD?
>
> For localized reads you want
>
> OPTION(rbd_balance_snap_reads, OPT_BOOL, false)
> OPTION(rbd_localize_snap_reads, OPT_BOOL, false)
>
> Note that the 'localize' logic is still very primitive (it matches by IP
> address).  There is a blueprint to improve this:
>
> 
> http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librados%2F%2Fobjecter%3A_smarter_localized_reads

Also, there are some issues with read/write consistency when using
localized reads because the replicas do not provide the ordering
guarantees that primaries will. See
http://tracker.ceph.com/issues/5388
At present localized reads are really only suitable for spreading the
load on write-once, read-many workloads.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
That sounds bad for me.
As said one of the things we consider is a one node setup, for production.
Not every Customer will afford hardware worth more than ~4000 Euro.
Small business users don't need need the biggest hardware, but i don't
think it's a good way to have a version who uses the filesystem and
one version who use ceph.

We prefer a Object Storage for our Files. It should work like the
Object Storage of the App Engine.
That scales from 1 to X Servers.


2013/8/19 John Wilkins :
> Actually, I wrote the Quick Start guides so that you could do exactly
> what you are trying to do, but mostly from a "kick the tires"
> perspective so that people can learn to use Ceph without imposing
> $100k worth of hardware as a requirement. See
> http://ceph.com/docs/master/start/quick-ceph-deploy/
>
> I even added a section so that you could do it on one disk--e.g., on
> your laptop.  
> http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only
>
> It says "demo only," because you won't get great performance out of a
> single node. Monitors, OSDs, and Journals writing to disk and fsync
> issues would make performance sub-optimal.
>
> For better performance, you should consider a separate drive for each
> Ceph OSD Daemon if you can, and potentially a separate SSD drive
> partitioned for journals. If you can separate the OS and monitor
> drives from the OSD drives, that's better too.
>
> I wrote it as a two-node quick start, because you cannot kernel mount
> the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
> Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
> can get around this too. If your machine has enough RAM and CPU, you
> can also install virtual machines and kernel mount cephfs and block
> devices in the virtual machines with no kernel issues. You don't need
> to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
> OpenStack all on the same host too.  It's just not an ideal situation
> from performance or high availability perspective.
>
>
>
> On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
>  wrote:
>> 2013/8/19 Wolfgang Hennerbichler :
>>> On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
> yes. depends on 'everything', but it's possible (though not recommended)
> to run mon, mds, and osd's on the same host, and even do virtualisation.

 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/
>>>
>>> whoa :)
>>
>> Yep that's awful.
>>
>> Our Application, Ceph's object storage and a database?
>
> what is 'a database'?

 We run Postgresql or MariaDB (without/with Galera depending on the cluster 
 size)
>>>
>>> You wouldn't want to put the data of postgres or mariadb on cephfs. I
>>> would run the native versions directly on the servers and use
>>> mysql-multi-master circular replication. I don't know about similar
>>> features of postgres.
>>
>> No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
>> in CephFS or Ceph's Object Storage and hold a key or path in the
>> database, also other things like user management will belong to the
>> database
>>
> shared nothing is possible with ceph, but in the end this really depends
> on your application.

 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.
>>>
>>> this is open to discussion, and really depends on your use case.
>>
>> Yeah we definitely know that it isn't good to use Ceph on a single
>> node, but i think it's easier to design the application that it will
>> depends on ceph. it wouldn't be easy to manage to have a single node
>> without ceph and more than 1 node with ceph.
>>
>> Currently we make an archiving software for small customers and we want
>> to move things on the file system on a object storage.
>
> you mean from the filesystem to an object storage?

 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.
>>>
>>> Got it.
>>>
 Currently we run on Microsoft Servers, but we plan to rewrite our
 whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
 7, 9, ... X²-1 should be possible.
>>>
>>> cool.
>>>
>> Currently we only
>> have customers that needs 1 machine or 3 machines. But everything should
>> work as fine on more.
>
> it would with ceph. probably :)

 That's nice to hear. I was really scared that we don't find a solution
 that can run on 1 system and scale up to even more. We first looked at
 HDFS but this isn't lightweight.
>>>

Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
What you are trying to do will work, because you will not need any kernel 
related code for object storage, so a one node setup will work for you. 

-- 
Sent from my mobile device

On 19.08.2013, at 20:29, "Schmitt, Christian"  wrote:

> That sounds bad for me.
> As said one of the things we consider is a one node setup, for production.
> Not every Customer will afford hardware worth more than ~4000 Euro.
> Small business users don't need need the biggest hardware, but i don't
> think it's a good way to have a version who uses the filesystem and
> one version who use ceph.
> 
> We prefer a Object Storage for our Files. It should work like the
> Object Storage of the App Engine.
> That scales from 1 to X Servers.
> 
> 
> 2013/8/19 John Wilkins :
>> Actually, I wrote the Quick Start guides so that you could do exactly
>> what you are trying to do, but mostly from a "kick the tires"
>> perspective so that people can learn to use Ceph without imposing
>> $100k worth of hardware as a requirement. See
>> http://ceph.com/docs/master/start/quick-ceph-deploy/
>> 
>> I even added a section so that you could do it on one disk--e.g., on
>> your laptop.  
>> http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only
>> 
>> It says "demo only," because you won't get great performance out of a
>> single node. Monitors, OSDs, and Journals writing to disk and fsync
>> issues would make performance sub-optimal.
>> 
>> For better performance, you should consider a separate drive for each
>> Ceph OSD Daemon if you can, and potentially a separate SSD drive
>> partitioned for journals. If you can separate the OS and monitor
>> drives from the OSD drives, that's better too.
>> 
>> I wrote it as a two-node quick start, because you cannot kernel mount
>> the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
>> Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
>> can get around this too. If your machine has enough RAM and CPU, you
>> can also install virtual machines and kernel mount cephfs and block
>> devices in the virtual machines with no kernel issues. You don't need
>> to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
>> OpenStack all on the same host too.  It's just not an ideal situation
>> from performance or high availability perspective.
>> 
>> 
>> 
>> On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
>>  wrote:
>>> 2013/8/19 Wolfgang Hennerbichler :
 On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
>> yes. depends on 'everything', but it's possible (though not recommended)
>> to run mon, mds, and osd's on the same host, and even do virtualisation.
> 
> Currently we don't want to virtualise on this machine since the
> machine is really small, as said we focus on small to midsize
> businesses. Most of the time they even need a tower server due to the
> lack of a correct rack. ;/
 
 whoa :)
>>> 
>>> Yep that's awful.
>>> 
>>> Our Application, Ceph's object storage and a database?
>> 
>> what is 'a database'?
> 
> We run Postgresql or MariaDB (without/with Galera depending on the 
> cluster size)
 
 You wouldn't want to put the data of postgres or mariadb on cephfs. I
 would run the native versions directly on the servers and use
 mysql-multi-master circular replication. I don't know about similar
 features of postgres.
>>> 
>>> No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
>>> in CephFS or Ceph's Object Storage and hold a key or path in the
>>> database, also other things like user management will belong to the
>>> database
>>> 
>> shared nothing is possible with ceph, but in the end this really depends
>> on your application.
> 
> hm, when disk fails we already doing some backup on a dell powervault
> rd1000, so i don't think thats a problem and also we would run ceph on
> a Dell PERC Raid Controller with RAID1 enabled on the data disk.
 
 this is open to discussion, and really depends on your use case.
>>> 
>>> Yeah we definitely know that it isn't good to use Ceph on a single
>>> node, but i think it's easier to design the application that it will
>>> depends on ceph. it wouldn't be easy to manage to have a single node
>>> without ceph and more than 1 node with ceph.
>>> 
>>> Currently we make an archiving software for small customers and we want
>>> to move things on the file system on a object storage.
>> 
>> you mean from the filesystem to an object storage?
> 
> yes, currently everything is on the filesystem and this is really
> horrible, thousands of pdfs just on the filesystem. we can't scale up
> that easily with this setup.
 
 Got it.
 
> Currently we run on Microsoft Servers, but we plan to rewrite our
> whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
> 7, 9, ... X²-1 should be possible.
 
 cool.

Re: [ceph-users] Ceph Deployments

2013-08-19 Thread John Wilkins
Wolfgang is correct. You do not need VMs at all if you are setting up
Ceph Object Storage. It's just Apache, FastCGI, and the radosgw daemon
interacting with the Ceph Storage Cluster. You can do that on one box
no problem. It's still better to have more drives for performance
though.

On Mon, Aug 19, 2013 at 12:08 PM, Wolfgang Hennerbichler
 wrote:
> What you are trying to do will work, because you will not need any kernel 
> related code for object storage, so a one node setup will work for you.
>
> --
> Sent from my mobile device
>
> On 19.08.2013, at 20:29, "Schmitt, Christian"  
> wrote:
>
>> That sounds bad for me.
>> As said one of the things we consider is a one node setup, for production.
>> Not every Customer will afford hardware worth more than ~4000 Euro.
>> Small business users don't need need the biggest hardware, but i don't
>> think it's a good way to have a version who uses the filesystem and
>> one version who use ceph.
>>
>> We prefer a Object Storage for our Files. It should work like the
>> Object Storage of the App Engine.
>> That scales from 1 to X Servers.
>>
>>
>> 2013/8/19 John Wilkins :
>>> Actually, I wrote the Quick Start guides so that you could do exactly
>>> what you are trying to do, but mostly from a "kick the tires"
>>> perspective so that people can learn to use Ceph without imposing
>>> $100k worth of hardware as a requirement. See
>>> http://ceph.com/docs/master/start/quick-ceph-deploy/
>>>
>>> I even added a section so that you could do it on one disk--e.g., on
>>> your laptop.  
>>> http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only
>>>
>>> It says "demo only," because you won't get great performance out of a
>>> single node. Monitors, OSDs, and Journals writing to disk and fsync
>>> issues would make performance sub-optimal.
>>>
>>> For better performance, you should consider a separate drive for each
>>> Ceph OSD Daemon if you can, and potentially a separate SSD drive
>>> partitioned for journals. If you can separate the OS and monitor
>>> drives from the OSD drives, that's better too.
>>>
>>> I wrote it as a two-node quick start, because you cannot kernel mount
>>> the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
>>> Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
>>> can get around this too. If your machine has enough RAM and CPU, you
>>> can also install virtual machines and kernel mount cephfs and block
>>> devices in the virtual machines with no kernel issues. You don't need
>>> to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
>>> OpenStack all on the same host too.  It's just not an ideal situation
>>> from performance or high availability perspective.
>>>
>>>
>>>
>>> On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
>>>  wrote:
 2013/8/19 Wolfgang Hennerbichler :
> On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
>>> yes. depends on 'everything', but it's possible (though not recommended)
>>> to run mon, mds, and osd's on the same host, and even do virtualisation.
>>
>> Currently we don't want to virtualise on this machine since the
>> machine is really small, as said we focus on small to midsize
>> businesses. Most of the time they even need a tower server due to the
>> lack of a correct rack. ;/
>
> whoa :)

 Yep that's awful.

 Our Application, Ceph's object storage and a database?
>>>
>>> what is 'a database'?
>>
>> We run Postgresql or MariaDB (without/with Galera depending on the 
>> cluster size)
>
> You wouldn't want to put the data of postgres or mariadb on cephfs. I
> would run the native versions directly on the servers and use
> mysql-multi-master circular replication. I don't know about similar
> features of postgres.

 No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
 in CephFS or Ceph's Object Storage and hold a key or path in the
 database, also other things like user management will belong to the
 database

>>> shared nothing is possible with ceph, but in the end this really depends
>>> on your application.
>>
>> hm, when disk fails we already doing some backup on a dell powervault
>> rd1000, so i don't think thats a problem and also we would run ceph on
>> a Dell PERC Raid Controller with RAID1 enabled on the data disk.
>
> this is open to discussion, and really depends on your use case.

 Yeah we definitely know that it isn't good to use Ceph on a single
 node, but i think it's easier to design the application that it will
 depends on ceph. it wouldn't be easy to manage to have a single node
 without ceph and more than 1 node with ceph.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.
>>>
>>> you mean from the filesystem to an o

Re: [ceph-users] Flapping osd / continuously reported as failed

2013-08-19 Thread Gregory Farnum
On Fri, Aug 16, 2013 at 5:47 AM, Mostowiec Dominik
 wrote:
> Hi,
> Thanks for your response.
>
>> It's possible, as deep scrub in particular will add a bit of load (it
>> goes through and compares the object contents).
>
> It is possible that the scrubbing blocks access(RW or only W) to bucket index 
> when check .dir... file?
> When rgw index is very large I guess it take some time.

Yes, it definitely can as scrubbing takes locks on the PG, which will
prevent reads or writes while the message is being processed (which
will involve the rgw index being scanned).

>> Are you not having any
>> flapping issues any more, and did you try and find when it started the
>> scrub to see if it matched up with your troubles?
>
> No, I didn't.
> But on our second cluster with the same problem, disable scrubbing also helps.
>
>> I'd be hesitant to turn it off as scrubbing can uncover corrupt
>> objects etc, but you can configure it with the settings at
>> http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing.
>> (Always check the surprisingly-helpful docs when you need to do some
>> config or operations work!)
>
> I think change config scrub timeout or interval don't full remove issues.
> Change "osd deep scrub stride" to small value make scrubbing lightest?
You probably don't want to change the scrub stride; that is used to
keep reads at an appropriate size for the internal control threads but
won't relate to the object read/write locking.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v0.61.8 Cuttlefish released

2013-08-19 Thread Sage Weil
We've made another point release for Cuttlefish. This release contains a 
number of fixes that are generally not individually critical, but do trip 
up users from time to time, are non-intrusive, and have held up under 
testing.

Notable changes include:

 * librados: fix async aio completion wakeup
 * librados: fix aio completion locking
 * librados: fix rare deadlock during shutdown
 * osd: fix race when queueing recovery operations
 * osd: fix possible race during recovery
 * osd: optionally preload rados classes on startup (disabled by default)
 * osd: fix journal replay corner condition
 * osd: limit size of peering work queue batch (to speed up peering)
 * mon: fix paxos recovery corner case
 * mon: fix rare hang when monmap updates during an election
 * mon: make osd pool mksnap ... avoid exposing uncommitted state
 * mon: make osd pool rmsnap ... not racy, avoid exposing uncommitted state
 * mon: fix bug during mon cluster expansion
 * rgw: fix crash during multi delete operation
 * msgr: fix race conditions during osd network reinitialization
 * ceph-disk: apply mount options when remounting

For more detailed information, please see the detailed release notes and 
complete changelog:

 * http://ceph.com/docs/master/release-notes/#v0-61-8-cuttlefish
 * http://ceph.com/docs/master/_downloads/v0.61.8.txt

You can get v0.61.8 from the usual locations:

 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.61.8.tar.gz
 * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
 * For RPMs, see http://ceph.com/docs/master/install/rpm

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Oliver Daudey
Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  "PGLog::undirty()"
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!


   Regards,

 Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
> You're right, PGLog::undirty() looks suspicious.  I just pushed a
> branch wip-dumpling-pglog-undirty with a new config
> (osd_debug_pg_log_writeout) which if set to false will disable some
> strictly debugging checks which occur in PGLog::undirty().  We haven't
> actually seen these checks causing excessive cpu usage, so this may be
> a red herring.
> -Sam
> 
> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey  wrote:
> > Hey Mark,
> >
> > On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
> >> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
> >> > Hey all,
> >> >
> >> > This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
> >> > created in the tracker.  Thought I would pass it through the list as
> >> > well, to get an idea if anyone else is running into it.  It may only
> >> > show under higher loads.  More info about my setup is in the bug-report
> >> > above.  Here goes:
> >> >
> >> >
> >> > I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
> >> > and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
> >> > unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
> >> > +MB/sec on simple linear writes to a file with `dd' inside a VM on this
> >> > cluster under regular load and the osds usually averaged 20-100%
> >> > CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
> >> > the osds shot up to 100% to 400% in `top' (multi-core system) and the
> >> > speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
> >> > complained that disk-access inside the VMs was significantly slower and
> >> > the backups of the RBD-store I was running, also got behind quickly.
> >> >
> >> > After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
> >> > rest at 0.67 Dumpling, speed and load returned to normal. I have
> >> > repeated this performance-hit upon upgrade on a similar test-cluster
> >> > under no additional load at all. Although CPU-usage for the osds wasn't
> >> > as dramatic during these tests because there was no base-load from other
> >> > VMs, I/O-performance dropped significantly after upgrading during these
> >> > tests as well, and returned to normal after downgrading the osds.
> >> >
> >> > I'm not sure what to make of it. There are no visible errors in the logs
> >> > and everything runs and reports good health, it's just a lot slower,
> >> > with a lot more CPU-usage.
> >>
> >> Hi Oliver,
> >>
> >> If you have access to the perf command on this system, could you try
> >> running:
> >>
> >> "sudo perf top"
> >>
> >> And if that doesn't give you much,
> >>
> >> "sudo perf record -g"
> >>
> >> then:
> >>
> >> "sudo perf report | less"
> >>
> >> during the period of high CPU usage?  This will give you a call graph.
> >> There may be symbols missing, but it might help track down what the OSDs
> >> are doing.
> >
> > Thanks for your help!  I did a couple of runs on my test-cluster,
> > loading it with writes from 3 VMs concurrently and measuring the results
> > at the first node with all 0.67 Dumpling-components and with the osds
> > replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
> > while, then copied anything that showed in red and green into this post.
> > Here are the results (sorry for the word-wraps):
> >
> > First, with 0.61.7 osds:
> >
> >  19.91%  [kernel][k] intel_idle
> >  10.18%  [kernel][k] _raw_spin_lock_irqsave
> >   6.79%  ceph-osd[.] ceph_crc32c_le
> >   4.93%  [kernel][k]
> > default_send_IPI_mask_sequence_phys
> >   2.71%  [kernel][k] copy_user_generic_string
> >   1.42%  libc-2.11.3.so  [.] memcpy
> >   1.23%  [kernel][k] find_busiest_group
> >   1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
> >   1.11%  [kernel][k] _raw_spin_lock
> >   0.99%  kvm [.] 0x1931f8
> >   0.92%  [igb]   [k] igb_poll
> >   0.87%  [kernel][k] native_write_cr0
> >   0.80%  [kernel][k] csum_partial
> >   0.78%

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Mark Nelson

Hi Oliver,

Glad that helped!  How much more efficient do the cuttlefish OSDs seem 
at this point (with wip-dumpling-pglog-undirty)?  On modern Intel 
platforms we were actually hoping to see CPU usage go down in many cases 
due to the use of hardware CRC32 instructions.


Mark

On 08/19/2013 03:06 PM, Oliver Daudey wrote:

Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  "PGLog::undirty()"
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!


Regards,

  Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:

You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey  wrote:

Hey Mark,

On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:

On 08/17/2013 06:13 AM, Oliver Daudey wrote:

Hey all,

This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
created in the tracker.  Thought I would pass it through the list as
well, to get an idea if anyone else is running into it.  It may only
show under higher loads.  More info about my setup is in the bug-report
above.  Here goes:


I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
+MB/sec on simple linear writes to a file with `dd' inside a VM on this
cluster under regular load and the osds usually averaged 20-100%
CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
the osds shot up to 100% to 400% in `top' (multi-core system) and the
speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
complained that disk-access inside the VMs was significantly slower and
the backups of the RBD-store I was running, also got behind quickly.

After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
rest at 0.67 Dumpling, speed and load returned to normal. I have
repeated this performance-hit upon upgrade on a similar test-cluster
under no additional load at all. Although CPU-usage for the osds wasn't
as dramatic during these tests because there was no base-load from other
VMs, I/O-performance dropped significantly after upgrading during these
tests as well, and returned to normal after downgrading the osds.

I'm not sure what to make of it. There are no visible errors in the logs
and everything runs and reports good health, it's just a lot slower,
with a lot more CPU-usage.


Hi Oliver,

If you have access to the perf command on this system, could you try
running:

"sudo perf top"

And if that doesn't give you much,

"sudo perf record -g"

then:

"sudo perf report | less"

during the period of high CPU usage?  This will give you a call graph.
There may be symbols missing, but it might help track down what the OSDs
are doing.


Thanks for your help!  I did a couple of runs on my test-cluster,
loading it with writes from 3 VMs concurrently and measuring the results
at the first node with all 0.67 Dumpling-components and with the osds
replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
while, then copied anything that showed in red and green into this post.
Here are the results (sorry for the word-wraps):

First, with 0.61.7 osds:

  19.91%  [kernel][k] intel_idle
  10.18%  [kernel][k] _raw_spin_lock_irqsave
   6.79%  ceph-osd[.] ceph_crc32c_le
   4.93%  [kernel][k]
default_send_IPI_mask_sequence_phys
   2.71%  [kernel][k] copy_user_generic_string
   1.42%  libc-2.11.3.so  [.] memcpy
   1.23%  [kernel][k] find_busiest_group
   1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
   1.11%  [kernel][k] _raw_spin_lock
   0.99%  kvm [.] 0x1931f8
   0.92%  [igb]   [k] igb_poll
   0.87%  [kernel][k] native_write_cr0
   0.80%  [kernel][k] csum_partial
   0.78%  [kernel][k] __do_softirq
   0.63%  [kernel]

Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have an OSD which crash every time I try to start it (see logs below).
> Is it a known problem ? And is there a way to fix it ?
> 
> root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
> 2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
> (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
> 2013-08-19 11:07:48.516363 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
> appears to work
> 2013-08-19 11:07:48.516380 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
> 'filestore fiemap' config option
> 2013-08-19 11:07:48.516514 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
> 2013-08-19 11:07:48.517087 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
> supported
> 2013-08-19 11:07:48.517389 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
> 2013-08-19 11:07:49.199483 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
> btrfs not detected
> 2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
> 2013-08-19 11:07:52.199908 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
> appears to work
> 2013-08-19 11:07:52.199916 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
> 'filestore fiemap' config option
> 2013-08-19 11:07:52.200058 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
> 2013-08-19 11:07:52.200886 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
> supported
> 2013-08-19 11:07:52.200919 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
> 2013-08-19 11:07:52.215850 7f6fe367a780  0 
> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
> btrfs not detected
> 2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
> 2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has 
> features 262144, adjusting msgr requires for clients
> 2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has 
> features 262144, adjusting msgr requires for osds
> 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
> OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
> 11:08:13.579519
> osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))
> 
>  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
>  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
>  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
> PG::RecoveryCtx*, std::set, 
> std::less >, std::allocator 
> > >*)+0x3c8) [0x6f8f48]
>  3: (OSD::process_peering_events(std::list > const&, 
> ThreadPool::TPHandle&)+0x31f) [0x6f975f]
>  4: (OSD::PeeringWQ::_process(std::list > const&, 
> ThreadPool::TPHandle&)+0x14) [0x7391d4]
>  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
>  6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
>  7: (()+0x6b50) [0x7f6fe3070b50]
>  8: (clone()+0x6d) [0x7f6fe15cba7d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> full logs here : http://pastebin.com/RphNyLU0
> 
> 

Hi,

still same problem with Ceph 0.61.8 :

2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 
23:01:58.313955
osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, 
std::less >, std::allocator > 
>*)+0x3c8) [0x6fa708]
 3: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x31f) [0x6faf1f]
 4: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x73a9b4]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8fb69a]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fc940]
 7: (()+0x6b50) [0x7fdd6619ab50]
 8: (clone()+0x6d) [0x7fdd646f5a7d]
 NOTE: a copy of the executable, or

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Oliver Daudey
Hey Mark,

If I look at the "wip-dumpling-pglog-undirty"-version with regular top,
I see a slightly higher base-load on the osd, with significantly more
and higher spikes in it than the Dumpling-osds.  Looking with `perf
top', "PGLog::undirty()" is still there, although pulling significantly
less CPU.  With the Cuttlefish-osds, I don't see it at all, even under
load.  That may account for the extra load I'm still seeing, but I don't
know what is still going on in it and if that too can be safely disabled
to save some more CPU.

All in all, it's quite close and seems a bit difficult to measure.  I'd
say the CPU-usage with "wip-dumpling-pglog-undirty" is still a good 30%
higher than Cuttlefish on my production-cluster.  I have yet to upgrade
all osds and compare performance of the cluster as a whole.  Is the
"wip-dumpling-pglog-undirty"-version considered safe enough to do so?
If you have any tips for other safe benchmarks, I'll try those as well.
Thanks!


   Regards,

  Oliver

On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
> Hi Oliver,
> 
> Glad that helped!  How much more efficient do the cuttlefish OSDs seem 
> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel 
> platforms we were actually hoping to see CPU usage go down in many cases 
> due to the use of hardware CRC32 instructions.
> 
> Mark
> 
> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
> > Hey Samuel,
> >
> > Thanks!  I installed your version, repeated the same tests on my
> > test-cluster and the extra CPU-loading seems to have disappeared.  Then
> > I replaced one osd of my production-cluster with your modified version
> > and it's config-option and it seems to be a lot less CPU-hungry now.
> > Although the Cuttlefish-osds still seem to be even more CPU-efficient,
> > your changes have definitely helped a lot.  We seem to be looking in the
> > right direction, at least for this part of the problem.
> >
> > BTW, I ran `perf top' on the production-node with your modified osd and
> > didn't see anything osd-related stand out on top.  "PGLog::undirty()"
> > was in there, but with much lower usage, right at the bottom of the
> > green part of the output.
> >
> > Many thanks for your help so far!
> >
> >
> > Regards,
> >
> >   Oliver
> >
> > On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
> >> You're right, PGLog::undirty() looks suspicious.  I just pushed a
> >> branch wip-dumpling-pglog-undirty with a new config
> >> (osd_debug_pg_log_writeout) which if set to false will disable some
> >> strictly debugging checks which occur in PGLog::undirty().  We haven't
> >> actually seen these checks causing excessive cpu usage, so this may be
> >> a red herring.
> >> -Sam
> >>
> >> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey  wrote:
> >>> Hey Mark,
> >>>
> >>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>  On 08/17/2013 06:13 AM, Oliver Daudey wrote:
> > Hey all,
> >
> > This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
> > created in the tracker.  Thought I would pass it through the list as
> > well, to get an idea if anyone else is running into it.  It may only
> > show under higher loads.  More info about my setup is in the bug-report
> > above.  Here goes:
> >
> >
> > I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
> > and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
> > unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
> > +MB/sec on simple linear writes to a file with `dd' inside a VM on this
> > cluster under regular load and the osds usually averaged 20-100%
> > CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
> > the osds shot up to 100% to 400% in `top' (multi-core system) and the
> > speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
> > complained that disk-access inside the VMs was significantly slower and
> > the backups of the RBD-store I was running, also got behind quickly.
> >
> > After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
> > rest at 0.67 Dumpling, speed and load returned to normal. I have
> > repeated this performance-hit upon upgrade on a similar test-cluster
> > under no additional load at all. Although CPU-usage for the osds wasn't
> > as dramatic during these tests because there was no base-load from other
> > VMs, I/O-performance dropped significantly after upgrading during these
> > tests as well, and returned to normal after downgrading the osds.
> >
> > I'm not sure what to make of it. There are no visible errors in the logs
> > and everything runs and reports good health, it's just a lot slower,
> > with a lot more CPU-usage.
> 
>  Hi Oliver,
> 
>  If you have access to the perf command on this system, could you try
>  running:
> 
>  "sudo perf top"
> 
>  And if that does

Re: [ceph-users] large memory leak on scrubbing

2013-08-19 Thread Mostowiec Dominik
Hi,
> Is that the only slow request message you see?
No.
Full log: https://www.dropbox.com/s/i3ep5dcimndwvj1/slow_requests.txt.tar.gz 
It start from:
2013-08-16 09:43:39.662878 mon.0 10.174.81.132:6788/0 4276384 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.50 10.174.81.135:6842/26019
2013-08-16 09:43:40.711911 mon.0 10.174.81.132:6788/0 4276386 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.14 10.174.81.132:6836/2958
2013-08-16 09:43:41.043016 mon.0 10.174.81.132:6788/0 4276388 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.13 10.174.81.132:6830/2482
2013-08-16 09:43:41.043047 mon.0 10.174.81.132:6788/0 4276389 : [INF] osd.4 
10.174.81.131:6805/31460 failed (3 reports from 3 peers after 2013-08-16 
09:43:56.042983 >= grace 20.00)
2013-08-16 09:43:41.122326 mon.0 10.174.81.132:6788/0 4276390 : [INF] osdmap 
e10294: 144 osds: 143 up, 143 in
2013-08-16 09:43:38.798833 osd.4 10.174.81.131:6805/31460 913 : [WRN] 6 slow 
requests, 6 included below; oldest blocked for > 30.190146 secs
2013-08-16 09:43:38.798843 osd.4 10.174.81.131:6805/31460 914 : [WRN] slow 
request 30.190146 seconds old, received at 2013-08-16 09:43:08.585504: 
osd_op(client.22301645.0:48987 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798854 osd.4 10.174.81.131:6805/31460 915 : [WRN] slow 
request 30.189643 seconds old, received at 2013-08-16 09:43:08.586007: 
osd_op(client.22301855.0:49374 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798859 osd.4 10.174.81.131:6805/31460 916 : [WRN] slow 
request 30.188236 seconds old, received at 2013-08-16 09:43:08.587414: 
osd_op(client.22307596.0:47674 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798862 osd.4 10.174.81.131:6805/31460 917 : [WRN] slow 
request 30.187853 seconds old, received at 2013-08-16 09:43:08.587797: 
osd_op(client.22303894.0:51846 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
...
2013-08-16 09:44:18.126318 mon.0 10.174.81.132:6788/0 4276427 : [INF] osd.4 
10.174.81.131:6805/31460 boot
...
2013-08-16 09:44:23.215918 mon.0 10.174.81.132:6788/0 4276437 : [DBG] osd.25 
10.174.81.133:6810/2961 reported failed by osd.83 10.174.81.137:6837/27963
2013-08-16 09:44:23.704769 mon.0 10.174.81.132:6788/0 4276438 : [INF] pgmap 
v17035051: 32424 pgs: 1 stale+active+clean+scrubbing+deep, 2 active, 31965 
active+clean, 7 stale+active+clean, 29 peering, 415 active+degraded, 5 
active+clean+scrubbing; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 
246065/61089697 degraded (0.403%)
2013-08-16 09:44:23.711244 mon.0 10.174.81.132:6788/0 4276439 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.26 10.174.81.133:6814/3674
2013-08-16 09:44:23.713597 mon.0 10.174.81.132:6788/0 4276440 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.17 10.174.81.132:6806/9188
2013-08-16 09:44:23.753952 mon.0 10.174.81.132:6788/0 4276441 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.27 10.174.81.133:6822/5389
2013-08-16 09:44:23.753982 mon.0 10.174.81.132:6788/0 4276442 : [INF] osd.133 
10.174.81.142:6803/21366 failed (3 reports from 3 peers after 2013-08-16 
09:44:38.753913 >= grace 20.00)


2013-08-16 09:47:10.229099 mon.0 10.174.81.132:6788/0 4276646 : [INF] pgmap 
v17035216: 32424 pgs: 32424 active+clean; 6630 GB data, 21420 GB used, 371 TB / 
392 TB avail; 0B/s rd, 622KB/s wr, 85op/s

Why osd's are 'reported failed' on scrubbing?

--
Regards 
Dominik 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping osd / continuously reported as failed

2013-08-19 Thread Mostowiec Dominik
Hi,
> Yes, it definitely can as scrubbing takes locks on the PG, which will prevent 
> reads or writes while the message is being processed (which will involve the 
> rgw index being scanned).
It is possible to tune scrubbing config for eliminate slow requests and marking 
osd down when large rgw bucket index is scrubbing?

--
Regards
Dominik

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.8 Cuttlefish released

2013-08-19 Thread James Harper
> 
> We've made another point release for Cuttlefish. This release contains a
> number of fixes that are generally not individually critical, but do trip
> up users from time to time, are non-intrusive, and have held up under
> testing.
> 
> Notable changes include:
> 
>  * librados: fix async aio completion wakeup
>  * librados: fix aio completion locking
>  * librados: fix rare deadlock during shutdown

Could any of these be causing the segfaults I'm seeing in tapdisk rbd? Are 
these fixes in dumpling?

Thanks

James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping osd / continuously reported as failed

2013-08-19 Thread Gregory Farnum
On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik
 wrote:
> Hi,
>> Yes, it definitely can as scrubbing takes locks on the PG, which will 
>> prevent reads or writes while the message is being processed (which will 
>> involve the rgw index being scanned).
> It is possible to tune scrubbing config for eliminate slow requests and 
> marking osd down when large rgw bucket index is scrubbing?

Unfortunately not, or we would have mentioned it before. :/ There are
some proposals for sharding bucket indexes that would ameliorate this
problem, and on Cuttlefish or Dumpling the OSD won't get marked down,
but it will still block incoming requests on that object (ie, requests
to access the bucket) while the scrubbing is in place.
That said, that improvement might be sufficient since you haven't
actually shown us how long the object scrub takes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.8 Cuttlefish released

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, James Harper wrote:
> > 
> > We've made another point release for Cuttlefish. This release contains a
> > number of fixes that are generally not individually critical, but do trip
> > up users from time to time, are non-intrusive, and have held up under
> > testing.
> > 
> > Notable changes include:
> > 
> >  * librados: fix async aio completion wakeup
> >  * librados: fix aio completion locking
> >  * librados: fix rare deadlock during shutdown
> 
> Could any of these be causing the segfaults I'm seeing in tapdisk rbd? 
> Are these fixes in dumpling?

They are also in the dumpling branch and 0.67.1.  They might explain it... 
not a slam dunk though.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.8 Cuttlefish released

2013-08-19 Thread James Harper
> On Mon, 19 Aug 2013, James Harper wrote:
> > >
> > > We've made another point release for Cuttlefish. This release contains a
> > > number of fixes that are generally not individually critical, but do trip
> > > up users from time to time, are non-intrusive, and have held up under
> > > testing.
> > >
> > > Notable changes include:
> > >
> > >  * librados: fix async aio completion wakeup
> > >  * librados: fix aio completion locking
> > >  * librados: fix rare deadlock during shutdown
> >
> > Could any of these be causing the segfaults I'm seeing in tapdisk rbd?
> > Are these fixes in dumpling?
> 
> They are also in the dumpling branch and 0.67.1.  They might explain it...
> not a slam dunk though.
> 

Just upgraded to test and no joy - tapdisk still segfaults

Thanks

James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Mark.

What is the design considerations to break large files into 4M chunk rather 
than storing the large file directly?

Thanks,
Guang



 From: Mark Kirkwood 
To: Guang Yang  
Cc: "ceph-users@lists.ceph.com"  
Sent: Monday, August 19, 2013 5:18 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 19/08/13 18:17, Guang Yang wrote:

>    3. Some industry research shows that one issue of file system is the
> metadata-to-data ratio, in terms of both access and storage, and some
> technic uses the mechanism to combine small files to large physical
> files to reduce the ratio (Haystack for example), if we want to use ceph
> to store photos, should this be a concern as Ceph use one physical file
> per object?

If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).


Regards

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Greg.

Some comments inline...

On Sunday, August 18, 2013, Guang Yang  wrote:

Hi ceph-users,
>This is Guang and I am pretty new to ceph, glad to meet you guys in the 
>community!
>
>
>After walking through some documents of Ceph, I have a couple of questions:
>  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
>to handle different work-loads (from KB to GB), with corresponding performance 
>report?

Not really; any comparison would be highly biased depending on your Amazon ping 
and your Ceph cluster. We've got some internal benchmarks where Ceph looks 
good, but they're not anything we'd feel comfortable publishing.
 [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
over the comparison.
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?

These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any 
IO to find object locations; CephFS only does IO if the inode you request has 
fallen out of the MDS cache (not terribly likely in general). This shouldn't be 
an issue...
[Guang] " CephFS only does IO if the inode you request has fallen out of the 
MDS cache", my understanding is, if we use CephFS, we will need to interact 
with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
etc.) and the second time to load data, and both times will need disk I/O in 
terms of inode and data. Is my understanding correct? The way some other 
storage system tried was to cache the file handle in memory, so that it can 
avoid the I/O to read inode in.
 
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

...although this might be. The issue basically comes down to how many disk 
seeks are required to retrieve an item, and one way to reduce that number is to 
hack the filesystem by keeping a small number of very large files an 
calculating (or caching) where different objects are inside that file. Since 
Ceph is designed for MB-sized objects it doesn't go to these lengths to 
optimize that path like Haystack might (I'm not familiar with Haystack in 
particular).
That said, you need some pretty extreme latency requirements before this 
becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
you're in that ballpark. You should be fine. :)
[Guang] Yep, that makes a lot sense.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Gregory Farnum
On Monday, August 19, 2013, Guang Yang wrote:

> Thanks Greg.
>
> Some comments inline...
>
> On Sunday, August 18, 2013, Guang Yang wrote:
>
> Hi ceph-users,
> This is Guang and I am pretty new to ceph, glad to meet you guys in the
> community!
>
> After walking through some documents of Ceph, I have a couple of questions:
>   1. Is there any comparison between Ceph and AWS S3, in terms of the
> ability to handle different work-loads (from KB to GB), with corresponding
> performance report?
>
>
> Not really; any comparison would be highly biased depending on your Amazon
> ping and your Ceph cluster. We've got some internal benchmarks where Ceph
> looks good, but they're not anything we'd feel comfortable publishing.
>  [Guang] Yeah, I mean the solely server side time regardless of the RTT
> impact over the comparison.
>
>   2. Looking at some industry solutions for distributed storage, GFS /
> Haystack / HDFS all use meta-server to store the logical-to-physical
> mapping within memory and avoid disk I/O lookup for file reading, is the
> concern valid for Ceph (in terms of latency to read file)?
>
>
> These are very different systems. Thanks to CRUSH, RADOS doesn't need to
> do any IO to find object locations; CephFS only does IO if the inode you
> request has fallen out of the MDS cache (not terribly likely in general).
> This shouldn't be an issue...
> [Guang] " CephFS only does IO if the inode you request has fallen out of
> the MDS cache", my understanding is, if we use CephFS, we will need to
> interact with Rados twice, the first time to retrieve meta-data (file
> attribute, owner, etc.) and the second time to load data, and both times
> will need disk I/O in terms of inode and data. Is my understanding correct?
> The way some other storage system tried was to cache the file handle in
> memory, so that it can avoid the I/O to read inode in.
>

In the worst case this can happen with CephFS, yes. However, the client is
not accessing metadata directly; it's going through the MetaData Server,
which caches (lots of) metadata on its own, and the client can get leases
as well (so it doesn't need to go to the MDS for each access, and can cache
information on its own). The typical case is going to depend quite a lot on
your scale.
That said, I'm not sure why you'd want to use CephFS for a
small-object store when you could just use raw RADOS, and avoid all the
posix overheads. Perhaps I've misunderstood your use case?
-Greg



>
>
>   3. Some industry research shows that one issue of file system is the
> metadata-to-data ratio, in terms of both access and storage, and some
> technic uses the mechanism to combine small files to large physical files
> to reduce the ratio (Haystack for example), if we want to use ceph to store
> photos, should this be a concern as Ceph use one physical file per object?
>
>
> ...although this might be. The issue basically comes down to how many disk
> seeks are required to retrieve an item, and one way to reduce that number
> is to hack the filesystem by keeping a small number of very large files an
> calculating (or caching) where different objects are inside that file.
> Since Ceph is designed for MB-sized objects it doesn't go to these lengths
> to optimize that path like Haystack might (I'm not familiar with Haystack
> in particular).
> That said, you need some pretty extreme latency requirements before this
> becomes an issue and if you're also looking at HDFS or S3 I can't imagine
> you're in that ballpark. You should be fine. :)
> [Guang] Yep, that makes a lot sense.
> -Greg
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Mark Kirkwood

On 20/08/13 13:27, Guang Yang wrote:

Thanks Mark.

What is the design considerations to break large files into 4M chunk
rather than storing the large file directly?




Quoting Wolfgang from previous reply:

=> which is a good thing in terms of replication and OSD usage
distribution


...which covers what I would have said quite well :-)

Cheers

Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-19 Thread David Zafman

Transferring this back the ceph-users.  Sorry, I can't help with rbd issues.  
One thing I will say is that if you are mounting an rbd device with a 
filesystem on a machine to export ftp, you can't also export the same device 
via iSCSI.

David Zafman
Senior Developer
http://www.inktank.com

On Aug 19, 2013, at 8:39 PM, PJ  wrote:

> 2013/8/14 David Zafman 
> 
> On Aug 12, 2013, at 7:41 PM, Josh Durgin  wrote:
> 
> > On 08/12/2013 07:18 PM, PJ wrote:
> >>
> >> If the target rbd device only map on one virtual machine, format it as
> >> ext4 and mount to two places
> >>   mount /dev/rbd0 /nfs --> for nfs server usage
> >>   mount /dev/rbd0 /ftp  --> for ftp server usage
> >> nfs and ftp servers run on the same virtual machine. Will file system
> >> (ext4) help to handle the simultaneous access from nfs and ftp?
> > 
> > I doubt that'll work perfectly on a normal disk, although rbd should
> > behave the same in this case. Consider what happens when to be some
> > issues when the same files are modified at once by the ftp and nfs
> > servers. You could run ftp on an nfs client on a different machine
> > safely.
> >
> 
> 
> Modern Linux kernels will do a bind mount when a block device is mounted on 2 
> different directories.   Think directory hard links.  Simultaneous access 
> will NOT corrupt ext4, but as Josh said modifying the same file at once by 
> ftp and nfs isn't going produce good results.  With file locking 2 nfs 
> clients could coordinate using advisory locking.
> 
> David Zafman
> Senior Developer
> http://www.inktank.com
> 
> 
> The first issue is reproduced, but there are changes to system configuration. 
> Due to hardware shortage, we only have one physical machine installed one OSD 
> and runs 6 virtual machines on it. Only one monitor (wistor-003) and one FTP 
> server (wistor-004), the other virtual machines are iSCSI servers.
> 
> The log size is big because when enable FTP service for a rbd device, we have 
> a rbd map retry loop in case it fails (retry rbd map every 10 sec and last 
> for 3 minutes). Please download monitor log from below link,
> https://www.dropbox.com/s/88cb9q91cjszuug/ceph-mon.wistor-003.log.zip
> 
> Here are the operation steps:
> 1. The pool rex is created
>Around 2013-08-20 09:16:38~09:16:39
> 2. The first time to map rbd device on wistor-004 and it fails (all retries 
> failed)
>Around 2013-08-20 09:17:43~09:20:46 (180 sec)
> 3. Tried second time and it works, but still have 9 fails in retry loop
>Around 2013-08-20 09:20:48~09:22:10 (82 sec)
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com