Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-04 Thread Colombo Marco
Hi Christian,



On 04/02/15 02:39, Christian Balzer ch...@gol.com wrote:

On Tue, 3 Feb 2015 15:16:57 + Colombo Marco wrote:

 Hi all,
  I have to build a new Ceph storage cluster, after i‘ve read the
 hardware recommendations and some mail from this mailing list i would
 like to buy these servers:
 

Nick mentioned a number of things already I totally agree with, so don't
be surprised if some of this feels like a repeat.

 OSD:
 SSG-6027R-E1R12L -
 http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm
 Intel Xeon e5-2630 v2 64 GB RAM
As nick said, v3 and more RAM might be helpful, depending on your use case
(small writes versus large ones) even faster CPUs as well.

Ok, we switch from v2 to v3 and from 64 to 96 GB of RAM.


 LSI 2308 IT
 2 x SSD Intel DC S3700 400GB
 2 x SSD Intel DC S3700 200GB
Why the separation of SSDs? 
They aren't going to be that busy with regards to the OS.

We would like to use 400GB SSD for a cache pool, and 200GB SSD for the 
journaling.


Get a case like Nick mentioned with 2 2.5 bays in the back, put 2 DC S3700
400GBs in there (connected to onboard 6Gb/s SATA3), partition them so that
you have a RAID1 for OS and plain partitions for the journals of the now 
12
OSD HDDs in your chassis. 
Of course this optimization in terms of cost and density comes with a
price, if one SSD should fail, you will have 6 OSDs down. 
Given how reliable the Intels are this is unlikely, but something you need
to consider.

If you want to limit the impact of a SSD failure and have just 2 OSD
journals per SSD, get a chassis like the one above and 4 DC S3700 200GB,
RAID10 them for the OS and put 2 journal partitions on each. 

I did the same with 8 3TB HDDs and 4 DC S3700 100GB, the HDDs (and CPU
with 4KB IOPS), are the limiting factor, not the SSDs.

 8 x HDD Seagate Enterprise 6TB
Are you really sure you need that density? One disk failure will result in
a LOT of data movement once these become somewhat full.
If you were to go for a 12 OSD node as described above, consider 4TB ones
for the same overall density, while having more IOPS and likely the same
price or less.

We choosen the 6TB of disk, because we need a lot of storage in a small 
amount of server and we prefer server with not too much disks.
However we plan to use max 80% of a 6TB Disk


 2 x 40GbE for backend network
You'd be lucky to write more that 800MB/s sustained to your 8 HDDs
(remember they will have to deal with competing reads and writes, this is
not a sequential synthetic write benchmark). 
Incidentally 1GB/s to 1.2GB/s (depending on configuration) would also be
the limit of your journal SSDs.
Other than backfilling caused by cluster changes (OSD removed/added), your
limitation is nearly always going to be IOPS, not bandwidth.


Ok, after some discussion, we switch to 2 x 10 GbE.


So 2x10GbE or if you're comfortable with it (I am ^o^) an Infiniband
backend (can be cheaper, less latency, plans for RDMA support in
Ceph) should be more than sufficient.

 2 x 10GbE  for public network
 
 META/MON:
 
 SYS-6017R-72RFTP -
 http://www.supermicro.com/products/system/1U/6017/SYS-6017R-72RFTP.cfm 2
 x Intel Xeon e5-2637 v2 4 x SSD Intel DC S3500 240GB raid 1+0
You're likely to get better performance and of course MUCH better
durability by using 2 DC S3700, at about the same price.

Ok we switch to 2 x SSD DC S3700


 128 GB RAM
Total overkill for a MON, but I have no idea about MDS and RAM never 
hurts.

Ok we switch from 128 to 96


In your follow-up you mentioned 3 mons, I would suggest putting 2 more
mons (only, not MDS) on OSD nodes and make sure that within the IP
numbering the real mons have the lowest IP addresses, because the MON
with the lowest IP becomes master (and thus the busiest). 
This way you can survive a loss of 2 nodes and still have a valid quorum.

Ok, got it



Christian

 2 x 10 GbE
 
 What do you think?
 Any feedbacks, advices, or ideas are welcome!
 
 Thanks so much
 
 Regards,


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com  Global OnLine Japan/Fusion Communications
http://www.gol.com/

Thanks so much!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd recover tool for stopped ceph cluster

2015-02-04 Thread minchen
rbd recover tool  is an offline tool to recover rbd image when ceph cluster is 
stopped.
It is usefull when you want to recover rbd image on a broken ceph cluster in 
urgent.  
I have used a similar prototype tool succeessfully recovering a large rbd 
image in ceph cluster whose scale is 900+ osds. So I think this tool can help 
us to keep rbd data security.
Before you runing this tool, just make sure to stop all ceph services: 
ceph-mon, ceph-osd, ceph-mds.
Currently, this tool supports both raw image and snapshot, and clone image 
will be supported sooner.


there is the pull request
https://github.com/ceph/ceph/pull/3611‍___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about output message and object update for ceph class

2015-02-04 Thread Dennis Chen
Hello,

I write a ceph client using rados lib to execute a funcution upon the object.

CLIENT SIDE CODE
===
int main()
{
  ...
strcpy(in, from client);
err = rados_exec(io, objname, devctl, devctl_op, in,
strlen(in), out, 128);
if (err  0) {
fprintf(stderr, rados_exec() failed: %s\n, strerror(-err));
rados_ioctx_destroy(io);
rados_shutdown(cluster);
exit(1);
}
out[err] = '\0';
printf(err = %d, exec result out = %s, in = %s\n, err, out, in);
  ...
}

CLASS CODE IN OSD SIDE
==
static int devctl_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
  ...

  i = cls_cxx_stat(hctx, size, NULL);
  if (i  0)
return i;

  bufferlist read_bl, write_bl;
  i = cls_cxx_read(hctx, 0, size, read_bl);
  if (i  0) {
CLS_ERR(cls_cxx_read failed);
return i;
  }


  // we generate our reply
  out-append(Hello, );
  if (in-length() == 0)
out-append(world);
  else
out-append(*in);
  out-append(!);

#if 1
  const char *tstr = from devctl func;
  write_bl.append(tstr);
  i = cls_cxx_write(hctx, size, write_bl.length(), write_bl);
  if (i  0) {
CLS_ERR(cls_cxx_write failed: %s, strerror(-i));
return i;
  }
#endif

  // this return value will be returned back to the librados caller
  return 0;
}

I found that if I update the content of the object when calling
cls_cxx_write(), then the 'out' will be null in the client side,
otherwise the out will be Hello, from client!.

Does anybody here can give some hints?

-- 
Den
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Sage Weil
On Wed, 4 Feb 2015, Cristian Falcas wrote:
 Hi,
 
 We have an openstack installation that uses ceph as the storage backend.
 
 We use mainly snapshot and boot from snapshot from an original
 instance with a 200gb disk. Something like this:
 1. import original image
 2. make volume from image (those 2 steps were done only once, when we
 installed openstack)
 3. boot main instance from volume, update the db inside
 4. snapshot the instance
 5. make volumes from previous snapshot
 6. boot test instances from those volumes (the last 3 steps take less then 
 30s)
 
 
 Currently the fs is btrfs and we are in love with the solution: the
 snapshots are instant and boot from snapshot is also instant. It cut
 our tests time (compared with the vmware solution + netap storage)
 from 12h to 2h. With vmware we were spending 10h with what now is done
 in a few seconds.

That's great to hear!

 I was wondering if the fs matters in this case, because we are a
 little worry about using btrfs and reading all the horror story here
 and on btrfs mailing list.
 
 Is the snapshoting performed by ceph or by the fs? Can we switch to
 xfs and have the same capabilities: instant snapshot + instant boot
 from snapshot?

The feature set and capabilities are identical.  The difference is that on 
btrfs we are letting btrfs do the efficient copy-on-write cloning when we 
touch a snapshotted object while with XFS we literally copy the object 
file (usually 4MB) on the first write.  You will likely see some penalty 
in the boot-from-clone scenario, although I have no idea how significant 
it will be.  On the other hand, we've also seen that btrfs fragmentation 
over time can lead to poor performance relative to XFS.

So, no clear answer, really.  Sorry!

If you do stick with btrfs, please report back here and share what you see 
as far as stability (along with the kernel version(s) you are using; most 
of the XFS over btrfs usage is based on FUD (in the literal sense) and I 
don't think we have seen much in the way of real user reports here in a 
while.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG to pool mapping?

2015-02-04 Thread Lincoln Bryant
On Feb 4, 2015, at 3:27 PM, Gregory Farnum wrote:

 On Wed, Feb 4, 2015 at 1:20 PM, Chad William Seys
 cws...@physics.wisc.edu wrote:
 Hi all,
   How do I determine which pool a PG belongs to?
   (Also, is it the case that all objects in a PG belong to one pool?)
 
 PGs are of the form 1.a2b3c4. The part prior to the period is the
 pool ID; the part following distinguishes the PG and is based on the
 hash range it covers. :)
 
 Yes, all objects in a PG belong to a single pool; they are hash ranges
 of the pool.
 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

You can also map the pool number to the pool name with:

'ceph osd lspools'

Similarly, 'rados lspools' will print out the pools line by line.

Cheers,
Lincoln

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG to pool mapping?

2015-02-04 Thread Gregory Farnum
On Wed, Feb 4, 2015 at 1:20 PM, Chad William Seys
cws...@physics.wisc.edu wrote:
 Hi all,
How do I determine which pool a PG belongs to?
(Also, is it the case that all objects in a PG belong to one pool?)

PGs are of the form 1.a2b3c4. The part prior to the period is the
pool ID; the part following distinguishes the PG and is based on the
hash range it covers. :)

Yes, all objects in a PG belong to a single pool; they are hash ranges
of the pool.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Cristian Falcas
Thank you for the clarifications.

We will try to report back, but I'm not sure our use case is relevant.
We are trying to use every dirty trick to speed up the VMs.

We have only 1 replica, and 2 pools.

One pool with journal on disk, where the original instance exists (we
want to keep this one safe).

The second pool is for the tests machines and has the journal in ram,
so this part is very volatile. We don't really care, because if the
worst happens and we have a power loss we just redo the pool and start
new instances. Journal in ram did wonders for us in terms of
read/write speed.



On Wed, Feb 4, 2015 at 11:22 PM, Sage Weil s...@newdream.net wrote:
 On Wed, 4 Feb 2015, Cristian Falcas wrote:
 Hi,

 We have an openstack installation that uses ceph as the storage backend.

 We use mainly snapshot and boot from snapshot from an original
 instance with a 200gb disk. Something like this:
 1. import original image
 2. make volume from image (those 2 steps were done only once, when we
 installed openstack)
 3. boot main instance from volume, update the db inside
 4. snapshot the instance
 5. make volumes from previous snapshot
 6. boot test instances from those volumes (the last 3 steps take less then 
 30s)


 Currently the fs is btrfs and we are in love with the solution: the
 snapshots are instant and boot from snapshot is also instant. It cut
 our tests time (compared with the vmware solution + netap storage)
 from 12h to 2h. With vmware we were spending 10h with what now is done
 in a few seconds.

 That's great to hear!

 I was wondering if the fs matters in this case, because we are a
 little worry about using btrfs and reading all the horror story here
 and on btrfs mailing list.

 Is the snapshoting performed by ceph or by the fs? Can we switch to
 xfs and have the same capabilities: instant snapshot + instant boot
 from snapshot?

 The feature set and capabilities are identical.  The difference is that on
 btrfs we are letting btrfs do the efficient copy-on-write cloning when we
 touch a snapshotted object while with XFS we literally copy the object
 file (usually 4MB) on the first write.  You will likely see some penalty
 in the boot-from-clone scenario, although I have no idea how significant
 it will be.  On the other hand, we've also seen that btrfs fragmentation
 over time can lead to poor performance relative to XFS.

 So, no clear answer, really.  Sorry!

 If you do stick with btrfs, please report back here and share what you see
 as far as stability (along with the kernel version(s) you are using; most
 of the XFS over btrfs usage is based on FUD (in the literal sense) and I
 don't think we have seen much in the way of real user reports here in a
 while.

 Thanks!
 sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Cristian Falcas
Hi,

We have an openstack installation that uses ceph as the storage backend.

We use mainly snapshot and boot from snapshot from an original
instance with a 200gb disk. Something like this:
1. import original image
2. make volume from image (those 2 steps were done only once, when we
installed openstack)
3. boot main instance from volume, update the db inside
4. snapshot the instance
5. make volumes from previous snapshot
6. boot test instances from those volumes (the last 3 steps take less then 30s)


Currently the fs is btrfs and we are in love with the solution: the
snapshots are instant and boot from snapshot is also instant. It cut
our tests time (compared with the vmware solution + netap storage)
from 12h to 2h. With vmware we were spending 10h with what now is done
in a few seconds.

I was wondering if the fs matters in this case, because we are a
little worry about using btrfs and reading all the horror story here
and on btrfs mailing list.

Is the snapshoting performed by ceph or by the fs? Can we switch to
xfs and have the same capabilities: instant snapshot + instant boot
from snapshot?

Best regards,
Cristian Falcas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG to pool mapping?

2015-02-04 Thread Chad William Seys
Hi all,
   How do I determine which pool a PG belongs to?
   (Also, is it the case that all objects in a PG belong to one pool?)

Thanks!
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Daniel Schwager
Hi Cristian,


 We will try to report back, but I'm not sure our use case is relevant.
 We are trying to use every dirty trick to speed up the VMs.

we have the same use-case.

 The second pool is for the tests machines and has the journal in ram,
 so this part is very volatile. We don't really care, because if the
 worst happens and we have a power loss we just redo the pool and start
 new instances. Journal in ram did wonders for us in terms of
 read/write speed.

How do you handle a reboot of a node managing your pool having the journals in 
RAM?
All the mon's knows about the volatile pool - do you have remove  recreate the
pool automatically after rebooting this node?

Did you tried to enable rdb-caching? Is there a write-performance benefit using
journal @RAM instead of enable rbd-caching on client (openstack) side ?
I thought with rbd-caching the write performance should be fast enough.

regards
Danny


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW put file question

2015-02-04 Thread baijia...@126.com
when I put the same file with multi threads, sometimes  put file head oid 
ref.ioctx.operate(ref.oid, op);  return -ECANCELED. I think this is normal.
but fuction jump to done_cancel, and run the complete_update_index_cancel(or 
index_op.cancel() ), but  osd execute rgw_bucket_complete_op with 
CLS_RGW_OP_ADD and file size must be 0;
so at this moment bucket index record file size is zero. I think this is not 
right.




baijia...@126.com

From: Yehuda Sadeh-Weinraub
Date: 2015-02-05 12:06
To: baijiaruo
CC: ceph-users
Subject: Re: [ceph-users] RGW put file question


- Original Message -
 From: baijia...@126.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, February 4, 2015 5:47:03 PM
 Subject: [ceph-users] RGW put file question
 
 when I put file failed, and run the function 
 RGWRados::cls_obj_complete_cancel,
 why we use CLS_RGW_OP_ADD not use CLS_RGW_OP_CANCEL?
 why we set poolid is -1 and set epoch is 0?
 

I'm not sure, could very well be a bug. It should definitely be OP_CANCEL, but 
going back through the history it seems like it has been OP_ADD since at least 
argonaut. How did you notice it? It might explain a couple of issues that we've 
been seeing.

Yehuda___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Cristian Falcas
We want to use this script as a service for start/stop (but it wasn't
tested yet):

#!/bin/bash
# chkconfig: - 50 90
# description: make a journal for osd.0 in ram
start () {
 -f /dev/shm/osd.0.journal || ceph-osd -i 0 --mkjournal
}
stop ()  {
 service ceph stop osd.0  ceph-osd -i osd.0 --flush-journal  rm -f
/dev/shm/osd.0.journal
}
case \$1 in
  start) start;;
  stop)  stop;;
esac

Also, we didn't see any noticeable  improvements with rbd-caching, but
we didn't performed any tests to measure it, just how we feel it.



On Thu, Feb 5, 2015 at 12:09 AM, Daniel Schwager
daniel.schwa...@dtnet.de wrote:
 Hi Cristian,


 We will try to report back, but I'm not sure our use case is relevant.
 We are trying to use every dirty trick to speed up the VMs.

 we have the same use-case.

 The second pool is for the tests machines and has the journal in ram,
 so this part is very volatile. We don't really care, because if the
 worst happens and we have a power loss we just redo the pool and start
 new instances. Journal in ram did wonders for us in terms of
 read/write speed.

 How do you handle a reboot of a node managing your pool having the journals 
 in RAM?
 All the mon's knows about the volatile pool - do you have remove  recreate 
 the
 pool automatically after rebooting this node?

 Did you tried to enable rdb-caching? Is there a write-performance benefit 
 using
 journal @RAM instead of enable rbd-caching on client (openstack) side ?
 I thought with rbd-caching the write performance should be fast enough.

 regards
 Danny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshoting on btrfs vs xfs

2015-02-04 Thread Lindsay Mathieson
On 5 February 2015 at 07:22, Sage Weil s...@newdream.net wrote:

  Is the snapshoting performed by ceph or by the fs? Can we switch to
  xfs and have the same capabilities: instant snapshot + instant boot
  from snapshot?

 The feature set and capabilities are identical.  The difference is that on
 btrfs we are letting btrfs do the efficient copy-on-write cloning when we
 touch a snapshotted object while with XFS we literally copy the object
 file (usually 4MB) on the first write.



Are ceph snapshots really that much faster when using btrfs underneath? one
of the problem we have with ceph is that snapshot take/restore is insanely
slow, tens of minutes - but we are using xfs.


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW put file question

2015-02-04 Thread baijia...@126.com
when I put file failed, and run the function  
RGWRados::cls_obj_complete_cancel, 
why we use CLS_RGW_OP_ADD not use CLS_RGW_OP_CANCEL?
why we set poolid is -1 and set epoch is 0?



baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Josh Durgin

On 02/05/2015 07:44 AM, Udo Lembke wrote:

Hi all,
is there any command to flush the rbd cache like the
echo 3  /proc/sys/vm/drop_caches for the os cache?


librbd exposes it as rbd_invalidate_cache(), and qemu uses it
internally, but I don't think you can trigger that via any user-facing
qemu commands.

Exposing it through the admin socket would be pretty simple though:

http://tracker.ceph.com/issues/2468

You can also just detach and reattach the device to flush the rbd cache.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi Dan,
I mean qemu-kvm, also librbd.
But how I can kvm told to flush the buffer?

Udo

On 05.02.2015 07:59, Dan Mick wrote:
 On 02/04/2015 10:44 PM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?

 Udo
 Do you mean the kernel rbd or librbd?  The latter responds to flush
 requests from the hypervisor.  The former...I'm not sure it has a
 separate cache.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Dan Mick
On 02/04/2015 10:44 PM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?
 
 Udo

Do you mean the kernel rbd or librbd?  The latter responds to flush
requests from the hypervisor.  The former...I'm not sure it has a
separate cache.

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi Josh,
thanks for the info.

detach/reattach schould be fine for me, because it's only for
performance testing.

#2468 would be fine of course.

Udo

On 05.02.2015 08:02, Josh Durgin wrote:
 On 02/05/2015 07:44 AM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?

 librbd exposes it as rbd_invalidate_cache(), and qemu uses it
 internally, but I don't think you can trigger that via any user-facing
 qemu commands.

 Exposing it through the admin socket would be pretty simple though:

 http://tracker.ceph.com/issues/2468

 You can also just detach and reattach the device to flush the rbd cache.

 Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] command to flush rbd cache?

2015-02-04 Thread Udo Lembke
Hi all,
is there any command to flush the rbd cache like the
echo 3  /proc/sys/vm/drop_caches for the os cache?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Dan Mick
I don't know the details well; I know the device itself supports the
block-device-level cache-flush commands (I know there's a SCSI-specific
one but I don't know offhand if there's a device generic one) so the
guest OS can, and does, request flushing.  I can't remember if there's
also a qemu command to prompt the virtual device to flush without
telling the guest.

On 02/04/2015 11:08 PM, Udo Lembke wrote:
 Hi Dan,
 I mean qemu-kvm, also librbd.
 But how I can kvm told to flush the buffer?
 
 Udo
 
 On 05.02.2015 07:59, Dan Mick wrote:
 On 02/04/2015 10:44 PM, Udo Lembke wrote:
 Hi all,
 is there any command to flush the rbd cache like the
 echo 3  /proc/sys/vm/drop_caches for the os cache?

 Udo
 Do you mean the kernel rbd or librbd?  The latter responds to flush
 requests from the hypervisor.  The former...I'm not sure it has a
 separate cache.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph Performance random write is more then sequential

2015-02-04 Thread Sumit Gaur
Yes, So far I have tried both the options and in both cases I am able to
get better sequential performance then random  (as explained by somnath)  *But
*performance numbers(iops, mbps) are way less then default option, I can
understand that as ceph is dealing with 1000 times more objects then
default option.  So keeping this is mind that I am running performance test
for random only and leaving sequential tests. Still not sure how reports
available on internet from intel and mellanox shows good number from
sequential write, may be they have enabled cache.

http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf

Thanks
sumit

On Thu, Feb 5, 2015 at 2:09 PM, Alexandre DERUMIER aderum...@odiso.com
wrote:

 Hi,

 What I saw after enabling RBD cache it is working as expected, means
 sequential write has better MBps than random write. can somebody explain
 this behaviour ?

 This is because rbd_cache merge coalesced ios in bigger ios, so it's
 working only with sequential workload.

 you'll do less ios but bigger ios to ceph, so less cpus,


 - Mail original -
 De: Sumit Gaur sumitkg...@gmail.com
 À: Florent MONTHEL fmont...@flox-arts.net
 Cc: ceph-users ceph-users@lists.ceph.com
 Envoyé: Lundi 2 Février 2015 03:54:36
 Objet: Re: [ceph-users] ceph Performance random write is more then
 sequential

 Hi All,
 What I saw after enabling RBD cache it is working as expected, means
 sequential write has better MBps than random write. can somebody explain
 this behaviour ? Is RBD cache setting must for ceph cluster to behave
 normally ?

 Thanks
 sumit

 On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur  sumitkg...@gmail.com  wrote:



 Hi Florent,
 Cache tiering , No .

 ** Our Architecture :

 vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3
 Mons)


 Thanks
 sumit

 [root@ceph-mon01 ~]# ceph -s
 cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a
 health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03
 monmap e1: 3 mons at {ceph-mon01=
 192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0
 }, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
 osdmap e603: 36 osds: 36 up, 36 in
 pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects
 522 GB used, 9349 GB / 9872 GB avail
 5120 active+clean


 On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL  fmont...@flox-arts.net
  wrote:

 BQ_BEGIN
 Hi Sumit

 Do you have cache pool tiering activated ?
 Some feed-back regarding your architecture ?
 Thanks

 Sent from my iPad

  On 1 févr. 2015, at 15:50, Sumit Gaur  sumitkg...@gmail.com  wrote:
 
  Hi
  I have installed 6 node ceph cluster and to my surprise when I ran rados
 bench I saw that random write has more performance number then sequential
 write. This is opposite to normal disk write. Can some body let me know if
 I am missing any ceph Architecture point here ?
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





 BQ_END



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about output message and object update for ceph class

2015-02-04 Thread Dennis Chen
I take back the question, because I just found that for a succeed
write opetion in the class, *no* data in the out buffer...

On Wed, Feb 4, 2015 at 5:44 PM, Dennis Chen kernel.org@gmail.com wrote:
 Hello,

 I write a ceph client using rados lib to execute a funcution upon the object.

 CLIENT SIDE CODE
 ===
 int main()
 {
   ...
 strcpy(in, from client);
 err = rados_exec(io, objname, devctl, devctl_op, in,
 strlen(in), out, 128);
 if (err  0) {
 fprintf(stderr, rados_exec() failed: %s\n, strerror(-err));
 rados_ioctx_destroy(io);
 rados_shutdown(cluster);
 exit(1);
 }
 out[err] = '\0';
 printf(err = %d, exec result out = %s, in = %s\n, err, out, in);
   ...
 }

 CLASS CODE IN OSD SIDE
 ==
 static int devctl_op(cls_method_context_t hctx, bufferlist *in, bufferlist 
 *out)
 {
   ...

   i = cls_cxx_stat(hctx, size, NULL);
   if (i  0)
 return i;

   bufferlist read_bl, write_bl;
   i = cls_cxx_read(hctx, 0, size, read_bl);
   if (i  0) {
 CLS_ERR(cls_cxx_read failed);
 return i;
   }


   // we generate our reply
   out-append(Hello, );
   if (in-length() == 0)
 out-append(world);
   else
 out-append(*in);
   out-append(!);

 #if 1
   const char *tstr = from devctl func;
   write_bl.append(tstr);
   i = cls_cxx_write(hctx, size, write_bl.length(), write_bl);
   if (i  0) {
 CLS_ERR(cls_cxx_write failed: %s, strerror(-i));
 return i;
   }
 #endif

   // this return value will be returned back to the librados caller
   return 0;
 }

 I found that if I update the content of the object when calling
 cls_cxx_write(), then the 'out' will be null in the client side,
 otherwise the out will be Hello, from client!.

 Does anybody here can give some hints?

 --
 Den



-- 
Den
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-04 Thread Udo Lembke
Hi Marco,

Am 04.02.2015 10:20, schrieb Colombo Marco:
...
 We choosen the 6TB of disk, because we need a lot of storage in a small 
 amount of server and we prefer server with not too much disks.
 However we plan to use max 80% of a 6TB Disk
 

80% is too much! You will run into trouble.
Ceph don't write the data in equal distribution. Sometimes I see an
difference of 20% in the usage of the OSD.

I recommend 60-70% as maximum.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Supermicro hardware recommendation

2015-02-04 Thread Christian Balzer

Hello,

On Wed, 4 Feb 2015 09:20:24 + Colombo Marco wrote:

 Hi Christian,
 
 
 
 On 04/02/15 02:39, Christian Balzer ch...@gol.com wrote:
 
 On Tue, 3 Feb 2015 15:16:57 + Colombo Marco wrote:
 
  Hi all,
   I have to build a new Ceph storage cluster, after i‘ve read the
  hardware recommendations and some mail from this mailing list i would
  like to buy these servers:
  
 
 Nick mentioned a number of things already I totally agree with, so don't
 be surprised if some of this feels like a repeat.
 
  OSD:
  SSG-6027R-E1R12L -
  http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm
  Intel Xeon e5-2630 v2 64 GB RAM
 As nick said, v3 and more RAM might be helpful, depending on your use
 case (small writes versus large ones) even faster CPUs as well.
 
 Ok, we switch from v2 to v3 and from 64 to 96 GB of RAM.
 
 
  LSI 2308 IT
  2 x SSD Intel DC S3700 400GB
  2 x SSD Intel DC S3700 200GB
 Why the separation of SSDs? 
 They aren't going to be that busy with regards to the OS.
 
 We would like to use 400GB SSD for a cache pool, and 200GB SSD for the 
 journaling.

Don't, at least not like that.
First and foremost, SSD based OSDs/pools have different requirements,
especially when it comes to CPU. 
Mixing your HDD and SSD based OSDs in the same chassis is a generally a bad
idea.
If you really want to use SSD based OSDs, got at least with Giant,
probably better even to wait for Hammer. 
Otherwise your performance will be nowhere near the investment you're
making. 
Read up in the ML archives about SSD based clusters and their performance,
as well as cache pools.

Which brings us to the second point, cache pools are pretty pointless
currently when it comes to performance. So unless you're planning to use
EC pools, you will gain very little from them.

Lastly, if you still want to do SSD based OSDs, go for something like this:
http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-DC0TR.cfm
Add the fastest CPUs you can afford and voila, instant SSD based cluster
(replication of 2 should be fine with DC S3700). 
Now with _this_ particular type of nodes, you might want to consider 40GbE
links (front and back-end).
 
 
 Get a case like Nick mentioned with 2 2.5 bays in the back, put 2 DC
 S3700 400GBs in there (connected to onboard 6Gb/s SATA3), partition
 them so that you have a RAID1 for OS and plain partitions for the
 journals of the now 12
 OSD HDDs in your chassis. 
 Of course this optimization in terms of cost and density comes with a
 price, if one SSD should fail, you will have 6 OSDs down. 
 Given how reliable the Intels are this is unlikely, but something you
 need to consider.
 
 If you want to limit the impact of a SSD failure and have just 2 OSD
 journals per SSD, get a chassis like the one above and 4 DC S3700 200GB,
 RAID10 them for the OS and put 2 journal partitions on each. 
 
 I did the same with 8 3TB HDDs and 4 DC S3700 100GB, the HDDs (and CPU
 with 4KB IOPS), are the limiting factor, not the SSDs.
 
  8 x HDD Seagate Enterprise 6TB
 Are you really sure you need that density? One disk failure will result
 in a LOT of data movement once these become somewhat full.
 If you were to go for a 12 OSD node as described above, consider 4TB
 ones for the same overall density, while having more IOPS and likely
 the same price or less.
 
 We choosen the 6TB of disk, because we need a lot of storage in a small 
 amount of server and we prefer server with not too much disks.
 However we plan to use max 80% of a 6TB Disk

Less disks, less IOPS, less bandwidth. 
Reducing the amount of servers (which are fixed cost after all) is
understandable. But you have an option up there that gives you the same
density as with the 6TB disks, but with a significantly improved
performance.
 
 
  2 x 40GbE for backend network
 You'd be lucky to write more that 800MB/s sustained to your 8 HDDs
 (remember they will have to deal with competing reads and writes, this
 is not a sequential synthetic write benchmark). 
 Incidentally 1GB/s to 1.2GB/s (depending on configuration) would also be
 the limit of your journal SSDs.
 Other than backfilling caused by cluster changes (OSD removed/added),
 your limitation is nearly always going to be IOPS, not bandwidth.
 
 
 Ok, after some discussion, we switch to 2 x 10 GbE.
 
 
 So 2x10GbE or if you're comfortable with it (I am ^o^) an Infiniband
 backend (can be cheaper, less latency, plans for RDMA support in
 Ceph) should be more than sufficient.
 
  2 x 10GbE  for public network
  
  META/MON:
  
  SYS-6017R-72RFTP -
  http://www.supermicro.com/products/system/1U/6017/SYS-6017R-72RFTP.cfm
  2 x Intel Xeon e5-2637 v2 4 x SSD Intel DC S3500 240GB raid 1+0
 You're likely to get better performance and of course MUCH better
 durability by using 2 DC S3700, at about the same price.
 
 Ok we switch to 2 x SSD DC S3700
 
 
  128 GB RAM
 Total overkill for a MON, but I have no idea about MDS and RAM never 
 hurts.
 
 Ok we switch from 128 to 96
 
Don't take my 

Re: [ceph-users] ceph Performance random write is more then sequential

2015-02-04 Thread Alexandre DERUMIER
Hi,

What I saw after enabling RBD cache it is working as expected, means 
sequential write has better MBps than random write. can somebody explain this 
behaviour ?

This is because rbd_cache merge coalesced ios in bigger ios, so it's working 
only with sequential workload.

you'll do less ios but bigger ios to ceph, so less cpus,


- Mail original -
De: Sumit Gaur sumitkg...@gmail.com
À: Florent MONTHEL fmont...@flox-arts.net
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 2 Février 2015 03:54:36
Objet: Re: [ceph-users] ceph Performance random write is more then  
sequential

Hi All, 
What I saw after enabling RBD cache it is working as expected, means sequential 
write has better MBps than random write. can somebody explain this behaviour ? 
Is RBD cache setting must for ceph cluster to behave normally ? 

Thanks 
sumit 

On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur  sumitkg...@gmail.com  wrote: 



Hi Florent, 
Cache tiering , No . 

** Our Architecture : 

vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) 


Thanks 
sumit 

[root@ceph-mon01 ~]# ceph -s 
cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a 
health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 
monmap e1: 3 mons at {ceph-mon01= 
192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0
 }, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 
osdmap e603: 36 osds: 36 up, 36 in 
pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 
522 GB used, 9349 GB / 9872 GB avail 
5120 active+clean 


On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL  fmont...@flox-arts.net  
wrote: 

BQ_BEGIN
Hi Sumit 

Do you have cache pool tiering activated ? 
Some feed-back regarding your architecture ? 
Thanks 

Sent from my iPad 

 On 1 févr. 2015, at 15:50, Sumit Gaur  sumitkg...@gmail.com  wrote: 
 
 Hi 
 I have installed 6 node ceph cluster and to my surprise when I ran rados 
 bench I saw that random write has more performance number then sequential 
 write. This is opposite to normal disk write. Can some body let me know if I 
 am missing any ceph Architecture point here ? 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 





BQ_END



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com