Re: [ceph-users] 2x replication: A BIG warning

2016-12-11 Thread Wido den Hollander

> Op 9 december 2016 om 22:31 schreef Oliver Humpage :
> 
> 
> 
> > On 7 Dec 2016, at 15:01, Wido den Hollander  wrote:
> > 
> > I would always run with min_size = 2 and manually switch to min_size = 1 if 
> > the situation really requires it at that moment.
> 
> Thanks for this thread, it’s been really useful.
> 
> I might have misunderstood, but does min_size=2 also mean that writes have to 
> wait for at least 2 OSDs to have data written before the write is confirmed? 
> I always assumed this would have a noticeable effect on performance and so 
> left it at 1.
> 
> Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
> enterprise SSDs, servers are linked with 10Gb, and we’re generally getting 
> very acceptable speeds. Any idea as to how upping min_size to 2 might affect 
> things, or should we just try it and see?
> 

As David already said, when all OSDs are up and in for a PG Ceph will wait for 
ALL OSDs to Ack the write. Writes in RADOS are always synchronous.

Only when OSDs go down you need at least min_size OSDs up before writes or 
reads are accepted.

So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to take 
place.

Wido

> Oliver.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread V Plus
Hi.. Udo,
I am not sure I understood what you said.
Did you mean that the 'dd' command also got cached in the osd node? or??


On Sun, Dec 11, 2016 at 10:46 PM, Udo Lembke  wrote:

> Hi,
> but I assume you measure also cache in this scenario - the osd-nodes has
> cached the writes in the filebuffer
> (due this the latency should be very small).
>
> Udo
>
> On 12.12.2016 03:00, V Plus wrote:
> > Thanks Somnath!
> > As you recommended, I executed:
> > dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
> > dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1
> >
> > Then the output results look more reasonable!
> > Could you tell me why??
> >
> > Btw, the purpose of my run is to test the performance of rbd in ceph.
> > Does my case mean that before every test, I have to "initialize" all
> > the images???
> >
> > Great thanks!!
> >
> > On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy  > > wrote:
> >
> > Fill up the image with big write (say 1M) first before reading and
> > you should see sane throughput.
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> > ] *On Behalf Of *V Plus
> > *Sent:* Sunday, December 11, 2016 5:44 PM
> > *To:* ceph-users@lists.ceph.com 
> > *Subject:* [ceph-users] Ceph performance is too good
> (impossible..)...
> >
> >
> >
> > Hi Guys,
> >
> > we have a ceph cluster with 6 machines (6 OSD per host).
> >
> > 1. I created 2 images in Ceph, and map them to another host A
> > (*/outside /*the Ceph cluster). On host A, I
> > got *//dev/rbd0/* and*/ /dev/rbd1/*.
> >
> > 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio
> > job descriptions can be found below)
> >
> > */"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output
> > b.txt  & wait"/*
> >
> > 3. After the test, in a.txt, we got */bw=1162.7MB/s/*, in b.txt,
> > we get */bw=3579.6MB/s/*.
> >
> > The results do NOT make sense because there is only one NIC on
> > host A, and its limit is 10 Gbps (1.25GB/s).
> >
> >
> >
> > I suspect it is because of the cache setting.
> >
> > But I am sure that in file *//etc/ceph/ceph.conf/* on host A,I
> > already added:
> >
> > */[client]/*
> >
> > */rbd cache = false/*
> >
> >
> >
> > Could anyone give me a hint what is missing? why
> >
> > Thank you very much.
> >
> >
> >
> > *fioA.job:*
> >
> > /[A]/
> >
> > /direct=1/
> >
> > /group_reporting=1/
> >
> > /unified_rw_reporting=1/
> >
> > /size=100%/
> >
> > /time_based=1/
> >
> > /filename=/dev/rbd0/
> >
> > /rw=read/
> >
> > /bs=4MB/
> >
> > /numjobs=16/
> >
> > /ramp_time=10/
> >
> > /runtime=20/
> >
> >
> >
> > *fioB.job:*
> >
> > /[B]/
> >
> > /direct=1/
> >
> > /group_reporting=1/
> >
> > /unified_rw_reporting=1/
> >
> > /size=100%/
> >
> > /time_based=1/
> >
> > /filename=/dev/rbd1/
> >
> > /rw=read/
> >
> > /bs=4MB/
> >
> > /numjobs=16/
> >
> > /ramp_time=10/
> >
> > /runtime=20/
> >
> >
> >
> > /Thanks.../
> >
> > PLEASE NOTE: The information contained in this electronic mail
> > message is intended only for the use of the designated
> > recipient(s) named above. If the reader of this message is not the
> > intended recipient, you are hereby notified that you have received
> > this message in error and that any review, dissemination,
> > distribution, or copying of this message is strictly prohibited.
> > If you have received this communication in error, please notify
> > the sender by telephone or e-mail (as shown above) immediately and
> > destroy any and all copies of this message in your possession
> > (whether hard copies or electronically stored copies).
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Udo Lembke
Hi,
but I assume you measure also cache in this scenario - the osd-nodes has
cached the writes in the filebuffer
(due this the latency should be very small).

Udo

On 12.12.2016 03:00, V Plus wrote:
> Thanks Somnath!
> As you recommended, I executed:
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1
>
> Then the output results look more reasonable!
> Could you tell me why??
>
> Btw, the purpose of my run is to test the performance of rbd in ceph.
> Does my case mean that before every test, I have to "initialize" all
> the images???
>
> Great thanks!!
>
> On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy  > wrote:
>
> Fill up the image with big write (say 1M) first before reading and
> you should see sane throughput.
>
>  
>
> Thanks & Regards
>
> Somnath
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>  
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host). 
>
> 1. I created 2 images in Ceph, and map them to another host A
> (*/outside /*the Ceph cluster). On host A, I
> got *//dev/rbd0/* and*/ /dev/rbd1/*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio
> job descriptions can be found below)
>
> */"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output
> b.txt  & wait"/*
>
> 3. After the test, in a.txt, we got */bw=1162.7MB/s/*, in b.txt,
> we get */bw=3579.6MB/s/*.
>
> The results do NOT make sense because there is only one NIC on
> host A, and its limit is 10 Gbps (1.25GB/s).
>
>  
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file *//etc/ceph/ceph.conf/* on host A,I
> already added:
>
> */[client]/*
>
> */rbd cache = false/*
>
>  
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>  
>
> *fioA.job:*
>
> /[A]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd0/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> *fioB.job:*
>
> /[B]/
>
> /direct=1/
>
> /group_reporting=1/
>
> /unified_rw_reporting=1/
>
> /size=100%/
>
> /time_based=1/
>
> /filename=/dev/rbd1/
>
> /rw=read/
>
> /bs=4MB/
>
> /numjobs=16/
>
> /ramp_time=10/
>
> /runtime=20/
>
>  
>
> /Thanks.../
>
> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated
> recipient(s) named above. If the reader of this message is not the
> intended recipient, you are hereby notified that you have received
> this message in error and that any review, dissemination,
> distribution, or copying of this message is strictly prohibited.
> If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and
> destroy any and all copies of this message in your possession
> (whether hard copies or electronically stored copies).
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
I generally do a 1M seq write to fill up the device. Block size doesn’t matter 
here but bigger block size is faster to fill up and that’s why people use that.

From: V Plus [mailto:v.plussh...@gmail.com]
Sent: Sunday, December 11, 2016 7:03 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks!

One more question, what do you mean by "bigger" ?
Do you mean that bigger block size (say, I will run read test with bs=4K, then 
I need to first write the rbd with bs>4K?)? or size that is big enough to cover 
the area where the test will be executed?


On Sun, Dec 11, 2016 at 9:54 PM, Somnath Roy 
> wrote:
A block needs to be written before read otherwise you will get funny result. 
For example, in case of flash (depending on how FW is implemented) , it will 
mostly return you 0 if a block is not written. Now, I have seen some flash FW 
is really inefficient on manufacturing this data (say 0) if not written and 
some are really fast.
So, to get predictable result you should be always reading a block that is 
written. In a device say half of the block is written and you are doing a full 
device random reads , you will get unpredictable/spiky/imbalanced result.
Same with rbd as well, consider it as a storage device and behavior would be 
similar. So, it is always recommended to precondition (fill up) a rbd image 
with bigger block seq write before you do any synthetic test on that. Now, for 
filestore backend added advantage of preconditioning rbd will be the files in 
the filesystem will be created beforehand.

Thanks & Regards
Somnath

From: V Plus [mailto:v.plussh...@gmail.com]
Sent: Sunday, December 11, 2016 6:01 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks Somnath!
As you recommended, I executed:
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1

Then the output results look more reasonable!
Could you tell me why??

Btw, the purpose of my run is to test the performance of rbd in ceph. Does my 
case mean that before every test, I have to "initialize" all the images???

Great thanks!!

On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
> wrote:
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread V Plus
Thanks!

One more question, what do you mean by "bigger" ?
Do you mean that bigger block size (say, I will run read test with bs=4K,
then I need to first write the rbd with bs>4K?)? or size that is big enough
to cover the area where the test will be executed?


On Sun, Dec 11, 2016 at 9:54 PM, Somnath Roy 
wrote:

> A block needs to be written before read otherwise you will get funny
> result. For example, in case of flash (depending on how FW is implemented)
> , it will mostly return you 0 if a block is not written. Now, I have seen
> some flash FW is really inefficient on manufacturing this data (say 0) if
> not written and some are really fast.
>
> So, to get predictable result you should be always reading a block that is
> written. In a device say half of the block is written and you are doing a
> full device random reads , you will get unpredictable/spiky/imbalanced
> result.
>
> Same with rbd as well, consider it as a storage device and behavior would
> be similar. So, it is always recommended to precondition (fill up) a rbd
> image with bigger block seq write before you do any synthetic test on that.
> Now, for filestore backend added advantage of preconditioning rbd will be
> the files in the filesystem will be created beforehand.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* V Plus [mailto:v.plussh...@gmail.com]
> *Sent:* Sunday, December 11, 2016 6:01 PM
> *To:* Somnath Roy
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Ceph performance is too good (impossible..)...
>
>
>
> Thanks Somnath!
>
> As you recommended, I executed:
>
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
>
> dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1
>
>
>
> Then the output results look more reasonable!
>
> Could you tell me why??
>
>
>
> Btw, the purpose of my run is to test the performance of rbd in ceph. Does
> my case mean that before every test, I have to "initialize" all the
> images???
>
>
>
> Great thanks!!
>
>
>
> On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
> wrote:
>
> Fill up the image with big write (say 1M) first before reading and you
> should see sane throughput.
>
>
>
> Thanks & Regards
>
> Somnath
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host).
>
> 1. I created 2 images in Ceph, and map them to another host A (*outside *the
> Ceph cluster). On host A, I got */dev/rbd0* and* /dev/rbd1*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job
> descriptions can be found below)
>
> *"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  &
> wait"*
>
> 3. After the test, in a.txt, we got *bw=1162.7MB/s*, in b.txt, we get
> *bw=3579.6MB/s*.
>
> The results do NOT make sense because there is only one NIC on host A, and
> its limit is 10 Gbps (1.25GB/s).
>
>
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file */etc/ceph/ceph.conf* on host A,I already
> added:
>
> *[client]*
>
> *rbd cache = false*
>
>
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>
>
> *fioA.job:*
>
> *[A]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd0*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *fioB.job:*
>
> *[B]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd1*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *Thanks...*
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
A block needs to be written before read otherwise you will get funny result. 
For example, in case of flash (depending on how FW is implemented) , it will 
mostly return you 0 if a block is not written. Now, I have seen some flash FW 
is really inefficient on manufacturing this data (say 0) if not written and 
some are really fast.
So, to get predictable result you should be always reading a block that is 
written. In a device say half of the block is written and you are doing a full 
device random reads , you will get unpredictable/spiky/imbalanced result.
Same with rbd as well, consider it as a storage device and behavior would be 
similar. So, it is always recommended to precondition (fill up) a rbd image 
with bigger block seq write before you do any synthetic test on that. Now, for 
filestore backend added advantage of preconditioning rbd will be the files in 
the filesystem will be created beforehand.

Thanks & Regards
Somnath

From: V Plus [mailto:v.plussh...@gmail.com]
Sent: Sunday, December 11, 2016 6:01 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks Somnath!
As you recommended, I executed:
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1

Then the output results look more reasonable!
Could you tell me why??

Btw, the purpose of my run is to test the performance of rbd in ceph. Does my 
case mean that before every test, I have to "initialize" all the images???

Great thanks!!

On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
> wrote:
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread V Plus
Thanks.
then how can we avoid this if I want to test the ceph rbd performance.

BTW, it seems not the case.
I followed what Somnath said, and got reasonable results.
But I am still confused.

On Sun, Dec 11, 2016 at 8:59 PM, JiaJia Zhong 
wrote:

> >> 3. After the test, in a.txt, we got *bw=1162.7MB/s*, in b.txt, we get
> *bw=3579.6MB/s*.
>
> mostly, due to your kernel buffer of client host
>
>
> -- Original --
> *From: * "Somnath Roy";
> *Date: * Mon, Dec 12, 2016 09:47 AM
> *To: * "V Plus"; "CEPH list" com>;
> *Subject: * Re: [ceph-users] Ceph performance is too good
> (impossible..)...
>
>
> Fill up the image with big write (say 1M) first before reading and you
> should see sane throughput.
>
>
>
> Thanks & Regards
>
> Somnath
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host).
>
> 1. I created 2 images in Ceph, and map them to another host A (*outside *the
> Ceph cluster). On host A, I got */dev/rbd0* and* /dev/rbd1*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job
> descriptions can be found below)
>
> *"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  &
> wait"*
>
> 3. After the test, in a.txt, we got *bw=1162.7MB/s*, in b.txt, we get
> *bw=3579.6MB/s*.
>
> The results do NOT make sense because there is only one NIC on host A, and
> its limit is 10 Gbps (1.25GB/s).
>
>
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file */etc/ceph/ceph.conf* on host A,I already
> added:
>
> *[client]*
>
> *rbd cache = false*
>
>
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>
>
> *fioA.job:*
>
> *[A]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd0*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *fioB.job:*
>
> *[B]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd1*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *Thanks...*
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread V Plus
Thanks Somnath!
As you recommended, I executed:
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1

Then the output results look more reasonable!
Could you tell me why??

Btw, the purpose of my run is to test the performance of rbd in ceph. Does
my case mean that before every test, I have to "initialize" all the
images???

Great thanks!!

On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
wrote:

> Fill up the image with big write (say 1M) first before reading and you
> should see sane throughput.
>
>
>
> Thanks & Regards
>
> Somnath
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *V Plus
> *Sent:* Sunday, December 11, 2016 5:44 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Ceph performance is too good (impossible..)...
>
>
>
> Hi Guys,
>
> we have a ceph cluster with 6 machines (6 OSD per host).
>
> 1. I created 2 images in Ceph, and map them to another host A (*outside *the
> Ceph cluster). On host A, I got */dev/rbd0* and* /dev/rbd1*.
>
> 2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job
> descriptions can be found below)
>
> *"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  &
> wait"*
>
> 3. After the test, in a.txt, we got *bw=1162.7MB/s*, in b.txt, we get
> *bw=3579.6MB/s*.
>
> The results do NOT make sense because there is only one NIC on host A, and
> its limit is 10 Gbps (1.25GB/s).
>
>
>
> I suspect it is because of the cache setting.
>
> But I am sure that in file */etc/ceph/ceph.conf* on host A,I already
> added:
>
> *[client]*
>
> *rbd cache = false*
>
>
>
> Could anyone give me a hint what is missing? why
>
> Thank you very much.
>
>
>
> *fioA.job:*
>
> *[A]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd0*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *fioB.job:*
>
> *[B]*
>
> *direct=1*
>
> *group_reporting=1*
>
> *unified_rw_reporting=1*
>
> *size=100%*
>
> *time_based=1*
>
> *filename=/dev/rbd1*
>
> *rw=read*
>
> *bs=4MB*
>
> *numjobs=16*
>
> *ramp_time=10*
>
> *runtime=20*
>
>
>
> *Thanks...*
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread JiaJia Zhong
>> 3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
>> bw=3579.6MB/s.

mostly, due to your kernel buffer of client host



 
 
-- Original --
From:  "Somnath Roy";
Date:  Mon, Dec 12, 2016 09:47 AM
To:  "V Plus"; "CEPH list"; 

Subject:  Re: [ceph-users] Ceph performance is too good (impossible..)...

 
  
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.
 
 
 
Thanks & Regards
 
Somnath
 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of V Plus
 Sent: Sunday, December 11, 2016 5:44 PM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Ceph performance is too good (impossible..)...
 
 
   
Hi Guys,
 
  
we have a ceph cluster with 6 machines (6 OSD per host). 
 
  
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
 
  
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
 
  
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
 
  
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
 
  
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).
 
  
 
 
  
I suspect it is because of the cache setting.
 
  
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
 
  
[client]
 
  
rbd cache = false
 
  
 
 
  
Could anyone give me a hint what is missing? why
 
  
Thank you very much.
 
  
 
 
  
fioA.job:
 
  
[A]
 
  
direct=1
 
  
group_reporting=1
 
  
unified_rw_reporting=1
 
  
size=100%
 
  
time_based=1
 
  
filename=/dev/rbd0
 
  
rw=read
 
  
bs=4MB
 
  
numjobs=16
 
  
ramp_time=10
 
  
runtime=20
 
  
 
 
  
fioB.job:
 
  
[B]
 
  
direct=1
 
  
group_reporting=1
 
  
unified_rw_reporting=1
 
  
size=100%
 
  
time_based=1
 
  
filename=/dev/rbd1
 
  
rw=read
 
  
bs=4MB
 
  
numjobs=16
 
  
ramp_time=10
 
  
runtime=20
 
  
 
 
  
Thanks...
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this  message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy  any and all 
copies of this message in your possession (whether hard copies or 
electronically stored copies).___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread V Plus
Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (*outside *the
Ceph cluster). On host A, I got */dev/rbd0* and* /dev/rbd1*.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job
descriptions can be found below)
*"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  &
wait"*
3. After the test, in a.txt, we got *bw=1162.7MB/s*, in b.txt, we get
*bw=3579.6MB/s*.
The results do NOT make sense because there is only one NIC on host A, and
its limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file */etc/ceph/ceph.conf* on host A,I already added:
*[client]*
*rbd cache = false*

Could anyone give me a hint what is missing? why
Thank you very much.

*fioA.job:*
*[A]*
*direct=1*
*group_reporting=1*
*unified_rw_reporting=1*
*size=100%*
*time_based=1*
*filename=/dev/rbd0*
*rw=read*
*bs=4MB*
*numjobs=16*
*ramp_time=10*
*runtime=20*

*fioB.job:*
*[B]*
*direct=1*
*group_reporting=1*
*unified_rw_reporting=1*
*size=100%*
*time_based=1*
*filename=/dev/rbd1*
*rw=read*
*bs=4MB*
*numjobs=16*
*ramp_time=10*
*runtime=20*

*Thanks...*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-12-11 Thread John Spray
On Sun, Dec 11, 2016 at 4:38 PM, Mike Miller  wrote:
> Hi,
>
> you have given up too early. rsync is not a nice workload for cephfs, in
> particular, most linux kernel clients cephfs will end up caching all
> inodes/dentries. The result is that mds servers crash due to memory
> limitations. And rsync basically screens all inodes/dentries so it is the
> perfect application to gobble up all inode caps.

While historically there have been client bugs that prevented the MDS
from enforcing cache size limits, this is not expected behaviour --
manually calling drop_caches is most definitely a workaround and not
something that I would recommend unless you're stuck with a
known-buggy client version for some reason.

Just felt the need to point that out in case people started picking
this up as a best practice!

Cheers,
John

> We run a cronjob script flush_cache every few (2-5) minutes:
>
> #!/bin/bash
> echo 2 > /proc/sys/vm/drop_caches
>
> on all machines that mount cephfs. There is no performance drop in the
> client machines, but happily, the mds congestion is solved by this.
>
> We also went the rbd way before this, but for large rbd images we much
> prefer cephfs instead.
>
> Regards,
>
> Mike
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph erasure code profile

2016-12-11 Thread rmichel
Hi!

some questions about EC profiles:

Is it possible to create a profile that spreads the chunks over 3
racks with 6 hosts inside a rack (k=8 m=3, plugin=jerasure)?

When, how?


Thanks!
Michel


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-12-11 Thread Mike Miller

Hi,

you have given up too early. rsync is not a nice workload for cephfs, in 
particular, most linux kernel clients cephfs will end up caching all 
inodes/dentries. The result is that mds servers crash due to memory 
limitations. And rsync basically screens all inodes/dentries so it is 
the perfect application to gobble up all inode caps.


We run a cronjob script flush_cache every few (2-5) minutes:

#!/bin/bash
echo 2 > /proc/sys/vm/drop_caches

on all machines that mount cephfs. There is no performance drop in the 
client machines, but happily, the mds congestion is solved by this.


We also went the rbd way before this, but for large rbd images we much 
prefer cephfs instead.


Regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sandisk SSDs

2016-12-11 Thread Mike Miller

Hi,

some time ago when starting a ceph evaluation cluster I used SSDs with 
similar specs. I would strongly recommend against it, during normal 
operation things might be fine, but wait until the first disk fails and 
things have to be backfilled.


If you still try, please let me know how things turn out for you.

Regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com