Re: [ceph-users] omap vs. xattr in librados

2018-09-12 Thread Gregory Farnum
Nope there shouldn’t be any impact apart from the potential issues that
arise from breaking up the I/O stream. Which in the case of either a
saturated or mostly-idle RADOS cluster should not be an issue.
-Greg
On Wed, Sep 12, 2018 at 9:24 PM Benjamin Cherian 
wrote:

> Greg, Paul,
>
> Thank you for the feedback. This has been very enlightening. One last
> question (for now at least). Are there any expected performance impacts
> from having I/O to multiple pools from the same client? (Given how RGW and
> CephFS store metadata, I would hope not, but I thought I'd ask.) Based on
> everything that has been described it makes sense to have metadata heavy
> objects (i.e., objects with a large fraction of kv data) to be in a
> replicated pool while putting the large blobs in an EC pool.
>
> Thanks again,
> Ben
>
> On Wed, Sep 12, 2018 at 1:05 PM Gregory Farnum  wrote:
>
>> On Tue, Sep 11, 2018 at 5:32 PM Benjamin Cherian <
>> benjamin.cher...@gmail.com> wrote:
>>
>>> Ok, that’s good to know. I was planning on using an EC pool. Maybe I'll
>>> store some of the larger kv pairs in their own objects or move the metadata
>>> into it's own replicated pool entirely. If the storage mechanism is the
>>> same, is there a reason xattrs are supported and omap is not? (Or is there
>>> some hidden cost to storing kv pairs in an EC pool I’m unaware of, e.g.,
>>> does the kv data get replicated across all OSDs being used for a PG or
>>> something?)
>>>
>>
>> Yeah, if you're on an EC pool there isn't a good way to erasure-code
>> key-value data. So we willingly replicate xattrs across all the nodes
>> (since they are presumed to be small and limited in number — I think we
>> actually have total limits, but not sure?) but don't support omap at all
>> (as it's presumed to be a lot of data).
>>
>> Do note that if small objects are a large proportion of your data you
>> might prefer to put them in a replicated pool — in an EC pool you'd need
>> very small chunk sizes to get any non-replication happening anyway, and for
>> something in the 10KB range at a reasonable k+m you'd be dominated by
>> metadata size anyway.
>> -Greg
>>
>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Tue, Sep 11, 2018 at 1:46 PM Patrick Donnelly 
>>> wrote:
>>>
 On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
  wrote:
 > On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
 wrote:
 >>
 >> 
 >> In general, if the key-value storage is of unpredictable or
 non-trivial
 >> size, you should use omap.
 >>
 >> At the bottom layer where the data is actually stored, they're
 likely to
 >> be in the same places (if using BlueStore, they are the same — in
 FileStore,
 >> a rados xattr *might* be in the local FS xattrs, or it might not).
 It is
 >> somewhat more likely that something stored in an xattr will get
 pulled into
 >> memory at the same time as the object's internal metadata, but that
 only
 >> happens if it's quite small (think the xfs or ext4 xattr rules).
 >
 >
 > Based on this description, if I'm planning on using Bluestore, there
 is no
 > particular reason to ever prefer using xattrs over omap (outside of
 ease of
 > use in the API), correct?

 You may prefer xattrs on bluestore if the metadata is small and you
 may need to store the xattrs on an EC pool. omap is not supported on
 ecpools.

 --
 Patrick Donnelly

>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-12 Thread Benjamin Cherian
Greg, Paul,

Thank you for the feedback. This has been very enlightening. One last
question (for now at least). Are there any expected performance impacts
from having I/O to multiple pools from the same client? (Given how RGW and
CephFS store metadata, I would hope not, but I thought I'd ask.) Based on
everything that has been described it makes sense to have metadata heavy
objects (i.e., objects with a large fraction of kv data) to be in a
replicated pool while putting the large blobs in an EC pool.

Thanks again,
Ben

On Wed, Sep 12, 2018 at 1:05 PM Gregory Farnum  wrote:

> On Tue, Sep 11, 2018 at 5:32 PM Benjamin Cherian <
> benjamin.cher...@gmail.com> wrote:
>
>> Ok, that’s good to know. I was planning on using an EC pool. Maybe I'll
>> store some of the larger kv pairs in their own objects or move the metadata
>> into it's own replicated pool entirely. If the storage mechanism is the
>> same, is there a reason xattrs are supported and omap is not? (Or is there
>> some hidden cost to storing kv pairs in an EC pool I’m unaware of, e.g.,
>> does the kv data get replicated across all OSDs being used for a PG or
>> something?)
>>
>
> Yeah, if you're on an EC pool there isn't a good way to erasure-code
> key-value data. So we willingly replicate xattrs across all the nodes
> (since they are presumed to be small and limited in number — I think we
> actually have total limits, but not sure?) but don't support omap at all
> (as it's presumed to be a lot of data).
>
> Do note that if small objects are a large proportion of your data you
> might prefer to put them in a replicated pool — in an EC pool you'd need
> very small chunk sizes to get any non-replication happening anyway, and for
> something in the 10KB range at a reasonable k+m you'd be dominated by
> metadata size anyway.
> -Greg
>
>
>>
>> Thanks,
>> Ben
>>
>> On Tue, Sep 11, 2018 at 1:46 PM Patrick Donnelly 
>> wrote:
>>
>>> On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
>>>  wrote:
>>> > On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
>>> wrote:
>>> >>
>>> >> 
>>> >> In general, if the key-value storage is of unpredictable or
>>> non-trivial
>>> >> size, you should use omap.
>>> >>
>>> >> At the bottom layer where the data is actually stored, they're likely
>>> to
>>> >> be in the same places (if using BlueStore, they are the same — in
>>> FileStore,
>>> >> a rados xattr *might* be in the local FS xattrs, or it might not). It
>>> is
>>> >> somewhat more likely that something stored in an xattr will get
>>> pulled into
>>> >> memory at the same time as the object's internal metadata, but that
>>> only
>>> >> happens if it's quite small (think the xfs or ext4 xattr rules).
>>> >
>>> >
>>> > Based on this description, if I'm planning on using Bluestore, there
>>> is no
>>> > particular reason to ever prefer using xattrs over omap (outside of
>>> ease of
>>> > use in the API), correct?
>>>
>>> You may prefer xattrs on bluestore if the metadata is small and you
>>> may need to store the xattrs on an EC pool. omap is not supported on
>>> ecpools.
>>>
>>> --
>>> Patrick Donnelly
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-12 Thread Jason Dillaman
On Wed, Sep 12, 2018 at 10:15 PM  wrote:
>
> On 2018-09-12 17:35:16-07:00 Jason Dillaman wrote:
>
>
> Any chance you know the LBA or byte offset of the corruption so I can
> compare it against the log?
>
> The LBAs of the corruption are 0xA74F000 through 175435776

Are you saying the corruption starts at byte offset 175435776 from the
start of the RBD image? If so, that would correspond to object 0x29:

2018-09-12 21:22:17.117246 7f268928f0c0 20 librbd::DiffIterate: object
rbd_data.4b383f1e836edc.0029: list_snaps complete
2018-09-12 21:22:17.117249 7f268928f0c0 20 librbd::DiffIterate:   diff
[499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
end_exists=1
2018-09-12 21:22:17.117251 7f268928f0c0 20 librbd::DiffIterate:
diff_iterate object rbd_data.4b383f1e836edc.0029 extent
0~4194304 from [0,4194304]
2018-09-12 21:22:17.117268 7f268928f0c0 20 librbd::DiffIterate:  opos
0 buf 0~4194304 overlap
[499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
2018-09-12 21:22:17.117270 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 499712~4096 logical 172466176~4096
2018-09-12 21:22:17.117271 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 552960~4096 logical 172519424~4096
2018-09-12 21:22:17.117272 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 589824~4096 logical 172556288~4096
2018-09-12 21:22:17.117273 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3338240~4096 logical 175304704~4096
2018-09-12 21:22:17.117274 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3371008~4096 logical 175337472~4096
2018-09-12 21:22:17.117275 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3469312~4096 logical 175435776~4096  <---
2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3502080~4096 logical 175468544~4096
2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3534848~4096 logical 175501312~4096
2018-09-12 21:22:17.117277 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3633152~4096 logical 175599616~4096

... and I can see it being imported ...

2018-09-12 22:07:38.698380 7f23ab2ec0c0 20 librbd::io::ObjectRequest:
0x5615cb507da0 send: write rbd_data.38abe96b8b4567.0029
3469312~4096

Therefore, I don't see anything structurally wrong w/ the
export/import behavior. Just to be clear, did you freeze/coalesce the
filesystem before you took the snapshot?

> On Wed, Sep 12, 2018 at 8:32 PM patrick.mcl...@sony.com wrote:
> gt;
> gt; Hi Jason,
> gt;
> gt; On 2018-09-10 11:15:45-07:00 ceph-users wrote:
> gt;
> gt; On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:
> gt;
> gt;
> gt; amp;gt; In addition to this, we are seeing a similar type of 
> corruption in another use case when we migrate RBDs and snapshots across 
> pools. In this case we clone a version of an RBD (e.g. HEAD-3) to a new pool 
> and rely on 'rbd export-diff/import-diff' to restore the last 3 snapshots on 
> top. Here too we see cases of fsck and RBD checksum failures.
> gt; amp;gt; We maintain various metrics and logs. Looking back at 
> our data we have seen the issue at a small scale for a while on Jewel, but 
> the frequency increased recently. The timing may have coincided with a move 
> to Luminous, but this may be coincidence. We are currently on Ceph 12.2.5.
> gt; amp;gt; We are wondering if people are experiencing similar 
> issues with 'rbd export-diff / import-diff'. I'm sure many people use it to 
> keep backups in sync. Since it is backups, many people may not inspect the 
> data often. In our use case, we use this mechanism to keep data in sync and 
> actually need the data in the other location often. We are wondering if 
> anyone else has encountered any issues, it's quite possible that many people 
> may have this issue, buts simply don't realize. We are likely hitting it much 
> more frequently due to the scale of our operation (tens of thousands of syncs 
> a day).
> gt;
> gt; If you are able to recreate this reliably without tiering, it would
> gt; assist in debugging if you could capture RBD debug logs during the
> gt; export along w/ the LBA of the filesystem corruption to compare
> gt; against.
> gt;
> gt; We haven't been able to reproduce this reliably as of yet, as of yet 
> we haven't actually figured out the exact conditions that cause this to 
> happen, we have just been seeing it happen on some percentage of 
> export/import-diff operations.
> gt;
> gt;
> gt; Logs from both export-diff and import-diff in a case where the 
> result gets corrupted are attached. Please let me know if you need any more 
> information.
> gt;
>
>
>
> --
> Jason
> /patrick.mcl...@sony.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-12 Thread Patrick.Mclean
On 2018-09-12 17:35:16-07:00 Jason Dillaman wrote:


Any chance you know the LBA or byte offset of the corruption so I can
compare it against the log?

The LBAs of the corruption are 0xA74F000 through 175435776



On Wed, Sep 12, 2018 at 8:32 PM patrick.mcl...@sony.com wrote:
gt;
gt; Hi Jason,
gt;
gt; On 2018-09-10 11:15:45-07:00 ceph-users wrote:
gt;
gt; On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:
gt;
gt;
gt; amp;gt; In addition to this, we are seeing a similar type of 
corruption in another use case when we migrate RBDs and snapshots across pools. 
In this case we clone a version of an RBD (e.g. HEAD-3) to a new pool and rely 
on 'rbd export-diff/import-diff' to restore the last 3 snapshots on top. Here 
too we see cases of fsck and RBD checksum failures.
gt; amp;gt; We maintain various metrics and logs. Looking back at our 
data we have seen the issue at a small scale for a while on Jewel, but the 
frequency increased recently. The timing may have coincided with a move to 
Luminous, but this may be coincidence. We are currently on Ceph 12.2.5.
gt; amp;gt; We are wondering if people are experiencing similar 
issues with 'rbd export-diff / import-diff'. I'm sure many people use it to 
keep backups in sync. Since it is backups, many people may not inspect the data 
often. In our use case, we use this mechanism to keep data in sync and actually 
need the data in the other location often. We are wondering if anyone else has 
encountered any issues, it's quite possible that many people may have this 
issue, buts simply don't realize. We are likely hitting it much more frequently 
due to the scale of our operation (tens of thousands of syncs a day).
gt;
gt; If you are able to recreate this reliably without tiering, it would
gt; assist in debugging if you could capture RBD debug logs during the
gt; export along w/ the LBA of the filesystem corruption to compare
gt; against.
gt;
gt; We haven't been able to reproduce this reliably as of yet, as of yet 
we haven't actually figured out the exact conditions that cause this to happen, 
we have just been seeing it happen on some percentage of export/import-diff 
operations.
gt;
gt;
gt; Logs from both export-diff and import-diff in a case where the result 
gets corrupted are attached. Please let me know if you need any more 
information.
gt;



--
Jason
/patrick.mcl...@sony.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-12 Thread Jason Dillaman
Any chance you know the LBA or byte offset of the corruption so I can
compare it against the log?
On Wed, Sep 12, 2018 at 8:32 PM  wrote:
>
> Hi Jason,
>
> On 2018-09-10 11:15:45-07:00 ceph-users wrote:
>
> On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:
>
>
> gt; In addition to this, we are seeing a similar type of corruption in 
> another use case when we migrate RBDs and snapshots across pools. In this 
> case we clone a version of an RBD (e.g. HEAD-3) to a new pool and rely on 
> 'rbd export-diff/import-diff' to restore the last 3 snapshots on top. Here 
> too we see cases of fsck and RBD checksum failures.
> gt; We maintain various metrics and logs. Looking back at our data we 
> have seen the issue at a small scale for a while on Jewel, but the frequency 
> increased recently. The timing may have coincided with a move to Luminous, 
> but this may be coincidence. We are currently on Ceph 12.2.5.
> gt; We are wondering if people are experiencing similar issues with 'rbd 
> export-diff / import-diff'. I'm sure many people use it to keep backups in 
> sync. Since it is backups, many people may not inspect the data often. In our 
> use case, we use this mechanism to keep data in sync and actually need the 
> data in the other location often. We are wondering if anyone else has 
> encountered any issues, it's quite possible that many people may have this 
> issue, buts simply don't realize. We are likely hitting it much more 
> frequently due to the scale of our operation (tens of thousands of syncs a 
> day).
>
> If you are able to recreate this reliably without tiering, it would
> assist in debugging if you could capture RBD debug logs during the
> export along w/ the LBA of the filesystem corruption to compare
> against.
>
> We haven't been able to reproduce this reliably as of yet, as of yet we 
> haven't actually figured out the exact conditions that cause this to happen, 
> we have just been seeing it happen on some percentage of export/import-diff 
> operations.
>
>
> Logs from both export-diff and import-diff in a case where the result gets 
> corrupted are attached. Please let me know if you need any more information.
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs speed

2018-09-12 Thread Joe Comeau


Hi

replying to list - instead of directly by accident
---
Sorry I was camping for a week and was disconnected without data for the most 
part

Yes it is over iSCSI - 2 iscsi nodes 
We've set both iscsi and vmware hosts to SUSE recommended settings
in addition we've set round robin iops limit to 1 (from default 1000) on each 
vm host

the iscsi servers don't appear to be busy at any point


>>> David Byte  9/1/2018 8:31 PM >>>

Is this over ISCSI then?
 
David Byte
Sr. Technology Strategist
SCE Enterprise Linux 
SCE Enterprise Storage
Alliances and SUSE Embedded
db...@suse.com
918.528.4422

>>> "Joe Comeau"  9/1/2018 8:21 PM >>>

Yes I was referring to windows  explorer copies as that is what users typically 
use

but also with windows robocopy and it set to 32 threads
the difference is we may go from a peak of 300MB/s to a more normal 100MB/s to 
a stall at 0 to 30MB/s
about every 7-8 seconds it stalls to 0 MB/s


being reported by windows resource monitor

these are large files  up to multi TB files probably 12 TB in total that we are 
copying in < 100 files

thanks Joe





>>> David Byte  8/31/2018 1:17 PM >>>

Are these single threaded writes that you are referring to?  It certainly 
appears so from the thread, but I thought it would be good to confirm that 
before digging in further.
 
 
David Byte
Sr. Technology Strategist
SCE Enterprise Linux 
SCE Enterprise Storage
Alliances and SUSE Embedded
db...@suse.com
918.528.4422
 
From: ceph-users  on behalf of Joe Comeau 

Date: Friday, August 31, 2018 at 1:07 PM
To: "ceph-users@lists.ceph.com" , Peter Eisch 

Subject: Re: [ceph-users] cephfs speed
 
Are you using bluestore OSDs ?
 
if so my thought process on this is what we are having an issue with is caching 
and bluestore
 
see the thread on bluestore caching
"Re: [ceph-users] Best practices for allocating memory to bluestore cache"
 
 
before when we were on Jewel and filestore we could get a much better sustained 
write 
Now on bluestore we are not getting more than a sustained 2GB file write before 
it drastically slows down 
Then it fluctuates from 0kb/s to 100MB/s and back and forth as it is writing

Thanks Joe

>>> Peter Eisch  8/31/2018 10:31 AM >>>
[replying to myself]

I set aside cephfs and created an rbd volume. I get the same splotchy 
throughput with rbd as I was getting with cephfs. (image attached)

So, withdrawing this as a question here as a cephfs issue.

#backingout

peter


Peter Eisch​









virginpulse.com|
globalchallenge.virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.10



On 8/30/18, 12:25 PM, "Peter Eisch"  wrote:

Thanks for the thought. It’s mounted with this entry in fstab (one line, if 
email wraps it):

cephmon-s01,cephmon-s02,cephmon-s03:/ /loam ceph 
noauto,name=clientname,secretfile=/etc/ceph/secret,noatime,_netdev 0 2

Pretty plain, but I'm open to tweaking!

peter

From: Gregory Farnum 
Date: Thursday, August 30, 2018 at 11:47 AM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] cephfs speed

How are you mounting CephFS? It may be that the cache settings are just set 
very badly for a 10G pipe. Plus rados bench is a very parallel large-IO 
benchmark and many benchmarks you might dump into a filesystem are definitely 
not. 
-Greg

On Thu, Aug 30, 2018 at 7:54 AM Peter Eisch 
 wrote:
Hi,

I have a cluster serving cephfs and it works. It’s just slow. Client is using 
the kernel driver. I can ‘rados bench’ writes to the cephfs_data pool at wire 
speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is rare to 
get above 100Mb/s. Large file writes may start fast (2Gb/s) but within a minute 
slows. In the dashboard at the OSDs I get lots of triangles (it doesn't stream) 
which seems to be lots of starts and stops. By contrast the graphs show 
constant flow when using 'rados bench.'

I feel like I'm missing something obvious. What can I do to help diagnose this 
better or resolve the issue? 

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: 

Re: [ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Chad William Seys

Hi Paul,
 Yes, all monitors have been restarted.

Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] omap vs. xattr in librados

2018-09-12 Thread Gregory Farnum
On Tue, Sep 11, 2018 at 5:32 PM Benjamin Cherian 
wrote:

> Ok, that’s good to know. I was planning on using an EC pool. Maybe I'll
> store some of the larger kv pairs in their own objects or move the metadata
> into it's own replicated pool entirely. If the storage mechanism is the
> same, is there a reason xattrs are supported and omap is not? (Or is there
> some hidden cost to storing kv pairs in an EC pool I’m unaware of, e.g.,
> does the kv data get replicated across all OSDs being used for a PG or
> something?)
>

Yeah, if you're on an EC pool there isn't a good way to erasure-code
key-value data. So we willingly replicate xattrs across all the nodes
(since they are presumed to be small and limited in number — I think we
actually have total limits, but not sure?) but don't support omap at all
(as it's presumed to be a lot of data).

Do note that if small objects are a large proportion of your data you might
prefer to put them in a replicated pool — in an EC pool you'd need very
small chunk sizes to get any non-replication happening anyway, and for
something in the 10KB range at a reasonable k+m you'd be dominated by
metadata size anyway.
-Greg


>
> Thanks,
> Ben
>
> On Tue, Sep 11, 2018 at 1:46 PM Patrick Donnelly 
> wrote:
>
>> On Tue, Sep 11, 2018 at 12:43 PM, Benjamin Cherian
>>  wrote:
>> > On Tue, Sep 11, 2018 at 10:44 AM Gregory Farnum 
>> wrote:
>> >>
>> >> 
>> >> In general, if the key-value storage is of unpredictable or non-trivial
>> >> size, you should use omap.
>> >>
>> >> At the bottom layer where the data is actually stored, they're likely
>> to
>> >> be in the same places (if using BlueStore, they are the same — in
>> FileStore,
>> >> a rados xattr *might* be in the local FS xattrs, or it might not). It
>> is
>> >> somewhat more likely that something stored in an xattr will get pulled
>> into
>> >> memory at the same time as the object's internal metadata, but that
>> only
>> >> happens if it's quite small (think the xfs or ext4 xattr rules).
>> >
>> >
>> > Based on this description, if I'm planning on using Bluestore, there is
>> no
>> > particular reason to ever prefer using xattrs over omap (outside of
>> ease of
>> > use in the API), correct?
>>
>> You may prefer xattrs on bluestore if the metadata is small and you
>> may need to store the xattrs on an EC pool. omap is not supported on
>> ecpools.
>>
>> --
>> Patrick Donnelly
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Paul Emmerich
Did you restart the mons or inject the option?

Paul

2018-09-12 17:40 GMT+02:00 Chad William Seys :
> Hi all,
>   I'm having trouble turning off the warning "1 pools have many more objects
> per pg than average".
>
> I've tried a lot of variations on the below, my current ceph.conf:
>
> #...
> [mon]
>
> #...
> mon_pg_warn_max_object_skew = 0
>
> All of my monitors have been restarted.
>
> Seems like I'm missing something.  Syntax error?  Wrong section? No vertical
> blank whitespace allowed?  Not supported in Luminous?
>
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Maged Mokhtar



On 12/09/18 17:06, Ján Senko wrote:

We are benchmarking a test machine which has:
8 cores, 64GB RAM
12 * 12 TB HDD (SATA)
2 * 480 GB SSD (SATA)
1 * 240 GB SSD (NVME)
Ceph Mimic

Baseline benchmark for HDD only (Erasure Code 4+2)
Write 420 MB/s, 100 IOPS, 150ms latency
Read 1040 MB/s, 260 IOPS, 60ms latency

Now we moved WAL to the SSD (all 12 WALs on single SSD, default size 
(512MB)):

Write 640 MB/s, 160 IOPS, 100ms latency
Read identical as above.

Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB 
for DB)

All results are the same as above!

Q: This is suspicious, right? Why is the DB on SSD not helping with 
our benchmark? We use *rados bench*

*
*
We tried putting WAL on the NVME, and again, the results are the same 
as on SSD.

Same for WAL+DB on NVME

Again, the same speed. Any ideas why we don't gain speed by using 
faster HW here?


Jan

--
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Jan,
putting 12 wals + db on the same SSD is probably too much. try 4-5 may 
give better results.
the recommendation i have is to run a load monitoring tool 
(atop/sar/collectl) while you are bench-marking and look for % 
utilization for disks and cpus, this will let you know without guessing 
where the bottleneck is : could be your ssd or maybe your cpu are saturated.


/Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-12 Thread Kevin Hrpcek
I couldn't find any sign of a networking issue at the OS or switches. No 
changes have been made in those to get the cluster stable again. I 
looked through a couple OSD logs and here is a selection of some of most 
frequent errors they were getting. Maybe something below is more obvious 
to you.


2018-09-09 18:17:33.245 7feb92079700  2 osd.84 991324 ms_handle_refused 
con 0x560e428b9800 session 0x560eb26b0060
2018-09-09 18:17:33.245 7feb9307b700  2 osd.84 991324 ms_handle_refused 
con 0x560ea639f000 session 0x560eb26b0060


2018-09-09 18:18:55.919 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424a3600 for osd.20, reopening
2018-09-09 18:18:55.919 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e447df600 session 0x560e9ec37680
2018-09-09 18:18:55.919 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e427a5600 session 0x560e9ec37680
2018-09-09 18:18:55.935 7feb92079700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e40afcc00 for osd.18, reopening
2018-09-09 18:18:55.935 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e44398c00 session 0x560e6a3a0620
2018-09-09 18:18:55.935 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e42f4ea00 session 0x560e6a3a0620
2018-09-09 18:18:55.939 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424c1e00 for osd.9, reopening
2018-09-09 18:18:55.940 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560ea4d09600 session 0x560e115e8120
2018-09-09 18:18:55.940 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e424a3600 session 0x560e115e8120
2018-09-09 18:18:55.956 7febadf54700 20 osd.84 991337 share_map_peer 
0x560e411ca600 already has epoch 991337


2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  new session 
0x560e40b5ce00 con=0x560e42471800 addr=10.1.9.13:6836/2276068
2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  session 
0x560e40b5ce00 osd.376 has caps osdcap[grant(*)] 'allow *'
2018-09-09 18:24:59.596 7feb9407d700  2 osd.84 991362 ms_handle_reset 
con 0x560e42471800 session 0x560e40b5ce00
2018-09-09 18:24:59.606 7feb9407d700  2 osd.84 991362 ms_handle_refused 
con 0x560e42d04600 session 0x560e10dfd000
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 
OSD::ms_get_authorizer type=osd
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 ms_get_authorizer 
bailing, we are shutting down
2018-09-09 18:24:59.633 7febad753700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e42326a00 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=18630 cs=1 
l=0).handle_connect_reply connect got BADAUTHORIZER


2018-09-09 18:22:56.434 7febadf54700  0 cephx: verify_authorizer could 
not decrypt ticket info: error: bad magic in decode_decrypt, 
3995972256093848467 != 18374858748799134293


2018-09-09 18:22:56.434 7febadf54700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e41fad600 :6848 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: got bad authorizer


2018-09-10 03:30:17.324 7ff0ab678700 -1 osd.84 992286 heartbeat_check: 
no reply from 10.1.9.28:6843 osd.578 since back 2018-09-10 
03:15:35.358240 front 2018-09-10 03:15:47.879015 (cutoff 2018-09-10 
03:29:17.326329)


Kevin


On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage
  




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:


Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:

Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to 

Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Ján Senko
Eugene:
Between tests we destroyed the OSDs and created them from scratch. We used
Docker image to deploy Ceph on one machine.
I've seen that there are WAL/DB partitions created on the disks.
Should I also check somewhere in ceph config that it actually uses those?

David:
We used 4MB writes.

I know about the recommended journal size, however this is the machine we
have at the moment.
For final production I can change the size of SSD (if it makes sense)
The benchmark hasn't filled the 30GB of DB in the time it was running, so I
doubt that having properly sized DB would change the results.
(It wrote 38GB per minute of testing, spread across 12 disks, with 50% EC
overhead, therefore about 5GB/minute)

Jan

st 12. 9. 2018 o 17:36 David Turner  napísal(a):

> If you're writes are small enough (64k or smaller) they're being placed on
> the WAL device regardless of where your DB is.  If you change your testing
> to use larger writes you should see a difference by adding the DB.
>
> Please note that the community has never recommended using less than 120GB
> DB for a 12TB OSD and the docs have come out and officially said that you
> should use at least a 480GB DB for a 12TB OSD.  If you're setting up your
> OSDs with a 30GB DB, you're just going to fill that up really quick and
> spill over onto the HDD and have wasted your money on the SSDs.
>
> On Wed, Sep 12, 2018 at 11:07 AM Ján Senko  wrote:
>
>> We are benchmarking a test machine which has:
>> 8 cores, 64GB RAM
>> 12 * 12 TB HDD (SATA)
>> 2 * 480 GB SSD (SATA)
>> 1 * 240 GB SSD (NVME)
>> Ceph Mimic
>>
>> Baseline benchmark for HDD only (Erasure Code 4+2)
>> Write 420 MB/s, 100 IOPS, 150ms latency
>> Read 1040 MB/s, 260 IOPS, 60ms latency
>>
>> Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
>> (512MB)):
>> Write 640 MB/s, 160 IOPS, 100ms latency
>> Read identical as above.
>>
>> Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for
>> DB)
>> All results are the same as above!
>>
>> Q: This is suspicious, right? Why is the DB on SSD not helping with our
>> benchmark? We use *rados bench*
>>
>> We tried putting WAL on the NVME, and again, the results are the same as
>> on SSD.
>> Same for WAL+DB on NVME
>>
>> Again, the same speed. Any ideas why we don't gain speed by using faster
>> HW here?
>>
>> Jan
>>
>> --
>> Jan Senko, Skype janos-
>> Phone in Switzerland: +41 774 144 602
>> Phone in Czech Republic: +420 777 843 818 <+420%20777%20843%20818>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

-- 
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

2018-09-12 Thread Gregory Farnum
That code bit is just "we have an incoming message with data", which is
what we'd expect, but means it's not very helpful for tracking down the
source of any leaks.

My guess is still very much that somehow there are deallocations missing
here. Internally, the synchronous API is wrapping the async one so it'd be
subject to the same bugs. (Although perhaps not terrible malloc behavior,
but if valgrind reports the same memory usage as the RSS being reported
then I don't think it can be malloc behavior.)

Are you actually freeing the buffers you provide when you're done with
them? I guess there could be something with the "C_bl_to_buf" structure
getting managed wrong as well since that looks to be unique to this code
path, but it wouldn't depend on the size of the objects since it's just 4
pointers/ints.
-Greg

On Wed, Sep 12, 2018 at 8:43 AM Daniel Goldbach 
wrote:

> The issue continues even when I do rados_aio_release(completion) at the
> end of the readobj(..) definition in the example. Also, in our production
> code we call rados_aio_release for each completion and we still see the
> issue there. The release command doesn't guarantee instant release, so
> could it be that the release operations are getting queued up but never
> executed?
>
> Valgrind massif shows that the relevant allocations are all happening in
> the bit of code in the following stack trace:
>
>
> 
>   ntime(i) total(B)   useful-heap(B) extra-heap(B)
> stacks(B)
>
> 
>  62166,854,775   82,129,696   81,808,615   321,081
> 0
>  63168,025,321   83,155,872   82,834,072   321,800
> 0
> 99.61% (82,834,072B) (heap allocation functions) malloc/new/new[],
> --alloc-fns, etc.
> ->93.75% (77,955,072B) 0x579AC05:
> ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)
> (in /usr/lib/ceph/libceph-common.so.0)
> | ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in
> /usr/lib/ceph/libceph-common.so.0)
> | | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int,
> std::chrono::duration >*) (in
> /usr/lib/ceph/libceph-common.so.0)
> | |   ->93.75% (77,955,072B) 0x5990816: ??? (in
> /usr/lib/ceph/libceph-common.so.0)
> | | ->93.75% (77,955,072B) 0xE957C7E: ??? (in
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
> | |   ->93.75% (77,955,072B) 0xE6896B8: start_thread
> (pthread_create.c:333)
> | | ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109)
> | |
> | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
> |
> ->05.87% (4,879,000B) in 387 places, all below massif's threshold (1.00%)
>
>
>
> On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum  wrote:
>
>> Yep, those completions are maintaining bufferlist references IIRC, so
>> they’re definitely holding the memory buffers in place!
>> On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley  wrote:
>>
>>>
>>>
>>> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
>>> > Hi all,
>>> >
>>> > We're reading from a Ceph Luminous pool using the librados asychronous
>>> > I/O API. We're seeing some concerning memory usage patterns when we
>>> > read many objects in sequence.
>>> >
>>> > The expected behaviour is that our memory usage stabilises at a small
>>> > amount, since we're just fetching objects and ignoring their data.
>>> > What we instead find is that the memory usage of our program grows
>>> > linearly with the amount of data read for an interval of time, and
>>> > then continues to grow at a much slower but still consistent pace.
>>> > This memory is not freed until program termination. My guess is that
>>> > this is an issue with Ceph's memory allocator.
>>> >
>>> > To demonstrate, we create 2 objects of size 10KB, and of size
>>> > 100KB, and of size 1MB:
>>> >
>>> > #include 
>>> > #include 
>>> > #include 
>>> > #include 
>>> >
>>> > int main() {
>>> > rados_t cluster;
>>> > rados_create(, "test");
>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>>> > rados_connect(cluster);
>>> >
>>> > rados_ioctx_t io;
>>> > rados_ioctx_create(cluster, "test", );
>>> >
>>> > char data[100];
>>> > memset(data, 'a', 100);
>>> >
>>> > char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
>>> > int i;
>>> > for (i = 0; i < 2; i++) {
>>> > sprintf(smallobj_name, "10kobj_%d", i);
>>> > rados_write(io, smallobj_name, data, 1, 0);
>>> >
>>> > sprintf(mediumobj_name, "100kobj_%d", i);
>>> > rados_write(io, mediumobj_name, data, 10, 0);
>>> >
>>> > sprintf(largeobj_name, "1mobj_%d", i);
>>> > rados_write(io, largeobj_name, data, 100, 0);
>>> >
>>> > printf("wrote %s of size 1, %s of size 10, %s of size
>>> 100\n",
>>> >   smallobj_name, mediumobj_name, largeobj_name);
>>> > }
>>> >
>>> > return 0;
>>> > }
>>> >
>>> > 

Re: [ceph-users] Performance predictions moving bluestore wall, db to ssd

2018-09-12 Thread Marc Roos
What thread? I have put this, with this specific subject so it is easier 
to find in the future and this is not a 'sub question' of someone's 
problem. Hoping for others to post their experience/results. I thought 
if cern can give estimates, people here can to.


-Original Message-
From: David Turner [mailto:drakonst...@gmail.com] 
Sent: woensdag 12 september 2018 18:20
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Performance predictions moving bluestore wall, 
db to ssd

You already have a thread talking about benchmarking the addition of WAL 
and DB partitions to an OSD.  Why are you creating a new one about the 
exact same thing?  As with everything, the performance increase isn't 
even solely answerable by which drives you have, there are a lot of 
factors that could introduce a bottleneck in your cluster.  But again, 
why create a new thread for the exact same topic?

On Wed, Sep 12, 2018 at 12:06 PM Marc Roos  
wrote:



When having a hdd bluestore osd with collocated wal and db. 


- What performance increase can be expected if one would move the 
wal to 
an ssd?

- What performance increase can be expected if one would move the 
db to 
an ssd?

- Would the performance be a lot if you have a very slow hdd (and 
thus 
not so much when you have a very fast hdd (sas 15k))

- It would be best to move the wal first to the ssd, and then maybe 
also 
the db?

In this CERN video (https://youtu.be/OopRMUYiY5E?t=931) of 2015 
they are 
talking about 5-10x increase etc. But that is filestore of course.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance predictions moving bluestore wall, db to ssd

2018-09-12 Thread David Turner
Sorry, I was wrong that it was you.  I just double checked.  But there is a
new thread as of this morning about this topic where someone is running
benchmark tests with numbers titled "Benchmark does not show gains with DB
on SSD".
On Wed, Sep 12, 2018 at 12:20 PM David Turner  wrote:

> You already have a thread talking about benchmarking the addition of WAL
> and DB partitions to an OSD.  Why are you creating a new one about the
> exact same thing?  As with everything, the performance increase isn't even
> solely answerable by which drives you have, there are a lot of factors that
> could introduce a bottleneck in your cluster.  But again, why create a new
> thread for the exact same topic?
>
> On Wed, Sep 12, 2018 at 12:06 PM Marc Roos 
> wrote:
>
>>
>> When having a hdd bluestore osd with collocated wal and db.
>>
>>
>> - What performance increase can be expected if one would move the wal to
>> an ssd?
>>
>> - What performance increase can be expected if one would move the db to
>> an ssd?
>>
>> - Would the performance be a lot if you have a very slow hdd (and thus
>> not so much when you have a very fast hdd (sas 15k))
>>
>> - It would be best to move the wal first to the ssd, and then maybe also
>> the db?
>>
>> In this CERN video (https://youtu.be/OopRMUYiY5E?t=931) of 2015 they are
>> talking about 5-10x increase etc. But that is filestore of course.
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance predictions moving bluestore wall, db to ssd

2018-09-12 Thread David Turner
You already have a thread talking about benchmarking the addition of WAL
and DB partitions to an OSD.  Why are you creating a new one about the
exact same thing?  As with everything, the performance increase isn't even
solely answerable by which drives you have, there are a lot of factors that
could introduce a bottleneck in your cluster.  But again, why create a new
thread for the exact same topic?

On Wed, Sep 12, 2018 at 12:06 PM Marc Roos  wrote:

>
> When having a hdd bluestore osd with collocated wal and db.
>
>
> - What performance increase can be expected if one would move the wal to
> an ssd?
>
> - What performance increase can be expected if one would move the db to
> an ssd?
>
> - Would the performance be a lot if you have a very slow hdd (and thus
> not so much when you have a very fast hdd (sas 15k))
>
> - It would be best to move the wal first to the ssd, and then maybe also
> the db?
>
> In this CERN video (https://youtu.be/OopRMUYiY5E?t=931) of 2015 they are
> talking about 5-10x increase etc. But that is filestore of course.
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance predictions moving bluestore wall, db to ssd

2018-09-12 Thread Marc Roos


When having a hdd bluestore osd with collocated wal and db. 


- What performance increase can be expected if one would move the wal to 
an ssd?

- What performance increase can be expected if one would move the db to 
an ssd?

- Would the performance be a lot if you have a very slow hdd (and thus 
not so much when you have a very fast hdd (sas 15k))

- It would be best to move the wal first to the ssd, and then maybe also 
the db?

In this CERN video (https://youtu.be/OopRMUYiY5E?t=931) of 2015 they are 
talking about 5-10x increase etc. But that is filestore of course.

 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Multisite replication jewel and luminous

2018-09-12 Thread Sage Weil
[Moving this to ceph-users where it will get more eyeballs.]

On Wed, 12 Sep 2018, Andrew Cassera wrote:

> Hello,
> 
> Any help would be appreciated.  I just created two clusters in the lab. One
> cluster is running jewel 10.2.10 and the other cluster is running luminous
> 12.2.8.  After creating the jewel cluster I created an S3 bucket, and moved
> objects into the bucket.  I then migrated the configuration to a multi-site
> configuration.  I anticipate doing this in production which is running
> jewel.  Replication works fin ief I create new buckets and move objects
> into those buckets.  I can see those buckets and objects in both clusters.
> The existing bucket I created before configuring multisite will not sync
> between clusters.
> 
> Will this work between jewel and luminous?  Is there some configuration I
> am missing with the bucket I created before doing the multisite
> configuration.  Here are the steps I took:
> 
> Jewel Master
> 
> radosgw-admin realm create --rgw-realm=us --default
> 1. radosgw-admin zonegroup rename --rgw-zonegroup default
> --zonegroup-new-name=us-east
> 2. radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1
> --rgw-zonegroup=us-east
> 3. radosgw-admin zonegroup modify --rgw-realm=us --rgw-zonegroup=us-east
> --endpoints http://tests31.solidsupport.com:7480 --master --default
> 
> 4. radosgw-admin user create --uid=rgwuser --display-name="rgw system user"
> --system
> 5. radosgw-admin zone modify --rgw-realm=us --rgw-zonegroup=us-east
> --rgw-zone=us-east-1 --endpoints http://tests31.test.com:7480 --master
> --default
> 6. radosgw-admin zone modify --rgw-realm=us --rgw-zonegroup=us-east
> --rgw-zone=us-east-1 --access-key=35G7VDV5XKL09WGYY34U
> --secret=lQePAxX62j5HR0qj8f4FUkq8DTBpSOIiBeXvJa1
> 7. radosgw-admin period update --commit
> 8. systemctl restart ceph-rado...@rgw.ub-radosgw1-1.service
> 
> Luminous secondary
> 
> 1. radosgw-admin realm pull --url=tests31.test.com:7480
> --access-key=35G7VDV5XKL09WGYY34U
> --secret=lQePAxX62j5HR0qj8f4FUkq8DTBpSOIiBeXvJa1O
> 2. radsogw-admin realm default --rgw-realm=us
> 3. radosgw-admin period pull --url=tests31.test.com:7480
> --access-key=35G7VDV5XKL09WGYY34U
> --secret=lQePAxX62j5HR0qj8f4FUkq8DTBpSOIiBeXvJa1O
> 5. radosgw-admin zone create --rgw-zonegroup=us-east --rgw-zone=us-east-2
> --endpoints=http://tests32.test.com:7480 --access-key=35G7VDV5XKL09WGYY34U
> --secret=lQePAxX62j5HR0qj8f4FUkq8DTBpSOIiBeXvJa1O
> 6. radosgw-admin user create --uid=rgwuser --display-name="rgw system user"
> --system
> 7. radosgw-admin period update --commit
> 8. systemctl restart ceph-rado...@rgw.ub-radosgw1-2.service
> 
> Andrew
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

2018-09-12 Thread Daniel Goldbach
The issue continues even when I do rados_aio_release(completion) at the end
of the readobj(..) definition in the example. Also, in our production code
we call rados_aio_release for each completion and we still see the issue
there. The release command doesn't guarantee instant release, so could it
be that the release operations are getting queued up but never executed?

Valgrind massif shows that the relevant allocations are all happening in
the bit of code in the following stack trace:


  ntime(i) total(B)   useful-heap(B) extra-heap(B)
stacks(B)

 62166,854,775   82,129,696   81,808,615   321,081
  0
 63168,025,321   83,155,872   82,834,072   321,800
  0
99.61% (82,834,072B) (heap allocation functions) malloc/new/new[],
--alloc-fns, etc.
->93.75% (77,955,072B) 0x579AC05:
ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)
(in /usr/lib/ceph/libceph-common.so.0)
| ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in
/usr/lib/ceph/libceph-common.so.0)
| | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int,
std::chrono::duration >*) (in
/usr/lib/ceph/libceph-common.so.0)
| |   ->93.75% (77,955,072B) 0x5990816: ??? (in
/usr/lib/ceph/libceph-common.so.0)
| | ->93.75% (77,955,072B) 0xE957C7E: ??? (in
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |   ->93.75% (77,955,072B) 0xE6896B8: start_thread
(pthread_create.c:333)
| | ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109)
| |
| ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.87% (4,879,000B) in 387 places, all below massif's threshold (1.00%)



On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum  wrote:

> Yep, those completions are maintaining bufferlist references IIRC, so
> they’re definitely holding the memory buffers in place!
> On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley  wrote:
>
>>
>>
>> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
>> > Hi all,
>> >
>> > We're reading from a Ceph Luminous pool using the librados asychronous
>> > I/O API. We're seeing some concerning memory usage patterns when we
>> > read many objects in sequence.
>> >
>> > The expected behaviour is that our memory usage stabilises at a small
>> > amount, since we're just fetching objects and ignoring their data.
>> > What we instead find is that the memory usage of our program grows
>> > linearly with the amount of data read for an interval of time, and
>> > then continues to grow at a much slower but still consistent pace.
>> > This memory is not freed until program termination. My guess is that
>> > this is an issue with Ceph's memory allocator.
>> >
>> > To demonstrate, we create 2 objects of size 10KB, and of size
>> > 100KB, and of size 1MB:
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> >
>> > int main() {
>> > rados_t cluster;
>> > rados_create(, "test");
>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>> > rados_connect(cluster);
>> >
>> > rados_ioctx_t io;
>> > rados_ioctx_create(cluster, "test", );
>> >
>> > char data[100];
>> > memset(data, 'a', 100);
>> >
>> > char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
>> > int i;
>> > for (i = 0; i < 2; i++) {
>> > sprintf(smallobj_name, "10kobj_%d", i);
>> > rados_write(io, smallobj_name, data, 1, 0);
>> >
>> > sprintf(mediumobj_name, "100kobj_%d", i);
>> > rados_write(io, mediumobj_name, data, 10, 0);
>> >
>> > sprintf(largeobj_name, "1mobj_%d", i);
>> > rados_write(io, largeobj_name, data, 100, 0);
>> >
>> > printf("wrote %s of size 1, %s of size 10, %s of size
>> 100\n",
>> >   smallobj_name, mediumobj_name, largeobj_name);
>> > }
>> >
>> > return 0;
>> > }
>> >
>> > $ gcc create.c -lrados -o create
>> > $ ./create
>> > wrote 10kobj_0 of size 1, 100kobj_0 of size 10, 1mobj_0 of
>> > size 100
>> > wrote 10kobj_1 of size 1, 100kobj_1 of size 10, 1mobj_1 of
>> > size 100
>> > [...]
>> > wrote 10kobj_19998 of size 1, 100kobj_19998 of size 10,
>> > 1mobj_19998 of size 100
>> > wrote 10kobj_1 of size 1, 100kobj_1 of size 10,
>> > 1mobj_1 of size 100
>> >
>> > Now we read each of these objects with the async API, into the same
>> > buffer. First we read just the the 10KB objects first:
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> >
>> > void readobj(rados_ioctx_t* io, char objname[]);
>> >
>> > int main() {
>> > rados_t cluster;
>> > rados_create(, "test");
>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>> > rados_connect(cluster);
>> >
>> > rados_ioctx_t io;
>> > rados_ioctx_create(cluster, "test", );
>> >
>> > 

[ceph-users] help me turn off "many more objects that average"

2018-09-12 Thread Chad William Seys

Hi all,
  I'm having trouble turning off the warning "1 pools have many more 
objects per pg than average".


I've tried a lot of variations on the below, my current ceph.conf:

#...
[mon]

#...
mon_pg_warn_max_object_skew = 0

All of my monitors have been restarted.

Seems like I'm missing something.  Syntax error?  Wrong section? No 
vertical blank whitespace allowed?  Not supported in Luminous?


Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread David Turner
If you're writes are small enough (64k or smaller) they're being placed on
the WAL device regardless of where your DB is.  If you change your testing
to use larger writes you should see a difference by adding the DB.

Please note that the community has never recommended using less than 120GB
DB for a 12TB OSD and the docs have come out and officially said that you
should use at least a 480GB DB for a 12TB OSD.  If you're setting up your
OSDs with a 30GB DB, you're just going to fill that up really quick and
spill over onto the HDD and have wasted your money on the SSDs.

On Wed, Sep 12, 2018 at 11:07 AM Ján Senko  wrote:

> We are benchmarking a test machine which has:
> 8 cores, 64GB RAM
> 12 * 12 TB HDD (SATA)
> 2 * 480 GB SSD (SATA)
> 1 * 240 GB SSD (NVME)
> Ceph Mimic
>
> Baseline benchmark for HDD only (Erasure Code 4+2)
> Write 420 MB/s, 100 IOPS, 150ms latency
> Read 1040 MB/s, 260 IOPS, 60ms latency
>
> Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
> (512MB)):
> Write 640 MB/s, 160 IOPS, 100ms latency
> Read identical as above.
>
> Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for DB)
> All results are the same as above!
>
> Q: This is suspicious, right? Why is the DB on SSD not helping with our
> benchmark? We use *rados bench*
>
> We tried putting WAL on the NVME, and again, the results are the same as
> on SSD.
> Same for WAL+DB on NVME
>
> Again, the same speed. Any ideas why we don't gain speed by using faster
> HW here?
>
> Jan
>
> --
> Jan Senko, Skype janos-
> Phone in Switzerland: +41 774 144 602
> Phone in Czech Republic: +420 777 843 818 <+420%20777%20843%20818>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Eugen Block

Hi Jan,

how did you move the WAL and DB to the SSD/NVMe? By recreating the  
OSDs or a different approach? Did you check afterwards that the  
devices were really used for that purpose? We had to deal with that a  
couple of months ago [1] and it's not really obvious if the new  
devices are really used.


Regards,
Eugen

[1]  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



Zitat von Ján Senko :


We are benchmarking a test machine which has:
8 cores, 64GB RAM
12 * 12 TB HDD (SATA)
2 * 480 GB SSD (SATA)
1 * 240 GB SSD (NVME)
Ceph Mimic

Baseline benchmark for HDD only (Erasure Code 4+2)
Write 420 MB/s, 100 IOPS, 150ms latency
Read 1040 MB/s, 260 IOPS, 60ms latency

Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
(512MB)):
Write 640 MB/s, 160 IOPS, 100ms latency
Read identical as above.

Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for DB)
All results are the same as above!

Q: This is suspicious, right? Why is the DB on SSD not helping with our
benchmark? We use *rados bench*

We tried putting WAL on the NVME, and again, the results are the same as on
SSD.
Same for WAL+DB on NVME

Again, the same speed. Any ideas why we don't gain speed by using faster HW
here?

Jan

--
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Ján Senko
We are benchmarking a test machine which has:
8 cores, 64GB RAM
12 * 12 TB HDD (SATA)
2 * 480 GB SSD (SATA)
1 * 240 GB SSD (NVME)
Ceph Mimic

Baseline benchmark for HDD only (Erasure Code 4+2)
Write 420 MB/s, 100 IOPS, 150ms latency
Read 1040 MB/s, 260 IOPS, 60ms latency

Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
(512MB)):
Write 640 MB/s, 160 IOPS, 100ms latency
Read identical as above.

Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for DB)
All results are the same as above!

Q: This is suspicious, right? Why is the DB on SSD not helping with our
benchmark? We use *rados bench*

We tried putting WAL on the NVME, and again, the results are the same as on
SSD.
Same for WAL+DB on NVME

Again, the same speed. Any ideas why we don't gain speed by using faster HW
here?

Jan

-- 
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osx support and performance testing

2018-09-12 Thread Marc Roos


Is this osxfuse, the only and best performing way to mount a ceph 
filesystem on an osx client?
http://docs.ceph.com/docs/mimic/dev/macos/

I am now testing cephfs performance on a client with the fio libaio 
engine. This engine does not exist on osx, but there is a posixaio. Does 
anyone have experience comparing these results?
I would like to predict performance on a osx client.


https://tracker.ceph.com/projects/ceph/wiki/Increasing_Ceph_portability


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

2018-09-12 Thread Gregory Farnum
Yep, those completions are maintaining bufferlist references IIRC, so
they’re definitely holding the memory buffers in place!
On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley  wrote:

>
>
> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
> > Hi all,
> >
> > We're reading from a Ceph Luminous pool using the librados asychronous
> > I/O API. We're seeing some concerning memory usage patterns when we
> > read many objects in sequence.
> >
> > The expected behaviour is that our memory usage stabilises at a small
> > amount, since we're just fetching objects and ignoring their data.
> > What we instead find is that the memory usage of our program grows
> > linearly with the amount of data read for an interval of time, and
> > then continues to grow at a much slower but still consistent pace.
> > This memory is not freed until program termination. My guess is that
> > this is an issue with Ceph's memory allocator.
> >
> > To demonstrate, we create 2 objects of size 10KB, and of size
> > 100KB, and of size 1MB:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main() {
> > rados_t cluster;
> > rados_create(, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", );
> >
> > char data[100];
> > memset(data, 'a', 100);
> >
> > char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
> > int i;
> > for (i = 0; i < 2; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > rados_write(io, smallobj_name, data, 1, 0);
> >
> > sprintf(mediumobj_name, "100kobj_%d", i);
> > rados_write(io, mediumobj_name, data, 10, 0);
> >
> > sprintf(largeobj_name, "1mobj_%d", i);
> > rados_write(io, largeobj_name, data, 100, 0);
> >
> > printf("wrote %s of size 1, %s of size 10, %s of size 100\n",
> >   smallobj_name, mediumobj_name, largeobj_name);
> > }
> >
> > return 0;
> > }
> >
> > $ gcc create.c -lrados -o create
> > $ ./create
> > wrote 10kobj_0 of size 1, 100kobj_0 of size 10, 1mobj_0 of
> > size 100
> > wrote 10kobj_1 of size 1, 100kobj_1 of size 10, 1mobj_1 of
> > size 100
> > [...]
> > wrote 10kobj_19998 of size 1, 100kobj_19998 of size 10,
> > 1mobj_19998 of size 100
> > wrote 10kobj_1 of size 1, 100kobj_1 of size 10,
> > 1mobj_1 of size 100
> >
> > Now we read each of these objects with the async API, into the same
> > buffer. First we read just the the 10KB objects first:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > void readobj(rados_ioctx_t* io, char objname[]);
> >
> > int main() {
> > rados_t cluster;
> > rados_create(, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", );
> >
> > char smallobj_name[16];
> > int i, total_bytes_read = 0;
> >
> > for (i = 0; i < 2; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > readobj(, smallobj_name);
> >
> > total_bytes_read += 1;
> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
> > }
> >
> > getchar();
> > return 0;
> > }
> >
> > void readobj(rados_ioctx_t* io, char objname[]) {
> > char data[100];
> > unsigned long bytes_read;
> > rados_completion_t completion;
> > int retval;
> >
> > rados_read_op_t read_op = rados_create_read_op();
> > rados_read_op_read(read_op, 0, 1, data, _read, );
> > retval = rados_aio_create_completion(NULL, NULL, NULL,
> > );
> > assert(retval == 0);
> >
> > retval = rados_aio_read_op_operate(read_op, *io, completion,
> > objname, 0);
> > assert(retval == 0);
> >
> > rados_aio_wait_for_complete(completion);
> > rados_aio_get_return_value(completion);
> > }
> >
> > $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
> > Read 10kobj_0 for total 1
> > Read 10kobj_1 for total 2
> > [...]
> > Read 10kobj_19998 for total 1
> > Read 10kobj_1 for total 2
> >
> > We read 200MB. A graph of the resident set size of the program is
> > attached as mem-graph-10k.png, with seconds on x axis and KB on the y
> > axis. You can see that the memory usage increases throughout, which
> > itself is unexpected since that memory should be freed over time and
> > we should only hold 10KB of object data in memory at a time. The rate
> > of growth decreases and eventually stabilises, and by the end we've
> > used 60MB of RAM.
> >
> > We repeat this experiment for the 100KB and 1MB objects and find that
> > after all reads they use 140MB and 500MB of RAM, and memory usage
> > presumably would continue to grow if there were more objects. This is
> > orders of magnitude more memory than what I 

Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

2018-09-12 Thread Casey Bodley



On 09/12/2018 05:29 AM, Daniel Goldbach wrote:

Hi all,

We're reading from a Ceph Luminous pool using the librados asychronous 
I/O API. We're seeing some concerning memory usage patterns when we 
read many objects in sequence.


The expected behaviour is that our memory usage stabilises at a small 
amount, since we're just fetching objects and ignoring their data. 
What we instead find is that the memory usage of our program grows 
linearly with the amount of data read for an interval of time, and 
then continues to grow at a much slower but still consistent pace. 
This memory is not freed until program termination. My guess is that 
this is an issue with Ceph's memory allocator.


To demonstrate, we create 2 objects of size 10KB, and of size 
100KB, and of size 1MB:


    #include 
    #include 
    #include 
    #include 

    int main() {
rados_t cluster;
rados_create(, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", );

        char data[100];
memset(data, 'a', 100);

        char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
        int i;
        for (i = 0; i < 2; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
rados_write(io, smallobj_name, data, 1, 0);

sprintf(mediumobj_name, "100kobj_%d", i);
rados_write(io, mediumobj_name, data, 10, 0);

sprintf(largeobj_name, "1mobj_%d", i);
rados_write(io, largeobj_name, data, 100, 0);

printf("wrote %s of size 1, %s of size 10, %s of size 100\n",
      smallobj_name, mediumobj_name, largeobj_name);
        }

return 0;
    }

    $ gcc create.c -lrados -o create
    $ ./create
    wrote 10kobj_0 of size 1, 100kobj_0 of size 10, 1mobj_0 of 
size 100
    wrote 10kobj_1 of size 1, 100kobj_1 of size 10, 1mobj_1 of 
size 100

    [...]
    wrote 10kobj_19998 of size 1, 100kobj_19998 of size 10, 
1mobj_19998 of size 100
    wrote 10kobj_1 of size 1, 100kobj_1 of size 10, 
1mobj_1 of size 100


Now we read each of these objects with the async API, into the same 
buffer. First we read just the the 10KB objects first:


    #include 
    #include 
    #include 
    #include 
    #include 

    void readobj(rados_ioctx_t* io, char objname[]);

    int main() {
        rados_t cluster;
rados_create(, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", );

        char smallobj_name[16];
        int i, total_bytes_read = 0;

        for (i = 0; i < 2; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
readobj(, smallobj_name);

total_bytes_read += 1;
printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
        }

getchar();
        return 0;
    }

    void readobj(rados_ioctx_t* io, char objname[]) {
        char data[100];
        unsigned long bytes_read;
rados_completion_t completion;
        int retval;

rados_read_op_t read_op = rados_create_read_op();
rados_read_op_read(read_op, 0, 1, data, _read, );
        retval = rados_aio_create_completion(NULL, NULL, NULL, 
);

assert(retval == 0);

        retval = rados_aio_read_op_operate(read_op, *io, completion, 
objname, 0);

assert(retval == 0);

rados_aio_wait_for_complete(completion);
rados_aio_get_return_value(completion);
    }

    $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
    Read 10kobj_0 for total 1
    Read 10kobj_1 for total 2
    [...]
    Read 10kobj_19998 for total 1
    Read 10kobj_1 for total 2

We read 200MB. A graph of the resident set size of the program is 
attached as mem-graph-10k.png, with seconds on x axis and KB on the y 
axis. You can see that the memory usage increases throughout, which 
itself is unexpected since that memory should be freed over time and 
we should only hold 10KB of object data in memory at a time. The rate 
of growth decreases and eventually stabilises, and by the end we've 
used 60MB of RAM.


We repeat this experiment for the 100KB and 1MB objects and find that 
after all reads they use 140MB and 500MB of RAM, and memory usage 
presumably would continue to grow if there were more objects. This is 
orders of magnitude more memory than what I would expect these 
programs to use.


  * We do not get this behaviour with the synchronous API, and the
memory usage remains stable at just a few MB.
  * We've found that for some reason, this doesn't happen (or doesn't
happen as severely) if we intersperse large reads with much
smaller reads. In this case, the memory usage seems to stabilise
at a reasonable number.
  * Valgrind only reports a trivial amount of unreachable memory.
  * Memory usage doesn't increase in this manner if we repeatedly read
the same object over and over again. It hovers around 20MB.
  * In other experiments we've done, with different object data and
distributions of 

[ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-12 Thread Stefan Kooman
Hi,

Once in a while, today a bit more often, the MDS is logging the
following:

mds.mds1 [WRN]  replayed op client.15327973:15585315,15585103 used ino
0x19918de but session next is 0x1873b8b

Nothing of importance is logged in the mds (debug_mds_log": "1/5").

What does this warning message mean / indicate?

At some point this client (ceph-fuse, mimic 13.2.1) triggers the following:

mon.mon1 [WRN] Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST)
mds.mds2 [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.911624 secs
mds.mds2 [WRN] slow request 30.911624 seconds old, received at
2018-09-12 15:18:44.739321: client_request(client.15732335:9506 lookup
#0x16901a7/ctdb_recovery_lock caller_uid=0, caller_gid=0{}) currently 
failed to
rdlock, waiting

mds logging:

2018-09-12 11:35:07.373091 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/2366241118 
conn(0x56332404f000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: challenging authorizer
2018-09-12 13:24:17.000787 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/526035198 
conn(0x56330c726000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: challenging authorizer
2018-09-12 15:21:17.176405 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/526035198 
conn(0x56330c726000 :6800 s=STATE_OPEN pgs=3 cs=1 l=0).fault server, going to 
standby
2018-09-12 15:22:26.641501 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/526035198 
conn(0x5633678f7000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: challenging authorizer
2018-09-12 15:22:26.641694 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/526035198 
conn(0x5633678f7000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 
existing_state=STATE_STANDBY
2018-09-12 15:22:26.641971 7f80af91e700  0 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >> [2001:7b8:81:7::11]:0/526035198 
conn(0x56330c726000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=3 cs=1 
l=0).handle_connect_msg: challenging authorizer

Thanks,

Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse slow cache?

2018-09-12 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):
 >
> 
> please add '-f' option (trace child processes' syscall)  to strace,

Good suggestion. We now see all apache child processes doing it's thing.
We have been, on and off, been stracing / debugging this issue. Nothing
obvious. We are still trying to get our fingers around it. Some website
are not "affected" and just quick, some have these "stalls".

As soon as we have more ceph related info we will let you know.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

2018-09-12 Thread Daniel Goldbach
Hi all,

We're reading from a Ceph Luminous pool using the librados asychronous I/O
API. We're seeing some concerning memory usage patterns when we read many
objects in sequence.

The expected behaviour is that our memory usage stabilises at a small
amount, since we're just fetching objects and ignoring their data. What we
instead find is that the memory usage of our program grows linearly with
the amount of data read for an interval of time, and then continues to grow
at a much slower but still consistent pace. This memory is not freed until
program termination. My guess is that this is an issue with Ceph's memory
allocator.

To demonstrate, we create 2 objects of size 10KB, and of size 100KB,
and of size 1MB:

#include 
#include 
#include 
#include 

int main() {
rados_t cluster;
rados_create(, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", );

char data[100];
memset(data, 'a', 100);

char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
int i;
for (i = 0; i < 2; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
rados_write(io, smallobj_name, data, 1, 0);

sprintf(mediumobj_name, "100kobj_%d", i);
rados_write(io, mediumobj_name, data, 10, 0);

sprintf(largeobj_name, "1mobj_%d", i);
rados_write(io, largeobj_name, data, 100, 0);

printf("wrote %s of size 1, %s of size 10, %s of size
100\n",
smallobj_name, mediumobj_name, largeobj_name);
}

return 0;
}

$ gcc create.c -lrados -o create
$ ./create
wrote 10kobj_0 of size 1, 100kobj_0 of size 10, 1mobj_0 of size
100
wrote 10kobj_1 of size 1, 100kobj_1 of size 10, 1mobj_1 of size
100
[...]
wrote 10kobj_19998 of size 1, 100kobj_19998 of size 10,
1mobj_19998 of size 100
wrote 10kobj_1 of size 1, 100kobj_1 of size 10,
1mobj_1 of size 100

Now we read each of these objects with the async API, into the same buffer.
First we read just the the 10KB objects first:

#include 
#include 
#include 
#include 
#include 

void readobj(rados_ioctx_t* io, char objname[]);

int main() {
rados_t cluster;
rados_create(, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", );

char smallobj_name[16];
int i, total_bytes_read = 0;

for (i = 0; i < 2; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
readobj(, smallobj_name);

total_bytes_read += 1;
printf("Read %s for total %d\n", smallobj_name,
total_bytes_read);
}

getchar();
return 0;
}

void readobj(rados_ioctx_t* io, char objname[]) {
char data[100];
unsigned long bytes_read;
rados_completion_t completion;
int retval;

rados_read_op_t read_op = rados_create_read_op();
rados_read_op_read(read_op, 0, 1, data, _read, );
retval = rados_aio_create_completion(NULL, NULL, NULL, );
assert(retval == 0);

retval = rados_aio_read_op_operate(read_op, *io, completion,
objname, 0);
assert(retval == 0);

rados_aio_wait_for_complete(completion);
rados_aio_get_return_value(completion);
}

$ gcc read.c -lrados -o read_small -Wall -g && ./read_small
Read 10kobj_0 for total 1
Read 10kobj_1 for total 2
[...]
Read 10kobj_19998 for total 1
Read 10kobj_1 for total 2

We read 200MB. A graph of the resident set size of the program is attached
as mem-graph-10k.png, with seconds on x axis and KB on the y axis. You can
see that the memory usage increases throughout, which itself is unexpected
since that memory should be freed over time and we should only hold 10KB of
object data in memory at a time. The rate of growth decreases and
eventually stabilises, and by the end we've used 60MB of RAM.

We repeat this experiment for the 100KB and 1MB objects and find that after
all reads they use 140MB and 500MB of RAM, and memory usage presumably
would continue to grow if there were more objects. This is orders of
magnitude more memory than what I would expect these programs to use.

   - We do not get this behaviour with the synchronous API, and the memory
   usage remains stable at just a few MB.
   - We've found that for some reason, this doesn't happen (or doesn't
   happen as severely) if we intersperse large reads with much smaller reads.
   In this case, the memory usage seems to stabilise at a reasonable number.
   - Valgrind only reports a trivial amount of unreachable memory.