[ceph-users] compacting omap doubles its size

2018-11-28 Thread Tomasz Płaza

Hi,

I have a ceph 12.2.8 cluster on filestore with rather large omap dirs 
(avg size is about 150G). Recently slow requests became a problem, so 
after some digging I decided to convert omap from leveldb to rocksdb. 
Conversion went fine and slow requests rate went down to acceptable 
level. Unfortunately  conversion did not shrink most of omap dirs, so I 
tried online compaction:


Before compaction: 50G    /var/lib/ceph/osd/ceph-0/current/omap/

After compaction: 100G    /var/lib/ceph/osd/ceph-0/current/omap/

Purge and recreate: 1.5G /var/lib/ceph/osd/ceph-0/current/omap/


Before compaction: 135G    /var/lib/ceph/osd/ceph-5/current/omap/

After compaction: 260G    /var/lib/ceph/osd/ceph-5/current/omap/

Purge and recreate: 2.5G /var/lib/ceph/osd/ceph-5/current/omap/


For me compaction which makes omap bigger is quite weird and 
frustrating. Please help.



P.S. My cluster suffered from ongoing index reshards (it is disabled 
now) and on many buckets with 4m+ objects I have a lot of old indexes:


634   bucket1
651   bucket2

...
1231 bucket17
1363 bucket18


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MGR Dashboard

2018-11-28 Thread Ashley Merrick
Hey,

After rebooting a server that hosts the MGR Dashboard I am now unable to
get the dashboard module to run.

Upon restarting the mgr service I see the following :

ImportError: No module named ordered_dict
Nov 29 07:13:14 ceph-m01 ceph-mgr[12486]: [29/Nov/2018:07:13:14] ENGINE
Serving on http://:::9283
Nov 29 07:13:14 ceph-m01 ceph-mgr[12486]: [29/Nov/2018:07:13:14] ENGINE Bus
STARTED


I have checked using pip install ordereddict and it states the module is
already installed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] problem on async+dpdk with ceph13.2.0

2018-11-28 Thread 冷镇宇
Hello,

 

I’m trying to find a way to use async+dpdk as network on cep13.2.0. .After I 
have compiled ceph with dpdk and mount hugepage, I got a segmentation fault 
like this:




 ./bin/ceph-mon -i a -c ceph.conf

EAL:Detected 48 lcore(s)

EAL: No free hugepages reported in hugepages-1048576KB

EAL: Probing VFIO support...

EAL:PCI device :03:00.0 on NUMA socket -1

EAL: probe driver: 8086:1521 net_e1000_igb

 Caught signal (Segmentation fault) **

  in thread 7fbb45511700 thread_name:lcore-slave-1




my kernel version is CentOS 3.10.0-862.11.6, and dpdk version is 17.11, the 
ceph.conf is like this:




ms_cluster_type = async+dpdk

ms_public_type = async+dpdk

public_addr = 10.10.2.24

cluster_addr = 10.10.2.24

ms_async_op_threads = 2

ms_dpdk_coremask = 0xF

ms_dpdk_hugepages = /mnt/huge

ms_dpdk_rx_buffer_count_per_core = 2048

ms_dpdk_memory_channel = 2

 

Any suggestions?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor ceph cluster performance

2018-11-28 Thread Paul Emmerich
Cody :
>
> > And this exact problem was one of the reasons why we migrated
> > everything to PXE boot where the OS runs from RAM.
>
> Hi Paul,
>
> I totally agree with and admire your diskless approach. If I may ask,
> what kind of OS image do you use? 1GB footprint sounds really small.

It's based on Debian, because Debian makes live boot really easy with
squashfs + overlayfs.
We also have a half-finished CentOS/RHEL-based version somewhere, but
that requires way more RAM because it doesn't use overlayfs (or didn't
when we last checked, I guess we need to check RHEL 8 again)

Current image size is 400 MB + 30 MB for kernel + initrd and it comes
with everything you need for Ceph. We don't even run aggressive
compression on the squashfs, it's just lzo.

You can test it for yourself in a VM: https://croit.io/croit-virtual-demo

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> On Tue, Nov 27, 2018 at 1:53 PM Paul Emmerich  wrote:
> >
> > And this exact problem was one of the reasons why we migrated
> > everything to PXE boot where the OS runs from RAM.
> > That kind of failure is just the worst to debug...
> > Also, 1 GB of RAM is cheaper than a separate OS disk.
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > Am Di., 27. Nov. 2018 um 19:22 Uhr schrieb Cody :
> > >
> > > Hi everyone,
> > >
> > > Many, many thanks to all of you!
> > >
> > > The root cause was due to a failed OS drive on one storage node. The
> > > server was responsive to ping, but unable to login. After a reboot via
> > > IPMI, docker daemon failed to start due to I/O errors and dmesg
> > > complained about the failing OS disk. I failed to catch the problem
> > > initially since  'ceph -s' kept showing HEALTH and the cluster was
> > > "functional" despite of slow performance.
> > >
> > > I really appreciate all the tips and advices received from you all and
> > > learned a lot. I will carry your advices (e.g. using bluestore,
> > > enterprise ssd/hdd, separating public and cluster traffics, etc) into
> > > my next round PoC.
> > >
> > > Thank you very much!
> > >
> > > Best regards,
> > > Cody
> > >
> > > On Tue, Nov 27, 2018 at 6:31 AM Vitaliy Filippov  
> > > wrote:
> > > >
> > > > > CPU: 2 x E5-2603 @1.8GHz
> > > > > RAM: 16GB
> > > > > Network: 1G port shared for Ceph public and cluster traffics
> > > > > Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
> > > > > OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)
> > > >
> > > > 0.84 MB/s sequential write is impossibly bad, it's not normal with any
> > > > kind of devices and even with 1G network, you probably have some kind of
> > > > problem in your setup - maybe the network RTT is very high or maybe osd 
> > > > or
> > > > mon nodes are shared with other running tasks and overloaded or maybe 
> > > > your
> > > > disks are already dead... :))
> > > >
> > > > > As I moved on to test block devices, I got a following error message:
> > > > >
> > > > > # rbd map image01 --pool testbench --name client.admin
> > > >
> > > > You don't need to map it to run benchmarks, use `fio --ioengine=rbd`
> > > > (however you'll still need /etc/ceph/ceph.client.admin.keyring)
> > > >
> > > > --
> > > > With best regards,
> > > >Vitaliy Filippov
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Raw space usage in Ceph with Bluestore

2018-11-28 Thread Igor Fedotov

Hi Jody,

yes, this is a known issue.

Indeed, currently 'ceph df detail' reports raw space usage in GLOBAL 
section and 'logical' in the POOLS one. While logical one has some flaws.


There is a pending PR targeted to Nautilus to fix that:

https://github.com/ceph/ceph/pull/19454

If you want to do an analysis at exactly per-pool level this PR is the 
only mean AFAIK.



If per-cluster stats are fine then you can also inspect corresponding 
OSD performance counters and sum over all OSDs to get per-cluster info.


This is the most precise but quite inconvenient method for low-level 
per-osd space analysis.


 "bluestore": {
...

   "bluestore_allocated": 655360, # space allocated at BlueStore 
for the specific OSD
    "bluestore_stored": 34768,  # amount of data stored at 
BlueStore for the specific OSD

...

Please note, that aggregate numbers built from these parameters include 
all the replication/EC overhead.  And bluestore_stored vs. 
bluestore_allocated difference is due to allocation overhead and/or 
applied compression.



Thanks,

Igor


On 11/29/2018 12:27 AM, Glider, Jody wrote:


Hello,

I’m trying to find a way to determine real/physical/raw storage 
capacity usage when storing a similar set of objects in different 
pools, for example a 3-way replicated pool vs. a 4+2 erasure coded 
pool, and in particular how this ratio changes from small (where 
Bluestore block size matters more) to large object sizes.


I find that /ceph df detail/ and /rados df/ don’t report on really-raw 
storage, I guess because they’re perceiving ‘raw’ storage from their 
perspective only. If I write a set of objects to each pool, rados df 
shows the space used as the summation of the logical size of the 
objects, while ceph df detail shows the raw used storage as the object 
size * the redundancy factor (e.g. 3 for 3-way replication and 1.5 for 
4+2 erasure code).


Any suggestions?

Jody Glider, Principal Storage Architect

Cloud Architecture and Engineering, SAP Labs LLC

3412 Hillview Ave (PAL 02 23.357), Palo Alto, CA 94304

E j.gli...@sap.com , T   +1 650-320-3306, M   
+1 650-441-0241




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Raw space usage in Ceph with Bluestore

2018-11-28 Thread Paul Emmerich
You can get all the details from the admin socket of the OSDs:

ceph daemon osd.X perf dump

(must be run on the server the OSD is running on)

Examples of relevant metrics are: bluestore_allocated/stored and the
bluefs block for metadata.
Running perf schema might contain some details on the meaning of the
individual metrics.



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Am Mi., 28. Nov. 2018 um 22:28 Uhr schrieb Glider, Jody :
>
>
>
> Hello,
>
>
>
> I’m trying to find a way to determine real/physical/raw storage capacity 
> usage when storing a similar set of objects in different pools, for example a 
> 3-way replicated pool vs. a 4+2 erasure coded pool, and in particular how 
> this ratio changes from small (where Bluestore block size matters more) to 
> large object sizes.
>
>
>
> I find that ceph df detail and rados df don’t report on really-raw storage, I 
> guess because they’re perceiving ‘raw’ storage from their perspective only. 
> If I write a set of objects to each pool, rados df shows the space used as 
> the summation of the logical size of the objects, while ceph df detail shows 
> the raw used storage as the object size * the redundancy factor (e.g. 3 for 
> 3-way replication and 1.5 for 4+2 erasure code).
>
>
>
> Any suggestions?
>
>
>
> Jody Glider, Principal Storage Architect
>
> Cloud Architecture and Engineering, SAP Labs LLC
>
> 3412 Hillview Ave (PAL 02 23.357), Palo Alto, CA 94304
>
> E   j.gli...@sap.com, T   +1 650-320-3306, M   +1 650-441-0241
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Raw space usage in Ceph with Bluestore

2018-11-28 Thread Glider, Jody

Hello,

I’m trying to find a way to determine real/physical/raw storage capacity usage 
when storing a similar set of objects in different pools, for example a 3-way 
replicated pool vs. a 4+2 erasure coded pool, and in particular how this ratio 
changes from small (where Bluestore block size matters more) to large object 
sizes.

I find that ceph df detail and rados df don’t report on really-raw storage, I 
guess because they’re perceiving ‘raw’ storage from their perspective only. If 
I write a set of objects to each pool, rados df shows the space used as the 
summation of the logical size of the objects, while ceph df detail shows the 
raw used storage as the object size * the redundancy factor (e.g. 3 for 3-way 
replication and 1.5 for 4+2 erasure code).

Any suggestions?

Jody Glider, Principal Storage Architect
Cloud Architecture and Engineering, SAP Labs LLC
3412 Hillview Ave (PAL 02 23.357), Palo Alto, CA 94304
E   j.gli...@sap.com, T   +1 650-320-3306, M   +1 
650-441-0241



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Swift metadata dropped when S3 bucket versioning enabled

2018-11-28 Thread Maxime Guyot
Hi Florian,

You assumed correctly, the "test" container (private) was created with the
"openstack container create test", then I am using the S3 API to
enable/disable object versioning on it.
I use the following Python snippet to enable/disable S3 bucket versioning:

import boto, boto.s3, boto.s3.connection
conn = conn = boto.connect_s3(aws_access_key_id='***',
aws_secret_access_key='***', host='***', port=8080,
calling_format=boto.s3.connection.OrdinaryCallingFormat())
bucket = conn.get_bucket('test')
bucket.configure_versioning(True) # Or False to disable S3 bucket versioning
bucket.get_versioning_status()

> Semi-related: I've seen some interesting things when mucking around with
> a single container/bucket while switching APIs, when it comes to
> container properties and metadata. For example, if you set a public read
> ACL on an S3 bucket, the the corresponding Swift container is also
> publicly readable but its read ACL looks empty (i.e. private) when you
> ask via the Swift API.

This can definitely become a problem if Swift API says "private" but data
is actually publicly available.
Since the doc says "S3 and Swift APIs share a common namespace, so you may
write data with one API and retrieve it with the other", it might be useful
to document this kind of limitations somewhere.

Cheers,
/ Maxime

On Wed, 28 Nov 2018 at 17:58 Florian Haas  wrote:

> On 27/11/2018 20:28, Maxime Guyot wrote:
> > Hi,
> >
> > I'm running into an issue with the RadosGW Swift API when the S3 bucket
> > versioning is enabled. It looks like it silently drops any metadata sent
> > with the "X-Object-Meta-foo" header (see example below).
> > This is observed on a Luminous 12.2.8 cluster. Is that a normal thing?
> > Am I misconfiguring something here?
> >
> >
> > With S3 bucket versioning OFF:
> > $ openstack object set --property foo=bar test test.dat
> > $ os object show test test.dat
> > ++--+
> > | Field  | Value|
> > ++--+
> > | account| v1   |
> > | container  | test |
> > | content-length | 507904   |
> > | content-type   | binary/octet-stream  |
> > | etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
> > | last-modified  | Tue, 27 Nov 2018 13:53:54 GMT|
> > | object | test.dat |
> > | properties | Foo='bar'|  <= Metadata is
> here
> > ++--+
> >
> > With S3 bucket versioning ON:
>
> Can you elaborate on what exactly you're doing here to enable S3 bucket
> versioning? Do I assume correctly that you are creating the "test"
> container using the swift or openstack client, then sending a
> VersioningConfiguration request against the "test" bucket, as explained
> in
>
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html#how-to-enable-disable-versioning-intro
> ?
>
> > $ openstack object set --property foo=bar test test2.dat
> > $ openstack object show test test2.dat
> > ++--+
> > | Field  | Value|
> > ++--+
> > | account| v1   |
> > | container  | test |
> > | content-length | 507904   |
> > | content-type   | binary/octet-stream  |
> > | etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
> > | last-modified  | Tue, 27 Nov 2018 13:56:50 GMT|
> > | object | test2.dat| <= Metadata is
> absent
> > ++--+
>
> Semi-related: I've seen some interesting things when mucking around with
> a single container/bucket while switching APIs, when it comes to
> container properties and metadata. For example, if you set a public read
> ACL on an S3 bucket, the the corresponding Swift container is also
> publicly readable but its read ACL looks empty (i.e. private) when you
> ask via the Swift API.
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Swift metadata dropped when S3 bucket versioning enabled

2018-11-28 Thread Florian Haas
On 27/11/2018 20:28, Maxime Guyot wrote:
> Hi,
> 
> I'm running into an issue with the RadosGW Swift API when the S3 bucket
> versioning is enabled. It looks like it silently drops any metadata sent
> with the "X-Object-Meta-foo" header (see example below).
> This is observed on a Luminous 12.2.8 cluster. Is that a normal thing?
> Am I misconfiguring something here?
> 
> 
> With S3 bucket versioning OFF:
> $ openstack object set --property foo=bar test test.dat
> $ os object show test test.dat
> ++--+
> | Field          | Value                            |
> ++--+
> | account        | v1                               |
> | container      | test                             |
> | content-length | 507904                           |
> | content-type   | binary/octet-stream              |
> | etag           | 03e8a398f343ade4e1e1d7c81a66e400 |
> | last-modified  | Tue, 27 Nov 2018 13:53:54 GMT    |
> | object         | test.dat                         |
> | properties     | Foo='bar'                        |  <= Metadata is here
> ++--+
> 
> With S3 bucket versioning ON:

Can you elaborate on what exactly you're doing here to enable S3 bucket
versioning? Do I assume correctly that you are creating the "test"
container using the swift or openstack client, then sending a
VersioningConfiguration request against the "test" bucket, as explained
in
https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html#how-to-enable-disable-versioning-intro?

> $ openstack object set --property foo=bar test test2.dat
> $ openstack object show test test2.dat
> ++--+
> | Field          | Value                            |
> ++--+
> | account        | v1                               |
> | container      | test                             |
> | content-length | 507904                           |
> | content-type   | binary/octet-stream              |
> | etag           | 03e8a398f343ade4e1e1d7c81a66e400 |
> | last-modified  | Tue, 27 Nov 2018 13:56:50 GMT    |
> | object         | test2.dat                        | <= Metadata is absent
> ++--+

Semi-related: I've seen some interesting things when mucking around with
a single container/bucket while switching APIs, when it comes to
container properties and metadata. For example, if you set a public read
ACL on an S3 bucket, the the corresponding Swift container is also
publicly readable but its read ACL looks empty (i.e. private) when you
ask via the Swift API.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-11-28 Thread Florian Haas
On 28/11/2018 15:52, Mark Nelson wrote:
>> Shifting over a discussion from IRC and taking the liberty to resurrect
>> an old thread, as I just ran into the same (?) issue. I see
>> *significantly* reduced performance on RBD reads, compared to writes
>> with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
>> (with the default 4K I/O size), whereas "rbd bench --io-type write"
>> produces more than twice that.
>>
>> I should probably add that while my end result of doing an "rbd bench
>> --io-type read" is about half of what I get from a write benchmark, the
>> intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
>> write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
>> really, my read IOPS are all over the map (and terrible on average),
>> whereas my write IOPS are not stellar, but consistent.
>>
>> This is an all-bluestore cluster on spinning disks with Luminous, and
>> I've tried the following things:
>>
>> - run rbd bench with --rbd_readahead_disable_after_bytes=0 and
>> --rbd_readahead_max_bytes=4194304 (per
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)
>>
>>
>> - configure OSDs with a larger bluestore_cache_size_hdd (4G; default
>> is 1G)
>>
>> - configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
>> than using 1%/99%/0% for metadata/KV data/objects, the OSDs use
>> 1%/49%/50%
>>
>> None of the above produced any tangible improvement. Benchmark results
>> are at http://paste.openstack.org/show/736314/ if anyone wants to take a
>> look.
>>
>> I'd be curious to see if anyone has a suggestion on what else to try.
>> Thanks in advance!
> 
> 
> Hi Florian,

Hi Mark, thanks for the speedy reply!

> By default bluestore will cache buffers on reads but not on writes
> (unless there are hints):
> 
> 
> Option("bluestore_default_buffered_read", Option::TYPE_BOOL,
> Option::LEVEL_ADVANCED)
>     .set_default(true)
>     .set_flag(Option::FLAG_RUNTIME)
>     .set_description("Cache read results by default (unless hinted
> NOCACHE or WONTNEED)"),
> 
>     Option("bluestore_default_buffered_write", Option::TYPE_BOOL,
> Option::LEVEL_ADVANCED)
>     .set_default(false)
>     .set_flag(Option::FLAG_RUNTIME)
>     .set_description("Cache writes by default (unless hinted NOCACHE or
> WONTNEED)"),
> 
> 
> This is one area where bluestore is a lot more confusing for users that
> filestore was.  There was a lot of concern about enabling buffer cache
> on writes by default because there's some associated overhead
> (potentially both during writes and in the mempool thread when trimming
> the cache).  It might be worth enabling bluestore_default_buffered_write
> and see if it helps reads.

So yes this is rather counterintuitive, but I happily gave it a shot and
the results are... more head-scratching than before. :)

The output is here: http://paste.openstack.org/show/736324/

In summary:

1. Write benchmark is in the same ballpark as before (good).

2. Read benchmark *without* readahead is *way* better than before
(splendid!) but has a weird dip down to 9K IOPS that I find
inexplicable. Any ideas on that?

3. Read benchmark *with* readahead is still abysmal, which I also find
rather odd. What do you think about that one?

4. Rerunning the benchmark without readahead is slow at first and then
speeds up to where it was before, but is not nearly being as consistent
even towards the end of the benchmark run.

I do much appreciate your continued insight, thanks a lot!

Cheers,
Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-11-28 Thread Mark Nelson


On 11/28/18 8:36 AM, Florian Haas wrote:

On 14/08/2018 15:57, Emmanuel Lacour wrote:

Le 13/08/2018 à 16:58, Jason Dillaman a écrit :

See [1] for ways to tweak the bluestore cache sizes. I believe that by
default, bluestore will not cache any data but instead will only
attempt to cache its key/value store and metadata.

I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.

I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.


In general, however, I would think that attempting to have bluestore
cache data is just an attempt to optimize to the test instead of
actual workloads. Personally, I think it would be more worthwhile to
just run 'fio --ioengine=rbd' directly against a pre-initialized image
after you have dropped the cache on the OSD nodes.

So with bluestore, I assume that we need to think more of client page
cache (at least when using a VM)  when with old filestore both osd and
client cache where used.
  
For benchmark, I did real benchmark here for the expected app workload

of this new cluster and it's ok for us :)


Thanks for your help Jason.

Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.

I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.

This is an all-bluestore cluster on spinning disks with Luminous, and
I've tried the following things:

- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)

- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)

- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%

None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.

I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!



Hi Florian,


By default bluestore will cache buffers on reads but not on writes 
(unless there are hints):



Option("bluestore_default_buffered_read", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)

    .set_default(true)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache read results by default (unless hinted 
NOCACHE or WONTNEED)"),


    Option("bluestore_default_buffered_write", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)

    .set_default(false)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache writes by default (unless hinted NOCACHE or 
WONTNEED)"),



This is one area where bluestore is a lot more confusing for users that 
filestore was.  There was a lot of concern about enabling buffer cache 
on writes by default because there's some associated overhead 
(potentially both during writes and in the mempool thread when trimming 
the cache).  It might be worth enabling bluestore_default_buffered_write 
and see if it helps reads.  You'll probably also want to pay attention 
to writes though.  I think we might want to consider enabling it by 
default but we should go through and do a lot of careful testing first. 
FWIW I did have it enabled when testing the new memory target code (and 
the not-yet-merged age-binned autotuning).  It was doing OK in my tests, 
but I didn't do an apples-to-apples comparison with it off.



Mark




Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

2018-11-28 Thread Florian Haas
On 14/08/2018 15:57, Emmanuel Lacour wrote:
> Le 13/08/2018 à 16:58, Jason Dillaman a écrit :
>>
>> See [1] for ways to tweak the bluestore cache sizes. I believe that by
>> default, bluestore will not cache any data but instead will only
>> attempt to cache its key/value store and metadata.
> 
> I suppose too because default ratio is to cache as much as possible k/v
> up to 512M and hdd cache is 1G by default.
> 
> I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
> processes uses 20GB now.
> 
>> In general, however, I would think that attempting to have bluestore
>> cache data is just an attempt to optimize to the test instead of
>> actual workloads. Personally, I think it would be more worthwhile to
>> just run 'fio --ioengine=rbd' directly against a pre-initialized image
>> after you have dropped the cache on the OSD nodes.
> 
> So with bluestore, I assume that we need to think more of client page
> cache (at least when using a VM)  when with old filestore both osd and
> client cache where used.
>  
> For benchmark, I did real benchmark here for the expected app workload
> of this new cluster and it's ok for us :)
> 
> 
> Thanks for your help Jason.

Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.

I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.

This is an all-bluestore cluster on spinning disks with Luminous, and
I've tried the following things:

- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)

- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)

- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%

None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.

I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rwg/civetweb log verbosity level

2018-11-28 Thread Casey Bodley
This stuff is logged under the 'civetweb' subsystem, so can be turned 
off with 'debug_civetweb = 0'. You can configure 'debug_rgw' separately.


On 11/28/18 1:03 AM, zyn赵亚楠 wrote:


Hi there,

I have a question about rgw/civetweb log settings.

Currently, rgw/civetweb prints 3 lines of logs with loglevel 1 (high 
priority) for each HTTP request, like following:


$ tail /var/log/ceph/ceph-client.rgw.node-1.log

2018-11-28 11:52:45.339229 7fbf2d693700  1 == starting new request 
req=0x7fbf2d68d190 =


2018-11-28 11:52:45.341961 7fbf2d693700  1 == req done 
req=0x7fbf2d68d190 op status=0 http_status=200 ==


2018-11-28 11:52:45.341993 7fbf2d693700  1 civetweb: 0x558f0433: 
127.0.0.1 - - [28/Nov/2018:11:48:10 +0800] "HEAD 
/swift/v1/images.xxx.com/8801234/BFAB307D-F5FE-4BC6-9449-E854944A460F_160_180.jpg 
HTTP/1.1" 1 0 - goswift/1.0


The above 3 lines occupies roughly 0.5KB space on average, varying a 
little with the lengths of bucket names and object names.


Now the problem is, when requests are intensive, it will consume a 
huge mount of space. For example, 4 million requests (on a single RGW 
node) will result to 2GB, which takes only ~6 hours to happen in our 
cluster node in busy period (a large part may be HEAD requests).


When trouble shooting, I usually need to turn the loglevel to 5, 10 or 
even bigger to check the detailed logs, but most of the log space is 
occupied by the above access logs (level 1), which doesn’t provide 
much information.


My question is, is there a way to configure Ceph skip those logs? E.g. 
only print logs with verbosity in a specified range (NOT support, 
according to my investigation).


Or, are there any suggested ways for turning on more logs for debugging?

Best Regards

Arthur Chiao


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph IO stability issues

2018-11-28 Thread Smith, Eric
There’s a couple of things I would look into:

  *   Any packet loss whatsoever – especially on your cluster / private 
replication network
  *   Test against an R3 pool to see if EC on RBD with overwrites is the culprit
  *   Check to see what processes are in the “R” state during high iowait times

Those would be my next steps
Eric

From: ceph-users  On Behalf Of Jean-Philippe 
Méthot
Sent: Tuesday, November 27, 2018 11:48 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph IO stability issues

Hi,

We’re currently progressively pushing into production a CEPH Mimic cluster and 
we’ve noticed a fairly strange behaviour. We use Ceph as a storage backend for 
Openstack block device. Now, we’ve deployed a few VMs on this backend to test 
the waters. These VMs are practically empty, with only the regular cpanel 
services running on them and no actual website set. We notice that about twice 
in a span of about 5 minutes, the iowait will jump to ~10% without any VM-side 
explanation, no specific service taking any more io bandwidth than usual.

I must also add that the speed of the cluster is excellent. It’s really more of 
a stability issue that bothers me here. I see the jump in iowait as the VM 
being unable to read or write on the ceph cluster for a second or so. I've 
considered that it could be the deep scrub operations, but those seem to 
complete in 0.1 second, as there’s practically no data to scrub.

The cluster pool configuration is as such:
-RBD on erasure-coded pool (a replicated metadata pool and an erasure coded 
data pool) with overwrites enabled
-The data pool size is k=6 m=2, so 8, with 1024 PGs
-The metadata pool size is 3, with 64 PGs


Of course, this is running on bluestore.
As for the hardware, the config is as follow:
-10 hosts
-9 OSD per host
-Each OSD is a Intel DC S3510
-CPUs are dual E5-2680v2 (40 threads total @2.8GHz)
-Each host has 128 GB of ram
-Network is 2x bonded 10gbps, 1 for storage, 1 for replication

I understand that I will eventually hit a speed block because of either the 
CPUs or the network, but maximum speed is not my current concern here and can 
be upgraded when needed. I’ve been wondering, could these hiccups be caused by 
data caching at the client level? If so, what could I do to fix this?

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com