Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-04-29 Thread Gregory Farnum
Yes, check out the file layout options:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

On Mon, Apr 29, 2019 at 3:32 PM Daniel Williams  wrote:
>
> Is the 4MB configurable?
>
> On Mon, Apr 29, 2019 at 4:36 PM Gregory Farnum  wrote:
>>
>> CephFS automatically chunks objects into 4MB objects by default. For
>> an EC pool, RADOS internally will further subdivide them based on the
>> erasure code and striping strategy, with a layout that can vary. But
>> by default if you have eg an 8+3 EC code, you'll end up with a bunch
>> of (4MB/8=)512KB objects within the OSD.
>> -Greg
>>
>> On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams  wrote:
>> >
>> > Hey,
>> >
>> > What controls / determines object size of a purely cephfs ec (6.3) pool? I 
>> > have large file but seemingly small objects.
>> >
>> > Daniel
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-04-29 Thread Daniel Williams
Is the 4MB configurable?

On Mon, Apr 29, 2019 at 4:36 PM Gregory Farnum  wrote:

> CephFS automatically chunks objects into 4MB objects by default. For
> an EC pool, RADOS internally will further subdivide them based on the
> erasure code and striping strategy, with a layout that can vary. But
> by default if you have eg an 8+3 EC code, you'll end up with a bunch
> of (4MB/8=)512KB objects within the OSD.
> -Greg
>
> On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams 
> wrote:
> >
> > Hey,
> >
> > What controls / determines object size of a purely cephfs ec (6.3) pool?
> I have large file but seemingly small objects.
> >
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sanity check on unexpected data movement

2019-04-29 Thread Graham Allan
Now that I dig into this, I can see in the exported crush map that the 
choose_args weight_set for this bucket id is zero for the 9th member 
(which I assume corresponds to the evacuated node-98).



rack even01 {
id -10  # do not change unnecessarily
id -14 class ssd# do not change unnecessarily
id -18 class hdd# do not change unnecessarily
# weight 132.502
alg straw2
hash 0  # rjenkins1
item node-08 weight 12.912
item node-02 weight 25.619
item node-04 weight 12.912
item node-06 weight 12.912
item node-10 weight 12.912
item node-12 weight 12.912
item node-14 weight 12.912
item node-16 weight 12.912
item node-98 weight 16.500
}

...

# choose_args
choose_args 18446744073709551615 {
  {

...

  {
bucket_id -18
weight_set [
  [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 0.000 ]
]
  }

...

I assume it wasn't set to zero until recently, as it was holding data... 
I wonder what caused it to change?


Presumably I can edit the crush map manually to correct this at least 
approximately to get things working better.


Changing the value from 0.000 to (guess) 12.000, recompiling the map, 
and testing with
"crushtool --test -i crush.map --show-utilization-all ..." does show 
things being stored again on the affected devices...


Even more mysterious though: I rebooted the node-98 (why not, it was no 
longer hosting any data), and after it returned, I saw that its 
choose_args value had magically changed:



  {
bucket_id -18
weight_set [
  [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 16.450 ]
]
  }


and data is moving back. I love it when things "fix themselves" without 
apparent cause!


Graham

On 4/29/19 12:12 PM, Graham Allan wrote:
I think I need a second set of eyes to understand some unexpected data 
movement when adding new OSDs to a cluster (Luminous 12.2.11).


Our cluster ran low on space sooner than expected; so as a stopgap I 
recommissioned a couple of older storage nodes while we get new hardware 
purchases under way.


I spent a little time to run drive tests and weed out any weaklings 
before creating new OSDs... because of this one node was ready before 
the other. Each has 30 HDDs/OSDs.


So for the first node I introduced the new OSDs by increasing their 
crush weight gradually to the final value (0.55 in steps of 0.1 - the 
values don't make much sense relative to hdd capacity but that's 
historic). We never had more than ~2% of pgs misplaced at any one time. 
All went well, the new OSDs acquired pgs in the expected proportions and 
the space crunch was mitigated.


Then I started adding the second node - first setting its osds to crush 
weight 0.1. All of a sudden, bam, 14-15% of pgs were misplaced! This 
didn't make any sense to me - what seems to have happened is that ceph 
evacuated almost all data from the previous new node. I just don't 
understand this given the osd crush weights...


What might cause this? The output of "ceph df tree" is below; the first 
new node is "node-98", the second is "node-99". Is there anything 
obvious I could be missing?


One note, we almost certainly need more pgs to improve the data 
distribution, but seems too risky to change that until more space 
available.


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Adrien Gillard
I would add that the use of cache tiering, though still possible, is not
recommended and comes with its own challenges.

On Mon, Apr 29, 2019 at 11:49 AM Igor Podlesny  wrote:

> On Mon, 29 Apr 2019 at 16:19, Rainer Krienke 
> wrote:
> [...]
> > - Do I still (nautilus) need two pools for EC based RBD images, one EC
> > data pool and a second replicated pool for metadatata?
>
> The answer is given at
>
> http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coding-with-overwrites
> "...
> Erasure coded pools do not support omap, so to use them with RBD and
> CephFS you must instruct them to store their data in an ec pool, and
> their metadata in a replicated pool
> ..."
>
> Another option is using tiered pools, specially when you can dedicate
> fast OSDs for that:
>
>
> http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coded-pool-and-cache-tiering
>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-04-29 Thread Gregory Farnum
Glad you got it working and thanks for the logs! Looks like we've seen
this once or twice before so I added them to
https://tracker.ceph.com/issues/38724.
-Greg

On Fri, Apr 26, 2019 at 5:52 PM Elise Burke  wrote:
>
> Thanks for the pointer to ceph-objectstore-tool, it turns out that removing 
> and exporting the PG from all three disks was enough to make it boot! I've 
> exported the three copies of the bad PG, let me know if you'd like me to 
> upload them anywhere for inspection.
>
> All data has been recovered (since I was originally removing the pool that 
> contained pg 25.0 anyway) and all systems are go on my end. Ceph's 
> architecture is very solid.
>
>
> $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-6 --op 
> export-remove --pgid 25.0 --file pg_25_0_from_osd_6.bin
> Exporting 25.0 info 25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) 
> local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 
> 7593/7593/7593)
> Export successful
>  marking collection for removal
> setting '_remove' omap key
> finish_remove_pgs 25.0_head removing 25.0
> Remove successful
>
>
> On Fri, Apr 26, 2019 at 8:33 PM Elise Burke  wrote:
>>
>> Using ceph-objectstore-info on PG 25.0 (which indeed, was the one I remember 
>> having the error) shows this:
>>
>> struct_v 10
>> {
>> "pgid": "25.0",
>> "last_update": "7592'106",
>> "last_complete": "7592'106",
>> "log_tail": "0'0",
>> "last_user_version": 106,
>> "last_backfill": "MIN",
>> "last_backfill_bitwise": 1,
>> "purged_snaps": [],
>> "history": {
>> "epoch_created": 5191,
>> "epoch_pool_created": 5191,
>> "last_epoch_started": 7489,
>> "last_interval_started": 7488,
>> "last_epoch_clean": 7489,
>> "last_interval_clean": 7488,
>> "last_epoch_split": 0,
>> "last_epoch_marked_full": 0,
>> "same_up_since": 7593,
>> "same_interval_since": 7593,
>> "same_primary_since": 7593,
>> "last_scrub": "7592'106",
>> "last_scrub_stamp": "2019-04-25 21:34:52.079721",
>> "last_deep_scrub": "7485'70",
>> "last_deep_scrub_stamp": "2019-04-22 10:15:40.532014",
>> "last_clean_scrub_stamp": "2019-04-19 14:52:44.047548"
>> },
>> "stats": {
>> "version": "7592'105",
>> "reported_seq": "2621",
>> "reported_epoch": "7592",
>> "state": "active+clean+inconsistent",
>> "last_fresh": "2019-04-25 20:02:55.620028",
>> "last_change": "2019-04-24 19:52:45.072473",
>> "last_active": "2019-04-25 20:02:55.620028",
>> "last_peered": "2019-04-25 20:02:55.620028",
>> "last_clean": "2019-04-25 20:02:55.620028",
>> "last_became_active": "2019-04-22 17:55:37.578239",
>> "last_became_peered": "2019-04-22 17:55:37.578239",
>> "last_unstale": "2019-04-25 20:02:55.620028",
>> "last_undegraded": "2019-04-25 20:02:55.620028",
>> "last_fullsized": "2019-04-25 20:02:55.620028",
>> "mapping_epoch": 7593,
>> "log_start": "0'0",
>> "ondisk_log_start": "0'0",
>> "created": 5191,
>> "last_epoch_clean": 7489,
>> "parent": "0.0",
>> "parent_split_bits": 0,
>> "last_scrub": "7592'88",
>> "last_scrub_stamp": "2019-04-24 19:52:45.072367",
>> "last_deep_scrub": "7485'70",
>> "last_deep_scrub_stamp": "2019-04-22 10:15:40.532014",
>> "last_clean_scrub_stamp": "2019-04-19 14:52:44.047548",
>> "log_size": 105,
>> "ondisk_log_size": 105,
>> "stats_invalid": false,
>> "dirty_stats_invalid": false,
>> "omap_stats_invalid": false,
>> "hitset_stats_invalid": false,
>> "hitset_bytes_stats_invalid": false,
>> "pin_stats_invalid": false,
>> "manifest_stats_invalid": false,
>> "snaptrimq_len": 0,
>> "stat_sum": {
>> "num_bytes": 0,
>> "num_objects": 9,
>> "num_object_clones": 0,
>> "num_object_copies": 27,
>> "num_objects_missing_on_primary": 0,
>> "num_objects_missing": 0,
>> "num_objects_degraded": 0,
>> "num_objects_misplaced": 0,
>> "num_objects_unfound": 0,
>> "num_objects_dirty": 9,
>> "num_whiteouts": 0,
>> "num_read": 87,
>> "num_read_kb": 87,
>> "num_write": 98,
>> "num_write_kb": 98,
>> "num_scrub_errors": 0,
>> "num_shallow_scrub_errors": 0,
>> "num_deep_scrub_errors": 0,
>> "num_objects_recovered": 0,
>> "num_bytes_recovered": 0,
>> "num_keys_recovered": 0,
>> "num_objects_omap": 9,
>> "num_objects_hit_set_archive": 0,
>> "num_bytes_hit_set_archive": 0,
>> "num_flush": 0,
>> "num_flush_kb": 0,
>> "num_evict": 0,
>>   

Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-04-29 Thread Gregory Farnum
CephFS automatically chunks objects into 4MB objects by default. For
an EC pool, RADOS internally will further subdivide them based on the
erasure code and striping strategy, with a layout that can vary. But
by default if you have eg an 8+3 EC code, you'll end up with a bunch
of (4MB/8=)512KB objects within the OSD.
-Greg

On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams  wrote:
>
> Hey,
>
> What controls / determines object size of a purely cephfs ec (6.3) pool? I 
> have large file but seemingly small objects.
>
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v14.2.1 Nautilus released

2019-04-29 Thread Abhishek
We're happy to announce the first bug fix release of Ceph Nautilus 
release series.
We recommend all nautilus users upgrade to this release. For upgrading 
from older releases of

ceph, general guidelines for upgrade to nautilus must be followed

Notable Changes
---

* The default value for `mon_crush_min_required_version` has been
  changed from `firefly` to `hammer`, which means the cluster will
  issue a health warning if your CRUSH tunables are older than hammer.
  There is generally a small (but non-zero) amount of data that will
  move around by making the switch to hammer tunables; for more 
information,

  see :ref:`crush-map-tunables`.

  If possible, we recommend that you set the oldest allowed client to 
`hammer`
  or later.  You can tell what the current oldest allowed client is 
with::


ceph osd dump | min_compat_client

  If the current value is older than hammer, you can tell whether it
  is safe to make this change by verifying that there are no clients
  older than hammer current connected to the cluster with::

ceph features

  The newer `straw2` CRUSH bucket type was introduced in hammer, and
  ensuring that all clients are hammer or newer allows new features
  only supported for `straw2` buckets to be used, including the
  `crush-compat` mode for the :ref:`balancer`.

* Ceph now packages python bindings for python3.6 instead of
  python3.4, because EPEL7 recently switched from python3.4 to
  python3.6 as the native python3. see the `announcement 
`_

  for more details on the background of this change.

Known Issues


* Nautilus-based librbd clients cannot open images stored on 
pre-Luminous

  clusters


For a detailed changelog please refer to the official release notes
entry at the ceph blog 
https://ceph.com/releases/v14-2-1-nautilus-released/


Getting ceph:

* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.1.tar.gz
* For packages, see
http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: d555a9489eb35f84f2e1ef49b77e19da9d113972


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CEPH calculates PGs per OSD for erasure coded (EC) pools?

2019-04-29 Thread Christian Wuerdig
On Sun, 28 Apr 2019 at 21:45, Igor Podlesny  wrote:

> On Sun, 28 Apr 2019 at 16:14, Paul Emmerich 
> wrote:
> > Use k+m for PG calculation, that value also shows up as "erasure size"
> > in ceph osd pool ls detail
>
> So does it mean that for PG calculation those 2 pools are equivalent:
>
> 1) EC(4, 2)
> 2) replicated, size 6
>

Correct


>
> ? Sounds weird to be honest. Replicated with size 6 means each logical
> data is stored 6 times, what needed single PG now requires 6 PGs.
> And with EC(4, 2) there's still only 1.5 overhead in terms of raw
> occupied space -- how come PG calculation distribution needs adjusting
> to 6 instead of 1.5 then?
>

A single logical data unit (an object in ceph terms) will be allocated to a
single PG. For a replicated pool of size n this PG will simply be stored on
n OSDs. For an EC(k+m) pool this PG will get stored on k+m OSDs with the
difference that this single PG will contain different parts of the data on
the different OSDs.
http://docs.ceph.com/docs/master/architecture/#erasure-coding provides a
good overview on how this is actually achieved.


> Also, why does CEPH documentation say "It is equivalent to a
> replicated pool of size __two__" when describing EC(2, 1) example?
>

This relates to fault tolerance. A replicated pool of size 2 can loose one
OSD without data loss and so can a EC(2+1) pool


>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sanity check on unexpected data movement

2019-04-29 Thread Graham Allan
I think I need a second set of eyes to understand some unexpected data 
movement when adding new OSDs to a cluster (Luminous 12.2.11).


Our cluster ran low on space sooner than expected; so as a stopgap I 
recommissioned a couple of older storage nodes while we get new hardware 
purchases under way.


I spent a little time to run drive tests and weed out any weaklings 
before creating new OSDs... because of this one node was ready before 
the other. Each has 30 HDDs/OSDs.


So for the first node I introduced the new OSDs by increasing their 
crush weight gradually to the final value (0.55 in steps of 0.1 - the 
values don't make much sense relative to hdd capacity but that's 
historic). We never had more than ~2% of pgs misplaced at any one time. 
All went well, the new OSDs acquired pgs in the expected proportions and 
the space crunch was mitigated.


Then I started adding the second node - first setting its osds to crush 
weight 0.1. All of a sudden, bam, 14-15% of pgs were misplaced! This 
didn't make any sense to me - what seems to have happened is that ceph 
evacuated almost all data from the previous new node. I just don't 
understand this given the osd crush weights...


What might cause this? The output of "ceph df tree" is below; the first 
new node is "node-98", the second is "node-99". Is there anything 
obvious I could be missing?


One note, we almost certainly need more pgs to improve the data 
distribution, but seems too risky to change that until more space available.


Thanks for any ideas, Graham

ID  CLASS WEIGHTREWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS TYPE NAME
 -2   267.21521- 4.69PiB 3.30PiB 1.39PiB 00   - root default 
-10   132.50159- 2.24PiB 1.53PiB  726TiB 68.40 0.97   - rack even01  
-5025.61932-  440TiB  345TiB 95.4TiB 78.32 1.11   - host node-02 
328   hdd   0.36699  1.0 7.28TiB 5.41TiB 1.87TiB 74.33 1.06  46 osd.328  
329   hdd   0.36699  1.0 7.28TiB 5.59TiB 1.69TiB 76.80 1.09  49 osd.329  
330   hdd   0.36699  1.0 7.28TiB 6.12TiB 1.16TiB 84.11 1.20  50 osd.330  
331   hdd   0.36699  1.0 7.28TiB 5.38TiB 1.90TiB 73.96 1.05  46 osd.331  
332   hdd   0.36699  1.0 7.28TiB 5.47TiB 1.80TiB 75.23 1.07  47 osd.332  
333   hdd   0.36699  1.0 7.28TiB 5.47TiB 1.81TiB 75.11 1.07  46 osd.333  
334   hdd   0.36699  1.0 7.28TiB 5.70TiB 1.58TiB 78.28 1.11  49 osd.334  
335   hdd   0.36699  0.7 7.28TiB 6.85TiB  442GiB 94.08 1.34  55 osd.335  
336   hdd   0.36699  1.0 7.28TiB 5.25TiB 2.03TiB 72.14 1.03  44 osd.336  
337   hdd   0.36699  1.0 7.28TiB 5.82TiB 1.46TiB 79.92 1.14  48 osd.337  
338   hdd   0.36699  1.0 7.28TiB 4.81TiB 2.47TiB 66.08 0.94  39 osd.338  
339   hdd   0.36699  1.0 7.28TiB 5.49TiB 1.78TiB 75.50 1.07  46 osd.339  
340   hdd   0.36699  1.0 7.28TiB 5.60TiB 1.68TiB 76.89 1.09  47 osd.340  
341   hdd   0.36699  1.0 7.28TiB 5.26TiB 2.02TiB 72.24 1.03  45 osd.341  
342   hdd   0.36699  1.0 7.28TiB 5.71TiB 1.57TiB 78.45 1.11  48 osd.342  
343   hdd   0.36699  1.0 7.28TiB 6.34TiB  958GiB 87.15 1.24  53 osd.343  
344   hdd   0.36699  1.0 7.28TiB 6.26TiB 1.02TiB 85.98 1.22  51 osd.344  
345   hdd   0.36699  1.0 7.28TiB 5.28TiB 1.99TiB 72.60 1.03  46 osd.345  
346   hdd   0.36699  1.0 7.28TiB 4.94TiB 2.33TiB 67.93 0.97  41 osd.346  
347   hdd   0.36699  1.0 7.28TiB 5.38TiB 1.90TiB 73.89 1.05  45 osd.347  
348   hdd   0.36699  1.0 7.28TiB 5.35TiB 1.92TiB 73.57 1.05  44 osd.348  
349   hdd   0.36699  0.8 7.28TiB 6.65TiB  642GiB 91.39 1.30  53 osd.349  
350   hdd   0.36699  1.0 7.28TiB 5.18TiB 2.09TiB 71.24 1.01  45 osd.350  
351   hdd   0.36699  1.0 7.28TiB 5.62TiB 1.66TiB 77.25 1.10  47 osd.351  
352   hdd   0.36699  1.0 7.28TiB 5.79TiB 1.49TiB 79.56 1.13  48 osd.352  
353   hdd   0.36699  1.0 7.28TiB 6.18TiB 1.10TiB 84.89 1.21  53 osd.353  
354   hdd   0.36699  1.0 7.28TiB 5.70TiB 1.58TiB 78.29 1.11  49 osd.354  
355   hdd   0.36699  1.0 7.28TiB 6.12TiB 1.16TiB 84.12 1.20  52 osd.355  
356   hdd   0.36699  1.0 7.28TiB 5.67TiB 1.61TiB 77.89 1.11  45 osd.356  
357   hdd   0.36699  1.0 7.28TiB 5.94TiB 1.33TiB 81.67 1.16  49 osd.357  
358   hdd   0.36699  1.0 7.28TiB 6.58TiB  713GiB 90.44 1.29  55 osd.358  
359   hdd   0.36699  1.0 7.28TiB 6.02TiB 1.26TiB 82.71 1.18  51 osd.359  
360   hdd   0.36699  1.0 7.28TiB 5.70TiB 1.58TiB 78.27 1.11  47 

[ceph-users] obj_size_info_mismatch error handling

2019-04-29 Thread Reed Dier
Hi list,

Woke up this morning to two PG's reporting scrub errors, in a way that I 
haven't seen before.
> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "mgr": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "osd": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156
> },
> "mds": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 2
> },
> "overall": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156,
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 8
> }
> }


> OSD_SCRUB_ERRORS 8 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]

Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
> {
> "epoch": 134582,
> "inconsistents": [
> {
> "object": {
> "name": "10008536718.",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "obj_size_info_mismatch"
> ],
> "shards": [
> {
> "osd": 7,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> },
> {
> "osd": 16,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0x",
> "expected_object_size": 0,
> 

[ceph-users] adding crush ruleset

2019-04-29 Thread Luis Periquito
Hi,

I need to add a more complex crush ruleset to a cluster and was trying
to script that as I'll need to do it often.

Is there any way to create these other than manually editing the crush map?

This is to create a k=4 + m=2 across 3 rooms, with 2 parts in each room
The ruleset would be something like (haven't tried/tuned the rule):

rule xxx {
   type erasure
   min_size 5
   max_size 6
   step set_chooseleaf_tries 5
   step set_choose_tries 100
   step take default
   step choose indep 3 type room
   step chooseleaf indep 2 type host
}

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 16:19, Rainer Krienke  wrote:
[...]
> - Do I still (nautilus) need two pools for EC based RBD images, one EC
> data pool and a second replicated pool for metadatata?

The answer is given at
http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coding-with-overwrites
"...
Erasure coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an ec pool, and
their metadata in a replicated pool
..."

Another option is using tiered pools, specially when you can dedicate
fast OSDs for that:

http://docs.ceph.com/docs/nautilus/rados/operations/erasure-code/#erasure-coded-pool-and-cache-tiering

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 16:37, Burkhard Linke
 wrote:
> On 4/29/19 11:19 AM, Rainer Krienke wrote:
[...]
> > - I also thought about the different k+m settings for a EC pool, for
> > example k=4, m=2 compared to k=8 and m=2. Both settings allow for two
> > OSDs to fail without any data loss, but I asked myself which of the two
> > settings would be more performant? On one hand distributing data to more
> > OSDs allows a higher parallel access to the data, that should result in
> > a faster access. On the other hand each OSD has a latency until
> > it can deliver its data shard. So is there a recommandation which of my
> > two k+m examples should be preferred?
>
> I cannot comment on speed (interesting question, since we are about to

In theory the more stripes you have the faster it works overall (IO
load is distributed among bigger number of hosts).

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Burkhard Linke

Hi,

On 4/29/19 11:19 AM, Rainer Krienke wrote:

I am planning to set up a ceph cluster and already implemented a test
cluster where we are going to use RBD images for data storage (9 hosts,
each host has 16 OSDs, each OSD 4TB).
We would like to use erasure coded (EC)  pools here, and so all OSD are
bluestore. Since several projects are going to store data on this ceph
cluster I think it would make sense to use several EC coded pools for
separation of the projects and access control.

Now I have some questions I hope someone can help me with:

- Do I still (nautilus) need two pools for EC based RBD images, one EC
data pool and a second replicated pool for metadatata?
AFAIK the EC pools cannot store metadata at all, so you probably still 
need a separate replicated pool.


- If I do need two pools for RBD images and I want to separate the data of
different projects by using different pools with EC coding then how
should I handle the metadata pool which contains probably only a small
amount of data compared to the data pool?  Does it make sense to have
*one* replicated metadata pool (eg the default rbd pool) for all
projects and one EC pool for each project, or would it be better to
create one replicated and one EC pool for each project?


An alternative concept is using rados namespaces; each project uses its 
own namespace in a single replicated pool. Whether this works in your 
setup depends on the clients and whether they support namespaces.


On the other hand the PG autotuning in nautilus can keep the number of 
PGs low, so additional replicated pools won't be as bad as they were 
pre-nautilus.




- I also thought about the different k+m settings for a EC pool, for
example k=4, m=2 compared to k=8 and m=2. Both settings allow for two
OSDs to fail without any data loss, but I asked myself which of the two
settings would be more performant? On one hand distributing data to more
OSDs allows a higher parallel access to the data, that should result in
a faster access. On the other hand each OSD has a latency until
it can deliver its data shard. So is there a recommandation which of my
two k+m examples should be preferred?


I cannot comment on speed (interesting question, since we are about to 
setup a new cluster, too)...but I won't use k=8,m=2 in a setup with 9 
hosts only. You should have at least k+m+m hosts to handle hosts 
failures gracefully. So with nine hosts even k=6,m=2 might (and will) be 
a problem.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need some advice about Pools and Erasure Coding

2019-04-29 Thread Rainer Krienke
I am planning to set up a ceph cluster and already implemented a test
cluster where we are going to use RBD images for data storage (9 hosts,
each host has 16 OSDs, each OSD 4TB).
We would like to use erasure coded (EC)  pools here, and so all OSD are
bluestore. Since several projects are going to store data on this ceph
cluster I think it would make sense to use several EC coded pools for
separation of the projects and access control.

Now I have some questions I hope someone can help me with:

- Do I still (nautilus) need two pools for EC based RBD images, one EC
data pool and a second replicated pool for metadatata?

- If I do need two pools for RBD images and I want to separate the data of
different projects by using different pools with EC coding then how
should I handle the metadata pool which contains probably only a small
amount of data compared to the data pool?  Does it make sense to have
*one* replicated metadata pool (eg the default rbd pool) for all
projects and one EC pool for each project, or would it be better to
create one replicated and one EC pool for each project?

- I also thought about the different k+m settings for a EC pool, for
example k=4, m=2 compared to k=8 and m=2. Both settings allow for two
OSDs to fail without any data loss, but I asked myself which of the two
settings would be more performant? On one hand distributing data to more
OSDs allows a higher parallel access to the data, that should result in
a faster access. On the other hand each OSD has a latency until
it can deliver its data shard. So is there a recommandation which of my
two k+m examples should be preferred?

Thanks in advance for your help
Rainer
-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Igor Podlesny
On Mon, 29 Apr 2019 at 15:13, Eugen Block  wrote:
>
> Sure there is:
>
> ceph pg ls-by-osd 

Thank you Eugen, I overlooked it somehow :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Igor Podlesny
Or is there no direct way to accomplish that?
What workarounds can be used then?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Eugen Block

Sure there is:

ceph pg ls-by-osd 

Regards,
Eugen

Zitat von Igor Podlesny :


Or is there no direct way to accomplish that?
What workarounds can be used then?

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does ceph osd reweight-by-xxx work correctly if OSDs aren't of same size?

2019-04-29 Thread huang jun
Yes, 'ceph osd reweight-by-xxx' will use the osd crush-weight(which
represent how much data it can hold)
 to calculate.

Igor Podlesny  于2019年4月29日周一 下午2:56写道:
>
> Say, some nodes have OSDs that are 1.5 times bigger, than other nodes
> have, meanwhile weights of all the nodes in question is almost equal
> (due having different number of OSDs obviously)
>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does ceph osd reweight-by-xxx work correctly if OSDs aren't of same size?

2019-04-29 Thread Igor Podlesny
Say, some nodes have OSDs that are 1.5 times bigger, than other nodes
have, meanwhile weights of all the nodes in question is almost equal
(due having different number of OSDs obviously)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com