Re: [ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-04 Thread Orit Wasserman
Hi Maarten,

On Tue, Jul 4, 2017 at 9:46 PM, Maarten De Quick 
wrote:

> Hi,
>
> Background: We're having issues with our index pool (slow requests / time
> outs causes crashing of an OSD and a recovery -> application issues). We
> know we have very big buckets (eg. bucket of 77 million objects with only
> 16 shards) that need a reshard so we were looking at the resharding process.
>
> First thing we would like to do is making a backup of the bucket index,
> but this failed with:
>
> # radosgw-admin -n client.radosgw.be-west-3 bi list
> --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
> 2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
> ERROR: bi_list(): (4) Interrupted system call
>
>
What version of are you using?
Can you rerun the command with --debug-rgw=20 --debug-ms=1?
Also please open a tracker issue (for rgw) with all the information.

Thanks,
Orit

When I grep for "idx" and I count these:
>  # grep idx priv-prod-up-alex.list.backup | wc -l
> 2294942
> When I do a bucket stats for that bucket I get:
> # radosgw-admin -n client.radosgw.be-west-3 bucket stats
> --bucket=priv-prod-up-alex | grep num_objects
> 2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
> "num_objects": 20148575
>
> It looks like there are 18 million objects missing and the backup is not
> complete (not sure if that's a correct assumption?). We're also afraid that
> the resharding command will face the same issue.
> Has anyone seen this behaviour before or any thoughts on how to fix it?
>
> We were also wondering if we really need the backup. As the resharding
> process creates a complete new index and keeps the old bucket, is there
> maybe a possibility to relink your bucket to the old bucket in case of
> issues? Or am I missing something important here?
>
> Any help would be greatly appreciated, thanks!
>
> Regards,
> Maarten
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon stuck in synchronizing after upgrading from Hammer to Jewel

2017-07-04 Thread jiajia zhong
refer to http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

I recalled we encoutered the same issue after upgrading to Jewel :(.

2017-07-05 11:21 GMT+08:00 许雪寒 :

> Hi, everyone.
>
> Recently, we upgraded one of clusters from Hammer to Jewel. However, after
> upgrading one of our monitors cannot finish the bootstrap procedure and
> stuck in “synchronizing”. Does anyone has any clue about this?
>
> Thank you☺
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-04 Thread Eino Tuominen
​Hello,


I noticed the same behaviour in our cluster.


ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)



cluster 0a9f2d69-5905-4369-81ae-e36e4a791831

 health HEALTH_WARN

1 pgs backfill_toofull

4366 pgs backfill_wait

11 pgs backfilling

45 pgs degraded

45 pgs recovery_wait

45 pgs stuck degraded

4423 pgs stuck unclean

recovery 181563/302722835 objects degraded (0.060%)

recovery 57192879/302722835 objects misplaced (18.893%)

1 near full osd(s)

noout,nodeep-scrub flag(s) set

 monmap e3: 3 mons at 
{0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}

election epoch 356, quorum 0,1,2 0,1,2

 osdmap e388588: 260 osds: 260 up, 242 in; 4378 remapped pgs

flags nearfull,noout,nodeep-scrub,require_jewel_osds

  pgmap v80658624: 25728 pgs, 8 pools, 202 TB data, 89212 kobjects

612 TB used, 300 TB / 912 TB avail

181563/302722835 objects degraded (0.060%)

57192879/302722835 objects misplaced (18.893%)

   21301 active+clean

4366 active+remapped+wait_backfill

  45 active+recovery_wait+degraded

  11 active+remapped+backfilling

   4 active+clean+scrubbing

   1 active+remapped+backfill_toofull

recovery io 421 MB/s, 155 objects/s

  client io 201 kB/s rd, 2034 B/s wr, 75 op/s rd, 0 op/s wr

I'm currently doing a rolling migration from Puppet on Ubuntu to Ansible on 
RHEL, and I started with a healthy cluster, evacuated some nodes by setting 
their weight to 0, removed them from the cluster and re-added them with ansible 
playbook.

Basically I ran


ceph osd crush remove osd.$num

ceph osd rm $num

ceph auth del osd.$num

in a loop for the osds I was replacing, and then let the ansible ceph-osd 
playbook to bring the host back to the cluster. Crushmap is attached.
​
--
  Eino Tuominen



From: ceph-users  on behalf of Gregory 
Farnum 
Sent: Friday, June 30, 2017 23:38
To: Andras Pataki; ceph-users
Subject: Re: [ceph-users] Degraded objects while OSD is being added/filled

On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki 
mailto:apat...@flatironinstitute.org>> wrote:
Hi cephers,

I noticed something I don't understand about ceph's behavior when adding an 
OSD.  When I start with a clean cluster (all PG's active+clean) and add an OSD 
(via ceph-deploy for example), the crush map gets updated and PGs get 
reassigned to different OSDs, and the new OSD starts getting filled with data.  
As the new OSD gets filled, I start seeing PGs in degraded states.  Here is an 
example:

  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
3164 TB used, 781 TB / 3946 TB avail
8017/994261437 objects degraded (0.001%)
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
   1 active+recovery_wait+degraded
   1 active+clean+scrubbing
   1 active+remapped+backfilling

Any ideas why there would be any persistent degradation in the cluster while 
the newly added drive is being filled?  It takes perhaps a day or two to fill 
the drive - and during all this time the cluster seems to be running degraded.  
As data is written to the cluster, the number of degraded objects increases 
over time.  Once the newly added OSD is filled, the cluster comes back to clean 
again.

Here is the PG that is degraded in this picture:

7.87c10200419430477
active+recovery_wait+degraded2017-06-20 14:12:44.119921344610'7
583572:2797[402,521]402[402,521]402344610'72017-06-16 
06:04:55.822503344610'72017-06-16 06:04:55.822503

The newly added osd here is 521.  Before it got added, this PG had two replicas 
clean, but one got forgotten somehow?

This sounds a bit concerning at first glance. Can you provide some output of 
exactly what commands you're invoking, and the "ceph -s" output as it changes 
in response?

I really don't see how adding a new OSD can result in it "forgetting" about 
existing valid copies — it's definitely not supposed to — so I wonder if 
there's a collision in how it's deciding to remove old locations.

Are you running with only two copies of your data? It shouldn't matter but 
there could also be errors resulting in a behavioral difference between two and 
three copies.
-Greg


Other remapped PGs have 521 in their "up" set but still have the two existing 
copies in their "acting" set - and no degradation is shown.  Examples:

2.f24142820162856405101485080131023102
active+remapped+wait_backfill2017-06-2

[ceph-users] Mon stuck in synchronizing after upgrading from Hammer to Jewel

2017-07-04 Thread 许雪寒
Hi, everyone.

Recently, we upgraded one of clusters from Hammer to Jewel. However, after 
upgrading one of our monitors cannot finish the bootstrap procedure and stuck 
in “synchronizing”. Does anyone has any clue about this? 

Thank you☺
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange (collectd) Cluster.osdBytesUsed incorrect

2017-07-04 Thread Marc Roos


On a test cluster with 994GB used, via collectd I get in influxdb an 
incorrect 9.3362651136e+10 (93GB) reported and this should be 933GB (or 
actually 994GB). Cluster.osdBytes is reported correctly 
3.3005833027584e+13 (30TB)



  cluster:
health: HEALTH_OK

  services:
mon: 3 daemons, quorum a,b,c
mgr: c(active), standbys: a, b
mds: 1/1/1 up {0=a=up:active}, 1 up:standby
osd: 6 osds: 6 up, 6 in

  data:
pools:   6 pools, 600 pgs
objects: 3477k objects, 327 GB
usage:   994 GB used, 29744 GB / 30739 GB avail
pgs: 600 active+clean
 
Influxdb:

1499201873311403849 c01  mon.aceph_bytes Cluster.osdBytesAvail   
  3.2912470376448e+13
1499201873311399889 c01  mon.aceph_bytes Cluster.osdBytesUsed
  9.3362651136e+10
1499201873311396462 c01  mon.aceph_bytes Cluster.osdBytes
  3.3005833027584e+13






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to set up bluestore manually?

2017-07-04 Thread Martin Emrich
Hi!

After getting some other stuff done, I finally got around to continuing here.

I set up a whole new cluster with ceph-deploy, but adding the first OSD fails:

ceph-deploy osd create --bluestore ${HOST}:/dev/sdc --block-wal 
/dev/cl/ceph-waldb-sdc --block-db /dev/cl/ceph-waldb-sdc
.
.
.
[WARNIN] get_partition_dev: Try 9/10 : partition 1 for /dev/cl/ceph-waldb-sdc 
does not exist in /sys/block/dm-2
 [WARNIN] get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid path is 
/sys/dev/block/253:2/dm/uuid
 [WARNIN] get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid is 
LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wAPN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
 [WARNIN]
 [WARNIN] get_partition_dev: Try 10/10 : partition 1 for /dev/cl/ceph-waldb-sdc 
does not exist in /sys/block/dm-2
 [WARNIN] get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid path is 
/sys/dev/block/253:2/dm/uuid
 [WARNIN] get_dm_uuid: get_dm_uuid /dev/cl/ceph-waldb-sdc uuid is 
LVM-2r0bGcoyMB0VnWeGGS77eOD5IOu8wAPN3wPX4OWSS1XGkYZYoziXhfAFMjJf4FJR
 [WARNIN]
 [WARNIN] Traceback (most recent call last):
 [WARNIN]   File "/usr/sbin/ceph-disk", line 9, in 
 [WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 
'ceph-disk')()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
5687, in run
 [WARNIN] main(sys.argv[1:])
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
5638, in main
 [WARNIN] args.func(args)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2004, in main
 [WARNIN] Prepare.factory(args).prepare()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
1993, in prepare
 [WARNIN] self._prepare()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2074, in _prepare
 [WARNIN] self.data.prepare(*to_prepare_list)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2807, in prepare
 [WARNIN] self.prepare_device(*to_prepare_list)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2983, in prepare_device
 [WARNIN] to_prepare.prepare()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2216, in prepare
 [WARNIN] self.prepare_device()
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
2310, in prepare_device
 [WARNIN] partition = device.get_partition(num)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
1714, in get_partition
 [WARNIN] dev = get_partition_dev(self.path, num)
 [WARNIN]   File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 
717, in get_partition_dev
 [WARNIN] (pnum, dev, error_msg))
 [WARNIN] ceph_disk.main.Error: Error: partition 1 for /dev/cl/ceph-waldb-sdc 
does not appear to exist in /sys/block/dm-2
 [ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-disk -v 
prepare --block.wal /dev/cl/ceph-waldb-sdc --block.db /dev/cl/ceph-waldb-sdc 
--bluestore --cluster ceph --fs-type xfs -- /dev/sdc
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

I found some open issues about this: http://tracker.ceph.com/issues/6042, and 
http://tracker.ceph.com/issues/5461.. Could this be related?

Cheers,

Martin


-Ursprüngliche Nachricht-
Von: Loris Cuoghi [mailto:loris.cuo...@artificiale.net] 
Gesendet: Montag, 3. Juli 2017 15:48
An: Martin Emrich 
Cc: ceph-users@lists.ceph.com; Vasu Kulkarni 
Betreff: Re: [ceph-users] How to set up bluestore manually?

Le Mon, 3 Jul 2017 12:32:20 +,
Martin Emrich  a écrit :

> Hi!
> 
> Thanks for the super-fast response!
> 
> That did work somehow... Here's my commandline (As Bluestore seems to 
> still require a Journal,

No, it doesn't. :D

> I repurposed the SSD partitions for it and put the DB/WAL on the 
> spinning disk):

On the contrary, Bluestore's DB/WAL are good candidates for low-latency storage 
like an SSD.

> 
>ceph-deploy osd create --bluestore
> :/dev/sdc:/dev/mapper/cl-ceph_journal_sdc

Just
ceph-deploy osd create --bluestore ${hostname}:/device/path

should be sufficent to create a device composed of:

* 1 small (~100 MB) XFS partition
* 1 big (remaining space) partition formatted as bluestore

Additional options like:

--block-wal /path/to/ssd/partition
--block-db /path/to/ssd/partition

allow having SSD-backed WAL and DB.

> But it created two (!) new OSDs instead of one, and placed them under 
> the default CRUSH rule (thus making my cluster doing stuff; they 
> should be under a different rule)...

Default stuff is applied... by default :P

Take a good read:

http://docs.ceph.com/docs/master/rados/operations/crush-map/

In particular, on how editing an existing CRUSH map:

http://docs.ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

-- Loris

> Did I do something wrong or are
> the two OSDs part of the bluestore concept? If 

[ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-04 Thread Maarten De Quick
Hi,

Background: We're having issues with our index pool (slow requests / time
outs causes crashing of an OSD and a recovery -> application issues). We
know we have very big buckets (eg. bucket of 77 million objects with only
16 shards) that need a reshard so we were looking at the resharding process.

First thing we would like to do is making a backup of the bucket index, but
this failed with:

# radosgw-admin -n client.radosgw.be-west-3 bi list
--bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
ERROR: bi_list(): (4) Interrupted system call

When I grep for "idx" and I count these:
 # grep idx priv-prod-up-alex.list.backup | wc -l
2294942
When I do a bucket stats for that bucket I get:
# radosgw-admin -n client.radosgw.be-west-3 bucket stats
--bucket=priv-prod-up-alex | grep num_objects
2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
"num_objects": 20148575

It looks like there are 18 million objects missing and the backup is not
complete (not sure if that's a correct assumption?). We're also afraid that
the resharding command will face the same issue.
Has anyone seen this behaviour before or any thoughts on how to fix it?

We were also wondering if we really need the backup. As the resharding
process creates a complete new index and keeps the old bucket, is there
maybe a possibility to relink your bucket to the old bucket in case of
issues? Or am I missing something important here?

Any help would be greatly appreciated, thanks!

Regards,
Maarten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton
Le 04/07/2017 à 19:00, Jack a écrit :
> You may just upgrade to Luminous, then replace filestore by bluestore

You don't just "replace" filestore by bluestore on a production cluster
: you transition over several weeks/months from the first to the second.
The two must be rock stable and have predictable performance
characteristics to do that.
We took more than 6 months with Firefly to migrate from XFS to Btrfs and
studied/tuned the cluster along the way. Simply replacing a store by
another without any experience of the real world behavior of the new one
is just playing with fire (and a huge heap of customer data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Jack
You may just upgrade to Luminous, then replace filestore by bluestore

Don't be scared, as Sage said:
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.

In my opinion, it should be fine that way

On 04/07/2017 18:54, Lionel Bouton wrote:
> Le 30/06/2017 à 18:48, Sage Weil a écrit :
>> On Fri, 30 Jun 2017, Lenz Grimmer wrote:
>>> Hi Sage,
>>>
>>> On 06/30/2017 05:21 AM, Sage Weil wrote:
>>>
 The easiest thing is to

 1/ Stop testing filestore+btrfs for luminous onward.  We've recommended 
 against btrfs for a long time and are moving toward bluestore anyway.
>>> Searching the documentation for "btrfs" does not really give a user any
>>> clue that the use of Btrfs is discouraged.
>>>
>>> Where exactly has this been recommended?
>>>
>>> The documentation currently states:
>>>
>>> http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=btrfs#osds
>>>
>>> "We recommend using the xfs file system or the btrfs file system when
>>> running mkfs."
>>>
>>> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=btrfs#filesystems
>>>
>>> "btrfs is still supported and has a comparatively compelling set of
>>> features, but be mindful of its stability and support status in your
>>> Linux distribution."
>>>
>>> http://docs.ceph.com/docs/master/start/os-recommendations/?highlight=btrfs#ceph-dependencies
>>>
>>> "If you use the btrfs file system with Ceph, we recommend using a recent
>>> Linux kernel (3.14 or later)."
>>>
>>> As an end user, none of these statements would really sound as
>>> recommendations *against* using Btrfs to me.
>>>
>>> I'm therefore concerned about just disabling the tests related to
>>> filestore on Btrfs while still including and shipping it. This has
>>> potential to introduce regressions that won't get caught and fixed.
>> Ah, crap.  This is what happens when devs don't read their own 
>> documetnation.  I recommend against btrfs every time it ever comes up, the 
>> downstream distributions all support only xfs, but yes, it looks like the 
>> docs never got updated... despite the xfs focus being 5ish years old now.
>>
>> I'll submit a PR to clean this up, but
>>  
 2/ Leave btrfs in the mix for jewel, and manually tolerate and filter out 
 the occasional ENOSPC errors we see.  (They make the test runs noisy but 
 are pretty easy to identify.)

 If we don't stop testing filestore on btrfs now, I'm not sure when we 
 would ever be able to stop, and that's pretty clearly not sustainable.
 Does that seem reasonable?  (Pretty please?)
>>> If you want to get rid of filestore on Btrfs, start a proper deprecation
>>> process and inform users that support for it it's going to be removed in
>>> the near future. The documentation must be updated accordingly and it
>>> must be clearly emphasized in the release notes.
>>>
>>> Simply disabling the tests while keeping the code in the distribution is
>>> setting up users who happen to be using Btrfs for failure.
>> I don't think we can wait *another* cycle (year) to stop testing this.
>>
>> We can, however,
>>
>>  - prominently feature this in the luminous release notes, and
>>  - require the 'enable experimental unrecoverable data corrupting features =
>> btrfs' in order to use it, so that users are explicitly opting in to 
>> luminous+btrfs territory.
>>
>> The only good(ish) news is that we aren't touching FileStore if we can 
>> help it, so it less likely to regress than other things.  And we'll 
>> continue testing filestore+btrfs on jewel for some time.
>>
>> Is that good enough?
> 
> Not sure how we will handle the transition. Is bluestore considered
> stable in Jewel ? Then our current clusters (recently migrated from
> Firefly to Hammer) will have support for both BTRFS+Filestore and
> Bluestore when the next upgrade takes place. If Bluestore is only
> considered stable on Luminous I don't see how we can manage the
> transition easily. The only path I see is to :
> - migrate to XFS+filestore with Jewel (which will not only take time but
> will be a regression for us : this will cause performance and sizing
> problems on at least one of our clusters and we will lose the silent
> corruption detection from BTRFS)
> - then upgrade to Luminous and migrate again to Bluestore.
> I was not expecting the transition from Btrfs+Filestore to Bluestore to
> be this convoluted (we were planning to add Bluestore OSDs one at a time
> and study the performance/stability for months before migrating the
> whole clusters). Is there any way to restrict your BTRFS tests to at
> least a given stable configuration (BTRFS is known to have problems with
> the high rate of snapshot deletion Ceph generates by default for example
> and we use 'filestore btrfs snap = false') ?
> 
> Best regards,
> 
> Lionel
> _

Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton
Le 30/06/2017 à 18:48, Sage Weil a écrit :
> On Fri, 30 Jun 2017, Lenz Grimmer wrote:
>> Hi Sage,
>>
>> On 06/30/2017 05:21 AM, Sage Weil wrote:
>>
>>> The easiest thing is to
>>>
>>> 1/ Stop testing filestore+btrfs for luminous onward.  We've recommended 
>>> against btrfs for a long time and are moving toward bluestore anyway.
>> Searching the documentation for "btrfs" does not really give a user any
>> clue that the use of Btrfs is discouraged.
>>
>> Where exactly has this been recommended?
>>
>> The documentation currently states:
>>
>> http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=btrfs#osds
>>
>> "We recommend using the xfs file system or the btrfs file system when
>> running mkfs."
>>
>> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=btrfs#filesystems
>>
>> "btrfs is still supported and has a comparatively compelling set of
>> features, but be mindful of its stability and support status in your
>> Linux distribution."
>>
>> http://docs.ceph.com/docs/master/start/os-recommendations/?highlight=btrfs#ceph-dependencies
>>
>> "If you use the btrfs file system with Ceph, we recommend using a recent
>> Linux kernel (3.14 or later)."
>>
>> As an end user, none of these statements would really sound as
>> recommendations *against* using Btrfs to me.
>>
>> I'm therefore concerned about just disabling the tests related to
>> filestore on Btrfs while still including and shipping it. This has
>> potential to introduce regressions that won't get caught and fixed.
> Ah, crap.  This is what happens when devs don't read their own 
> documetnation.  I recommend against btrfs every time it ever comes up, the 
> downstream distributions all support only xfs, but yes, it looks like the 
> docs never got updated... despite the xfs focus being 5ish years old now.
>
> I'll submit a PR to clean this up, but
>  
>>> 2/ Leave btrfs in the mix for jewel, and manually tolerate and filter out 
>>> the occasional ENOSPC errors we see.  (They make the test runs noisy but 
>>> are pretty easy to identify.)
>>>
>>> If we don't stop testing filestore on btrfs now, I'm not sure when we 
>>> would ever be able to stop, and that's pretty clearly not sustainable.
>>> Does that seem reasonable?  (Pretty please?)
>> If you want to get rid of filestore on Btrfs, start a proper deprecation
>> process and inform users that support for it it's going to be removed in
>> the near future. The documentation must be updated accordingly and it
>> must be clearly emphasized in the release notes.
>>
>> Simply disabling the tests while keeping the code in the distribution is
>> setting up users who happen to be using Btrfs for failure.
> I don't think we can wait *another* cycle (year) to stop testing this.
>
> We can, however,
>
>  - prominently feature this in the luminous release notes, and
>  - require the 'enable experimental unrecoverable data corrupting features =
> btrfs' in order to use it, so that users are explicitly opting in to 
> luminous+btrfs territory.
>
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.
>
> Is that good enough?

Not sure how we will handle the transition. Is bluestore considered
stable in Jewel ? Then our current clusters (recently migrated from
Firefly to Hammer) will have support for both BTRFS+Filestore and
Bluestore when the next upgrade takes place. If Bluestore is only
considered stable on Luminous I don't see how we can manage the
transition easily. The only path I see is to :
- migrate to XFS+filestore with Jewel (which will not only take time but
will be a regression for us : this will cause performance and sizing
problems on at least one of our clusters and we will lose the silent
corruption detection from BTRFS)
- then upgrade to Luminous and migrate again to Bluestore.
I was not expecting the transition from Btrfs+Filestore to Bluestore to
be this convoluted (we were planning to add Bluestore OSDs one at a time
and study the performance/stability for months before migrating the
whole clusters). Is there any way to restrict your BTRFS tests to at
least a given stable configuration (BTRFS is known to have problems with
the high rate of snapshot deletion Ceph generates by default for example
and we use 'filestore btrfs snap = false') ?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster with Deeo Scrub Error

2017-07-04 Thread Etienne Menguy
Hello,

You are running 10.0.2.5 or 10.2.5?


If you are running 10.2 you can can read this documentation 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent


'rados list-inconsistent-obj' will give you the reason of this scrub error.


And I would not use Ceph with raid6.

Your data should already be safe with Ceph.


Etienne



From: ceph-users  on behalf of Hauke Homburg 

Sent: Tuesday, July 4, 2017 17:41
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph Cluster with Deeo Scrub Error

Am 02.07.2017 um 13:23 schrieb Hauke Homburg:
Hello,

Ich have Ceph Cluster with 5 Ceph Servers, Running unter CentOS 7.2 and ceph 
10.0.2.5. All OSD running in a RAID6.
In this Cluster i have Deep Scrub Error:
/var/log/ceph/ceph-osd.6.log-20170629.gz:389 .356391 7f1ac4c57700 -1 
log_channel(cluster) log [ERR] : 1.129 deep-scrub 1 errors

This Line is the inly Line i Can find with the Error.

I tried to repair with withceph osd deep-scrub osd and ceph pg repair.
Both didn't fiy the error.

What can i do to repair the Error?

Regards

Hauke

--
www.w3-creative.de

www.westchat.de



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Hello,

Today i made a ceph osd scrub . So i had after some hours a running 
Ceph.

I wonder why the run takes so a long time for one OSD. Does ceph have queries 
at this Point?

regards

Hauke

--
www.w3-creative.de

www.westchat.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster with Deeo Scrub Error

2017-07-04 Thread Hauke Homburg
Am 02.07.2017 um 13:23 schrieb Hauke Homburg:
> Hello,
>
> Ich have Ceph Cluster with 5 Ceph Servers, Running unter CentOS 7.2
> and ceph 10.0.2.5. All OSD running in a RAID6.
> In this Cluster i have Deep Scrub Error:
> /var/log/ceph/ceph-osd.6.log-20170629.gz:389 .356391 7f1ac4c57700 -1
> log_channel(cluster) log [ERR] : 1.129 deep-scrub 1 errors
>
> This Line is the inly Line i Can find with the Error.||
>
> I tried to repair with withceph osd deep-scrub osd and |ceph pg repair.
> Both didn't fiy the error.
>
> What can i do to repair the Error?
>
> Regards
>
> Hauke
> |
> -- 
> www.w3-creative.de
>
> www.westchat.de
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hello,

Today i made a ceph osd scrub . So i had after some hours a
running Ceph.

I wonder why the run takes so a long time for one OSD. Does ceph have
queries at this Point?

regards

Hauke

-- 
www.w3-creative.de

www.westchat.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-04 Thread Joao Eduardo Luis

On 07/04/2017 06:57 AM, Z Will wrote:

Hi:
   I am testing ceph-mon brain split . I have read the code . If I
understand it right , I know it won't be brain split. But I think
there is still another problem. My ceph version is 0.94.10. And here
is my test detail :

3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
mon , and use iptables to block the communication between mon 0 and
mon 1. When the cluster is stable, start mon.1 .  I found the 3
monitors will all can not work well. They are all trying to call  new
leader  election . This means the cluster can't work anymore.

Here is my analysis. Because mon will always respond to leader
election message, so , in my test, communication between  mon.0 and
mon.1 is blocked , so mon.1 will always try to be leader, because it
will always see mon.2, and it should win over mon.2. Mon.0 should
always win over mon.2. But mon.2 will always responsd to the election
message issued by mon.1, so this loop will never end. Am I right ?

This should be a problem? Or is it  was just designed like this , and
should be handled by human ?


This is a known behaviour, quite annoying, but easily identifiable by 
having the same monitor constantly calling an election and usually 
timing out because the peon did not defer to it.


In a way, the elector algorithm does what it is intended to. Solving 
this corner case would be nice, but I don't think there's a good way to 
solve it. We may be able to presume a monitor is in trouble during the 
probe phase, to disqualify a given monitor from the election, but in the 
end this is a network issue that may be transient or unpredictable and 
there's only so much we can account for.


Dealing with it automatically would be nice, but I think, thus far, the 
easiest way to address this particular issue is human intervention.


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados maximum object size issue since Luminous? SOLVED

2017-07-04 Thread Martin Emrich
Awesome, that did it.

I consider creating a separate Bareos device with striping, testing there, and 
then fading out the old non-striped pool... Maybe that would also fix the 
suboptimal throughput...

But from the Ceph side of things, it looks like I'm good now.

Thanks again :)

Cheers,

Martin 

-Ursprüngliche Nachricht-
Von: Jens Rosenboom [mailto:j.rosenb...@x-ion.de] 
Gesendet: Dienstag, 4. Juli 2017 14:42
An: Martin Emrich 
Cc: Gregory Farnum ; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Rados maximum object size issue since Luminous?

2017-07-04 12:10 GMT+00:00 Martin Emrich :
...
> So as striping is not backwards-compatible (and this pools is indeed for 
> backup/archival purposes where large objects are no problem):
>
> How can I restore the behaviour of jewel (allowing 50GB objects)?
>
> The only option I found was "osd max write size" but that seems not to be the 
> right one, as its default of 90MB is lower than my observed 128MB.

That should be osd_max_object_size, see https://github.com/ceph/ceph/pull/15520
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel : How to remove MDS ?

2017-07-04 Thread John Spray
On Tue, Jul 4, 2017 at 1:49 PM, Florent B  wrote:
> Hi everyone,
>
> I would like to remove a MDS from map. How to do this ?
>
> # ceph mds rm mds.$ID;
> Invalid command:  mds.host1 doesn't represent an int
> mds rm  :  remove nonactive mds
> Error EINVAL: invalid command

Avoid "mds rm" (you never really need that command, the naming is
unfortunate) -- use "mds fail" to drop an MDS from the filesystem
map/mds map.  mds fail will take a rank or a daemon name, so it's much
friendlier in practice.  For an explanation of ranks etc see
http://docs.ceph.com/docs/master/cephfs/standby/

Note that if the daemon is still running then you'll find it
immediately appears in the map again.  If you want to permamently
remove an MDS daemon on a particular server then you need to go and do
that on the server (or modify whatever configuration management
solution you're using).

John

> This int is supposed to be the "rank" of the MDS. But where do I find it ?
>
> Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados maximum object size issue since Luminous?

2017-07-04 Thread Jens Rosenboom
2017-07-04 12:10 GMT+00:00 Martin Emrich :
...
> So as striping is not backwards-compatible (and this pools is indeed for 
> backup/archival purposes where large objects are no problem):
>
> How can I restore the behaviour of jewel (allowing 50GB objects)?
>
> The only option I found was "osd max write size" but that seems not to be the 
> right one, as its default of 90MB is lower than my observed 128MB.

That should be osd_max_object_size, see https://github.com/ceph/ceph/pull/15520
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados maximum object size issue since Luminous?

2017-07-04 Thread Martin Emrich
Hi!

I dug deeper, and apparently striping ist not backwards-compatible to 
"non-striping":

* "rados ls --stripe" lists only objects where striping was used to write them 
in the first place.
* If I enable striping in Bareos (tried different values for stripe_unit and 
stripe_count), it crashes here:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.0/rpm/el7/BUILD/ceph-12.1.0/src/osdc/Striper.cc:
 In function 'static void Striper::file_to_extents(CephContext*, const char*, 
const file_layout_t*, uint64_t, uint64_t, uint64_t, std::map >&, uint64_t)' thread 7f32d14da700 time 2017-07-04 
13:23:26.097884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.0/rpm/el7/BUILD/ceph-12.1.0/src/osdc/Striper.cc:
 64: FAILED assert(object_size >= su)
 ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x7f32db699120]
 2: (Striper::file_to_extents(CephContext*, char const*, file_layout_t const*, 
unsigned long, unsigned long, unsigned long, std::map >, std::less, 
std::allocator > > > >&, unsigned long)+0x1826) [0x7f32e5969c16]
 3: (Striper::file_to_extents(CephContext*, char const*, file_layout_t const*, 
unsigned long, unsigned long, unsigned long, std::vector >&, unsigned long)+0x5b) [0x7f32e596b65b]
 4: (libradosstriper::RadosStriperImpl::aio_read(std::string const&, 
librados::AioCompletionImpl*, ceph::buffer::list*, unsigned long, unsigned 
long)+0x584) [0x7f32e58f2054]
 5: (libradosstriper::RadosStriperImpl::read(std::string const&, 
ceph::buffer::list*, unsigned long, unsigned long)+0x55) [0x7f32e58f2315]
 6: (rados_striper_read()+0x112) [0x7f32e58eada2]
 7: (rados_device::read_object_data(long, char*, unsigned long)+0x3c) 
[0x7f32e6f0a08c]
 8: (rados_device::d_read(int, void*, unsigned long)+0x1a) [0x7f32e6f0a0ba]
 9: (DEVICE::read(void*, unsigned long)+0x27) [0x7f32e6ef1187]
 10: (DCR::read_block_from_dev(bool)+0xca) [0x7f32e6ee99aa]
 11: (read_dev_volume_label(DCR*)+0x2d8) [0x7f32e6ef4988]
 12: (DCR::check_volume_label(bool&, bool&)+0x10d) [0x7f32e6ef72dd]
 13: (DCR::mount_next_write_volume()+0x5c0) [0x7f32e6ef80e0]
 14: (acquire_device_for_append(DCR*)+0xdb) [0x7f32e6ee179b]
 15: /sbin/bareos-sd() [0x408031]
 16: /sbin/bareos-sd() [0x40f5c4]
 17: /sbin/bareos-sd() [0x40f9d9]
 18: /sbin/bareos-sd() [0x40fbd2]
 19: /sbin/bareos-sd() [0x41070b]
 20: /sbin/bareos-sd() [0x40ee02]
 21: /sbin/bareos-sd() [0x414c18]
 22: (workq_server()+0x1f5) [0x7f32e6a9ca85]
 23: (lmgr_thread_launcher()+0x55) [0x7f32e6a84fb5]
 24: (()+0x7dc5) [0x7f32e5da9dc5]
 25: (clone()+0x6d) [0x7f32e4a5876d]


I guess this is because it tries to read older Volumes (==Objects) which were 
not written with striping on?

So as striping is not backwards-compatible (and this pools is indeed for 
backup/archival purposes where large objects are no problem):

How can I restore the behaviour of jewel (allowing 50GB objects)?

The only option I found was "osd max write size" but that seems not to be the 
right one, as its default of 90MB is lower than my observed 128MB.

Cheers,

Martin

-Ursprüngliche Nachricht-
Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von 
Martin Emrich
Gesendet: Dienstag, 4. Juli 2017 09:46
An: Gregory Farnum ; ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Rados maximum object size issue since Luminous?

Hi,

thanks for the explanation! I am just now diving into the C code of Bareos, it 
seems there is already code in there to use libradosstriper, I just would have 
to turn it on ;)

But there are two parameters (stripe_unit and stripe_count), but there are no 
default values.

What would be sane default values for these parameters (expecting objects of 
5-50GB) ? Can I retain backwards compatibility to existing larger objects 
written without striping?

Thanks so much,

Martin

-Ursprüngliche Nachricht-
Von: Gregory Farnum [mailto:gfar...@redhat.com]
Gesendet: Montag, 3. Juli 2017 19:59
An: Martin Emrich 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Rados maximum object size issue since Luminous?

On Mon, Jul 3, 2017 at 10:17 AM, Martin Emrich  
wrote:
> Hi!
>
>
>
> Having to interrupt my bluestore test, I have another issue since 
> upgrading from Jewel to Luminous: My backup system (Bareos with 
> RadosFile backend) can no longer write Volumes (objects) larger than around 
> 128MB.
>
> (Of course, I did not test that on my test cluster prior to upgrading 
> the production one :/ )
>
>
>
> At first, I suspected an incompatibility between the Bareos storage 
> daemon and the newer Ceph version, but I could replicate it with the rados 
> tool:
>
>
>
> Create a large file (1GB)
>
>
>
> Put it with rados
>
>
>
> rados --pool backup put rados-testfile rados-testfile-1G
>
> e

Re: [ceph-users] Rados maximum object size issue since Luminous?

2017-07-04 Thread Martin Emrich
Hi,

thanks for the explanation! I am just now diving into the C code of Bareos, it 
seems there is already code in there to use libradosstriper, I just would have 
to turn it on ;)

But there are two parameters (stripe_unit and stripe_count), but there are no 
default values.

What would be sane default values for these parameters (expecting objects of 
5-50GB) ? Can I retain backwards compatibility to existing larger objects 
written without striping?

Thanks so much,

Martin

-Ursprüngliche Nachricht-
Von: Gregory Farnum [mailto:gfar...@redhat.com] 
Gesendet: Montag, 3. Juli 2017 19:59
An: Martin Emrich 
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] Rados maximum object size issue since Luminous?

On Mon, Jul 3, 2017 at 10:17 AM, Martin Emrich  
wrote:
> Hi!
>
>
>
> Having to interrupt my bluestore test, I have another issue since 
> upgrading from Jewel to Luminous: My backup system (Bareos with 
> RadosFile backend) can no longer write Volumes (objects) larger than around 
> 128MB.
>
> (Of course, I did not test that on my test cluster prior to upgrading 
> the production one :/ )
>
>
>
> At first, I suspected an incompatibility between the Bareos storage 
> daemon and the newer Ceph version, but I could replicate it with the rados 
> tool:
>
>
>
> Create a large file (1GB)
>
>
>
> Put it with rados
>
>
>
> rados --pool backup put rados-testfile rados-testfile-1G
>
> error putting backup-fra1/rados-testfile: (27) File too large
>
>
>
> Read it back:
>
>
>
> rados  --pool backup get rados-testfile rados-testfile-readback
>
>
>
> Indeed, it wrote just about 128MB
>
>
>
> Adding the “—striper” option to both get and put command lines, it works:
>
>
>
> -rw-r--r-- 1 root root 1073741824  3. Jul 18:47 rados-testfile-1G
>
> -rw-r--r-- 1 root root  134217728  3. Jul 19:12 
> rados-testfile-readback
>
>
>
> The error message I get from the backup system looks similar:
>
> block.c:659-29028 === Write error. fd=0 size=64512 rtn=-1 
> dev_blk=134185235
> blk_blk=10401 errno=28: ERR=Auf dem Gerät ist kein Speicherplatz mehr 
> verfügbar
>
>
>
> (German for „No space left on device”)
>
>
>
> The service worked fine with Ceph jewel, nicely writing 50GB objects. 
> Did the API change somehow?

We set a default maximum object size (of 128MB, probably?) in order to prevent 
people setting individual objects which are too large for the system to behave 
well with. It is configurable (I don't remember how, you'll need to look it up 
in hopefully-the-docs but probably-the-source), but there's generally not a 
good reason to create single individual objects instead of sharding them. 50GB 
objects probably work fine for archival, but if eg you have an OSD failure you 
won't be able to do any IO on objects which are being backfilled or recovered, 
and for a 50GB object that will take a while.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Full Ratio Luminous - Unset

2017-07-04 Thread Ashley Merrick
Okie noticed their is a new command to set these.


Tried these and still showing as 0 and error on full ratio out of order "ceph 
osd set-{full,nearfull,backfillfull}-ratio"


,Ashley


From: ceph-users  on behalf of Ashley 
Merrick 
Sent: 04 July 2017 05:55:10
To: ceph-us...@ceph.com
Subject: [ceph-users] OSD Full Ratio Luminous - Unset


Hello,


On a Luminous upgraded from Jewel I am seeing the following in ceph -s  : "Full 
ratio(s) out of order"


and


ceph pg dump | head
dumped all
version 44281
stamp 2017-07-04 05:52:08.337258
last_osdmap_epoch 0
last_pg_scan 0
full_ratio 0
nearfull_ratio 0

I have tried to inject the values however makes no effect, these where 
previously non 0 values and the issue only showed after running "ceph osd 
require-osd-release luminous"


Thanks,

Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com