Re: [ceph-users] NVRAM cache for ceph journal

2017-02-21 Thread Christian Balzer

Hello,

On Wed, 22 Feb 2017 15:07:48 +0800 (HKT) Horace wrote:

> Dear all,
> 
> Is anybody got any experience on this product? It is a BBU backed NVRAM 
> cache, I think it is most fit on Ceph.
> 
> https://www.microsemi.com/products/storage/flashtec-nvram-drives/nv1616
> 
Not this product, but some people here on this ML looked at similar things
in the past.

And while fast and won't wear out, it's hard to find a good
(economical) use case for these.

1. For HDD backed OSDs this kind of NVRAM unit is overkill and at 16GB
also something that may prove too small with more than 8 HDDs. 

2. For SSD backed OSDs it's a much better fit, however can you afford both
and more importantly do you need that performance and can actually realize
it? Your CPUs are going to become bottlenecks at some point.

3. For NVMe backed OSDs it would be a good fit, too. But given the PCIe
lanes needed you may wind up only with 4 OSDs per node. Also the fastest
CPUs you can find (high speed, not cores). And the price tag of this combo
is going to be something for specialist use and/or really well funded
operations. 

Lastly while BlueStore will of course also profit from some of its data on
fast storage (WAL, DB) the exact size requirements are not exactly clear
to me at this time. 
What it definitely won't need are (relatively large) journals.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Having many Pools

2017-02-21 Thread Christian Balzer
On Wed, 22 Feb 2017 06:21:41 + Mustafa AKIN wrote:

> Hi, I’m fairly new to Ceph. We are building a shared system on top of Ceph. I 
> know that OpenStack uses a few pools and handles the ownership itself. But 
> would it be undesirable to create a pool for a user in Ceph? It would lead to 
> having too many Placement Groups, is there any bad affect of it?

As discussed/mentioned here many times before, PGs are not free.
They consume RAM and most of all CPU, especially when tiering (when OSDs
get added or removed/fail). 

So ideally you want to limit yourself to what the formulas and PGcalc
recommend for your cluster size.

And that's obviously not a number that will scale up with the amount of
users.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVRAM cache for ceph journal

2017-02-21 Thread Horace
Dear all,

Is anybody got any experience on this product? It is a BBU backed NVRAM cache, 
I think it is most fit on Ceph.

https://www.microsemi.com/products/storage/flashtec-nvram-drives/nv1616

Regards,
Horace Ng

ISL E-Mail Disclaimer 
(http://www.hkisl.net/index.php?hkisl_page=emailDisclaimer)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Having many Pools

2017-02-21 Thread Mustafa AKIN
Hi, I’m fairly new to Ceph. We are building a shared system on top of Ceph. I 
know that OpenStack uses a few pools and handles the ownership itself. But 
would it be undesirable to create a pool for a user in Ceph? It would lead to 
having too many Placement Groups, is there any bad affect of it?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd export-diff bug? rbd export-diff generates different incremental files

2017-02-21 Thread Jason Dillaman
On Tue, Feb 21, 2017 at 8:28 PM, Zhongyan Gu  wrote:
> Well, we have already included this fix in our test setup. I think this time
> we encountered another potential bug in the export process. we are diving
> into the code and trying the easy-reproduce-case.

Even you know you can eventually reproduce the issue, may I suggest
that you capture the logs from "rbd export-diff --debug-rbd=20
--debug-rados=20 --from-snap  @"
from before and after you detect the issue. Hopefully that will allow
the issue to be narrowed down.

> On Wed, Feb 22, 2017 at 2:28 AM, Jason Dillaman  wrote:
>>
>> On Mon, Feb 20, 2017 at 10:13 PM, Zhongyan Gu 
>> wrote:
>> > You mentioned the fix is scheduled to be included in Hammer 0.94.10, Is
>> > there any fix already there??
>>
>> The fix for that specific diff issue is included in the hammer branch
>> [1][2] -- but 0.94.10 hasn't been released yet.
>>
>> [1] http://tracker.ceph.com/issues/18111
>> [2] https://github.com/ceph/ceph/pull/12446
>>
>> --
>> Jason
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd export-diff bug? rbd export-diff generates different incremental files

2017-02-21 Thread Zhongyan Gu
Well, we have already included this fix in our test setup. I think this
time we encountered another potential bug in the export process. we are
diving into the code and trying the easy-reproduce-case.

On Wed, Feb 22, 2017 at 2:28 AM, Jason Dillaman  wrote:

> On Mon, Feb 20, 2017 at 10:13 PM, Zhongyan Gu 
> wrote:
> > You mentioned the fix is scheduled to be included in Hammer 0.94.10, Is
> > there any fix already there??
>
> The fix for that specific diff issue is included in the hammer branch
> [1][2] -- but 0.94.10 hasn't been released yet.
>
> [1] http://tracker.ceph.com/issues/18111
> [2] https://github.com/ceph/ceph/pull/12446
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How safe is ceph pg repair these days?

2017-02-21 Thread David Zafman


Nick,

Yes, as you would expect a read error would not be used as a source 
for repair no matter which OSD(s) are getting read errors.



David

On 2/21/17 12:38 AM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Gregory Farnum
Sent: 20 February 2017 22:13
To: Nick Fisk ; David Zafman 
Cc: ceph-users 
Subject: Re: [ceph-users] How safe is ceph pg repair these days?

On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk  wrote:

 From what I understand in Jewel+ Ceph has the concept of an
authorative shard, so in the case of a 3x replica pools, it will
notice that 2 replicas match and one doesn't and use one of the good
replicas. However, in a 2x pool your out of luck.

However, if someone could confirm my suspicions that would be good as

well.

Hmm, I went digging in and sadly this isn't quite right. The code has a

lot of

internal plumbing to allow more smarts than were previously feasible and
the erasure-coded pools make use of them for noticing stuff like local
corruption. Replicated pools make an attempt but it's not as reliable as

one

would like and it still doesn't involve any kind of voting mechanism.
A self-inconsistent replicated primary won't get chosen. A primary is

self-

inconsistent when its digest doesn't match the data, which happens when:
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest was
recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap entries
don't match what the digest says should be there.


Thanks for the correction Greg. So I'm guessing that the probability of
overwriting with an incorrect primary is reduced in later releases, but it
can still happen.

Quick question and its maybe that this is a #5 on your list. What about
objects that are marked inconsistent on the primary due to a read error. I
would say 90% of my inconsistent PG's are always caused by a read error and
associated smartctl error.

"rados list-inconsistent-obj" shows that it knows that the primary had a
read error, so I assume a "pg repair" wouldn't try and read from the primary
again?


David knows more and correct if I'm missing something. He's also working

on

interfaces for scrub that are more friendly in general and allow
administrators to make more fine-grained decisions about recovery in ways
that cooperate with RADOS.
-Greg


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Tracy Reed
Sent: 18 February 2017 03:06
To: Shinobu Kinjo 
Cc: ceph-users 
Subject: Re: [ceph-users] How safe is ceph pg repair these days?

Well, that's the question...is that safe? Because the link to the
mailing

list

post (possibly outdated) says that what you just suggested is
definitely

NOT

safe. Is the mailing list post wrong? Has the situation changed?
Exactly

what

does ceph repair do now? I suppose I could go dig into the code but
I'm

not

an expert and would hate to get it wrong and post possibly bogus info
the the list for other newbies to find and worry about and possibly
lose their data.

On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly:

if ``ceph pg deep-scrub `` does not work then
   do
 ``ceph pg repair 


On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed


wrote:

I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs
say run a repair first. But a couple people on IRC and a mailing
list thread from 2015 say that ceph blindly copies the primary
over the secondaries and calls it good.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-

May/001370.

html

I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair".
I have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact
same kind of error on two different OSDs is highly improbable. I
don't understand why ceph repair wouldn't have done this all along.

What is the current best practice in the use of ceph repair?

Thanks!

--
Tracy Reed

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Tracy Reed

___
ceph-users 

Re: [ceph-users] help with crush rule

2017-02-21 Thread Brian Andrus
I don't think a CRUSH rule exception is currently possible, but it makes
sense to me for a feature request.

On Sat, Feb 18, 2017 at 6:16 AM, Maged Mokhtar  wrote:

>
> Hi,
>
> I have a need to support a small cluster with 3 hosts and 3 replicas given
> that in normal operation each replica will be placed on a separate host
> but in case one host dies then its replicas could be stored on separate
> osds on the 2 live hosts.
>
> I was hoping to write a rule that in case it could only find 2 replicas on
> separated nodes will emit it and do another select/emit to place the
> reaming replica. Is this possible ? i could not find a way to define an if
> condition or being able to determine the size of the working vector
> actually returned.
>
> Cheers /maged
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-21 Thread Patrick Donnelly
On Tue, Feb 21, 2017 at 4:45 PM, Nick Fisk  wrote:
> I'm trying to put some examples together for a book and so wanted to try and 
> come up with a more out of the box experience someone could follow. I'm 
> guessing some basic examples in LUA and then come custom rados classes in C++ 
> might be the best approach for this for now?

FYI, since you are writing a book: Lua is not an acronym:
https://www.lua.org/about.html#name

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-21 Thread Nick Fisk
> 
> On 02/19/2017 12:15 PM, Patrick Donnelly wrote:
> > On Sat, Feb 18, 2017 at 2:55 PM, Noah Watkins 
> wrote:
> >> The least intrusive solution is to simply change the sandbox to allow
> >> the standard file system module loading function as expected. Then
> >> any user would need to make sure that every OSD had consistent
> >> versions of dependencies installed using something like LuaRocks.
> >> This is simple, but could make debugging and deployment a major
> headache.
> >
> > A locked down require which doesn't load C bindings (i.e. only load
> > .lua files) would probably be alright.
> >
> >> A more ambitious version would be to create an interface for users to
> >> upload scripts and dependencies into objects, and support referencing
> >> those objects as standard dependencies in Lua scripts as if they were
> >> standard modules on the file system. Each OSD could then cache
> >> scripts and dependencies, allowing applications to use references to
> >> scripts instead of sending a script with every request.
> >
> > This is very doable. I imagine we'd just put all of the Lua modules in
> > a flattened hierarchy under a RADOS namespace? The potentially
> > annoying nit in this is writing some kind of mechanism for installing
> > a Lua module tree into RADOS. Users would install locally and then
> > upload the tree through some tool.
> 
> Using rados objects for this is not really feasible. It would be incredibly
> complex within the osd - it involves multiple objects, cache invalidation, and
> has all kinds of potential issues with consistent versioning and atomic
> updates across objects.
> 
> The simple solution of loading modules from the local fs sounds way better
> to me. Putting modules on all osds and reloading the modules or restarting
> the osds seems like a pretty simple deployment model with any configuration
> management system.
> 
> That said, for research purposes you could resurrect something like the
> ability to load modules into the cluster from a client - just store them on 
> the
> local fs of each osd, not in rados objects. This was removed back in 2011:
> 
> https://github.com/ceph/ceph/commit/7c04f81ca16d11fc5a592992a4462b3
> 4ccb199dc
> https://github.com/ceph/ceph/commit/964a0a6e1326d4f773c547655ebb2a5
> c97794268
> 

Thanks Josh. They look like they refer to loading the actual rados classes 
themselves, whereas I just want the LUA rados class which runs a LUA script 
passed via JSON, to be able to load extra lua scripts from the local FS. Or 
have I misunderstood the contents of those commits?

I'm trying to put some examples together for a book and so wanted to try and 
come up with a more out of the box experience someone could follow. I'm 
guessing some basic examples in LUA and then come custom rados classes in C++ 
might be the best approach for this for now?

> Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-21 Thread Nick Fisk
Yep sure, will try and present some figures at tomorrow’s meeting again.

 

From: Samuel Just [mailto:sj...@redhat.com] 
Sent: 21 February 2017 18:14
To: Nick Fisk 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Ok, I've added explicit support for osd_snap_trim_sleep (same param, new 
non-blocking implementation) to that branch.  Care to take it for a whirl?

-Sam

 

On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk  > wrote:

Building now

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 ] On Behalf Of Samuel Just
Sent: 09 February 2017 19:22
To: Nick Fisk  >
Cc: ceph-users@lists.ceph.com  


Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep 

  (based on master) passed a rados suite.  It adds a configurable limit to the 
number of pgs which can be trimming on any OSD (default: 2).  PGs trimming will 
be in snaptrim state, PGs waiting to trim will be in snaptrim_wait state.  I 
suspect this'll be adequate to throttle the amount of trimming.  If not, I can 
try to add an explicit limit to the rate at which the work items trickle into 
the queue.  Can someone test this branch?   Tester beware: this has not merged 
into master yet and should only be run on a disposable cluster.

-Sam

 

On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk  > wrote:

Yeah it’s probably just the fact that they have more PG’s so they will hold 
more data and thus serve more IO. As they have a fixed IO limit, they will 
always hit the limit first and become the bottleneck.

 

The main problem with reducing the filestore queue is that I believe you will 
start to lose the benefit of having IO’s queued up on the disk, so that the 
scheduler can re-arrange them to action them in the most efficient manor as the 
disk head moves across the platters. You might possibly see up to a 20% hit on 
performance, in exchange for more consistent client latency. 

 

From: Steve Taylor [mailto:steve.tay...@storagecraft.com 
 ] 
Sent: 07 February 2017 20:35
To: n...@fisk.me.uk  ; ceph-users@lists.ceph.com


Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Thanks, Nick.

 

One other data point that has come up is that nearly all of the blocked 
requests that are waiting on subops are waiting for OSDs with more PGs than the 
others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB OSDs. 
The cluster is well balanced based on OSD capacity, so those 7 OSDs 
individually have 33% more PGs than the others and are causing almost all of 
the blocked requests. It appears that maps updates are generally not blocking 
long enough to show up as blocked requests.

 

I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. I’ll 
test some more when the PG counts per OSD are more balanced and see what I get. 
I’ll also play with the filestore queue. I was telling some of my colleagues 
yesterday that this looked likely to be related to buffer bloat somewhere. I 
appreciate the suggestion.

 

  _  


 

 

Steve Taylor | Senior Software Engineer |  

 StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799   | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _  

From: Nick Fisk [mailto:n...@fisk.me.uk  ] 
Sent: Tuesday, February 7, 2017 10:25 AM
To: Steve Taylor  >; ceph-users@lists.ceph.com 

Re: [ceph-users] RADOSGW S3 api ACLs

2017-02-21 Thread Andrew Bibby
Josef,

A co-maintainer of the radula project forwarded this message to me.

Our little project started specifically to address the handling of ACLs
of uploaded objects through the S3 api, but has since grown to include
other nice-to-haves.

We noted that it was possible to upload objects to a bucket that the
bucket owner could not control or even read. So we set about writing
an upload tool (similar to s3cmd, awscli) that took care of the extra
actions needed on our behalf.

For our clusters, we rely on bucket policies. The user that is the bucket
owner retains FULL_CONTROL, while optional read-only users may also be
present (with perms READ + READ_ACP). With newly uploaded objects,
radula synchronizes the object policy with the bucket policy, changing
ownership if need be.

We guard the write-enabled user closely, and typically issue keys to
the read-only user to research staff.

If you want to look at our implementation, the source is at
https://github.com/bibby/radula

But the short version is: after the upload, we set the object's ACL
to a copy of the bucket's ACL.

- bibby
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin bucket link: empty bucket instance id

2017-02-21 Thread Casey Bodley
When it complains about a missing bucket instance id, that's what it's 
expecting to get from the --bucket-id argument. That's the "id" field 
shown in bucket stats. Try this?


$ radosgw-admin bucket link --bucket=XXX --bucket-id=YYY --uid=ZZZ

Casey


On 02/21/2017 08:30 AM, Valery Tschopp wrote:

Hi,

I've the same problem about 'radosgw-admin bucket link --bucket XXX 
--uid YYY', but with a Jewel radosgw


The admin rest API [1] do not work either :(

Any idea?

[1]: http://docs.ceph.com/docs/master/radosgw/adminops/#link-bucket


On 28/01/16 17:03 , Wido den Hollander wrote:

Hi,

I'm trying to link a bucket to a new user and this is failing for me.

The Ceph version is 0.94.5 (Hammer).

The bucket is called 'packer' and I can verify that it exists:

$ radosgw-admin bucket stats --bucket packer

{
"bucket": "packer",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets",
"id": "ams02.5862567.3564",
"marker": "ams02.5862567.3564",
"owner": "X_beta",
"ver": "0#21975",
"master_ver": "0#0",
"mtime": "2015-08-04 12:31:06.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 10737764,
"size_kb_actual": 10737836,
"num_objects": 27
},
"rgw.multimeta": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Now when I try to link this bucket it fails:

$ radosgw-admin bucket link --bucket packer --uid 

"failure: (22) Invalid argument: empty bucket instance id"

It seems like this is a bug in the radosgw-admin tool where it doesn't
parse the --bucket argument properly.

Any ideas?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs with large numbers of files per directory

2017-02-21 Thread Rhian Resnick
Logan,


Thank you for the feedback.


Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 



From: Logan Kuhn 
Sent: Tuesday, February 21, 2017 8:42 AM
To: Rhian Resnick
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Cephfs with large numbers of files per directory

We have a very similar configuration at one point.

I was fairly new when we started to move away from it, but what happened to us 
is that anytime a directory needed to stat, backup, ls, rsync, etc.  It would 
take minutes to return and while it was waiting CPU load would spike due to 
iowait.  The difference between what you've said and what we did was that we 
used a gateway machine, the actual cluster never had any issues with it.  This 
was also on infernalis so things probably have changed in Jewel and Kraken.

Regards,
Logan

- On Feb 21, 2017, at 7:37 AM, Rhian Resnick  wrote:

Good morning,


We are currently investigating using Ceph for a KVM farm, block storage and 
possibly file systems (cephfs with ceph-fuse, and ceph hadoop). Our cluster 
will be composed of 4 nodes, ~240 OSD's, and 4 monitors providing mon and mds 
as required.


What experience has the community had with large numbers of files in a single 
directory (500,000 - 5 million). We know that directory fragmentation will be 
required but are concerned about the stability of the implementation.


Your opinions and suggestions are welcome.


Thank you


Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd export-diff bug? rbd export-diff generates different incremental files

2017-02-21 Thread Jason Dillaman
On Mon, Feb 20, 2017 at 10:13 PM, Zhongyan Gu  wrote:
> You mentioned the fix is scheduled to be included in Hammer 0.94.10, Is
> there any fix already there??

The fix for that specific diff issue is included in the hammer branch
[1][2] -- but 0.94.10 hasn't been released yet.

[1] http://tracker.ceph.com/issues/18111
[2] https://github.com/ceph/ceph/pull/12446

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-21 Thread Samuel Just
Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
non-blocking implementation) to that branch.  Care to take it for a whirl?
-Sam

On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk  wrote:

> Building now
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Samuel Just
> *Sent:* 09 February 2017 19:22
> *To:* Nick Fisk 
> *Cc:* ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep
> 
> (based on master) passed a rados suite.  It adds a configurable limit to
> the number of pgs which can be trimming on any OSD (default: 2).  PGs
> trimming will be in snaptrim state, PGs waiting to trim will be in
> snaptrim_wait state.  I suspect this'll be adequate to throttle the amount
> of trimming.  If not, I can try to add an explicit limit to the rate at
> which the work items trickle into the queue.  Can someone test this branch?
>   Tester beware: this has not merged into master yet and should only be run
> on a disposable cluster.
>
> -Sam
>
>
>
> On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk  wrote:
>
> Yeah it’s probably just the fact that they have more PG’s so they will
> hold more data and thus serve more IO. As they have a fixed IO limit, they
> will always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO’s queued up on the disk, so
> that the scheduler can re-arrange them to action them in the most efficient
> manor as the disk head moves across the platters. You might possibly see up
> to a 20% hit on performance, in exchange for more consistent client
> latency.
>
>
>
> *From:* Steve Taylor [mailto:steve.tay...@storagecraft.com]
> *Sent:* 07 February 2017 20:35
> *To:* n...@fisk.me.uk; ceph-users@lists.ceph.com
>
>
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all
> of the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I’ll test some more when the PG counts per OSD are more balanced and see
> what I get. I’ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
> --
>
>
> 
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
> *From:* Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* Tuesday, February 7, 2017 10:25 AM
> *To:* Steve Taylor ;
> ceph-users@lists.ceph.com
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Hi Steve,
>
>
>
> From what I understand, the issue is not with the queueing in Ceph, which
> is correctly moving client IO to the front of the queue. The problem lies
> below what Ceph controls, ie the scheduler and disk layer in Linux. Once
> the IO’s leave Ceph it’s a bit of a free for all and the client IO’s tend
> to get 

Re: [ceph-users] CephFS : double objects in 2 pools

2017-02-21 Thread John Spray
On Tue, Feb 21, 2017 at 5:20 PM, Florent B  wrote:
> Hi everyone,
>
> I use a Ceph Jewel cluster.
>
> I have a CephFS with some directories at root, on which I defined some
> layouts :
>
> # getfattr -n ceph.dir.layout maildata1/
> # file: maildata1/
> ceph.dir.layout="stripe_unit=1048576 stripe_count=3 object_size=4194304
> pool=cephfs.maildata1"
>
>
> My problem is that the default "data" pool contains 44904 EMPTY objects
> (size of pool=0), and duplicates of my pool cephfs.maildata1.

This is normal: the MDS stores a "backtrace" for each file, that
allows it to find the file by inode number when necessary.  Usually,
when files are in the first data pool, the backtrace is stored along
with the data.  When your files are in a different data pool, the
backtrace is stored on an otherwise-empty object in the first data
pool.

Cheers,
John

> An example :
>
> # stat
> maildata1/domain.net/test5/mdbox/mailboxes/1319/dbox/dovecot.index.cache
>   File:
> 'maildata1/domain.net/test5/mdbox/mailboxes/1319/dbox/dovecot.index.cache'
>   Size: 728   Blocks: 2  IO Block: 1048576 regular file
> Device: 54h/84dInode: 1099526218076  Links: 1
>
> # getfattr -n ceph.file.layout
> maildata1/domain.net/test5/mdbox/mailboxes/1319/dbox/dovecot.index.cache
> # file:
> maildata1/domain.net/test5/mdbox/mailboxes/1319/dbox/dovecot.index.cache
> ceph.file.layout="stripe_unit=1048576 stripe_count=3 object_size=4194304
> pool=cephfs.maildata1"
>
> 1099526218076 = 1dea15c in hex :
>
> # rados -p cephfs.maildata1 ls | grep "1dea15c"
> 1dea15c.
>
> # rados -p data ls | grep "1dea15c"
> 1dea15c.
>
> The object in maildata1 pool contains file data, wheras the one in data
> is empty :
>
> # rados -p data get 1dea15c. - | wc -c
> 0
>
> # rados -p cephfs.maildata1 get 1dea15c. - | wc -c
> 728
>
> Clients accessing these directories does not have permission on "data"
> pool, that's normal :
>
> # ceph auth get client.maildata1
> exported keyring for client.maildata1
> [client.maildata1]
> key = 
> caps mds = "allow r, allow rw path=/maildata1"
> caps mon = "allow r"
> caps osd = "allow * pool=cephfs.maildata1"
>
> Have you ever seen this ? What could be the cause ?
>
> Thank you for your help.
>
> Florent
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-21 Thread george.vasilakakos
I have noticed something odd with the ceph-objectstore-tool command:

It always reports PG X not found even on healthly OSDs/PGs. The 'list' op works 
on both and unhealthy PGs.


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
Sent: 21 February 2017 10:17
To: w...@42on.com; ceph-users@lists.ceph.com; bhubb...@redhat.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

> Can you for the sake of redundancy post your sequence of commands you 
> executed and their output?

[root@ceph-sn852 ~]# systemctl stop ceph-osd@307
[root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn852 ~]# systemctl start ceph-osd@307

I did the same thing for 307 (new up but not acting primary) and all the OSDs 
in the original set (including 595). The output was the exact same. I don't 
have the whole session log handy from all those sessions but here's a sample 
from one that's easy to pick out:

[root@ceph-sn832 ~]# systemctl stop ceph-osd@7
[root@ceph-sn832 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 
--op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn832 ~]# systemctl start ceph-osd@7
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/2.256s6_head/   
2.c3s10_TEMP/   3.b9s4_head/
0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/2.256s6_TEMP/   
2.c4s3_head/3.b9s4_TEMP/
1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/2.2d5s2_head/   
2.c4s3_TEMP/4.34s10_head/
1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/   
2.dbs7_head/4.34s10_TEMP/
11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/   
2.dbs7_TEMP/commit_op_seq
11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/   
3.159s3_head/   meta/
11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  2.36es10_head/  
3.159s3_TEMP/   nosnap
11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   2.36es10_TEMP/  
3.170s1_head/   omap/
11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   2.3d3s10_head/  
3.170s1_TEMP/
11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   2.3d3s10_TEMP/  
3.1aas5_head/
11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   2.c3s10_head/   
3.1aas5_TEMP/
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
1.323s8_head/ 1.323s8_TEMP/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
DIR_3/ DIR_7/ DIR_B/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ 
DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
total 271276
-rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0001__head_2BA91323__1__8

> If you run a find in the data directory of the OSD, does that PG show up?

OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) have 
a 1.323_headsX directory. OSD 307 does not.
I have not checked the other OSDs in the PG yet.

Wido

>
> Best regards,
>
> George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs with large numbers of files per directory

2017-02-21 Thread Logan Kuhn
We have a very similar configuration at one point. 

I was fairly new when we started to move away from it, but what happened to us 
is that anytime a directory needed to stat, backup, ls, rsync, etc. It would 
take minutes to return and while it was waiting CPU load would spike due to 
iowait. The difference between what you've said and what we did was that we 
used a gateway machine, the actual cluster never had any issues with it. This 
was also on infernalis so things probably have changed in Jewel and Kraken. 

Regards, 
Logan 

- On Feb 21, 2017, at 7:37 AM, Rhian Resnick  wrote: 

| Good morning,

| We are currently investigating using Ceph for a KVM farm, block storage and
| possibly file systems (cephfs with ceph-fuse, and ceph hadoop). Our cluster
| will be composed of 4 nodes, ~240 OSD's, and 4 monitors providing mon and mds
| as required.

| What experience has the community had with large numbers of files in a single
| directory (500,000 - 5 million). We know that directory fragmentation will be
| required but are concerned about the stability of the implementation.

| Your opinions and suggestions are welcome.

| Thank you

| Rhian Resnick

| Assistant Director Middleware and HPC

| Office of Information Technology

| Florida Atlantic University

| 777 Glades Road, CM22, Rm 173B

| Boca Raton, FL 33431

| Phone 561.297.2647

| Fax 561.297.0222

| ___
| ceph-users mailing list
| ceph-users@lists.ceph.com
| http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs with large numbers of files per directory

2017-02-21 Thread Rhian Resnick
Good morning,


We are currently investigating using Ceph for a KVM farm, block storage and 
possibly file systems (cephfs with ceph-fuse, and ceph hadoop). Our cluster 
will be composed of 4 nodes, ~240 OSD's, and 4 monitors providing mon and mds 
as required.


What experience has the community had with large numbers of files in a single 
directory (500,000 - 5 million). We know that directory fragmentation will be 
required but are concerned about the stability of the implementation.


Your opinions and suggestions are welcome.


Thank you


Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin bucket link: empty bucket instance id

2017-02-21 Thread Valery Tschopp

Hi,

I've the same problem about 'radosgw-admin bucket link --bucket XXX 
--uid YYY', but with a Jewel radosgw


The admin rest API [1] do not work either :(

Any idea?

[1]: http://docs.ceph.com/docs/master/radosgw/adminops/#link-bucket


On 28/01/16 17:03 , Wido den Hollander wrote:

Hi,

I'm trying to link a bucket to a new user and this is failing for me.

The Ceph version is 0.94.5 (Hammer).

The bucket is called 'packer' and I can verify that it exists:

$ radosgw-admin bucket stats --bucket packer

{
"bucket": "packer",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets",
"id": "ams02.5862567.3564",
"marker": "ams02.5862567.3564",
"owner": "X_beta",
"ver": "0#21975",
"master_ver": "0#0",
"mtime": "2015-08-04 12:31:06.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 10737764,
"size_kb_actual": 10737836,
"num_objects": 27
},
"rgw.multimeta": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Now when I try to link this bucket it fails:

$ radosgw-admin bucket link --bucket packer --uid 

"failure: (22) Invalid argument: empty bucket instance id"

It seems like this is a bug in the radosgw-admin tool where it doesn't
parse the --bucket argument properly.

Any ideas?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
SWITCH
--
Valery Tschopp, Software Engineer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
email: valery.tsch...@switch.ch phone: +41 44 268 1544




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw's swift api return 403, and user cann't be removed.

2017-02-21 Thread choury
Hi all,
I create a user to test swift api like this:

{
"user_id": "test",
"display_name": "test",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [
{
"id": "test:swift",
"permissions": "full-control"
}
],
"keys": [
{
"user": "test",
"access_key": "E3AUWSVX2TX4QCXTTGK6",
"secret_key": "805UKOYIc484xwzeewMsBNMFpMofoZOjWvsapyDl"
},
{
"user": "test:swift",
"access_key": "QXPFRV8PAC87VPBIZR4K",
"secret_key": ""
}
],
"swift_keys": [
{
"user": "test:swift",
"secret_key": "pZr0ZDvH8BgMHCv8x52rf2wFJdaUKfXQWpB1LCzJ"
}
],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

But when access the api with curl:
curl -v localhost/auth -H "X-Auth-User: test:swift" -H "X-Auth-Key:
pZr0ZDvH8BgMHCv8x52rf2wFJdaUKfXQWpB1LCzJ"

This is the response:

 HTTP/1.1 403 Forbidden
 x-amz-request-id: tx0e8dc-0058ac133d-e21b2-cn-sh
 Content-Length: 23
 Accept-Ranges: bytes
 Content-Type: application/json
 Date: Tue, 21 Feb 2017 10:15:25 GMT

{"Code":"AccessDenied"}

Then when I'm removing this user, I got this:

radosgw-admin user rm --uid=test -n client.radosgw.cn-sh
could not remove user: unable to remove user, unable to remove user from RADOS
2017-02-21 18:18:15.711224 7f989f29b860  0 ERROR: could not remove
test:swift (swift name object), should be fixed (err=-22)

# radosgw-admin subuser rm --uid=test:swift -n client.radosgw.cn-sh
could not remove subuser: unable to parse request, user info was not populated


My ceph version is 0.94.9, what should I do with this?


Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-21 Thread george.vasilakakos
> Can you for the sake of redundancy post your sequence of commands you 
> executed and their output?

[root@ceph-sn852 ~]# systemctl stop ceph-osd@307
[root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn852 ~]# systemctl start ceph-osd@307

I did the same thing for 307 (new up but not acting primary) and all the OSDs 
in the original set (including 595). The output was the exact same. I don't 
have the whole session log handy from all those sessions but here's a sample 
from one that's easy to pick out:

[root@ceph-sn832 ~]# systemctl stop ceph-osd@7
[root@ceph-sn832 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 
--op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn832 ~]# systemctl start ceph-osd@7
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/2.256s6_head/   
2.c3s10_TEMP/   3.b9s4_head/
0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/2.256s6_TEMP/   
2.c4s3_head/3.b9s4_TEMP/
1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/2.2d5s2_head/   
2.c4s3_TEMP/4.34s10_head/
1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/   
2.dbs7_head/4.34s10_TEMP/
11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/   
2.dbs7_TEMP/commit_op_seq
11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/   
3.159s3_head/   meta/
11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  2.36es10_head/  
3.159s3_TEMP/   nosnap
11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   2.36es10_TEMP/  
3.170s1_head/   omap/
11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   2.3d3s10_head/  
3.170s1_TEMP/   
11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   2.3d3s10_TEMP/  
3.1aas5_head/   
11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   2.c3s10_head/   
3.1aas5_TEMP/   
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
1.323s8_head/ 1.323s8_TEMP/ 
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
DIR_3/ DIR_7/ DIR_B/ DIR_F/ 
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ 
DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
total 271276
-rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0001__head_2BA91323__1__8

> If you run a find in the data directory of the OSD, does that PG show up?

OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) have 
a 1.323_headsX directory. OSD 307 does not.
I have not checked the other OSDs in the PG yet.

Wido

>
> Best regards,
>
> George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How safe is ceph pg repair these days?

2017-02-21 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Gregory Farnum
> Sent: 20 February 2017 22:13
> To: Nick Fisk ; David Zafman 
> Cc: ceph-users 
> Subject: Re: [ceph-users] How safe is ceph pg repair these days?
> 
> On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk  wrote:
> > From what I understand in Jewel+ Ceph has the concept of an
> > authorative shard, so in the case of a 3x replica pools, it will
> > notice that 2 replicas match and one doesn't and use one of the good
> > replicas. However, in a 2x pool your out of luck.
> >
> > However, if someone could confirm my suspicions that would be good as
> well.
> 
> Hmm, I went digging in and sadly this isn't quite right. The code has a
lot of
> internal plumbing to allow more smarts than were previously feasible and
> the erasure-coded pools make use of them for noticing stuff like local
> corruption. Replicated pools make an attempt but it's not as reliable as
one
> would like and it still doesn't involve any kind of voting mechanism.
> A self-inconsistent replicated primary won't get chosen. A primary is
self-
> inconsistent when its digest doesn't match the data, which happens when:
> 1) the object hasn't been written since it was last scrubbed, or
> 2) the object was written in full, or
> 3) the object has only been appended to since the last time its digest was
> recorded, or
> 4) something has gone terribly wrong in/under LevelDB and the omap entries
> don't match what the digest says should be there.
> 

Thanks for the correction Greg. So I'm guessing that the probability of
overwriting with an incorrect primary is reduced in later releases, but it
can still happen.

Quick question and its maybe that this is a #5 on your list. What about
objects that are marked inconsistent on the primary due to a read error. I
would say 90% of my inconsistent PG's are always caused by a read error and
associated smartctl error. 

"rados list-inconsistent-obj" shows that it knows that the primary had a
read error, so I assume a "pg repair" wouldn't try and read from the primary
again?

> David knows more and correct if I'm missing something. He's also working
on
> interfaces for scrub that are more friendly in general and allow
> administrators to make more fine-grained decisions about recovery in ways
> that cooperate with RADOS.
> -Greg
> 
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Tracy Reed
> >> Sent: 18 February 2017 03:06
> >> To: Shinobu Kinjo 
> >> Cc: ceph-users 
> >> Subject: Re: [ceph-users] How safe is ceph pg repair these days?
> >>
> >> Well, that's the question...is that safe? Because the link to the
> >> mailing
> > list
> >> post (possibly outdated) says that what you just suggested is
> >> definitely
> > NOT
> >> safe. Is the mailing list post wrong? Has the situation changed?
> >> Exactly
> > what
> >> does ceph repair do now? I suppose I could go dig into the code but
> >> I'm
> > not
> >> an expert and would hate to get it wrong and post possibly bogus info
> >> the the list for other newbies to find and worry about and possibly
> >> lose their data.
> >>
> >> On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly:
> >> > if ``ceph pg deep-scrub `` does not work then
> >> >   do
> >> > ``ceph pg repair 
> >> >
> >> >
> >> > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed
> >> > 
> >> wrote:
> >> > > I have a 3 replica cluster. A couple times I have run into
> >> > > inconsistent PGs. I googled it and ceph docs and various blogs
> >> > > say run a repair first. But a couple people on IRC and a mailing
> >> > > list thread from 2015 say that ceph blindly copies the primary
> >> > > over the secondaries and calls it good.
> >> > >
> >> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-
> >> May/001370.
> >> > > html
> >> > >
> >> > > I sure hope that isn't the case. If so it would seem highly
> >> > > irresponsible to implement such a naive command called "repair".
> >> > > I have recently learned how to properly analyze the OSD logs and
> >> > > manually fix these things but not before having run repair on a
> >> > > dozen inconsistent PGs. Now I'm worried about what sort of
> >> > > corruption I may have introduced. Repairing things by hand is a
> >> > > simple heuristic based on comparing the size or checksum (as
> >> > > indicated by the logs) for each of the 3 copies and figuring out
> >> > > which is correct. Presumably matching two out of three should win
> >> > > and the odd object out should be deleted since having the exact
> >> > > same kind of error on two different OSDs is highly improbable. I
> >> > > don't understand why ceph repair wouldn't have done this all along.
> >> > >
> >> > > What is the current best practice in the use of ceph repair?
> 

Re: [ceph-users] PG stuck peering after host reboot

2017-02-21 Thread Wido den Hollander

> Op 20 februari 2017 om 17:52 schreef george.vasilaka...@stfc.ac.uk:
> 
> 
> Hi Wido,
> 
> Just to make sure I have everything straight,
> 
> > If the PG still doesn't recover do the same on osd.307 as I think that 
> > 'ceph pg X query' still hangs?
> 
> > The info from ceph-objectstore-tool might shed some more light on this PG.
> 
> You mean run the objectstore command on 307, not remove that from the CRUSH 
> map too. Am I correct?
> 

Correct. Stop the OSD, leave it marked as in and run this command.

> Assuming I am, I tried this command on all OSDs in that PG, including 307 and 
> they all say "PG '1.323' not found", which is weird and worrying.

Can you for the sake of redundancy post your sequence of commands you executed 
and their output?

If you run a find in the data directory of the OSD, does that PG show up?

Wido

> 
> Best regards,
> 
> George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com