Re: [ceph-users] Adventures with large RGW buckets

2019-08-02 Thread Josh Durgin

On 8/2/19 3:04 AM, Harald Staub wrote:
Right now our main focus is on the Veeam use case (VMWare backup), used 
with an S3 storage tier. Currently we host a bucket with 125M objects 
and one with 100M objects.


As Paul stated, searching common prefixes can be painful. We had some 
cases that did not work (taking too much time, radosgw taking too much 
memory) until the upgrade from 14.2.1 to 14.2.2, which includes an 
important fix for that :-)


We expect up to 400M objects per bucket. Following the 100k 
recommendation, we started with 4096 shards per bucket.


Other cases to search common prefixes took several minutes. It helped us 
to reshard from 4096 to 1024, response time became nearly 3 times faster.


It feels that the main reason to have shards is to get distribution of 
index operations' load over several PGs and therefore over several OSDs. 
So maybe a number of shards much higher than the number of PGs or OSDs 
does not help a lot? But it introduces some overhead. Maybe it would be 
better to have a recommendation based on the number of OSDs involved?


More shards are also important for recovery. Overall recovery time for a
given bucket is reduced since each shard can be recovered in parallel.
With fewer shards, you get larger rados objects, each of which will take 
longer to recover, potentially causing a longer outage if there aren't

enough copies to be active.

The mentioned resharding (4096 -> 1024) itself worked ("completed 
successfully"), but the removal of one of the old indexes did not. The 
cluster saw an OSD going down, which seems to have aborted the cleanup. 
This OSD stayed up, but there were timeouts, probably during RocksDB 
compaction (from looking at the OSD log). The affected OSD has the 
highest number of PGs of the index pool. Again, this would suggest that 
a lot of shards does not help when many shards are processed together in 
one RocksDB.


Manually removing the objects of the old index one by one was no 
problem. Maybe dynamic resharding could do it similarly to avoid the 
RocksDB overload? Or RocksDB could be made to stay responsive?


There's a lot of work going into making rocksdb more effective for these 
cases - much of it discussed in the performance weekly from July 25:


https://pad.ceph.com/p/performance_weekly

One of the major pieces there is sharding rocksdb into multiple column
families. This reduces the size of an individual LSM-tree within
rocksdb, meaning levels are smaller and compactions are faster. In
testing so far this significantly reduces tail latency (60-70% reduced
99% latency for pure-omap writes, similar to an rgw bucket index
workload).

Josh


On 31.07.19 20:02, Paul Emmerich wrote:

Hi,

we are seeing a trend towards rather large RGW S3 buckets lately.
we've worked on
several clusters with 100 - 500 million objects in a single bucket, 
and we've
been asked about the possibilities of buckets with several billion 
objects more

than once.

 From our experience: buckets with tens of million objects work just 
fine with
no big problems usually. Buckets with hundreds of million objects 
require some
attention. Buckets with billions of objects? "How about indexless 
buckets?" -

"No, we need to list them".


A few stories and some questions:


1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?

It doesn't really match my experiences. We know a few clusters running 
with
larger shards because resharding isn't possible for various reasons at 
the

moment. They sometimes work better than buckets with lots of shards.

So we've been considering to at least double that 100k target shard size
for large buckets, that would make the following point far less annoying.



2. Many shards + ordered object listing = lots of IO

Unfortunately telling people to not use ordered listings when they 
don't really
need them doesn't really work as their software usually just doesn't 
support

that :(

A listing request for X objects will retrieve up to X objects from 
each shard
for ordering them. That will lead to quite a lot of traffic between 
the OSDs
and the radosgw instances, even for relatively innocent simple queries 
as X

defaults to 1000 usually.

Simple example: just getting the first page of a bucket listing with 4096
shards fetches around 1 GB of data from the OSD to return ~300kb or so 
to the

S3 client.

I've got two clusters here that are only used for some relatively 
low-bandwidth
backup use case here. However, there are a few buckets with hundreds 
of millions

of objects that are sometimes being listed by the backup system.

The result is that this cluster has an average read IO of 1-2 GB/s, 
all going
to the index pool. Not a big deal since that's coming from SSDs and 
goes over

80 Gbit/s LACP bonds. But it does pose the question about scalability
as the user-
visible load created by the S3 clients is quite low.



3. Deleting large buckets

Someone accidentaly put 450 million small o

Re: [ceph-users] Luminous v12.2.10 released

2018-11-30 Thread Josh Durgin

The only relevant component for this issue is OSDs. Upgrading the
monitors first as usual is fine. If your OSDs are all 12.2.8, moving 
them to 12.2.10 has no chance of hitting this bug.


If you upgrade the OSDs to 13.2.2, which does have the PG hard limit
patches, you may hit the bug as noted here:

http://docs.ceph.com/docs/master/releases/mimic/#v13-2-2-mimic

Josh

On 11/30/18 2:07 PM, Christa Doane wrote:

On mixed node environment, does it matter what role ceph component has, in 
terms of updating them all to 12.2.10.
I have monitor nodes at 12.2.9 and OSD nodes at 12.2.8.  I am preparing to 
upgrade to mimic 13.2.2 and will be running dist-upgrade+reboot prior to 
upgrade to 13.2.2, that would move them all to 12.2.10.

Is that ok?
And should I still update my monitor nodes first to 12.2.10 and then OSD nodes 
to 12.2.10?

  Thank you,
Christa

On 11/27/18, 12:16 PM, "ceph-users on behalf of Josh Durgin" 
 wrote:

 On 11/27/18 9:40 AM, Graham Allan wrote:
 >
 >
 > On 11/27/2018 08:50 AM, Abhishek Lekshmanan wrote:
 >>
 >> We're happy to announce the tenth bug fix release of the Luminous
 >> v12.2.x long term stable release series. The previous release, v12.2.9,
 >> introduced the PG hard-limit patches which were found to cause an issue
 >> in certain upgrade scenarios, and this release was expedited to revert
 >> those patches. If you already successfully upgraded to v12.2.9, you
 >> should **not** upgrade to v12.2.10, but rather **wait** for a release in
 >> which http://tracker.ceph.com/issues/36686 is addressed. All other users
 >> are encouraged to upgrade to this release.
 >
 > I wonder if you can comment on upgrade policy for a mixed cluster - eg
 > where the majority is running 12.2.8 but a handful of newly-added osd
 > nodes were installed with 12.2.9. Should the 12.2.8 nodes be upgraded to
 > 12.2.10 (this does sound like it should have no negative effects) and
 > just the 12.2.9 nodes kept to wait for a future release - or wait on all?
 
 I'd suggest upgrading everything to 12.2.10. If you aren't hitting

 crashes already with this mixed 12.2.9 + 12.2.8 cluster, a further
 upgrade shouldn't cause any issues.
 
 Josh

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.10 released

2018-11-27 Thread Josh Durgin

On 11/27/18 12:11 PM, Josh Durgin wrote:

13.2.3 will have a similar revert, so if you are running anything other
than 12.2.9 or 13.2.2 you can go directly to 13.2.3.


Correction: I misremembered here, we're not reverting these patches for
13.2.3, so 12.2.9 users can upgrade to 13.2.2 or later, but other
luminous users should avoid 13.2.2 or later for the time being, unless
they can accept some downtime during the upgrade.

See http://tracker.ceph.com/issues/36686#note-6 for more detail.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.10 released

2018-11-27 Thread Josh Durgin

On 11/27/18 12:00 PM, Robert Sander wrote:

Am 27.11.18 um 15:50 schrieb Abhishek Lekshmanan:


   As mentioned above if you've successfully upgraded to v12.2.9 DO NOT
   upgrade to v12.2.10 until the linked tracker issue has been fixed.


What about clusters currently running 12.2.9 (because this was the
version in the repos when they got installed / last upgraded) where new
nodes are scheduled to setup?
Can the new nodes be installed with 12.2.10 and run with the other
12.2.9 nodes?
Should the new nodes be pinned to 12.2.9?


To be safe, pin them to 12.2.9 until we have a safe upgrade path in a
future luminous release. Alternately you can restart them all at once as
12.2.10 if you don't mind a short loss of availability.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.10 released

2018-11-27 Thread Josh Durgin

On 11/27/18 9:40 AM, Graham Allan wrote:



On 11/27/2018 08:50 AM, Abhishek Lekshmanan wrote:


We're happy to announce the tenth bug fix release of the Luminous
v12.2.x long term stable release series. The previous release, v12.2.9,
introduced the PG hard-limit patches which were found to cause an issue
in certain upgrade scenarios, and this release was expedited to revert
those patches. If you already successfully upgraded to v12.2.9, you
should **not** upgrade to v12.2.10, but rather **wait** for a release in
which http://tracker.ceph.com/issues/36686 is addressed. All other users
are encouraged to upgrade to this release.


I wonder if you can comment on upgrade policy for a mixed cluster - eg 
where the majority is running 12.2.8 but a handful of newly-added osd 
nodes were installed with 12.2.9. Should the 12.2.8 nodes be upgraded to 
12.2.10 (this does sound like it should have no negative effects) and 
just the 12.2.9 nodes kept to wait for a future release - or wait on all?


I'd suggest upgrading everything to 12.2.10. If you aren't hitting
crashes already with this mixed 12.2.9 + 12.2.8 cluster, a further
upgrade shouldn't cause any issues.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.10 released

2018-11-27 Thread Josh Durgin

On 11/27/18 8:26 AM, Simon Ironside wrote:

On 27/11/2018 14:50, Abhishek Lekshmanan wrote:


We're happy to announce the tenth bug fix release of the Luminous
v12.2.x long term stable release series. The previous release, v12.2.9,
introduced the PG hard-limit patches which were found to cause an issue
in certain upgrade scenarios, and this release was expedited to revert
those patches. If you already successfully upgraded to v12.2.9, you
should **not** upgrade to v12.2.10, but rather **wait** for a release in
which http://tracker.ceph.com/issues/36686 is addressed. All other users
are encouraged to upgrade to this release.


Is it safe for v12.2.9 users upgrade to v13.2.2 Mimic?

http://tracker.ceph.com/issues/36686 suggests a similar revert might be 
on the cards for v13.2.3 so I'm not sure.


Yes, 13.2.2 has the same pg hard limit code as 12.2.9, so that upgrade
is safe. The only danger is running a mixed-version cluster where some
of the osds have the pg hard limit code, and others do not.

13.2.3 will have a similar revert, so if you are running anything other
than 12.2.9 or 13.2.2 you can go directly to 13.2.3.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-06 Thread Josh Durgin

You should be able to avoid the crash by setting:

osd recovery max single start = 1
osd recovery max active = 1

With that, you can unset norecover to let recovery start again.

A fix so you don't need those settings is here: 
https://github.com/ceph/ceph/pull/21273


If you see any other backtraces let me know - especially the
complete_read_op one from http://tracker.ceph.com/issues/21931

Josh

On 04/05/2018 08:25 PM, Adam Tygart wrote:

Thank you! Setting norecover has seemed to work in terms of keeping
the osds up. I am glad my logs were of use to tracking this down. I am
looking forward to future updates.

Let me know if you need anything else.

--
Adam

On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin  wrote:

On 04/05/2018 08:11 PM, Josh Durgin wrote:


On 04/05/2018 06:15 PM, Adam Tygart wrote:


Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.

I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than one crash in there):
http://people.cs.ksu.edu/~mozes/ceph-osd.422.log

Anyone have any thoughts? My cluster feels like it is getting more and
more unstable by the hour...



Thanks to your logs, I think I've found the root cause. It looks like a
bug in the EC recovery code that's triggered by EC overwrites. I'm working
on a fix.

For now I'd suggest setting the noout and norecover flags to avoid
hitting this bug any more by avoiding recovery. Backfilling with no client
I/O would also avoid the bug.



I forgot to mention the tracker ticket for this bug is:
http://tracker.ceph.com/issues/23195


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Josh Durgin

On 04/05/2018 08:11 PM, Josh Durgin wrote:

On 04/05/2018 06:15 PM, Adam Tygart wrote:

Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.

I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than one crash in there):
http://people.cs.ksu.edu/~mozes/ceph-osd.422.log

Anyone have any thoughts? My cluster feels like it is getting more and
more unstable by the hour...


Thanks to your logs, I think I've found the root cause. It looks like a
bug in the EC recovery code that's triggered by EC overwrites. I'm 
working on a fix.


For now I'd suggest setting the noout and norecover flags to avoid
hitting this bug any more by avoiding recovery. Backfilling with no 
client I/O would also avoid the bug.


I forgot to mention the tracker ticket for this bug is:
http://tracker.ceph.com/issues/23195
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Josh Durgin

On 04/05/2018 06:15 PM, Adam Tygart wrote:

Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.

I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than one crash in there):
http://people.cs.ksu.edu/~mozes/ceph-osd.422.log

Anyone have any thoughts? My cluster feels like it is getting more and
more unstable by the hour...


Thanks to your logs, I think I've found the root cause. It looks like a
bug in the EC recovery code that's triggered by EC overwrites. I'm 
working on a fix.


For now I'd suggest setting the noout and norecover flags to avoid
hitting this bug any more by avoiding recovery. Backfilling with no 
client I/O would also avoid the bug.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Josh Durgin
(Sorry for top posting, this email client isn't great at editing)


The mitigation strategy I mentioned before of forcing backfill could be 
backported to jewel, but I don't think it's a very good option for RBD users 
without SSDs.


In luminous there is a command (something like 'ceph pg force-recovery') that 
you can use to prioritize recovery of particular PGs (and thus rbd images with 
some scripting). This would at least let you limit the scope of affected 
images. A couple folks from OVH added it for just this purpose.


Neither of these is an ideal workaround, but I haven't thought of a better one 
for existing versions.


Josh


Sent from Nine

From: Florian Haas 
Sent: Sep 15, 2017 3:43 PM
To: Josh Durgin
Cc: ceph-users@lists.ceph.com; Christian Theune
Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops 
after OSDs rejoin cluster (also, slow requests)

> On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin  wrote: 
> >> So this affects just writes. Then I'm really not following the 
> >> reasoning behind the current behavior. Why would you want to wait for 
> >> the recovery of an object that you're about to clobber anyway? Naïvely 
> >> thinking an object like that would look like a candidate for 
> >> *eviction* from the recovery queue, not promotion to a higher 
> >> priority. Is this because the write could be a partial write, whereas 
> >> recovery would need to cover the full object? 
> > 
> > 
> > Generally most writes are partial writes - for RBD that's almost always 
> > the case - often writes are 512b or 4kb. It's also true for e.g. RGW 
> > bucket index updates (adding an omap key/value pair). 
>
> Sure, makes sense. 
>
> >> This is all under the disclaimer that I have no detailed 
> >> knowledge of the internals so this is all handwaving, but would a more 
> >> logical sequence of events not look roughly like this: 
> >> 
> >> 1. Are all replicas of the object available? If so, goto 4. 
> >> 2. Is the write a full object write? If so, goto 4. 
> >> 3. Read the local copy of the object, splice in the partial write, 
> >> making it a full object write. 
> >> 4. Evict the object from the recovery queue. 
> >> 5. Replicate the write. 
> >> 
> >> Forgive the silly use of goto; I'm wary of email clients mangling 
> >> indentation if I were to write this as a nested if block. :) 
> > 
> > 
> > This might be a useful optimization in some cases, but it would be 
> > rather complex to add to the recovery code. It may be worth considering 
> > at some point - same with deletes or other cases where the previous data 
> > is not needed. 
>
> Uh, yeah, waiting for an object to recover just so you can then delete 
> it, and blocking the delete I/O in the process, does also seem rather 
> very strange. 
>
> I think we do agree that any instance of I/O being blocked upward of 
> 30s in a VM is really really bad, but the way you describe it, I see 
> little chance for a Ceph-deploying cloud operator to ever make a 
> compelling case to their customers that such a thing is unlikely to 
> happen. And I'm not even sure if a knee-jerk reaction to buy faster 
> hardware would be a very prudent investment: it's basically all just a 
> factor of (a) how much I/O happens on a cluster during an outage, (b) 
> how many nodes/OSDs will be affected by that outage. Neither is very 
> predictable, and only (b) is something you have any influence over in 
> a cloud environment. Beyond a certain threshold of either (a) or (b), 
> the probability of *recovery* slowing a significant number of VMs to a 
> crawl approximates 1. 
>
> For an rgw bucket index pool, that's usually a sufficiently small 
> amount of data that allows you to sprinkle a few fast drives 
> throughout your cluster, create a ruleset with a separate root 
> (pre-Luminous) or making use of classes (Luminous and later), and then 
> assign that ruleset to the pool. But for RBD storage, that's usually 
> not an option — not at non-prohibitive cost, anyway. 
>
> Can you share your suggested workaround / mitigation strategy for 
> users that are currently being bitten by this behavior? If async 
> recovery lands in mimic with no chance of a backport, then it'll be a 
> while before LTS users get any benefit out of it. 
>
> Cheers, 
> Florian 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-15 Thread Josh Durgin

On 09/15/2017 01:57 AM, Florian Haas wrote:

On Fri, Sep 15, 2017 at 8:58 AM, Josh Durgin  wrote:

This is more of an issue with write-intensive RGW buckets, since the
bucket index object is a single bottleneck if it needs recovery, and
all further writes to a shard of a bucket index will be blocked on that
bucket index object.


Well, yes, the problem impact may be even worse on rgw, but you do
agree that the problem does exist for RBD too, correct? (The hard
evidence points to that.)


Yes, of course it still exists for RBD or other uses.




There's a description of the idea here:

https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92


Thanks!. That raises another question:

"Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered."

So this affects just writes. Then I'm really not following the
reasoning behind the current behavior. Why would you want to wait for
the recovery of an object that you're about to clobber anyway? Naïvely
thinking an object like that would look like a candidate for
*eviction* from the recovery queue, not promotion to a higher
priority. Is this because the write could be a partial write, whereas
recovery would need to cover the full object?


Generally most writes are partial writes - for RBD that's almost always
the case - often writes are 512b or 4kb. It's also true for e.g. RGW
bucket index updates (adding an omap key/value pair).


This is all under the disclaimer that I have no detailed
knowledge of the internals so this is all handwaving, but would a more
logical sequence of events not look roughly like this:

1. Are all replicas of the object available? If so, goto 4.
2. Is the write a full object write? If so, goto 4.
3. Read the local copy of the object, splice in the partial write,
making it a full object write.
4. Evict the object from the recovery queue.
5. Replicate the write.

Forgive the silly use of goto; I'm wary of email clients mangling
indentation if I were to write this as a nested if block. :)


This might be a useful optimization in some cases, but it would be
rather complex to add to the recovery code. It may be worth considering
at some point - same with deletes or other cases where the previous data
is not needed.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-14 Thread Josh Durgin

On 09/14/2017 12:44 AM, Florian Haas wrote:

On Thu, Sep 14, 2017 at 2:47 AM, Josh Durgin  wrote:

On 09/13/2017 03:40 AM, Florian Haas wrote:


So we have a client that is talking to OSD 30. OSD 30 was never down;
OSD 17 was. OSD 30 is also the preferred primary for this PG (via
primary affinity). The OSD now says that

- it does itself have a copy of the object,
- so does OSD 94,
- but that the object is "also" missing on OSD 17.

So I'd like to ask firstly: what does "also" mean here?



Nothing, it's just included in all the log messages in the loop looking
at whether objects are missing.


OK, maybe the "also" can be removed to reduce potential confusion?


Sure


Secondly, if the local copy is current, and we have no fewer than
min_size objects, and recovery is meant to be a background operation,
then why is the recovery in the I/O path here? Specifically, why is
that the case on a write, where the object is being modified anyway,
and the modification then needs to be replicated out to OSDs 17 and
94?



Mainly because recovery pre-dated the concept of min_size. We realized
this was a problem during luminous development, but did not complete the
fix for it in time for luminous. Nice analysis of the issue though!


Well I wasn't quite done with the analysis yet, I just wanted to check
whether my initial interpretation was correct.

So, here's what this behavior causes, if I understand things correctly:

- We have a bunch of objects that need to be recovered onto the
just-returned OSD(s).
- Clients access some of these objects while they are pending recovery.
- When that happens, recovery of those objects gets reprioritized.
Simplistically speaking, they get to jump the queue.

Did I get that right?


Yes


If so, let's zoom out a bit now and look at RBD's most frequent use
case, virtualization. While the OSDs were down, the RADOS objects that
were created or modified would have come from whatever virtual
machines were running at that time. When the OSDs return, there's a
very good chance that those same VMs are still running. While they're
running, they of course continue to access the same RBDs, and are
quite likely to access the same *data* as before on those RBDs — data
that now needs to be recovered.

So that means that there is likely a solid majority of to-be-recovered
RADOS objects that needs to be moved to the front of the queue at some
point during the recovery. Which, in the extreme, renders the
prioritization useless: if I have, say, 1,000 objects that need to be
recovered but 998 have been moved to the "front" of the queue, the
queue is rather meaningless.


This is more of an issue with write-intensive RGW buckets, since the
bucket index object is a single bottleneck if it needs recovery, and
all further writes to a shard of a bucket index will be blocked on that
bucket index object.


Again, on the assumption that this correctly describes what Ceph
currently does, do you have suggestions for how to mitigate this? It
seems to me that the only actual remedy for this issue in
Jewel/Luminous would be to not access objects pending recovery, but as
just pointed out, that's a rather unrealistic goal.


In luminous you can force the osds to backfill (which does not block
I/O) instead of using log-based recovery. This requires scanning
the disk to see which objects are missing, instead of looking at the pg
log, so it will take longer to recover. This is feasible for all-SSD
setups, but with pure HDD it may be too much slower, depending on your
desire to trade-off durability for availability.

You can do this by setting:

osd pg log min entries = 1
osd pg log max entries = 2


I'm working on the fix (aka async recovery) for mimic. This won't be
backportable unfortunately.


OK — is there any more information on this that is available and
current? A quick search turned up a Trello card
(https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list
post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide
deck (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02),
a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive
branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery),
but I was hoping for something a little more detailed. Thanks in
advance for any additional insight you can share here!


There's a description of the idea here:

https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-13 Thread Josh Durgin

On 09/13/2017 03:40 AM, Florian Haas wrote:

So we have a client that is talking to OSD 30. OSD 30 was never down;
OSD 17 was. OSD 30 is also the preferred primary for this PG (via
primary affinity). The OSD now says that

- it does itself have a copy of the object,
- so does OSD 94,
- but that the object is "also" missing on OSD 17.

So I'd like to ask firstly: what does "also" mean here?


Nothing, it's just included in all the log messages in the loop looking
at whether objects are missing.


Secondly, if the local copy is current, and we have no fewer than
min_size objects, and recovery is meant to be a background operation,
then why is the recovery in the I/O path here? Specifically, why is
that the case on a write, where the object is being modified anyway,
and the modification then needs to be replicated out to OSDs 17 and
94?


Mainly because recovery pre-dated the concept of min_size. We realized
this was a problem during luminous development, but did not complete the
fix for it in time for luminous. Nice analysis of the issue though!

I'm working on the fix (aka async recovery) for mimic. This won't be 
backportable unfortunately.


Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release luminous

2017-09-12 Thread Josh Durgin
Could you post your crushmap? PGs mapping to no OSDs is a symptom of something 
wrong there.


You can stop the osds from changing position at startup with 'osd crush update 
on start = false':


http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-location


Josh

Sent from Nine

From: Jan-Willem Michels 
Sent: Sep 11, 2017 23:50
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release 
luminous

> We have a kraken cluster,  at the time newly build, with bluestore enabled. 
> it is 8 systems, with each 10 disks 10TB ,  and each computer has 1 NVME 
> 2TB disk 
> 3 monitor etc 
> About 700 TB and 300TB used. Mainly S3 objectstore 
>
> Of course there is more to the story:  We have one strange thing in our 
> cluster. 
> We tried  to create two pools of storage, default and ssd, and created a 
> new crush rule. 
> Worked without problems for months 
> But when we restart a computer / nvme-osd, it would "forget" that the 
> nvme should be connected the SSD pool ( for that particular computer). 
> Since we don't restart systems, we didn't notice that. 
> The nvme would appear back a default  pool. 
> When we re-apply the same crush rule again it would go back to the SSD 
> pool. 
> All while data kept working on the nvme disks 
>
> Clearly something is not ideal there. And luminous has a different 
> approach to separating  SSD from HDD. 
> So we thought first go to luminous 12.2.0 and later see how we fix this. 
>
> We did an upgrade to luminous and that went well. That requires a reboot 
> / restart off osd's, so all nvme devices where a default. 
> Reapplying the crush rule  brought them back to the ssd pool. 
> Also while doing the upgrade we switched off in ceph.conf the rule: 
> # enable experimental unrecoverable data corrupting features = 
> bluestore, sine in luminous that was no problem 
>
> Everything was working fine. 
> In Ceph -s we had this health warning 
>
>  all OSDs are running luminous or later but 
> require_osd_release < luminous 
>
> So i thought i would set the minimum  OSD version to luminous with; 
>
> ceph osd require-osd-release luminous 
>
> To us that seemed nothing more than a minimum software version that was 
> required to connect tot the cluster 
> the system answered back 
>
> recovery_deletes is set 
>
> and that was it, the same second, ceph-s went to "0" 
>
>   ceph -s 
>    cluster: 
>  id: 5bafad08-31b2-4716-be77-07ad2e2647eb 
>  health: HEALTH_WARN 
>  noout flag(s) set 
>  Reduced data availability: 3248 pgs inactive 
>  Degraded data redundancy: 3248 pgs unclean 
>
>    services: 
>  mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 
>  mgr: Ceph-Mon2(active), standbys: Ceph-Mon3, Ceph-Mon1 
>  osd: 88 osds: 88 up, 88 in; 297 remapped pgs 
>   flags noout 
>
>    data: 
>  pools:   26 pools, 3248 pgs 
>  objects: 0 objects, 0 bytes 
>  usage:   0 kB used, 0 kB / 0 kB avail 
>  pgs: 100.000% pgs unknown 
>   3248 unknown 
>
> And it was something like this. The errors (apart  from the scrub error) 
> you see would where from the upgrade / restarting, and I would expect 
> them to go away very fast. 
>
> ceph -s 
>    cluster: 
>  id: 5bafad08-31b2-4716-be77-07ad2e2647eb 
>  health: HEALTH_ERR 
>  385 pgs backfill_wait 
>  5 pgs backfilling 
>  135 pgs degraded 
>  1 pgs inconsistent 
>  1 pgs peering 
>  4 pgs recovering 
>  131 pgs recovery_wait 
>  98 pgs stuck degraded 
>  525 pgs stuck unclean 
>  recovery 119/612465488 objects degraded (0.000%) 
>  recovery 24/612465488 objects misplaced (0.000%) 
>  1 scrub errors 
>  noout flag(s) set 
>  all OSDs are running luminous or later but 
> require_osd_release < luminous 
>
>    services: 
>  mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 
>  mgr: Ceph-Mon2(active), standbys: Ceph-Mon1, Ceph-Mon3 
>  osd: 88 osds: 88 up, 88 in; 387 remapped pgs 
>   flags noout 
>
>    data: 
>  pools:   26 pools, 3248 pgs 
>  objects: 87862k objects, 288 TB 
>  usage:   442 TB used, 300 TB / 742 TB avail 
>  pgs: 0.031% pgs not active 
>   119/612465488 objects degraded (0.000%) 
>   24/612465488 objects misplaced (0.000%) 
>   2720 active+clean 
>   385  active+remapped+backfill_wait 
>   131  active+recovery_wait+degraded 
>   5    active+remapped+backfilling 
>   4    active+recovering+degraded 
>   1    active+clean+inconsistent 
>   1    peering 
>   1    active+clean+scrubbing+deep 
>
>    io: 
>  client:   34264 B/s rd, 2091 kB/s wr, 38 op/s rd, 48 op/s wr 
>  recovery: 423

Re: [ceph-users] Client features by IP?

2017-09-07 Thread Josh Durgin

On 09/07/2017 11:31 AM, Bryan Stillwell wrote:

On 09/07/2017 10:47 AM, Josh Durgin wrote:

On 09/06/2017 04:36 PM, Bryan Stillwell wrote:

I was reading this post by Josh Durgin today and was pretty happy to
see we can get a summary of features that clients are using with the
'ceph features' command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of
those clients with the older feature sets.  Is there a flag I can
pass to 'ceph features' to list the IPs associated with each feature
set?


There is not currently, we should add that - it'll be easy to backport
to luminous too. The only place both features and IP are shown is in
'debug mon = 10' logs right now.


I think that would be great!  The first thing I would want to do after
seeing an old client listed would be to find it and upgrade it.  Having
the IP of the client would make that a ton easier!


Yup, should've included that in the first place!


Anything I could do to help make that happen?  File a feature request
maybe?


Sure, adding a short tracker.ceph.com ticket would help, that way we can 
track the backport easily too.


Thanks!
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client features by IP?

2017-09-07 Thread Josh Durgin

On 09/06/2017 04:36 PM, Bryan Stillwell wrote:

I was reading this post by Josh Durgin today and was pretty happy to see we can 
get a summary of features that clients are using with the 'ceph features' 
command:

http://ceph.com/community/new-luminous-upgrade-complete/

However, I haven't found an option to display the IP address of those clients 
with the older feature sets.  Is there a flag I can pass to 'ceph features' to 
list the IPs associated with each feature set?


There is not currently, we should add that - it'll be easy to backport
to luminous too. The only place both features and IP are shown is in
'debug mon = 10' logs right now.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Long OSD restart after upgrade to 10.2.9

2017-07-18 Thread Josh Durgin

On 07/17/2017 10:04 PM, Anton Dmitriev wrote:
My cluster stores more than 1.5 billion objects in RGW, cephfs I dont 
use. Bucket index pool stored on separate SSD placement. But compaction 
occurs on all OSD, also on those, which doesn`t contain bucket indexes. 
After restarting 5 times every OSD nothing changed, each of them doing 
comapct again and again.


As an example omap dir size on one of OSDs, which doesnt contain bucket 
indexes.


root@storage01:/var/lib/ceph/osd/ceph-0/current/omap$ ls -l | wc -l
1455
root@storage01:/var/lib/ceph/osd/ceph-0/current/omap$ du -hd1
2,8G

Not so big at first look.


That is smaller than I'd expect to get long delays from compaction.
Could you provide more details on your environment - distro version and 
leveldb version? Has leveldb been updated recently? Perhaps there's some 
commonality between setups hitting this.


There weren't any changes in the way ceph was using leveldb from 10.2.7
to 10.2.9 that I could find.

Josh


On 17.07.2017 22:03, Josh Durgin wrote:

Both of you are seeing leveldb perform compaction when the osd starts
up. This can take a while for large amounts of omap data (created by
things like cephfs directory metadata or rgw bucket indexes).

The 'leveldb_compact_on_mount' option wasn't changed in 10.2.9, but 
leveldb will compact automatically if there is enough work to do.


Does restarting an OSD affected by this with 10.2.9 again after it's
completed compaction still have these symptoms?

Josh

On 07/17/2017 05:57 AM, Lincoln Bryant wrote:

Hi Anton,

We observe something similar on our OSDs going from 10.2.7 to 10.2.9 
(see thread "some OSDs stuck down after 10.2.7 -> 10.2.9 update"). 
Some of our OSDs are not working at all on 10.2.9 or die with suicide 
timeouts. Those that come up/in take a very long time to boot up. 
Seems to not affect every OSD in our case though.


--Lincoln

On 7/17/2017 1:29 AM, Anton Dmitriev wrote:
During start it consumes ~90% CPU, strace shows, that OSD process 
doing something with LevelDB.

Compact is disabled:
r...@storage07.main01.ceph.apps.prod.int.grcc:~$ cat 
/etc/ceph/ceph.conf | grep compact

#leveldb_compact_on_mount = true

But with debug_leveldb=20 I see, that compaction is running, but why?

2017-07-17 09:27:37.394008 7f4ed2293700  1 leveldb: Compacting 1@1 + 
12@2 files
2017-07-17 09:27:37.593890 7f4ed2293700  1 leveldb: Generated table 
#76778: 277817 keys, 2125970 bytes
2017-07-17 09:27:37.718954 7f4ed2293700  1 leveldb: Generated table 
#76779: 221451 keys, 2124338 bytes
2017-07-17 09:27:37.777362 7f4ed2293700  1 leveldb: Generated table 
#76780: 63755 keys, 809913 bytes
2017-07-17 09:27:37.919094 7f4ed2293700  1 leveldb: Generated table 
#76781: 231475 keys, 2026376 bytes
2017-07-17 09:27:38.035906 7f4ed2293700  1 leveldb: Generated table 
#76782: 190956 keys, 1573332 bytes
2017-07-17 09:27:38.127597 7f4ed2293700  1 leveldb: Generated table 
#76783: 148675 keys, 1260956 bytes
2017-07-17 09:27:38.286183 7f4ed2293700  1 leveldb: Generated table 
#76784: 294105 keys, 2123438 bytes
2017-07-17 09:27:38.469562 7f4ed2293700  1 leveldb: Generated table 
#76785: 299617 keys, 2124267 bytes
2017-07-17 09:27:38.619666 7f4ed2293700  1 leveldb: Generated table 
#76786: 277305 keys, 2124936 bytes
2017-07-17 09:27:38.711423 7f4ed2293700  1 leveldb: Generated table 
#76787: 110536 keys, 951545 bytes
2017-07-17 09:27:38.869917 7f4ed2293700  1 leveldb: Generated table 
#76788: 296199 keys, 2123506 bytes
2017-07-17 09:27:39.028395 7f4ed2293700  1 leveldb: Generated table 
#76789: 248634 keys, 2096715 bytes
2017-07-17 09:27:39.028414 7f4ed2293700  1 leveldb: Compacted 1@1 + 
12@2 files => 21465292 bytes
2017-07-17 09:27:39.053288 7f4ed2293700  1 leveldb: compacted to: 
files[ 0 0 48 549 948 0 0 ]
2017-07-17 09:27:39.054014 7f4ed2293700  1 leveldb: Delete type=2 
#76741


Strace:

open("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", O_RDONLY) 
= 18
stat("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", 
{st_mode=S_IFREG|0644, st_size=2154394, ...}) = 0

mmap(NULL, 2154394, PROT_READ, MAP_SHARED, 18, 0) = 0x7f96a67a
close(18)   = 0
brk(0x55d15664) = 0x55d15664
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK

Re: [ceph-users] Long OSD restart after upgrade to 10.2.9

2017-07-17 Thread Josh Durgin

Both of you are seeing leveldb perform compaction when the osd starts
up. This can take a while for large amounts of omap data (created by
things like cephfs directory metadata or rgw bucket indexes).

The 'leveldb_compact_on_mount' option wasn't changed in 10.2.9, but 
leveldb will compact automatically if there is enough work to do.


Does restarting an OSD affected by this with 10.2.9 again after it's
completed compaction still have these symptoms?

Josh

On 07/17/2017 05:57 AM, Lincoln Bryant wrote:

Hi Anton,

We observe something similar on our OSDs going from 10.2.7 to 10.2.9 
(see thread "some OSDs stuck down after 10.2.7 -> 10.2.9 update"). Some 
of our OSDs are not working at all on 10.2.9 or die with suicide 
timeouts. Those that come up/in take a very long time to boot up. Seems 
to not affect every OSD in our case though.


--Lincoln

On 7/17/2017 1:29 AM, Anton Dmitriev wrote:
During start it consumes ~90% CPU, strace shows, that OSD process 
doing something with LevelDB.

Compact is disabled:
r...@storage07.main01.ceph.apps.prod.int.grcc:~$ cat 
/etc/ceph/ceph.conf | grep compact

#leveldb_compact_on_mount = true

But with debug_leveldb=20 I see, that compaction is running, but why?

2017-07-17 09:27:37.394008 7f4ed2293700  1 leveldb: Compacting 1@1 + 
12@2 files
2017-07-17 09:27:37.593890 7f4ed2293700  1 leveldb: Generated table 
#76778: 277817 keys, 2125970 bytes
2017-07-17 09:27:37.718954 7f4ed2293700  1 leveldb: Generated table 
#76779: 221451 keys, 2124338 bytes
2017-07-17 09:27:37.777362 7f4ed2293700  1 leveldb: Generated table 
#76780: 63755 keys, 809913 bytes
2017-07-17 09:27:37.919094 7f4ed2293700  1 leveldb: Generated table 
#76781: 231475 keys, 2026376 bytes
2017-07-17 09:27:38.035906 7f4ed2293700  1 leveldb: Generated table 
#76782: 190956 keys, 1573332 bytes
2017-07-17 09:27:38.127597 7f4ed2293700  1 leveldb: Generated table 
#76783: 148675 keys, 1260956 bytes
2017-07-17 09:27:38.286183 7f4ed2293700  1 leveldb: Generated table 
#76784: 294105 keys, 2123438 bytes
2017-07-17 09:27:38.469562 7f4ed2293700  1 leveldb: Generated table 
#76785: 299617 keys, 2124267 bytes
2017-07-17 09:27:38.619666 7f4ed2293700  1 leveldb: Generated table 
#76786: 277305 keys, 2124936 bytes
2017-07-17 09:27:38.711423 7f4ed2293700  1 leveldb: Generated table 
#76787: 110536 keys, 951545 bytes
2017-07-17 09:27:38.869917 7f4ed2293700  1 leveldb: Generated table 
#76788: 296199 keys, 2123506 bytes
2017-07-17 09:27:39.028395 7f4ed2293700  1 leveldb: Generated table 
#76789: 248634 keys, 2096715 bytes
2017-07-17 09:27:39.028414 7f4ed2293700  1 leveldb: Compacted 1@1 + 
12@2 files => 21465292 bytes
2017-07-17 09:27:39.053288 7f4ed2293700  1 leveldb: compacted to: 
files[ 0 0 48 549 948 0 0 ]

2017-07-17 09:27:39.054014 7f4ed2293700  1 leveldb: Delete type=2 #76741

Strace:

open("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", O_RDONLY) = 18
stat("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", 
{st_mode=S_IFREG|0644, st_size=2154394, ...}) = 0

mmap(NULL, 2154394, PROT_READ, MAP_SHARED, 18, 0) = 0x7f96a67a
close(18)   = 0
brk(0x55d15664) = 0x55d15664
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_si

Re: [ceph-users] ask about async recovery

2017-06-30 Thread Josh Durgin

On 06/29/2017 08:16 PM, donglifec...@gmail.com wrote:

zhiqiang, Josn

what about the async recovery feature? I didn't see any update on
github recently,will it be further developed?


Yes, post-luminous at this point.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] python3-rados

2017-04-13 Thread Josh Durgin

On 04/12/2017 09:26 AM, Gerald Spencer wrote:

Ah I'm running Jewel. Is there any information online about python3-rados
with Kraken? I'm having difficulties finding more then I initially posted.


What info are you looking for?

The interface for the python bindings is the same for python 2 and 3.
The python3-rados package was added in kraken to a) compile the cython
against python3 and b) put the module where python3 will find it.

The python-rados package still installs the python2 version of the
bindings.

Josh


On Mon, Apr 10, 2017 at 10:37 PM, Wido den Hollander  wrote:




Op 8 april 2017 om 4:03 schreef Gerald Spencer :


Do the rados bindings exist for python3?
I see this sprinkled in various areas..
https://github.com/ceph/ceph/pull/7621
https://github.com/ceph/ceph/blob/master/debian/python3-rados.install

This being said, I can not find said package


What version of Ceph do you have installed? You need at least Kraken for
Python 3.

Wido


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why is librados for Python so Neglected?

2017-03-08 Thread Josh Durgin

On 03/08/2017 02:15 PM, Kent Borg wrote:

On 03/08/2017 05:08 PM, John Spray wrote:

Specifically?
I'm not saying you're wrong, but I am curious which bits in particular
you missed.



Object maps. Those transaction-y things. Object classes. Maybe more I
don't know about because I have been learning via Python.


There are certainly gaps in the python bindings, but those are all
covered since jewel.

Hmm, you may have been confused by the docs website - I'd thought the
reference section was autogenerated from the docstrings, like it is for
librbd, but it's just static text: http://tracker.ceph.com/issues/19238

For reference, take a look at 'help(rados)' from the python
interpreter, or check out the source and tests:

https://github.com/ceph/ceph/blob/jewel/src/pybind/rados/rados.pyx
https://github.com/ceph/ceph/blob/jewel/src/test/pybind/test_rados.py

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-20 Thread Josh Durgin

On 02/19/2017 12:15 PM, Patrick Donnelly wrote:

On Sat, Feb 18, 2017 at 2:55 PM, Noah Watkins  wrote:

The least intrusive solution is to simply change the sandbox to allow
the standard file system module loading function as expected. Then any
user would need to make sure that every OSD had consistent versions of
dependencies installed using something like LuaRocks. This is simple,
but could make debugging and deployment a major headache.


A locked down require which doesn't load C bindings (i.e. only load
.lua files) would probably be alright.


A more ambitious version would be to create an interface for users to
upload scripts and dependencies into objects, and support referencing
those objects as standard dependencies in Lua scripts as if they were
standard modules on the file system. Each OSD could then cache scripts
and dependencies, allowing applications to use references to scripts
instead of sending a script with every request.


This is very doable. I imagine we'd just put all of the Lua modules in
a flattened hierarchy under a RADOS namespace? The potentially
annoying nit in this is writing some kind of mechanism for installing
a Lua module tree into RADOS. Users would install locally and then
upload the tree through some tool.


Using rados objects for this is not really feasible. It would be
incredibly complex within the osd - it involves multiple objects, cache
invalidation, and has all kinds of potential issues with consistent
versioning and atomic updates across objects.

The simple solution of loading modules from the local fs sounds
way better to me. Putting modules on all osds and reloading the modules
or restarting the osds seems like a pretty simple deployment model with
any configuration management system.

That said, for research purposes you could resurrect something like the
ability to load modules into the cluster from a client - just store
them on the local fs of each osd, not in rados objects. This was
removed back in 2011:

https://github.com/ceph/ceph/commit/7c04f81ca16d11fc5a592992a4462b34ccb199dc
https://github.com/ceph/ceph/commit/964a0a6e1326d4f773c547655ebb2a5c97794268

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding general information Openstack+kvm virtual machine block storage

2016-09-15 Thread Josh Durgin

On 09/16/2016 09:46 AM, Erick Perez - Quadrian Enterprises wrote:

Can someone point me to a thread or site that uses ceph+erasure coding
to serve block storage for Virtual Machines running with Openstack+KVM?
All references that I found are using erasure coding for cold data or
*not* VM block access.


Erasure coding is not supported by RBD currently, since EC pools only 
support append operations. There's work in progress to make it

possible, by allowing overwrites for EC pools, but it won't be usable
until at earliest Luminous [0].

Josh

[0] http://tracker.ceph.com/issues/14031
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados API never kills threads

2016-09-13 Thread Josh Durgin

On 09/13/2016 01:13 PM, Stuart Byma wrote:

Hi,

Can anyone tell me why librados creates multiple threads per object, and never 
kills them, even when the ioctx is deleted? I am using the C++ API with a 
single connection and a single IO context. More threads and memory are used for 
each new object accessed. Is there a way to prevent this behaviour, like 
prevent the implementation from caching per-object connections? Or is there a 
way to “close” objects after reading/writing them that will kill these threads 
and release associated memory? Destroying and recreating the cluster connection 
is too expensive to do all the time.


The threads you're seeing are most likely associated with the cluster
connection - with SimpleMessenger, the default, a connection to an OSD
uses 2 threads. If you access objects that happen to live on different
OSDs, you'll notice more threads being created for each new OSD. These
are encapulated in the Rados object, which you likely don't want to
recreate all the time.

An IoCtx is really just a small set of in-memory state, e.g. pool id, 
snapshot, namespace, etc. and doesn't consume many resources itself.


Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is rados_write_op_* any more efficient than issuing the commands individually?

2016-09-07 Thread Josh Durgin

On 09/06/2016 10:16 PM, Dan Jakubiec wrote:

Hello, I need to issue the following commands on millions of objects:

rados_write_full(oid1, ...)
rados_setxattr(oid1, "attr1", ...)
rados_setxattr(oid1, "attr2", ...)


Would it make it any faster if I combined all 3 of these into a single
rados_write_op and issued them "together" as a single call?

My current application doesn't really care much about the atomicity, but
maximizing our write throughput is quite important.

Does rados_write_op save any roundtrips to the OSD or have any other
efficiency gains?


Yes, individual calls will send one message per call, adding more round
trips and overhead, whereas bundling changes in a rados_write_op will
only send one message. You can see this in the number of MOSDOp
messages shown with 'debug ms = 1' on the client.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD/Ceph as Physical boot volume

2016-03-19 Thread Josh Durgin

On 03/17/2016 03:51 AM, Schlacta, Christ wrote:

I posted about this a while ago, and someone else has since inquired,
but I am seriously wanting to know if anybody has figured out how to
boot from a RBD device yet using ipxe or similar.  Last I read.
loading the kernel and initrd from object storage would be
theoretically easy, and would only require making an initramfs to
initialize and mount the rbd..  But I couldn't find any documented
instances of anybody having done this yet..  So..  Has anybody done
this yet?  If so, which distros is it working on, and where can I find
more info?


Not sure if anyone is doing this, though there was a patch for creating
an initramfs that would mount rbd:

https://lists.debian.org/debian-kernel/2015/06/msg00161.html

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache did not help improve performance

2016-03-01 Thread Josh Durgin

On 03/01/2016 10:03 PM, min fang wrote:

thanks, with your help, I set the read ahead parameter. What is the
cache parameter for kernel module rbd?
Such as:
1) what is the cache size?
2) Does it support write back?
3) Will read ahead be disabled if max bytes has been read into cache?
(similar the concept as "rbd_readahead_disable_after_bytes".


The kernel rbd module does not implement any caching itself. If you're 
doing I/O to a file on a filesystem on top of a kernel rbd device,

it will go through the usual kernel page cache (unless you use O_DIRECT
of course).

Josh



2016-03-01 21:31 GMT+08:00 Adrien Gillard mailto:gillard.adr...@gmail.com>>:

As Tom stated, RBD cache only works if your client is using librbd
(KVM clients for instance).
Using the kernel RBD client, one of the parameter you can tune to
optimize sequential read is increasing
/sys/class/block/rbd4/queue/read_ahead_kb

Adrien



On Tue, Mar 1, 2016 at 12:48 PM, min fang mailto:louisfang2...@gmail.com>> wrote:

I can use the following command to change parameter, for example
as the following,  but not sure whether it will work.

  ceph --admin-daemon /var/run/ceph/ceph-mon.openpower-0.asok
config set rbd_readahead_disable_after_bytes 0

2016-03-01 15:07 GMT+08:00 Tom Christensen mailto:pav...@gmail.com>>:

If you are mapping the RBD with the kernel driver then
you're not using librbd so these settings will have no
effect I believe.  The kernel driver does its own caching
but I don't believe there are any settings to change its
default behavior.


On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo
mailto:ski...@redhat.com>> wrote:

You may want to set "ioengine=rbd", I guess.

Cheers,

- Original Message -
From: "min fang" mailto:louisfang2...@gmail.com>>
To: "ceph-users" mailto:ceph-users@lists.ceph.com>>
Sent: Tuesday, March 1, 2016 1:28:54 PM
Subject: [ceph-users]  rbd cache did not help improve
performance

Hi, I set the following parameters in ceph.conf

[client]
rbd cache=true
rbd cache size= 25769803776
rbd readahead disable after byte=0


map a rbd image to a rbd device then run fio testing on
4k read as the command
./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread
-rw=read -ioengine=aio -bs=4K -size=500G -numjobs=32
-runtime=300 -group_reporting -name=mytest2

Compared the result with setting rbd cache=false and
enable cache model, I did not see performance improved
by librbd cache.

Is my setting not right, or it is true that ceph librbd
cache will not have benefit on 4k seq read?

thanks.


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Josh Durgin

On 02/26/2016 03:17 PM, Shinobu Kinjo wrote:

In jewel, as you mentioned, there will be "--max-objects" and "--object-size" 
options.
That hint will go away or mitigate /w those options. Collect?


The io hint isn't sent by rados bench, just rbd. So even with those
options, rados bench still doesn't have the io hint, rbd caching,
object map, or cloning, which all make it a bit different from what rbd
does.


Are those options available in:

# ceph -v
ceph version 10.0.2 (86764eaebe1eda943c59d7d784b893ec8b0c6ff9)??


No, they were just recently merged are aren't in any releases yet.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Josh Durgin

On 02/26/2016 02:27 PM, Shinobu Kinjo wrote:

In this case it's likely rados bench using tiny objects that's
causing the massive overhead. rados bench is doing each write to a new
object, which ends up in a new file beneath the osd, with its own
xattrs too. For 4k writes, that's a ton of overhead.


That means that we don't see any proper results coming rados bench in this 
scenario (using very small objects), do we?
Or rados bench itself works as expected, but just 4k writes is problem?


It depends what workload you're trying to measure. If you want to
create new objects of a certain size, rados bench is perfect. Large 
writes are reasonably similar to rbd, but not exactly the same. For

small writes it's particularly different from the typical I/O pattern of
something like rbd.


I'm just curious about that because someone could misunderstand performance of 
the Ceph cluster because of the result in hammer.


In general I'd recommend using a tool more closely matching your actual
workload, or at least the interface used, e.g. fio with the rbd backend
will be more accurate for rbd than rados bench, cosbench will be better
for radosgw, etc.

Josh


Rgds,
Shinobu

- Original Message -
From: "Josh Durgin" 
To: "Christian Balzer" , ceph-users@lists.ceph.com
Sent: Saturday, February 27, 2016 6:05:07 AM
Subject: Re: [ceph-users] Observations with a SSD based pool under Hammer

On 02/24/2016 07:10 PM, Christian Balzer wrote:

10 second rados bench with 4KB blocks, 219MB written in total.
nand-writes per SSD:41*32MB=1312MB.
10496MB total written to all SSDs.
Amplification:48!!!

Le ouch.
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example.
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying SSDs.
And where the Ceph code could probably do with some attention.


In this case it's likely rados bench using tiny objects that's
causing the massive overhead. rados bench is doing each write to a new
object, which ends up in a new file beneath the osd, with its own
xattrs too. For 4k writes, that's a ton of overhead.

fio with the rbd backend will give you a more realistic picture.
In jewel there will be --max-objects and --object-size options for
rados bench to get closer to an rbd-like workload as well.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Josh Durgin

On 02/26/2016 01:42 PM, Jan Schermer wrote:

RBD backend might be even worse, depending on how large dataset you try. One 4KB 
block can end up creating a 4MB object, and depending on how well hole-punching 
and fallocate works on your system you could in theory end up with a >1000 
amplification if you always hit a different 4MB chunk (but that's not realistic).
Is that right?


Yes, the size hints rbd sends with writes will end up as an xfs ioctl
to ask for MIN(rbd object size, filestore_max_alloc_hint_size) (1MB for
the max by default) for writes to new objects.

Depending on how much the benchmark fills the image, this could be a
large or small overhead compared to the amount of data written.

Josh


Jan


On 26 Feb 2016, at 22:05, Josh Durgin  wrote:

On 02/24/2016 07:10 PM, Christian Balzer wrote:

10 second rados bench with 4KB blocks, 219MB written in total.
nand-writes per SSD:41*32MB=1312MB.
10496MB total written to all SSDs.
Amplification:48!!!

Le ouch.
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example.
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying SSDs.
And where the Ceph code could probably do with some attention.


In this case it's likely rados bench using tiny objects that's
causing the massive overhead. rados bench is doing each write to a new
object, which ends up in a new file beneath the osd, with its own
xattrs too. For 4k writes, that's a ton of overhead.

fio with the rbd backend will give you a more realistic picture.
In jewel there will be --max-objects and --object-size options for
rados bench to get closer to an rbd-like workload as well.

Josh


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Josh Durgin

On 02/24/2016 07:10 PM, Christian Balzer wrote:

10 second rados bench with 4KB blocks, 219MB written in total.
nand-writes per SSD:41*32MB=1312MB.
10496MB total written to all SSDs.
Amplification:48!!!

Le ouch.
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example.
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying SSDs.
And where the Ceph code could probably do with some attention.


In this case it's likely rados bench using tiny objects that's
causing the massive overhead. rados bench is doing each write to a new
object, which ends up in a new file beneath the osd, with its own
xattrs too. For 4k writes, that's a ton of overhead.

fio with the rbd backend will give you a more realistic picture.
In jewel there will be --max-objects and --object-size options for
rados bench to get closer to an rbd-like workload as well.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to check the block device space usage

2016-01-12 Thread Josh Durgin

On 01/12/2016 10:34 PM, Wido den Hollander wrote:

On 01/13/2016 07:27 AM, wd_hw...@wistron.com wrote:

Thanks Wido.
So it seems there is no way to do this under Hammer.



Not very easily no. You'd have to count and stat all objects for a RBD
image to figure this out.


For hammer you'd need another loop and sum around this:
http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3684

Josh


WD

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Wednesday, January 13, 2016 2:19 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to check the block device space usage

On 01/13/2016 06:48 AM, wd_hw...@wistron.com wrote:

Hi,

   Is there any way to check the block device space usage under the
specified pool? I need to know the capacity usage. If the block device
is used over 80%, I will send an alert to user.



This can be done in Infernalis / Jewel, but it requires new RBD features.

It needs the fast-diffv2 feature iirc.




   Thanks a lot!



Best Regards,

WD



*-
--
*

*This email contains confidential or legally privileged information
and is for the sole use of its intended recipient. *

*Any unauthorized review, use, copying or distribution of this email
or the content of this email is strictly prohibited.*

*If you are not the intended recipient, you may reply to the sender
and should delete this e-mail immediately.*

*-
--
*



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Hammer and rbd image features

2016-01-12 Thread Josh Durgin

On 01/12/2016 11:11 AM, Stefan Priebe wrote:

Hi,

i want to add support for fast-diff and object map to our "old" firefly
v2 rbd images.

The current hammer release can't to this.

Is there any reason not to cherry-pick this one? (on my own)
https://github.com/ceph/ceph/commit/3a7b28d9a2de365d515ea1380ee9e4f867504e10


You'd need a lot of supporting commits (that one just adds the cli
command). There are a bunch of librbd and cls_rbd commits to allow
updating features, and fast-diff itself wasn't in hammer, so it's not
very simple to backport.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ocfs2 with RBDcache

2016-01-12 Thread Josh Durgin

On 01/09/2016 02:34 AM, Wukongming wrote:

Hi, all

I notice this sentence "Running GFS or OCFS on top of RBD will not work with 
caching enabled." on http://docs.ceph.com/docs/master/rbd/rbd-config-ref/. why? Is 
there any way to open rbd cache with ocfs2 based on? Because I have a fio test with 
qemu-kvm config setting cache=none, which give a terrible result of IOPS less than 100 ( 
fio --numjobs=16 --iodepth=16 --ioengine=libaio --runtime=300 --direct=1 
--group_reporting --filename=/dev/sdd --name=mytest --rw=randwrite --bs=8k --size=8G)
, while other non-ceph cluster could give a result of IOPS to 1000+. Would 
disabling rbd cache cause this problem?


OCFS, GFS, and similar cluster filesystems assume they are using the 
same physical disk.
RBD caching is client side, so if you are accessing the same rbd image 
from more than one
client, they have independent caches that are not coherent. This means 
something like
ocfs2 could cache data in one rbd client, try overwriting it in another 
rbd client, and still se
the original data in the first client. With a regular physical disk, 
this is not possible, since its

cache is part of the device.

Your diagram shows you using qemu - in that case why not use the rbd 
support built into

qemu, and avoid a shared fs entirely?

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD export format for start and end snapshots

2016-01-12 Thread Josh Durgin

On 01/12/2016 06:10 AM, Alex Gorbachev wrote:

Good day!  I am working on a robust backup script for RBD and ran into a
need to reliably determine start and end snapshots for differential
exports (done with rbd export-diff).

I can clearly see these if dumping the ASCII header of the export file,
e.g.:

iss@lab2-b1:/data/volume1$ strings
exp-tst1-spin1-sctst1-0111-174938-2016-cons-thin.scexp|head -3
rbd diff v1
auto-0111-083856-2016-tst1t
auto-0111-174856-2016-tst1s

It appears that "auto-0111-083856-2016-tst1" is the start snapshot
(followed by t) and "auto-0111-174856-2016-tst1" is the end snapshot
(followed by s).

Is this the best way to determine snapshots and are letters "s" and "t"
going to stay the same?


The format won't change in an incompatible way, so we won't use those
fields for other purposes, but 'strings | head -3' might not always
work if we add fields or have longer strings later.

The format is documented here:

http://docs.ceph.com/docs/master/dev/rbd-diff/

It'd be more reliable if you decoded it by unpacking the diff with
your language of choice (e.g. using 
https://docs.python.org/2/library/struct.html for python.)


Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd merge-diff error

2015-12-09 Thread Josh Durgin

Hmm, perhaps there's a secondary bug.

Can you send the output from strace, i.e. strace.log after running:

cat snap1.diff | strace -f -o strace.log rbd merge-diff - snap2.diff 
combined.diff


for a case where it fails?

Josh

On 12/09/2015 08:38 PM, Alex Gorbachev wrote:

More oddity: retrying several times, the merge-diff sometimes works and
sometimes does not, using the same source files.

On Wed, Dec 9, 2015 at 10:15 PM, Alex Gorbachev mailto:a...@iss-integration.com>> wrote:

Hi Josh, looks like I celebrated too soon:

On Wed, Dec 9, 2015 at 2:25 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

This is the problem:

http://tracker.ceph.com/issues/14030

As a workaround, you can pass the first diff in via stdin, e.g.:

cat snap1.diff | rbd merge-diff - snap2.diff combined.diff


one test worked - merging the initial full export (export-diff with
just one snapshot)

but the second one failed (merging two incremental diffs):

root@lab2-b1:/data/volume1# cat scrun1-120720151502.bck | rbd
merge-diff - scrun1-120720151504.bck scrun1-part04.bck
Merging image diff: 13% complete...failed.
rbd: merge-diff error

I am not sure how to run gdb in such scenario with stdin/stdout

Thanks,
Alex



Josh


On 12/08/2015 11:11 PM, Josh Durgin wrote:

On 12/08/2015 10:44 PM, Alex Gorbachev wrote:

Hi Josh,

On Mon, Dec 7, 2015 at 6:50 PM, Josh Durgin
mailto:jdur...@redhat.com>
<mailto:jdur...@redhat.com <mailto:jdur...@redhat.com>>>
wrote:

 On 12/07/2015 03:29 PM, Alex Gorbachev wrote:

 When trying to merge two results of rbd
export-diff, the
 following error
 occurs:

 iss@lab2-b1:~$ rbd export-diff --from-snap
autosnap120720151500
 spin1/scrun1@autosnap120720151502
 /data/volume1/scrun1-120720151502.bck

 iss@lab2-b1:~$ rbd export-diff --from-snap
autosnap120720151504
 spin1/scrun1@autosnap120720151504
 /data/volume1/scrun1-120720151504.bck

 iss@lab2-b1:~$ rbd merge-diff
/data/volume1/scrun1-120720151502.bck
 /data/volume1/scrun1-120720151504.bck
 /data/volume1/mrg-scrun1-0204.bck
   Merging image diff: 11% complete...failed.
 rbd: merge-diff error

 That's all the output and I have found this link
http://tracker.ceph.com/issues/12911 but not sure if the
patch
 should
 have already been in hammer or how to get it?


 That patch fixed a bug that was only present after
hammer, due to
 parallelizing export-diff. You're likely seeing a
different (possibly
 new) issue.

 Unfortunately there's not much output we can enable for
export-diff in
 hammer. Could you try running the command via gdb
to figure out where
 and why it's failing? Make sure you have librbd-dbg
installed, then
 send the output from gdb doing:

 gdb --args rbd merge-diff
/data/volume1/scrun1-120720151502.bck \
 /data/volume1/scrun1-120720151504.bck
/data/volume1/mrg-scrun1-0204.bck
 break rbd.cc:1931
 break rbd.cc:1935
 break rbd.cc:1967
 break rbd.cc:1985
 break rbd.cc:1999
 break rbd.cc:2008
 break rbd.cc:2021
 break rbd.cc:2053
 break rbd.cc:2098
 run
 # (it will run now, stopping when it hits the error)
 info locals


Will do - how does one load librbd-dbg?  I have the
following on the
system:

librbd-dev - RADOS block device client library
(development files)
librbd1-dbg - debugging symbols for librbd1

is librbd1-dbg sufficient?


Yes, I just forgot the 1 in the package name.

Also a question - the merge-diff really stitches the to
diff files
together, not really merges, correct? For example, in
the following
workflow:

export-diff from full ima

Re: [ceph-users] rbd merge-diff error

2015-12-09 Thread Josh Durgin

This is the problem:

http://tracker.ceph.com/issues/14030

As a workaround, you can pass the first diff in via stdin, e.g.:

cat snap1.diff | rbd merge-diff - snap2.diff combined.diff

Josh

On 12/08/2015 11:11 PM, Josh Durgin wrote:

On 12/08/2015 10:44 PM, Alex Gorbachev wrote:

Hi Josh,

On Mon, Dec 7, 2015 at 6:50 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

On 12/07/2015 03:29 PM, Alex Gorbachev wrote:

When trying to merge two results of rbd export-diff, the
following error
occurs:

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151500
spin1/scrun1@autosnap120720151502
/data/volume1/scrun1-120720151502.bck

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151504
spin1/scrun1@autosnap120720151504
/data/volume1/scrun1-120720151504.bck

iss@lab2-b1:~$ rbd merge-diff
/data/volume1/scrun1-120720151502.bck
/data/volume1/scrun1-120720151504.bck
/data/volume1/mrg-scrun1-0204.bck
  Merging image diff: 11% complete...failed.
rbd: merge-diff error

That's all the output and I have found this link
http://tracker.ceph.com/issues/12911 but not sure if the patch
should
have already been in hammer or how to get it?


That patch fixed a bug that was only present after hammer, due to
parallelizing export-diff. You're likely seeing a different (possibly
new) issue.

Unfortunately there's not much output we can enable for
export-diff in
hammer. Could you try running the command via gdb to figure out where
and why it's failing? Make sure you have librbd-dbg installed, then
send the output from gdb doing:

gdb --args rbd merge-diff /data/volume1/scrun1-120720151502.bck \
/data/volume1/scrun1-120720151504.bck
/data/volume1/mrg-scrun1-0204.bck
break rbd.cc:1931
break rbd.cc:1935
break rbd.cc:1967
break rbd.cc:1985
break rbd.cc:1999
break rbd.cc:2008
break rbd.cc:2021
break rbd.cc:2053
break rbd.cc:2098
run
# (it will run now, stopping when it hits the error)
info locals


Will do - how does one load librbd-dbg?  I have the following on the
system:

librbd-dev - RADOS block device client library (development files)
librbd1-dbg - debugging symbols for librbd1

is librbd1-dbg sufficient?


Yes, I just forgot the 1 in the package name.


Also a question - the merge-diff really stitches the to diff files
together, not really merges, correct? For example, in the following
workflow:

export-diff from full image - 10GB
export-diff from snap1 - 2 GB
export-diff from snap2 - 1 GB

My resulting merge export file would be 13GB, correct?


It does merge overlapping sections, i.e. part of snap1 that was
overwritten in snap2, so the merged diff may be smaller than the
original two.

Josh


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd merge-diff error

2015-12-08 Thread Josh Durgin

On 12/08/2015 10:44 PM, Alex Gorbachev wrote:

Hi Josh,

On Mon, Dec 7, 2015 at 6:50 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

On 12/07/2015 03:29 PM, Alex Gorbachev wrote:

When trying to merge two results of rbd export-diff, the
following error
occurs:

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151500
spin1/scrun1@autosnap120720151502
/data/volume1/scrun1-120720151502.bck

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151504
spin1/scrun1@autosnap120720151504
/data/volume1/scrun1-120720151504.bck

iss@lab2-b1:~$ rbd merge-diff /data/volume1/scrun1-120720151502.bck
/data/volume1/scrun1-120720151504.bck
/data/volume1/mrg-scrun1-0204.bck
  Merging image diff: 11% complete...failed.
rbd: merge-diff error

That's all the output and I have found this link
http://tracker.ceph.com/issues/12911 but not sure if the patch
should
have already been in hammer or how to get it?


That patch fixed a bug that was only present after hammer, due to
parallelizing export-diff. You're likely seeing a different (possibly
new) issue.

Unfortunately there's not much output we can enable for export-diff in
hammer. Could you try running the command via gdb to figure out where
and why it's failing? Make sure you have librbd-dbg installed, then
send the output from gdb doing:

gdb --args rbd merge-diff /data/volume1/scrun1-120720151502.bck \
/data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck
break rbd.cc:1931
break rbd.cc:1935
break rbd.cc:1967
break rbd.cc:1985
break rbd.cc:1999
break rbd.cc:2008
break rbd.cc:2021
break rbd.cc:2053
break rbd.cc:2098
run
# (it will run now, stopping when it hits the error)
info locals


Will do - how does one load librbd-dbg?  I have the following on the system:

librbd-dev - RADOS block device client library (development files)
librbd1-dbg - debugging symbols for librbd1

is librbd1-dbg sufficient?


Yes, I just forgot the 1 in the package name.


Also a question - the merge-diff really stitches the to diff files
together, not really merges, correct? For example, in the following
workflow:

export-diff from full image - 10GB
export-diff from snap1 - 2 GB
export-diff from snap2 - 1 GB

My resulting merge export file would be 13GB, correct?


It does merge overlapping sections, i.e. part of snap1 that was
overwritten in snap2, so the merged diff may be smaller than the
original two.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd merge-diff error

2015-12-07 Thread Josh Durgin

On 12/07/2015 03:29 PM, Alex Gorbachev wrote:

When trying to merge two results of rbd export-diff, the following error
occurs:

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151500
spin1/scrun1@autosnap120720151502 /data/volume1/scrun1-120720151502.bck

iss@lab2-b1:~$ rbd export-diff --from-snap autosnap120720151504
spin1/scrun1@autosnap120720151504 /data/volume1/scrun1-120720151504.bck

iss@lab2-b1:~$ rbd merge-diff /data/volume1/scrun1-120720151502.bck
/data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck
 Merging image diff: 11% complete...failed.
rbd: merge-diff error

That's all the output and I have found this link
http://tracker.ceph.com/issues/12911 but not sure if the patch should
have already been in hammer or how to get it?


That patch fixed a bug that was only present after hammer, due to
parallelizing export-diff. You're likely seeing a different (possibly
new) issue.

Unfortunately there's not much output we can enable for export-diff in
hammer. Could you try running the command via gdb to figure out where
and why it's failing? Make sure you have librbd-dbg installed, then
send the output from gdb doing:

gdb --args rbd merge-diff /data/volume1/scrun1-120720151502.bck \
/data/volume1/scrun1-120720151504.bck /data/volume1/mrg-scrun1-0204.bck
break rbd.cc:1931
break rbd.cc:1935
break rbd.cc:1967
break rbd.cc:1985
break rbd.cc:1999
break rbd.cc:2008
break rbd.cc:2021
break rbd.cc:2053
break rbd.cc:2098
run
# (it will run now, stopping when it hits the error)
info locals

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] python3 librados

2015-11-30 Thread Josh Durgin

On 11/30/2015 10:26 AM, misa-c...@hudrydum.cz wrote:

Hi John,

thanks for the info. It seems that patch adds a python3 compatibility support
but leaves the ugly thread spawning intact. No idea if it makes sense to try
to merge some of my changes back to the ceph source.


Yeah, like Wido mentioned these changes were designed to keep rados+rbd
working in python2 and python3.

It looks like you've fleshed out the object operation api - adding some
of the missing constants and methods, e.g. cmpxattr(), assert_exists(),
etc. Those changes would be quite useful upstream.

The asyncio interface seems useful as well. To keep the base rados.py
methods agnostic to the async framework used, asyncio versions could be
implemented as wrappers around them, or maybe just an optional new
module, e.g. rados.asyncio.

The thread spawning in rados.py is a separate issue - that was
introduced so that /usr/bin/ceph (or other librados python users) could
be interrupted by SIGINT (i.e. ^C). If there's a better way around that
(and maybe putting that logic into the ceph cli would be better at this
point) I for one would be happy to get rid of it.

Josh


On Monday 30 of November 2015 09:46:18 John Spray wrote:

On Sun, Nov 29, 2015 at 7:20 PM,   wrote:

Hi everyone,

for my pet project I've needed python3 rados library. So I've took the
existing python2 rados code and clean it up a little bit to fit my needs.
The lib contains basic interface, asynchronous operations and also
asyncio wrapper for convenience in asyncio programs.

If you are interested the code can be found at

https://github.com/mihu/python3-rados


I haven't played with this, but I do remember that some level of
python 3 support was merged to master recently:

commit b7de34bc7e4237fd407dc6c0d714deec026d0594
Merge: 773713b 615f8f4
Author: Josh Durgin 
Date:   Thu Nov 12 19:32:42 2015 -0800

 Merge branch 'pybind3' of https://github.com/dcoles/ceph into
wip-pybind3

 pybind: Add Python 3 support for rados and rbd modules

 Reviewed-by: Josh Durgin 

 Conflicts:
 src/pybind/rbd.py (new create args, minor fix to work with py3)

It would be good to reconcile this with that.

Cheers,
John


Cheers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librbd ports to other language

2015-11-13 Thread Josh Durgin

On 11/13/2015 09:20 PM, Master user for YYcloud Groups wrote:

Hi,

I found there is only librbd's python wrapper.
Do you know how to ports this library to other language such as
Java/Ruby/Perl etc?


Here are a few other ports of librbd:

Java: https://github.com/ceph/rados-java

Ruby: https://github.com/ceph/ceph-ruby

Go: https://github.com/ceph/go-ceph

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd create => seg fault

2015-11-13 Thread Josh Durgin

On 11/13/2015 03:47 PM, Artie Ziff wrote:

Hello...

Want to share some more info about the rbd problem I am have.

I took Jason's suggestion to heart as there was a past errant configure
that was run with prefix not getting set so it installed into /. I never
did clean that up but today I configured a Ceph Makefile for / and ran
`make uninstall`. Thank you devs for the uninstall command.
But I still had the segfault on rbd.

Next...
Uninstalled the current build ceph app in /usr/local (using make uninstall)
And then cloned a new repo from Ceph's github repo. Instead of
performing a checkout to 0.94.5, I left it at HEAD. Built & installed.

$ ceph -v
ceph version 9.2.0-702-g7d926ce (7d926ce7e0e68d4416c4ae00e658a8d39d1540c6)

No rbd segfault but instead I see this:

rbd create image1 --size 35840 -p pool1
rbd: symbol lookup error: rbd: undefined symbol: _ZTIN8librados9WatchCtx2E

What could this indicate?
Perhaps an installation problem?
In one of the few search results, person pointed to a potential librbd1
and librados2 mismatch.


Yes, this is definitely an old version of librados getting picked up
somewhere in your library load path.

You can find where the old librados is via:

strace -f -e open ./rbd ls 2>&1 | grep librados.so


What approach is recommended to resolve this?


Remove the old librados. If you're determined to install via
'make install', you'll need to be careful about old versions of
libraries in the wrong place like this.

Using packages prevents this kind of mess.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Josh Durgin

On 10/19/2015 02:45 PM, Jan Schermer wrote:



On 19 Oct 2015, at 23:15, Gregory Farnum  wrote:

On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:

I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
other people using Ceph think.

If I were to use RADOS directly in my app I'd probably rejoice at its 
capabilities and how useful and non-legacy it is, but my use is basically for 
RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities 
are unneeded.
I live in this RBD bubble so that's all I know, but isn't this also the only 
usage pattern that 90% (or more) people using Ceph care about? Isn't this what 
drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to 
displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of 
Ceph?*

What are the apps that actually use the RADOS features? I know Swift has some 
RADOS backend (which does the same thing Swift already did by itself, maybe 
with stronger consistency?), RGW (which basically does the same as Swift?) - 
doesn't seem either of those would need anything special. What else is there?
Apps that needed more than POSIX semantics (like databases for transactions) 
already developed mechanisms to do that - how likely is my database server to 
replace those mechanisms with RADOS API and objects in the future? It's all 
posix-filesystem-centric and that's not going away.

Ceph feels like a perfect example of this 
https://en.wikipedia.org/wiki/Inner-platform_effect

I was really hoping there was an easy way to just get rid of journal and 
operate on filestore directly - that should suffice for anyone using RBD only  
(in fact until very recently I thought it was possible to just disable journal 
in config...)


The biggest thing you're missing here is that Ceph needs to keep *its*
data and metadata consistent. The filesystem journal does *not* let us
do that, so we need a journal of our own.



I get that, but I can't see any reason for the client IO to cause any change in 
this data.
Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, probably 
a good idea to have several copies that are checksummed and versioned and put 
somewhere super-safe.
But I see no need for client IO to pass through here, ever...



Could something be constructed to do that more efficiently? Probably,
with enough effort...but it's hard, and we don't have it right now,
and it will still require a Ceph journal, because Ceph will always
have its own metadata that needs to be kept consistent with its data.
(Short example: rbd client sends two writes. OSDs crash and restart.
client dies before they finish. OSDs try to reconstruct consistent
view of the data. If OSDs don't have the right metadata about which
writes have been applied, they can't tell who's got the newest data or
if somebody's missing some piece of it, and without journaling you
could get the second write applied but not the first, etc)



If the writes were followed by a flush (barrier) then that blocks until the 
data (all data not flushed) is safe and durable on the disk. Whether that means 
in a journal or flushed to OSD filesystem makes no difference.
If the writes were not followed by a flush then anything can happen - there 
could be any state (like only the second write happening) and that's what the 
client _MUST_ be able to cope with, Ceph or not. It's the same as a physical 
drive - will it have the data or not after a crash? Who cares - the OS didn't 
get a confirmation so it's replayed (from filesystem journal in the guest, 
database transaction log, retried from application...).
Even if just the first write happened and then the whole cluster went down - no 
different then a power failure with local disk.
I can't see a scenario where something breaks - RBD is a block device, not a 
filesystem. The filesystem on top already has a journal and better 
understanding on what needs to be durable or not.
Until the guest VM asks for data to be durable, any state is acceptable.

You are right that without a "Ceph transaction log" it has no idea what was 
written and what wasn't - does that matter? It does not :-)
If a guest makes a write to a RBD image in a 3-replica cluster and power on all 
3 OSDs involved goes down at the same moment, what can it expect?
Did the guest get a confirmation for the write or not?
If it did then all replicas are consistent at that one moment.
If it did not then there might be different objects on those 3 OSDs - so what? The guest 
doesn't care what data is there because no disk gives that guarantee. All Ceph needs to 
do is stick to one version  (by a simple timestamp possibly) and replicate it. Even if it 
was not the "best" copy, the guest filesystem must and will cope with that.
You're trying to bring consistency to something that doesn't really need it by 
design. Even if you dutifully preserve every single IO the guest did - if it 
didn't get that confirmation then it will not u

Re: [ceph-users] Straw2 kernel version?

2015-09-10 Thread Josh Durgin

On 09/10/2015 11:53 AM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

My notes show that it should have landed in 4.1, but I also have
written down that it wasn't merged yet. Just trying to get a
confirmation on the version that it did land in.


Yes, it landed in 4.1.

Josh


- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 12:45 PM, Lincoln Bryant  wrote:

Hi Robert,

I believe kernel versions 4.1 and beyond support straw2.

—Lincoln


On Sep 10, 2015, at 1:43 PM, Robert LeBlanc  wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Has straw2 landed in the kernel and if so which version?

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8c9fCRDmVDuy+mK58QAAKOoP/ibMriwPzqlY0ow1N36V
OX1wg+6r3nQRyGglvKVi9cmpPrgnlTZxPVv0KRr8xocRBrPYI//hob6qEVWH
hvaUVg5PDbgQRGi4GNWP8oY0VR7rYxjQAys3c+Mo9LSs1ZmgygIxmuNSGR1w
g3BCHJjBnSvrQ+NzDuIsaSnxAWCQKIJgMSmlOa0Pieqq4lXJDTNAdRILDOMn
eAuJcXZqq2Ll8axQnl8ymIRvq9aZ/TQi+q0lqJ/wgAkO/coZm/18HmMa/VI0
1/8rZTG0Jy4lgxny5VB1OjAZLMGnKfPyKs8bvQeksNBhMhIZVeFrZ5JHQC3f
4VsmAnTtDxD7RSEhlVy66kBMmdOlU6PhlSWZQ0OmLgHotX8HC9TJAq2I18yJ
ggk4mNkpcZwTz4PagjeEtST8/s1OIEjX4e9lh5u9einFv6mCxUMWT7bQwzFd
SImx589rjXLyZjdDtXsPZxN1G2Qi4HnlgKnkC44mx4soypo2sDFFmtv6YeWJ
e0Nr8RvFmKhPPgc71R1po9ZTOMIh3aBfMehvsAueVE8AhBZl8lvQyatAqYES
S7dcuhVATS4gfkEv4XWR1MVhvLDYP3l/I1H32cp5mh43BCT/DpSHvyfr0lhb
dxBlfSY/GYLFMGxbG73DFZO3S9o85nz2vma90rsS6AGx/oJOsJYUnXKcvUzL
Qgep
=e4Wy
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8dG2CRDmVDuy+mK58QAALXYP/0QYtBxc17tJmxVgUzwu
nnUpiM2iePAGkyG9FNVYOBW9TENWkLz56zFbJM3/aDYgTW8L0pqthVA4whCu
+pV0WxrQ7cDEZ0UWdZ+I8Ag12g5KWmpZyt000Uuxjx9PJQSKD5KxHEFw4dcn
A4BqQbKExtaz3KY8MVt3WpO+xsXBM64ImPkcYGUyOQ5tWzVvNbIOiLe0RIGW
JX00KnqdaVm9xXz+lBqkYKBGNto5z/Xu2wWK28FCFDfxe5Uw8Pd8JRq2/rp4
MhlTYOJazW9LbLW8mkzCaxscDMjeCTkGKPAlGXU6QkOp1ounxEpsT19nxPJw
IafYjGqYbASCBcCKjHWSZziEfA1PjlZhgs2DyXYFo9PaktW1vUtOJGkz+SMa
LkkXa8L3Y920v9iNe9syOC5/CKd2DTnfIsZjCw3Np9HL3REiulINEk4R2d1E
MLUmApTE60bBtzXKy2MmuzvX2IcE3TV0Oh+f9Nijr60Cd43dhBH7h0wKJmgh
CbJmZ23vxnJIUlhCd/+y+PSfahx0z4pSL7CVNLvJ1eQLBfLsrsTWLiRyF4iQ
k8KDaNdfBvOVmg6wU7Hzvxn1Z8sn0dEqxrEz6F6gcIpFyBq28jK1kpaaJ+/3
GgIEakW0CwOpkzT8viqJMsw8DUG+30meXWja8JOD0CPt3CyqrhhbORQPUgXQ
5g3M
=psm2
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Warning regarding LTTng while checking status or restarting service

2015-08-06 Thread Josh Durgin

On 08/06/2015 03:10 AM, Daleep Bais wrote:

Hi,

Whenever I restart or check the logs for OSD, MON, I get below warning
message..

I am running a test cluster of 09 OSD's and 03 MON nodes.

[ceph-node1][WARNIN] libust[3549/3549]: Warning: HOME environment
variable not set. Disabling LTTng-UST per-user tracing. (in
setup_local_apps() at lttng-ust-comm.c:375)


In short: this is harmless, you can ignore it.

liblttng-ust tries to listen for control commands from lttng-sessiond
in a few places by default, including under $HOME. It does this via a
shared mmaped file. If you were interested in tracing as a non-root
user, you could set LTTNG_HOME to a place that was usable, like /var/lib
/ceph/. Since ceph daemons run as root today, this is irrelevant, and
you can still use lttng as root just fine. Unfortunately there's no
simple to silence liblttng-ust about this.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] readonly snapshots of live mounted rbd?

2015-08-04 Thread Josh Durgin

On 08/01/2015 07:52 PM, pixelfairy wrote:

Id like to look at a read-only copy of running virtual machines for
compliance and potentially malware checks that the VMs are unaware of.

the first note on http://ceph.com/docs/master/rbd/rbd-snapshot/ warns
that the filesystem has to be in a consistent state. does that just mean
you might get a "crashed" filesystem or will some other bad thing happen
if you snapshot a running filesystem that hasned synced? would telling
the os to sync just before help?


Ideally you would do xfs_freeze -f, snap, xfs_freeze -u to get a
consistent fs for your snapshot. Despite the name this works on all
linux filesystems.

If you don't do this, like you said you get a crash-consistent snapshot,
which might require fs jounal replay (writing to the image). This is
doable using a clone of the snapshot, but it's a bit more complicated
to manage.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-23 Thread Josh Durgin

On 07/23/2015 06:31 AM, Jan Schermer wrote:

Hi all,
I am looking for a way to alleviate the overhead of RBD snapshots/clones for 
some time.

In our scenario there are a few “master” volumes that contain production data, 
and are frequently snapshotted and cloned for dev/qa use. Those 
snapshots/clones live for a few days to a few weeks before they get dropped, 
and they sometimes grow very fast (databases, etc.).

With the default 4MB object size there seems to be huge overhead involved with 
this, could someone give me some hints on how to solve that?

I have some hope in

1) FIEMAP
I’ve calculated that files on my OSDs are approx. 30% filled with NULLs - I 
suppose this is what it could save (best-scenario) and it should also make COW 
operations much faster.
But there are lots of bugs in FIEMAP in kernels (i saw some reference to CentOS 
6.5 kernel being buggy - which is what we use) and filesystems (like XFS). No 
idea about ext4 which we’d like to use in the future.

Is enabling FIEMAP a good idea at all? I saw some mention of it being replaced 
with SEEK_DATA and SEEK_HOLE.


fiemap (and ceph's use of it) has been buggy on all fses in the past.
SEEK_DATA and SEEK_HOLE are the proper interfaces to use for these
purposes. That said, it's not incredibly well tested since it's off by
default, so I wouldn't recommend using it without careful testing on
the fs you're using. I wouldn't expect it to make much of a difference
if you use small objects.


2) object size < 4MB for clones
I did some quick performance testing and setting this lower for production is 
probably not a good idea. My sweet spot is 8MB object size, however this would 
make the overhead for clones even worse than it already is.
But I could make the cloned images with a different block size from the 
snapshot (at least according to docs). Does someone use it like that? Any 
caveats? That way I could have the production data with 8MB block size but make 
the development snapshots with for example 64KiB granularity, probably at 
expense of some performance, but most of the data would remain in the (faster) 
master snapshot anyway. This should drop overhead tremendously, maybe even more 
than neabling FIEMAP. (Even better when working in tandem I suppose?)


Since these clones are relatively short-lived this seems like a better
way to go in the short term. 64k may be extreme, but if there aren't
too many of these clones it's not a big deal. There is more overhead
for recovery and scrub with smaller objects, so I wouldn't recommend
using tiny objects in general.

It'll be interesting to see your results. I'm not sure many folks
have looked at optimizing this use case.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backing Hadoop with Ceph ??

2015-07-17 Thread Josh Durgin

On 07/15/2015 11:48 AM, Shane Gibson wrote:


Somnath - thanks for the reply ...

:-)  Haven't tried anything yet - just starting to gather
info/input/direction for this solution.

Looking at the S3 API info [2] - there is no mention of support for the
"S3a" API extensions - namely "rename" support.  The problem with
backing via S3 API - if you need to rename a large (say multi GB) data
object - you have to copy to new name and delete - this is a very IO
expensive operation - and something we do a lot of.  That in and of
itself might be a deal breaker ...   Any idea/input/intention of
supporting the S3a exentsions within the RadosGW S3 API implementation?


I see you're trying out cephfs now, and I think that makes sense.

I just wanted to mention that at CDS a couple weeks ago Yehuda noted
that RGW's rename is cheap, since it does not require copying the data,
just updating its location [1].

Josh

[1] http://pad.ceph.com/p/hadoop-over-rgw


Plus - it seems like it's considered a "bad idea" to back Hadoop via S3
(and indirectly Ceph via RGW) [3]; though not sure if the architectural
differences from Amazon's S3 implementation and the far superior Ceph
make it more palatable?

~~shane

[2] http://ceph.com/docs/master/radosgw/s3/
[3] https://wiki.apache.org/hadoop/AmazonS3




On 7/15/15, 9:50 AM, "Somnath Roy" mailto:somnath@sandisk.com>> wrote:

Did you try to integrate ceph +rgw+s3 with Hadoop?

Sent from my iPhone

On Jul 15, 2015, at 8:58 AM, Shane Gibson mailto:shane_gib...@symantec.com>> wrote:




We are in the (very) early stages of considering testing backing
Hadoop via Ceph - as opposed to HDFS.  I've seen a few very vague
references to doing that, but haven't found any concrete info
(architecture, configuration recommendations, gotchas, lessons
learned, etc...).   I did find the ceph.com/docs/
 info [1] which discusses use of CephFS for
backing Hadoop - but this would be foolish for production clusters
given that CephFS isn't yet considered production quality/grade.

Does anyone in the ceph-users community have experience with this
that they'd be willing to share?   Preferably ... via use of Ceph
- not via CephFS...but I am interested in any CephFS related
experiences too.

If we were to do this, and Ceph proved out as a backing store to
Hadoop - there is the potential to be creating a fairly large
multi-Petabyte (100s ??) class backing store for Ceph.  We do a
very large amount of analytics on a lot of data sets for security
trending correlations, etc...

Our current Ceph experience is limited to a few small (90 x 4TB
OSD size) clusters - which we are working towards putting in
production for Glance/Cinder backing and for Block storage for
various large storage need platforms (eg software and package
repo/mirrors, etc...).

Thanks in  advance for any input, thoughts, or pointers ...

~~shane

[1] http://ceph.com/docs/master/cephfs/hadoop/



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution,
or copying of this message is strictly prohibited. If you have
received this communication in error, please notify the sender by
telephone or e-mail (as shown above) immediately and destroy any and
all copies of this message in your possession (whether hard copies
or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache + libvirt

2015-06-12 Thread Josh Durgin

On 06/08/2015 09:23 PM, Alexandre DERUMIER wrote:

In the short-term, you can remove the "rbd cache" setting from your ceph.conf


That's not true, you need to remove the ceph.conf file.
Removing rbd_cache is not enough or default rbd_cache=false will apply.


I have done tests, here the result matrix


host ceph.conf : no rbd_cache:  guest cache=writeback  : result : nocache   
(wrong)
host ceph.conf : rbd_cache=false :  guest cache=writeback  : result : nocache   
(wrong)
host ceph.conf : rbd_cache=true  :  guest cache=writeback  : result : cache
host ceph.conf : no rbd_cache:  guest cache=none   : result : nocache
host ceph.conf : rbd_cache=false :  guest cache=none   : result : no cache
host ceph.conf : rbd_cache=true  :  guest cache=none   : result : cache 
(wrong)


QEMU patch 3/4 fixes this:

http://comments.gmane.org/gmane.comp.emulators.qemu.block/2500

Josh


- Mail original -
De: "Jason Dillaman" 
À: "Andrey Korolyov" 
Cc: "Josh Durgin" , "aderumier" , 
"ceph-users" 
Envoyé: Lundi 8 Juin 2015 22:29:10
Objet: Re: [ceph-users] rbd cache + libvirt


On Mon, Jun 8, 2015 at 10:43 PM, Josh Durgin  wrote:

On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote:


Hi,


looking at the latest version of QEMU,



It's seem that it's was already this behaviour since the add of rbd_cache
parsing in rbd.c by josh in 2012


http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85


I'll do tests on my side tomorrow to be sure.



It seems like we should switch the order so ceph.conf is overridden by
qemu's cache settings. I don't remember a good reason to have it the
other way around.

Josh



Erm, doesn`t this code *already* represent the right priorities?
Cache=none setting should set a BDRV_O_NOCACHE which is effectively
disabling cache in a mentioned snippet.



Yes, the override is applied (correctly) based upon your QEMU cache settings. However, it then reads your configuration 
file and re-applies the "rbd_cache" setting based upon what is in the file (if it exists). So in the case 
where a configuration file has "rbd cache = true", the override of "rbd cache = false" derived from 
your QEMU cache setting would get wiped out. The long term solution would be to, as Josh noted, switch the order (so 
long as there wasn't a use-case for applying values in this order). In the short-term, you can remove the "rbd 
cache" setting from your ceph.conf so that QEMU controls it (i.e. it cannot get overridden when reading the 
configuration file) or use a different ceph.conf for a drive which requires different cache settings from the default 
configuration's settings.

Jason



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Josh Durgin

On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote:

Hi,

looking at the latest version of QEMU,


It's seem that it's was already this behaviour since the add of rbd_cache 
parsing in rbd.c by josh in 2012

http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85


I'll do tests on my side tomorrow to be sure.


It seems like we should switch the order so ceph.conf is overridden by
qemu's cache settings. I don't remember a good reason to have it the
other way around.

Josh


- Mail original -
De: "Jason Dillaman" 
À: "Arnaud Virlet" 
Cc: "ceph-users" 
Envoyé: Lundi 8 Juin 2015 17:50:53
Objet: Re: [ceph-users] rbd cache + libvirt

Hmm ... looking at the latest version of QEMU, it appears that the RBD cache settings are 
changed prior to reading the configuration file instead of overriding the value after the 
configuration file has been read [1]. Try specifying the path to a new configuration file 
via the "conf=/path/to/my/new/ceph.conf" QEMU parameter where the RBD cache is 
explicitly disabled [2].


[1] 
http://git.qemu.org/?p=qemu.git;a=blob;f=block/rbd.c;h=fbe87e035b12aab2e96093922a83a3545738b68f;hb=HEAD#l478
[2] http://ceph.com/docs/master/rbd/qemu-rbd/#usage



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-04 Thread Josh Durgin

On 06/03/2015 04:15 AM, Jan Schermer wrote:

Thanks for a very helpful answer.
So if I understand it correctly then what I want (crash consistency with RPO>0) 
isn’t possible now in any way.
If there is no ordering in RBD cache then ignoring barriers sounds like a very 
bad idea also.


Yes, that's why the default rbd cache configuration in hammer stays in
writethrough mode until it sees a flush from the guest.


Any thoughts on ext4 with journal_async_commit? That should be safe in any 
circumstance, but it’s pretty hard to test that assumption…


It doesn't sound incredibly well-tested in general. It does something 
like what you want, allowing some data to be lost but theoretically

preventing fs corruption, but I wouldn't trust it without a lot of
testing.

It seems like db-specific options for controlling how much data they
can lose may be best for your use case right now.


Is there someone running big database (OLTP) workloads on Ceph? What did you do 
to make them run? Out of box we are all limited to the same ~100 tqs/s (with 
5ms write latency)…


There is a lot of work going on to improve performance, and latency in
particular:

http://pad.ceph.com/p/performance_weekly

If you haven't seen them, Mark has a config optimized for latency at
the end of this:

http://nhm.ceph.com/Ceph_SSD_OSD_Performance.pdf

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph asok filling nova open files

2015-06-03 Thread Josh Durgin

On 06/03/2015 03:15 PM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thank you for pointing to the information. I'm glad a fix is already
ready. I can't tell from https://github.com/ceph/ceph/pull/4657, will
this be included in the next point release of hammer?


It'll be in 0.94.3.

0.94.2 is close to release already: http://tracker.ceph.com/issues/11492

Josh


Thanks,
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jun 3, 2015 at 4:00 PM, Josh Durgin  wrote:

On 06/03/2015 02:31 PM, Robert LeBlanc wrote:


We are experiencing a problem where nova is opening up all kinds of
sockets like:

nova-comp 20740 nova 1996u  unix 0x8811b3116b40  0t0 41081179
/var/run/ceph/ceph-client.volumes.20740.81999792.asok

hitting the open file limits rather quickly and preventing any new
work from happening in Nova.

The thing is, there isn't even that many volumes in the pool. Any ideas?



This is http://tracker.ceph.com/issues/11535 combined with nova
checking storage utilization periodically. Backports to hammer and
firefly are ready but not in a point release yet.

You can turn off the admin socket option in the ceph.conf that nova
uses to work around it.

Josh


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVb3ySCRDmVDuy+mK58QAAtW8P/i1jnakJRtSRy2a4xaLt
gRL0Ks5dRYpbnOZhmtucHyFW5C9y77ca70ydQQpuROS0Z3NO2EE2FwkErjjF
wT9/IZ8yvaBTyWYR2S/+WejxLqRbdJT3ILAXHXoSjNtuGQCLrM1IwQ4riqns
ai7lIB0xp4RpoZHev0VB8AAcatFOATKOodtImLWcGQWLLyWspqReguoyiyrl
BNArt3kG7x3ITUsVCVuVY8gZw8LvyG17ccc8hH4q8QUFYTbvtnvEEY1+l9Gy
2jk4odIE2/Xh3JrCuUURMn6svoZOW7/Akh5Qr2uCS64E4EajMBPv3mSJ5qiD
89Z7prlcRpt6Hpzeo0ZhMya2ZMZ30w8oFy4I7w3WUQ4iTSrsxKKTLQS+eWHt
fxJuOkHnQGbJB41w1t2pdt3HW0HpmqGlYrbQCvuQuAopYm7bZ8DDXBoVU2WX
5t/zkM/OuDi4l08qtjhQhBuPR2XvX0IrqNc8+j/pQmbsyQQ2kagf/eQeH2sW
XH0He17QV82ngRUPPciER3SOdJ+DHzmj65OEdYhVPOP9/m8p/h3zhFQsJFiI
elp9mRu2yTVpRazg5OlYQGY5yYZbFkGJheI3cTYZdSLwWzOdI7jQz8x5ZWp8
3v90PkTYldMw9iJA5fbqkzVy8IAvH3BEC2pIFj/vxHGPOQZZ+LIpWSDfTa6M
mg8M
=nzKz
-END PGP SIGNATURE-



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph asok filling nova open files

2015-06-03 Thread Josh Durgin

On 06/03/2015 02:31 PM, Robert LeBlanc wrote:

We are experiencing a problem where nova is opening up all kinds of
sockets like:

nova-comp 20740 nova 1996u  unix 0x8811b3116b40  0t0 41081179
/var/run/ceph/ceph-client.volumes.20740.81999792.asok

hitting the open file limits rather quickly and preventing any new
work from happening in Nova.

The thing is, there isn't even that many volumes in the pool. Any ideas?


This is http://tracker.ceph.com/issues/11535 combined with nova
checking storage utilization periodically. Backports to hammer and
firefly are ready but not in a point release yet.

You can turn off the admin socket option in the ceph.conf that nova
uses to work around it.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-02 Thread Josh Durgin

On 06/01/2015 03:41 AM, Jan Schermer wrote:

Thanks, that’s it exactly.
But I think that’s really too much work for now, that’s why I really would like 
to see a quick-win by using the local RBD cache for now - that would suffice 
for most workloads (not too many people run big databases on CEPH now, those 
who do must be aware of this).

The issue is - and I have not yet seen an answer to that - would it be safe as 
it is now if the flushes were ignored (rbd cache = unsafe) or will it 
completely b0rk the filesystem when not flushed properly?


Generally the latter. Right now flushes are the only thing enforcing
ordering for rbd. As a block device it doesn't guarantee that e.g. the
extent at offset 0 is written before the extent at offset 4096 unless
it sees a flush between the writes.

As suggested earlier in this thread, maintaining order during writeback
would make not sending flushes (via mount -o nobarrier in the guest or
cache=unsafe for qemu) safer from a crash-consistency point of view.

An fs or database on top of rbd would still have to replay their
internal journal, and could lose some writes, but should be able to
end up in a consistent state that way. This would make larger caches
more useful, and would be a simple way to use a large local cache
devices as an rbd cache backend. Live migration should still work in
such a system because qemu will still tell rbd to flush data at that
point.

A distributed local cache like [1] might be better long term, but
much more complicated to implement.

Josh

[1] 
https://www.usenix.org/conference/fast15/technical-sessions/presentation/bhagwat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80.8 and librbd performance

2015-04-15 Thread Josh Durgin

On 04/14/2015 08:01 PM, shiva rkreddy wrote:

The clusters are in test environment, so its a new deployment of 0.80.9.
OS on the cluster nodes is reinstalled as well, so there shouldn't be
any fs aging unless the disks are slowing down.

The perf measurement is done initiating multiple cinder create/delete
commands and tracking the volume to be in available or completely gone
from "cinder list" output.

Even running  "rbd rm " command from cinder node results in similar
behaviour.

I'll try with  increasing  rbd_concurrent_management in ceph.conf.
  Is the param name rbd_concurrent_management or rbd-concurrent-management ?


'rbd concurrent management ops' - spaces, hyphens, and underscores are
equivalent in ceph configuration.

A log with 'debug ms = 1' and 'debug rbd = 20' from 'rbd rm' on both 
versions might give clues about what's going slower.


Josh


On Tue, Apr 14, 2015 at 12:36 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

I don't see any commits that would be likely to affect that between
0.80.7 and 0.80.9.

Is this after upgrading an existing cluster?
Could this be due to fs aging beneath your osds?

How are you measuring create/delete performance?

You can try increasing rbd concurrent management ops in ceph.conf on
the cinder node. This affects delete speed, since rbd tries to
delete each object in a volume.

Josh


*From:* shiva rkreddy mailto:shiva.rkre...@gmail.com>>
*Sent:* Apr 14, 2015 5:53 AM
*To:* Josh Durgin
*Cc:* Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com
<mailto:ceph-us...@ceph.com>
*Subject:* Re: v0.80.8 and librbd performance

Hi Josh,

We are using firefly 0.80.9 and see both cinder create/delete
numbers slow down compared 0.80.7.
I don't see any specific tuning requirements and our cluster is
run pretty much on default configuration.
Do you recommend any tuning or can you please suggest some log
signatures we need to be looking at?

Thanks
shiva

On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a
performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9
is working it's way
through QA and should be out in a few days.  If you
haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/__issues/9854
<http://tracker.ceph.com/issues/9854> ,
http://tracker.ceph.com/__issues/10956
<http://tracker.ceph.com/issues/10956>). It's not
totally clear to me
which of the 70+ unreleased commits on the firefly
branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/__pull/3410
<https://github.com/ceph/ceph/pull/3410> , or are there
more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majord...@vger.kernel.org
<mailto:majord...@vger.kernel.org>
More majordomo info at
http://vger.kernel.org/__majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80.8 and librbd performance

2015-04-14 Thread Josh Durgin
I don't see any commits that would be likely to affect that between 0.80.7 and 
0.80.9.

Is this after upgrading an existing cluster?
Could this be due to fs aging beneath your osds?

How are you measuring create/delete performance?

You can try increasing rbd concurrent management ops in ceph.conf on the cinder 
node. This affects delete speed, since rbd tries to delete each object in a 
volume.

Josh


From: shiva rkreddy 
Sent: Apr 14, 2015 5:53 AM
To: Josh Durgin
Cc: Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com
Subject: Re: v0.80.8 and librbd performance

> Hi Josh,
>
> We are using firefly 0.80.9 and see both cinder create/delete numbers slow 
> down compared 0.80.7.
> I don't see any specific tuning requirements and our cluster is run pretty 
> much on default configuration.
> Do you recommend any tuning or can you please suggest some log signatures we 
> need to be looking at?
>
> Thanks
> shiva
>
> On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin  wrote:
>>
>> On 03/03/2015 03:28 PM, Ken Dreyer wrote:
>>>
>>> On 03/03/2015 04:19 PM, Sage Weil wrote:
>>>>
>>>> Hi,
>>>>
>>>> This is just a heads up that we've identified a performance regression in
>>>> v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way
>>>> through QA and should be out in a few days.  If you haven't upgraded yet
>>>> you may want to wait.
>>>>
>>>> Thanks!
>>>> sage
>>>
>>>
>>> Hi Sage,
>>>
>>> I've seen a couple Redmine tickets on this (eg
>>> http://tracker.ceph.com/issues/9854 ,
>>> http://tracker.ceph.com/issues/10956). It's not totally clear to me
>>> which of the 70+ unreleased commits on the firefly branch fix this
>>> librbd issue.  Is it only the three commits in
>>> https://github.com/ceph/ceph/pull/3410 , or are there more?
>>
>>
>> Those are the only ones needed to fix the librbd performance
>> regression, yes.
>>
>> Josh
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] live migration fails with image on ceph

2015-04-10 Thread Josh Durgin

On 04/08/2015 09:37 PM, Yuming Ma (yumima) wrote:

Josh,

I think we are using plain live migration and not mirroring block drives
as the other test did.


Do you have the migration flags or more from the libvirt log? Also
which versions of qemu is this?

The libvirt log message about qemuMigrationCancelDriveMirror from your
first email is suspicious. Being unable to stop it may mean it was not 
running (fine, but libvirt shouldn't have tried to stop it), or it kept 
running (bad esp. if it's trying to copy to the same rbd).



What are the chances or scenario that disk image
can be corrupted during the live migration for both source and target
are connected to the same volume and RBD caches is turned on:


Generally rbd caching with live migration is safe. The way to get
corruption is to have drive-mirror try to copy over the rbd on the
destination while the source is still using the disk...

Did you observe fs corruption after a live migration, or just other odd
symptoms? Since a reboot fixed it, it sounds more like memory corruption
to me, unless it was fsck'd during reboot.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] long blocking with writes on rbds

2015-04-08 Thread Josh Durgin

On 04/08/2015 11:40 AM, Jeff Epstein wrote:

Hi, thanks for answering. Here are the answers to your questions.
Hopefully they will be helpful.

On 04/08/2015 12:36 PM, Lionel Bouton wrote:

I probably won't be able to help much, but people knowing more will
need at least: - your Ceph version, - the kernel version of the host
on which you are trying to format /dev/rbd1, - which hardware and
network you are using for this cluster (CPU, RAM, HDD or SSD models,
network cards, jumbo frames, ...).


ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

Linux 3.18.4pl2 #3 SMP Thu Jan 29 21:11:23 CET 2015 x86_64 GNU/Linux

The hardware is an Amazon AWS c3.large. So, a (virtual) Xeon(R) CPU
E5-2680 v2 @ 2.80GHz, 3845992 kB RAM, plus whatever other virtual
hardware Amazon provides.


AWS will cause some extra perf variance, but...


There's only one thing surprising me here: you have only 6 OSDs, 1504GB
(~ 250G / osd) and a total of 4400 pgs ? With a replication of 3 this is
2200 pgs / OSD, which might be too much and unnecessarily increase the
load on your OSDs.

Best regards,

Lionel Bouton


Our workload involves creating and destroying a lot of pools. Each pool
has 100 pgs, so it adds up. Could this be causing the problem? What
would you suggest instead?


...this is most likely the cause. Deleting a pool causes the data and
pgs associated with it to be deleted asynchronously, which can be a lot
of background work for the osds.

If you're using the cfq scheduler you can try decreasing the priority of 
these operations with the "osd disk thread ioprio..." options:


http://ceph.com/docs/master/rados/configuration/osd-config-ref/#operations

If that doesn't help enough, deleting data from pools before deleting
the pools might help, since you can control the rate more finely. And of
course not creating/deleting so many pools would eliminate the hidden
background cost of deleting the pools.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of ioctx per rados connection

2015-04-08 Thread Josh Durgin

Yes, you can use multiple ioctxs with the same underlying rados connection. 
There's no hard limit on how many, it depends on your usage if/when a single 
rados connection becomes a bottleneck.

It's safe to use different ioctxs from multiple threads. IoCtxs have some local 
state like namespace, object locator key and snapshot that limit what you can 
do safely with multiple threads using the same IoCtx. librados.h has more 
details, but it's simplest to use a separate ioctx for each thread.

Josh


From: Michel Hollands 
Sent: Apr 8, 2015 6:54 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Number of ioctx per rados connection

> Hello,
>
> This is a question about the C API for librados. Can you use multiple “IO 
> contexts” (ioctx) per rados connection and if so how many ? Can these then be 
> used by multiple threads ? 
>
> Thanks in advance,
>
> Michel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] live migration fails with image on ceph

2015-04-06 Thread Josh Durgin

Like the last comment on the bug says, the message about block migration (drive 
mirroring) indicates that nova is telling libvirt to copy the virtual disks, 
which is not what should happen for ceph or other shared storage.

For ceph just plain live migration should be used, not block migration. It's 
either a configuration issue or a bug in nova.

Josh


From: "Yuming Ma (yumima)" 
Sent: Apr 3, 2015 1:27 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] live migration fails with image on ceph


Problem: live-migrating a VM, the migration will complete but cause a VM to 
become unstable.  The VM may become unreachable on the network, or go through a 
cycle where it hangs for ~10 mins at a time. A hard-reboot is the only way to 
resolve this.

Related libvirt logs:

2015-03-30 01:18:23.429+: 244411: warning : 
qemuMigrationCancelDriveMirror:1383 : Unable to stop block job on 
drive-virtio-disk0

2015-03-30 01:17:41.899+: 244408: warning : 
qemuDomainObjEnterMonitorInternal:1175 : This thread seems to be the async job 
owner; entering monitor without asking for a nested job is dangerous


Nova env: 
Kernel : 3.11.0-26-generic

libvirt-bin : 1.1.1-0ubuntu11 

ceph-common : 0.67.9-1precise


Ceph:

Kernel: 3.13.0-36-generic

ceph        : 0.80.7-1precise 

ceph-common : 0.80.7-1precise  



Saw post here (https://bugs.dogfood.paddev.net/mos/+bug/1371130) that this 
might have something to do the libvirt migration with RBD image, but exactly 
how Ceph is related and how to resovle it if anyone had this before. 


Thanks.


— Yuming___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error DATE 1970

2015-04-02 Thread Josh Durgin

On 04/01/2015 02:42 AM, Jimmy Goffaux wrote:

English Version :

Hello,

I found a strange behavior in Ceph. This behavior is visible on Buckets
(RGW) and pools (RDB).
pools:

``
root@:~# qemu-img info rbd:pool/kibana2
image: rbd:pool/kibana2
file format: raw
virtual size: 30G (32212254720 bytes)
disk size: unavailable
Snapshot list:
IDTAG VM SIZE  DATE   VM   CLOCK
snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00
00:00:00.000
snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00
00:00:00.000
``

As you can see the all dates are set to 1970-01-01 ?


The reason for this is simple for rbd: it doesn't store the date for
snapshots, and that's just the default value qemu shows.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Josh Durgin

On 03/26/2015 10:46 AM, Gregory Farnum wrote:

I don't know why you're mucking about manually with the rbd directory;
the rbd tool and rados handle cache pools correctly as far as I know.


That's true, but the rados tool should be able to manipulate binary data 
more easily. It should probably be able to read from a file or stdin for 
this.


Josh



On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke  wrote:

Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
"\0x0f\0x00\0x00\0x00"2cfc7ce74b0dc51

but "rados setomapval" don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  "rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2"
I got the binary inside the file name_vm-409-disk-2, but reverse do an
"rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2"
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:

You shouldn't rely on "rados ls" when working with cache pools. It
doesn't behave properly and is a silly operation to run against a pool
of any size even when it does. :)

More specifically, "rados ls" is invoking the "pgls" operation. Normal
read/write ops will go query the backing store for objects if they're
not in the cache tier. pgls is different — it just tells you what
objects are present in the PG on that OSD right now. So any objects
which aren't in cache won't show up when listing on the cache pool.
-Greg

On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke  wrote:

Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
}


Are there any "magic" (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-05 Thread Josh Durgin

On 03/05/2015 12:46 AM, koukou73gr wrote:

On 03/05/2015 03:40 AM, Josh Durgin wrote:


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r
= -1

-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.


Josh,

All  images, parent, snapshot and clone reside on the same pool
(libvirt-pool *) and the user (libvirt) seems to have the proper
capabilities. See:

client.libvirt
 key: 
 caps: [mon] allow r
 caps: [osd] allow class-read object_prefix rbd_children, allow rw
class-read pool=rbd


This includes everything except class-write on the pool you're using.
You'll need that so that a copy_up call (used just for clones) works.
That's what was getting a permissions error. You can use rwx for short.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin

On 03/04/2015 01:36 PM, koukou73gr wrote:

On 03/03/2015 05:53 PM, Jason Dillaman wrote:

Your procedure appears correct to me.  Would you mind re-running your
cloned image VM with the following ceph.conf properties:

[client]
rbd cache off
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

If you recreate the issue, would you mind opening a ticket at
http://tracker.ceph.com/projects/rbd/issues?

Jason,

Thanks for the reply. Recreating the issue is not a problem, I can
reproduce it any time.
The log file was getting a bit large, I destroyed the guest after
letting it thrash for about ~3 minutes, plenty of time to hit the
problem. I've uploaded it at:

http://paste.scsys.co.uk/468868 (~19MB)


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960 
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r = -1


-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin

On 03/02/2015 04:16 AM, koukou73gr wrote:


Hello,

Today I thought I'd experiment with snapshots and cloning. So I did:

rbd import --image-format=2 vm-proto.raw rbd/vm-proto
rbd snap create rbd/vm-proto@s1
rbd snap protect rbd/vm-proto@s1
rbd clone rbd/vm-proto@s1 rbd/server

And then proceeded to create a qemu-kvm guest with rbd/server as its
backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:


What does the qemu command line look like?


[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0
ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
  sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem


This suggests the disk is being exposed as read-only via QEMU,
perhaps via qemu's snapshot or other options.

You can use a clone in exactly the same way as any other rbd image.
If you're running QEMU manually, for example, something like:

qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback

is fine for using the clone. QEMU is supposed to be unaware of any
snapshots, parents, etc. at the rbd level.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80.8 and librbd performance

2015-03-04 Thread Josh Durgin

On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way
through QA and should be out in a few days.  If you haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/issues/9854 ,
http://tracker.ceph.com/issues/10956). It's not totally clear to me
which of the 70+ unreleased commits on the firefly branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/pull/3410 , or are there more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] wider rados namespace support?

2015-02-18 Thread Josh Durgin

On 02/12/2015 05:59 PM, Blair Bethwaite wrote:

My particular interest is for a less dynamic environment, so manual
key distribution is not a problem. Re. OpenStack, it's probably good
enough to have the Cinder host creating them as needed (presumably
stored in its DB) and just send the secret keys over the message bus
to compute hosts as needed - if your infrastructure network is not
trusted then you've got bigger problems to worry about. It's true that
a lot of clouds would end up logging the secrets in various places,
but then they are only useful on particular hosts.

I guess there is nothing special about the default "" namespace
compared to any other as far as cephx is concerned. It would be nice
to have something of a nested auth, so that the client requires
explicit permission to read the default namespace (configured
out-of-band when setting up compute hosts) and further permission for
particular non-default namespaces (managed by the cinder rbd driver),
that way leaking secrets from cinder gives less exposure - but I guess
that would be a bit of a change from the current namespace
functionality.


You can restrict client access to the default namespace like this with
the existing ceph capabilities. For the proposed rbd usage of
namespaces, for example, you could allow read-only access to the
rbd_id.* objects in the default namespace, and full access to other
specific namespaces. Something like:

mon 'allow r' osd 'allow r class-read pool=foo namespace="" 
object_prefix rbd_id, allow rwx pool=foo namespace=bar'


Cinder or other management layers would still want broader access, but
these more restricted keys could be the only ones exposed to QEMU.

Josh


On 13 February 2015 at 05:57, Josh Durgin  wrote:

On 02/10/2015 07:54 PM, Blair Bethwaite wrote:


Just came across this in the docs:
"Currently (i.e., firefly), namespaces are only useful for
applications written on top of librados. Ceph clients such as block
device, object storage and file system do not currently support this
feature."

Then found:
https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support

Is there any progress or plans to address this (particularly for rbd
clients but also cephfs)?



No immediate plans for rbd. That blueprint still seems like a
reasonable way to implement it to me.

The one part I'm less sure about is the OpenStack or other higher level
integration, which would need to start adding secret keys to libvirt
dynamically.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD on RBD (KVM)

2015-02-18 Thread Josh Durgin
> From: "Logan Barfield" 
> We've been running some tests to try to determine why our FreeBSD VMs
> are performing much worse than our Linux VMs backed by RBD, especially
> on writes.
> 
> Our current deployment is:
> - 4x KVM Hypervisors (QEMU 2.0.0+dfsg-2ubuntu1.6)
> - 2x OSD nodes (8x SSDs each, 10Gbit links to hypervisors, pool has 2x
> replication across nodes)
> - Hypervisors have "rbd_cache enabled"
> - All VMs use "cache=none" currently.

If you don't have rbd cache writethrough until flush = true, this
configuration is unsafe - with cache=none, qemu will not send flushes.
 
> In testing we were getting ~30MB/s writes, and ~100MB/s reads on
> FreeBSD 10.1.  On Linux VMs we're seeing ~150+MB/s for writes and
> reads (dd if=/dev/zero of=output bs=1M count=1024 oflag=direct).

I'm not very familiar with FreeBSD, but I'd guess it's sending smaller
I/Os for some reason. This could be due to trusting the sector size
qemu reports (this can be changed, though I don't remember the syntax
offhand), lower fs block size, or scheduler or block subsystem
configurables.  It could also be related to differences in block
allocation strategies by whatever FS you're using in the guest and
Linux filesystems. What FS are you using in each guest?

You can check the I/O sizes seen by rbd by adding something like this
to ceph.conf on a node running qemu:

[client]
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

This will show the offset and length of requests in lines containing
aio_read and aio_write. If you're using giant you could instead gather
a trace of I/O to rbd via lttng.

> I tested several configurations on both RBD and local SSDs, and the
> only time FreeBSD performance was comparable to Linux was with the
> following configuration:
> - Local SSD
> - Qemu cache=writeback
> - GPT journaling enabled
> 
> We did see some performance improvement (~50MB/s writes instead of
> 30MB/s) when using cache=writeback on RBD.
> 
> I've read several threads regarding cache=none vs cache=writeback.
> cache=none is apparently safer for live migration, but cache=writeback
> is recommended by Ceph to prevent data loss.  Apparently there was a
> patch submitted for Qemu a few months ago to make cache=writeback
> safer for live migrations as well: http://tracker.ceph.com/issues/2467

RBD caching is already safe with live migration without this patch. It
just makes sure that it will continue to be safe in case of future
QEMU changes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] wider rados namespace support?

2015-02-12 Thread Josh Durgin

On 02/10/2015 07:54 PM, Blair Bethwaite wrote:

Just came across this in the docs:
"Currently (i.e., firefly), namespaces are only useful for
applications written on top of librados. Ceph clients such as block
device, object storage and file system do not currently support this
feature."

Then found:
https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support

Is there any progress or plans to address this (particularly for rbd
clients but also cephfs)?


No immediate plans for rbd. That blueprint still seems like a
reasonable way to implement it to me.

The one part I'm less sure about is the OpenStack or other higher level 
integration, which would need to start adding secret keys to libvirt

dynamically.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Josh Durgin

On 02/05/2015 07:44 AM, Udo Lembke wrote:

Hi all,
is there any command to flush the rbd cache like the
"echo 3 > /proc/sys/vm/drop_caches" for the os cache?


librbd exposes it as rbd_invalidate_cache(), and qemu uses it
internally, but I don't think you can trigger that via any user-facing
qemu commands.

Exposing it through the admin socket would be pretty simple though:

http://tracker.ceph.com/issues/2468

You can also just detach and reattach the device to flush the rbd cache.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:

Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.


Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700


On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:



On Monday, January 5, 2015, Chen, Xiaoxi  wrote:


When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long time
since rbd client need to send a delete request to OSD, OSD need to find out
the object context and delete(or doesn’t exist at all). The time needed to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no object
created in these area), they can use this flag to skip the time consuming
trimming.



How do you think?



That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object  of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

  $ rbd info huge
 rbd image 'huge':
  size 1024 TB in 268435456 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.8a14.2ae8944a
  format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:



On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com > wrote:

 Hi,

 If its the only think in your pool, you could try deleting the
 pool instead.

 I found that to be faster in my testing; I had created 500TB when
 I meant to create 500GB.

 Note for the Devs: I would be nice if rbd create/resize would
 accept sizes with units (i.e. MB GB TB PB, etc).




 On 2015-01-04 08:45, Edwin Peer wrote:

 Hi there,

 I did something stupid while growing an rbd image. I
accidentally
 mistook the units of the resize command for bytes instead of
 megabytes
 and grew an rbd image to 650PB instead of 650GB. This all
happened
 instantaneously enough, but trying to rectify the mistake is
 not going
 nearly as well.

 
 ganymede ~ # rbd resize --size 665600 --allow-shrink
 client-disk-img0/vol-x318644f-0
 Resizing image: 1% complete...
 

 It took a couple days before it started showing 1% complete
 and has
 been stuck on 1% for a couple more. At this rate, I should be
 able to
 shrink the image back to the intended size in about 2016.

   

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh


On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin  wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700



On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:




On Monday, January 5, 2015, Chen, Xiaoxi  wrote:



When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator
all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long
time
since rbd client need to send a delete request to OSD, OSD need to find
out
the object context and delete(or doesn’t exist at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as
shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no
object
created in these area), they can use this flag to skip the time
consuming
trimming.



How do you think?




That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object
of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

   $ rbd info huge
  rbd image 'huge':
   size 1024 TB in 268435456 objects
   order 22 (4096 kB objects)
   block_name_prefix: rb.0.8a14.2ae8944a
   format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:




On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com <mailto:6exbab4fy...@dyweni.com>> wrote:

  Hi,

  If its the only think in your pool, you could try deleting the
  pool instead.

  I found that to be

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 04:45 PM, Robert LeBlanc wrote:

Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is
modified a message could be put on the bus to update the others. That
opens up a whole other can of worms though.


Rados' watch/notify functions are used as a limited form of this. That's
how rbd can notice that e.g. snapshots are created or disks are resized
online. With the object map code the idea is to funnel all management
operations like that through a single client that's locked the image
for write access (all handled automatically by librbd).

Using watch/notify to coordinate multi-client access would get complex
and inefficient pretty fast, and in general is best left to cephfs
rather than rbd.

Josh


On Jan 6, 2015 5:35 PM, "Josh Durgin" mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of
code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a
round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh

    On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin
mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have
an object then
it is a noop and should be pretty quick. The number of
outstanding
operations can be limited to 100 or a 1000 which would
provide a
balance between speed and performance impact if there is
data to be
trimmed. I'm not a big fan of a "--skip-trimming" option
as there is
the potential to leave some orphan objects that may not
be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This
trimming
actually is parallelized (10 ops at once by default,
changeable via
--rbd-concurrent-management-__ops) since dumpling.

What will really help without being dangerous is keeping a
map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing
images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/__pull/2700
<https://github.com/ceph/ceph/pull/2700>


On Tue, Jan 6, 2015 at 8:09 AM, Jake Young
mailto:jak3...@gmail.com>> wrote:




On Monday, January 5, 2015, Chen, Xiaoxi
mailto:xiaoxi.c...@intel.com>> wrote:



When you shrinking the RBD, most of the time was
spent on
librbd/internal.cc::trim___image(), in this
function, client will iterator
all
unnecessary objects(no matter whether it exists)
and delete them.



So in this case,  when Edwin shrinking his RBD
from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) *
1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will
definitely take a long
time
since rbd client need to send a delete request
to OSD, OSD need to find
out
the object context and delete(or doesn’t exist
at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy
your VM's file system to
the new image, then delete the old one will  NOT
 

Re: [ceph-users] Ceph Block device and Trim/Discard

2014-12-18 Thread Josh Durgin

On 12/18/2014 10:49 AM, Travis Rhoden wrote:

One question re: discard support for kRBD -- does it matter which format
the RBD is?  Format 1 and Format 2 are okay, or just for Format 2?


It shouldn't matter which format you use.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Double-mounting of RBD

2014-12-17 Thread Josh Durgin

On 12/17/2014 03:49 PM, Gregory Farnum wrote:

On Wed, Dec 17, 2014 at 2:31 PM, McNamara, Bradley
 wrote:

I have a somewhat interesting scenario.  I have an RBD of 17TB formatted
using XFS.  I would like it accessible from two different hosts, one
mapped/mounted read-only, and one mapped/mounted as read-write.  Both are
shared using Samba 4.x.  One Samba server gives read-only access to the
world for the data.  The other gives read-write access to a very limited set
of users who occasionally need to add data.


However, when testing this, when changes are made to the read-write Samba
server the changes don’t seem to be seen by the read-only Samba server.  Is
there some file system caching going on that will eventually be flushed?



Am I living dangerously doing what I have set up?  I thought I would avoid
most/all potential file system corruption by making sure there is only one
read-write access method.  Thanks for any answers.


Well, you'll avoid corruption by only having one writer, but the other
reader is still caching data in-memory that will prevent it from
seeing the writes on the disk.
Plus I have no idea if mounting xfs read-only actually prevents it
from making any writes to the disk; I think some FSes will do stuff
like defragment internal data structures in that mode, maybe?
-Greg


FSes mounted read-only still do tend to do things like journal replay,
but since the block device is mapped read-only that won't be a problem
in this case.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados crash in nova-compute

2014-10-24 Thread Josh Durgin

On 10/24/2014 08:21 AM, Xu (Simon) Chen wrote:

Hey folks,

I am trying to enable OpenStack to use RBD as image backend:
https://bugs.launchpad.net/nova/+bug/1226351

For some reason, nova-compute segfaults due to librados crash:

./log/SubsystemMap.h: In function 'bool
ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread
7f1b477fe700 time 2014-10-24 03:20:17.382769
./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: (()+0x42785) [0x7f1b4c4db785]
2: (ObjectCacher::flusher_entry()+0xfda) [0x7f1b4c53759a]
3: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f1b4c54a16d]
4: (()+0x6b50) [0x7f1b6ea93b50]
5: (clone()+0x6d) [0x7f1b6df3e0ed]
NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted

I feel that there is some concurrency issue, since this sometimes happen
before and sometimes after this line:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/rbd_utils.py#L208

Any idea what are the potential causes of the crash?

Thanks.
-Simon


This is http://tracker.ceph.com/issues/8912, fixed in the latest
firefly and dumpling releases.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD import slow

2014-09-25 Thread Josh Durgin

On 09/24/2014 04:57 PM, Brian Rak wrote:

I've been doing some testing of importing virtual machine images, and
I've found that 'rbd import' is at least 2x as slow as 'qemu-img
convert'.  Is there anything I can do to speed this process up?  I'd
like to use rbd import because it gives me a little additional flexibility.

My test setup was a 40960MB LVM volume, and I used the following two
commands:

rbd import /dev/lvmtest/testvol test
qemu-img convert /dev/lvmtest/testvol rbd:test/test

rbd import took 13 minutes, qemu-img took 5.

I'm at a loss to explain this, I would have expected rbd import to be
faster.

This is with ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)


rbd import was doing one synchronous I/O after another. Recently import
and export were parallelized according to 
--rbd-concurrent-management-ops (default 10), which helps quite a bit. 
This will be in

giant.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about librbd io

2014-09-10 Thread Josh Durgin

On 09/09/2014 07:06 AM, yuelongguang wrote:

hi, josh.durgin:
i want to know how librbd launch io request.
use case:
inside vm, i use fio to test rbd-disk's io performance.
fio's pramaters are  bs=4k, direct io, qemu cache=none.
in this case, if librbd just send what it gets from vm, i mean  no
gather/scatter. the rate , io inside vm : io at librbd: io at osd
filestore = 1:1:1?


If the rbd image is not a clone, the io issued from the vm's block
driver will match the io issued by librbd. With caching disabled
as you have it, the io from the OSDs will be similar, with some
small amount extra for OSD bookkeeping.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-07 Thread Josh Durgin

On 07/07/2014 05:41 AM, Patrycja Szabłowska wrote:

OK, the mystery is solved.

 From https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10368.html
"During a multi part upload you can't upload parts smaller than 5M"

I've tried to upload smaller chunks, like 10KB. I've changed chunk size
to 5MB and it works now.

It's a pity that the Ceph's docs don't mention the limit (or I couldn't
found it anywhere). And that the error wasn't helpful at all.


Glad you figured it out. This is in the s3 docs [1], but the lack of
error message is a regression. I added a couple tickets for this:

http://tracker.ceph.com/issues/8764
http://tracker.ceph.com/issues/8766

Josh

[1] http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-agent failed to parse

2014-07-07 Thread Josh Durgin

On 07/04/2014 08:36 AM, Peter wrote:

i am having issues running radosgw-agent to sync data between two
radosgw zones. As far as i can tell both zones are running correctly.

My issue is when i run the radosgw-agent command:



radosgw-agent -v --src-access-key  --src-secret-key
 --dest-access-key  --dest-secret-key
 --src-zone us-master http://us-secondary.example.com:80


i get the following error:

|DEBUG:boto:Using access key provided by client.||
||DEBUG:boto:Using secret key provided by client.||
||DEBUG:boto:StringToSign:||
||GET||
||
||Fri, 04 Jul 2014 15:25:53 GMT||
||/admin/config||
||DEBUG:boto:Signature:||
||AWS EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=||
||DEBUG:boto:url =
'http://us-secondary.example.comhttp://us-secondary.example.com/admin/config'||
||params={}||
||headers={'Date': 'Fri, 04 Jul 2014 15:25:53 GMT', 'Content-Length':
'0', 'Authorization': 'AWS
EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=', 'User-Agent':
'Boto/2.20.1 Python/2.7.6 Linux/3.13.0-24-generic'}||
||data=None||
||ERROR:root:Could not retrieve region map from destination||
||Traceback (most recent call last):||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line
269, in main||
||region_map = client.get_region_map(dest_conn)||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
line 391, in get_region_map||
||region_map = request(connection, 'get', 'admin/config')||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
line 153, in request||
||result = handler(url, params=params, headers=request.headers,
data=data)||
||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in
get||
||return request('get', url, **kwargs)||
||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in
request||
||return session.request(method=method, url=url, **kwargs)||
||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
349, in request||
||prep = self.prepare_request(req)||
||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
287, in prepare_request||
||hooks=merge_hooks(request.hooks, self.hooks),||
||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
287, in prepare||
||self.prepare_url(url, params)||
||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
334, in prepare_url||
||scheme, auth, host, port, path, query, fragment = parse_url(url)||
||  File "/usr/lib/python2.7/dist-packages/urllib3/util.py", line 390,
in parse_url||
||raise LocationParseError("Failed to parse: %s" % url)||
||LocationParseError: Failed to parse: Failed to parse:
us-secondary.example.comhttp:


|||Is this a bug? or is my setup wrong? i can navigate to
http://us-secondary.example.com/admin/config and it correctly outputs
zone details. at the output above


It seems like an issue with your environment. What version of
radosgw-agent and which distro is this running on?

Are there any special characters in the access or secret keys that
might need to be escaped on the command line?


|DEBUG:boto:url =
'http://us-secondary.example.comhttp://us-secondary.example.com/admin/config'||


|should the url be repeated like that?


No, and it's rather strange since it should be the url passed on the
command line, parsed, and with /admin/config added.

Could post the result of this run in a python interpreter:

import urlparse
result = urlparse.urlparse('http://us-secondary.example.com:80')
print result.hostname, result.port

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: CEPH Multitenancy and Data Isolation

2014-06-10 Thread Josh Durgin

On 06/10/2014 01:56 AM, Vilobh Meshram wrote:

How does CEPH guarantee data isolation for volumes which are not meant
to be shared in a Openstack tenant?

When used with OpenStack the data isolation is provided by the
Openstack level so that all users who are part of same tenant will be
able to access/share the volumes created by users in that tenant.
Consider a case where we have one pool named “Volumes” for all the
tenants. All the tenants use the same keyring to access the volumes in
the pool.

 1. How do we guarantee that one user can’t see the contents of the
volumes created by another user; if the volume is not meant to be
shared.


OpenStack users or tenants have no access to the keyring. Cinder tracks
volume ownership and checks permissions when a volume is attached, and
qemu prevents users from seeing anything outside of their vm, including 
the keyring.



 2. If someone malicious user gets the access to the keyring (which we
used as a authentication mechanism between the client/Openstack
and CEPH) how does CEPH guarantee that the malicious user can’t
access the volumes in that pool.


The keyring gives a user access to the cluster. If someone has a valid 
keyring, Ceph treats them as a valid user, since there is no information

to say otherwise. Ceph can't tell whether the user of a keyring is
malicious.


 3. Lets say our Cinder services are running on the Openstack API
node. How does the CEPH keyring information gets transferred from
the API node to the Hypervisor node ? Does this keyring passed
through message queue? If yes can the malicious user have a look
at the message queue and grab this keyring information ? If not
then how does it reach from the API node to the Hypervisor node.


The keyring is static and configured by the administrator on the nodes
running cinder-volume and nova-compute. It's not sent over the network,
and is not needed by nova or cinder api nodes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Export-Diff With Children Snapshots

2014-06-10 Thread Josh Durgin
On Fri, 6 Jun 2014 17:34:56 -0700
Tyler Wilson  wrote:

> Hey All,
> 
> Simple question, does 'rbd export-diff' work with children snapshot
> aka;
> 
> root:~# rbd children images/03cb46f7-64ab-4f47-bd41-e01ced45f0b4@snap
> compute/2b65c0b9-51c3-4ab1-bc3c-6b734cc796b8_disk
> compute/54f3b23c-facf-4a23-9eaa-9d221ddb7208_disk
> compute/592065d1-264e-4f7d-8504-011c2ea3bce3_disk
> compute/9ce6d6af-c4df-442c-b433-be2bb1cef9f6_disk
> compute/f0714add-683a-4ba2-a6f3-ded7dbf193eb_disk
> 
> Could I export a diff from that image snapshot vs one of the compute
> disks?
> 
> Thanks for your help!

The rbd diff-related commands compare points in time of a single
image. Since children are identical to their parent when they're cloned,
if you created a snapshot right after it was cloned, you could export
the diff between the used child and the parent. Something like:

rbd clone child parent@snap
rbd snap create child@base

rbd snap create child@changed
rbd export-diff child@changed --from-snap base child_changes.diff

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Quota Management in CEPH

2014-05-21 Thread Josh Durgin

On 05/21/2014 03:29 PM, Vilobh Meshram wrote:

Hi All,

I want to understand on how do CEPH users go about Quota Management when
CEPH is used with Openstack.

 1. Is it recommended to use a common pool say “volumes” for creating
volumes which is shared by all tenants ? In this case a common
keyring ceph.common.keyring will be shared across all the
tenants/common volume pool.


Yes, using a common pool is recommended. More pools take up more cpu and
memory on the osds, since placement groups (shards of pools) are the 
unit of recovery. Having a pool per tenant would be a scaling issue.


There is a further level of division in rados called a 'namespace',
which can provide finer-grained cephx security within a pool, but
rbd does not support it yet, and as it stands it would not be useful
for quotas [1].


 2. Or is it recommended to use a pool for each tenant say “volume1 pool
for tenant1” , “volume2 pool for tenant2" ?  In this case we will
have a keyring per volume pool/ tenant I.e. Keyring 1 for
volume/tenant1 and so on.

Considering both of these cases how do we guarantee that we enforce a
quota for each user inside a tenant say a quota of say 5 volumes to be
created by each user.


When using OpenStack, Cinder does the quota management for volumes based
on its database, and can limit total space, number of volumes and
number of snapshots [2]. RBD is entirely unaware of OpenStack tenants.

Josh

[1] http://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support
[2] 
http://docs.openstack.org/user-guide-admin/content/cli_set_quotas.html#cli_set_block_storage_quotas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data still in OSD directories after removing

2014-05-21 Thread Josh Durgin

On 05/21/2014 03:03 PM, Olivier Bonvalet wrote:

Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit :

You're certain that that is the correct prefix for the rbd image you
removed?  Do you see the objects lists when you do 'rados -p rbd ls - |
grep '?


I'm pretty sure yes : since I didn't see a lot of space freed by the
"rbd snap purge" command, I looked at the RBD prefix before to do the
"rbd rm" (it's not the first time I see that problem, but previous time
without the RBD prefix I was not able to check).

So :
- "rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29" return nothing
at all
- # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
  error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such
file or directory
- # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.
  error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such
file or directory
- # ls -al 
/var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
-rw-r--r-- 1 root root 4194304 oct.   8  2013 
/var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9



If the objects really are orphaned, teh way to clean them up is via 'rados
-p rbd rm '.  I'd like to get to the bottom of how they ended
up that way first, though!


I suppose the problem came from me, by doing CTRL+C while "rbd snap
purge $IMG".
"rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026" don't remove
thoses files, and just answer with a "No such file or directory".


Those files are all for snapshots, which are removed by the osds
asynchronously in a process called 'snap trimming'. There's no
way to directly remove them via rados.

Since you stopped 'rbd snap purge' partway through, it may
have removed the reference to the snapshot before removing
the snapshot itself.

You can get a list of snapshot ids for the remaining objects
via the 'rados listsnaps' command, and use
rados_ioctx_selfmanaged_snap_remove() (no convenient wrapper
unfortunately) on each of those snapshot ids to be sure they are all
scheduled for asynchronous deletion.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ' rbd username specified but secret not found' error, virsh live migration on rbd

2014-05-19 Thread Josh Durgin

On 05/19/2014 01:48 AM, JinHwan Hwang wrote:

I have been trying to do live migration on vm which is running on rbd.
But so far, they only give me ' internal error: rbd username 'libvirt'
specified but secret not found' when i do live migration.

ceph-admin : source host
host : destination host

root@main:/home/ceph-admin# virsh migrate --live rbd1-1
  qemu+ssh://host/system
error: internal error: rbd username 'libvirt' specified but secret not found

These are rbd1-1 vm dump. It worked for running rbd1-1.
.
  
   
   
 
   
   
 
 
   


Is '...' not sufficient for doing live migration? I also
have tried setting same secret on both source and destination host
virsh(so both host virsh have
uuid='b34526f2-8d32-ed5d-3153-e90d011dd37e' ), But it didn't worked.
I've followed this('http://ceph.com/docs/master/rbd/libvirt/')
instruction and this is only secret so far i know. Am i miss something?
If that so, Where should i put those missing 'secrets'?
Thanks in advance for any helps.


Did you set the value of the secret on the second node?
Is it defined with ephemeral='no' and private='no'?

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Info firefly qemu rbd

2014-05-08 Thread Josh Durgin

On 05/08/2014 09:42 AM, Federico Iezzi wrote:

I guys,

First of all congratulations on Firefly Release!
IMHO I think that this release is a huge step for Ceph Project!

Just for fun, this morning I upgraded one my staging Ceph cluster used
by an OpenStack Havana installation (Canonical cloud archive, Kernel
3.11, Ubuntu 12.04)
I had one issue during volume attach, the below log from libvirt.
With the same environment and Emperor I don’t have any issue (so there
aren’t any network issue)

Do you think that it is my fault during upgrade procedure or a bug?

I tried to use Admin keyring and I got a “open disk image file failed”.

Thanks a lot,
Regards,
Federico






2e5991bf-b76b-48b1-b271-cf13406e9825] libvirtError: Timed out during
operation: cannot acquire state change lock


This is the sign of a bug in libvirt. Try restarting libvirt-bin.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Qemu-devel] qemu + rbd block driver with cache=writeback, is live migration safe ?

2014-04-18 Thread Josh Durgin

On 04/18/2014 10:47 AM, Alexandre DERUMIER wrote:

Thanks Kevin for for the full explain!


cache.writeback=on,cache.direct=off,cache.no-flush=off


I didn't known about the cache options split,thanks.



rbd does, to my knowledge, not use the kernel page cache, so we're safe

>from that part. It does however honour the cache.direct flag when it

decides whether to use its own cache. rbd doesn't implement
bdrv_invalidate_cache() in order to clear that cache when migration
completes.


Maybe some ceph devs could comment about this ?


That's correct, librbd uses its own in-memory cache instead of
the kernel page cache, and it honors flush requests. Furthermore,
librbd keeps its own metadata synchronized among different
clients via the ceph cluster (this is information like image
size, which rbd snapshots exist, and rbd parent image).

So as I understand it live migration with raw format images on
rbd is safe even with cache.writeback=true and cache.direct=false
(i.e. cache=writeback) because:

1) rbd metadata is synchronized internally

2) the source vm has any rbd caches flushed by vm_stop() before
   the destination starts

3) rbd does not read anything into its cache before the
   destination starts

4) raw format images have no extra metadata that needs invalidation

If librbd populated its cache when the disk was opened, the rbd driver
would need to implement bdrv_invalidate(), but since it does not, it's
unnecessary.

Is this correct Kevin?

Josh


No, such a QMP command doesn't exist, though it would be possible to
implement (for toggling cache.direct, that is; cache.writeback is guest
visible and can therefore only be toggled by the guest)


yes, that's what I have in mind, toggling cache.direct=on before migration, 
then disable it after the migration.



- Mail original -

De: "Kevin Wolf" 
À: "Alexandre DERUMIER" 
Cc: "qemu-devel" , ceph-users@lists.ceph.com
Envoyé: Mardi 15 Avril 2014 11:36:22
Objet: Re: [Qemu-devel] qemu + rbd block driver with cache=writeback, is live 
migration safe ?

Am 12.04.2014 um 17:01 hat Alexandre DERUMIER geschrieben:

Hello,

I known that qemu live migration with disk with cache=writeback are not safe 
with storage like nfs,iscsi...

Is it also true with rbd ?


First of all, in order to avoid misunderstandings, let's be clear that
there are three dimensions for the cache configuration of qemu block
devices. In current versions, they are separately configurable and
cache=writeback really expands to:

cache.writeback=on,cache.direct=off,cache.no-flush=off

The problematic part of this for live migration is generally not
cache.writeback being enabled, but cache.direct being disabled.

The reason for that is that the destination host will open the image
file immediately, because it needs things like the image size to
correctly initialise the emulated disk devices. Now during the migration
the source keeps working on the image, so if qemu read some metadata on
the destination host, that metadata may be stale by the time that the
migration actually completes.

In order to solve this problem, qemu calls bdrv_invalidate_cache(),
which throws away everything that is cached in qemu so that it is reread
from the image. However, this is ineffective if there are other caches
having stale data, such as the kernel page cache. cache.direct bypasses
the kernel page cache, so this is why it's important in many cases.

rbd does, to my knowledge, not use the kernel page cache, so we're safe
from that part. It does however honour the cache.direct flag when it
decides whether to use its own cache. rbd doesn't implement
bdrv_invalidate_cache() in order to clear that cache when migration
completes.

So the answer to your original question is that it's probably _not_ safe
to use live migration with rbd and cache.direct=off.


If yes, it is possible to disable manually writeback online with qmp ?


No, such a QMP command doesn't exist, though it would be possible to
implement (for toggling cache.direct, that is; cache.writeback is guest
visible and can therefore only be toggled by the guest).

Kevin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot remove rbd image, snapshot busy

2014-04-03 Thread Josh Durgin

On 04/03/2014 03:36 PM, Jonathan Gowar wrote:

On Tue, 2014-03-04 at 14:05 +0800, YIP Wai Peng wrote:

Dear all,


I have a rbd image that I can't delete. It contains a snapshot that is
"busy"


# rbd --pool openstack-images rm
2383ba62-b7ab-4964-a776-fb3f3723aabe-deleted

2014-03-04 14:02:04.062099 7f340b2d5760 -1 librbd: image has snapshots
- not removing
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge'
before the image can be removed.


# rbd --pool openstack-images snap purge
2383ba62-b7ab-4964-a776-fb3f3723aabe-deleted
Removing all snapshots: 0% complete...failed.
rbd: removing snaps failed: (16) Device or resource busy
2014-03-04 14:03:27.311437 7f1991622760 -1 librbd: removing snapshot
from header failed: (16) Device or resource busy


Hi,

   The same is happening for me.  Did you find a solution?

Regards,
Jon


Are the snapshots protected? They need to be unprotected before they
can be deleted.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshots aware of CRUSH map?

2014-03-31 Thread Josh Durgin

On 03/31/2014 03:03 PM, Brendan Moloney wrote:

Hi,

I was wondering if RBD snapshots use the CRUSH map to distribute
snapshot data and live data on different failure domains? If not, would
it be feasible in the future?


Currently rbd snapshots and live objects are stored in the same place,
since that is how rados snapshots are implemented. The simplest way to
store them separately would probably be an implementation of rbd
snapshots on top of regular rados objects, without using rados
snapshots. This would be a large project, but it would be simpler than
another implementation of rados snapshots since it would not require
osd changes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU

2014-03-26 Thread Josh Durgin

On 03/26/2014 05:50 PM, Craig Lewis wrote:

I made a typo in my timeline too.

It should read:
At 14:14:00, I started OSD 4, and waited for ceph-w to stabilize. CPU
usage was normal.
At 14:15:10, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:16:00, I started OSD 8, and waited for ceph -w to stabilize.  CPU
usage started out normal, but went to 100% before 14:16:40.


The osd.8 log shows it doing some deep scrubbing here. Perhaps that is
what caused your earlier issues with CPU usage?


At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.
regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8.
At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:19:10, I stopped ceph-osd id=*/4/*.


Since you've got the noout flag set, when osd.8 goes down any objects
for which osd.8 is the primary will not be readable. Since ceph reads
from primaries, and the noout flag prevents another osd from being
selected, which would happen if osd.8 were marked out, these objects
(which apparently happen to include some needed for regions list or
regionmap get) are inaccessible.

Josh


Some newlines were added.  The only material change is the last line,
changing to id=4.

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter
  | Facebook
  | LinkedIn
  | Blog


On 3/26/14 15:04 , Craig Lewis wrote:

At 14:14:00, I started OSD 4, and waited for ceph-w to stabilize.  CPU
usage was normal.
At 14:15:10, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:16:00, I started OSD 8, and waited for ceph -w to stabilize.
CPU usage started out normal, but went to 100% before 14:16:40.
At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.
regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8.
At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regions
list && radosgw-admin --name=client.radosgw.ceph1c regionmap get.  It
returned successfully.
At 14:19:10, I stopped ceph-osd id=8.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recover partially lost rbd

2014-03-26 Thread Josh Durgin

On 03/26/2014 02:26 PM, gustavo panizzo  wrote:

hello
 one of our OSD crashed, unfortunately it had a a unreplicated pool
(size = 1).
i want to recover as much as possible using raw files, i've tried using
this tool
https://raw.githubusercontent.com/smmoore/ceph/4eb806fdcc02632bf4ac60f302c4e1ee3bef6363/rbd_restore.sh
but i don't found any file rb.0.1938.f8e1ca65f7a.*


# rbd -p cinder-simple info volume-e1dbb182-24b1-4729-a398-e494fe037678
rbd image 'volume-e1dbb182-24b1-4729-a398-e494fe037678':
 size 300 GB in 76800 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.1938f8e1ca65f7a
 format: 2
 features: layering

i beth on disk structure has changed, can anybody shed some light on how
is it now? i'm using plain XFS ceph 0.72.2 (emperor)


That script refers to format 1 rbd image object names (rb.x.y.z.*).
Just use the block_name_prefix reported by rbd info instead, in this
case rbd_data.1938f8e1ca65f7a, and it should work fine.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backward compatibility of librados in Firefly

2014-03-26 Thread Josh Durgin

On 03/25/2014 12:18 AM, Ray Lv wrote:

Hi there,

We got a case to use the C APIs of compound operation in librados. These
interfaces are only exported as C APIs from Firefly release, such as
https://github.com/ceph/ceph/blob/firefly/src/include/rados/librados.h#L1834.
But our RADOS deployment will stick to the Dumpling release right now.
So here are a few questions on that:

  * Is the librados in Firefly backward compatible with RADOS servers in
Dumpling by design?


Yes. ABI compatibility is always preserved. There are some rare issues 
that require changing minor semantics, but they are few and documented

in release notes. In this case the most important change is with
firefly osds and firefly librados the individual op return codes in 
compound operations are filled in.



  * Is Firefly librados + Dumpling RADOS a configuration validated by
the community?


Firefly isn't out yet, but we are running upgrade tests that include
this kind of configuration.


We tried to backport these C APIs from Firefly to Dumpling and noticed
there are a couple of commits on it.

  * Do you guys know a tracker issue or Blueprint that covers the C APIs
of compound operation in Firefly?


The related commits are:

https://github.com/ceph/ceph/commit/4425f9edaa9f4108bd4a693760c5d7be4359ee4a

and most of this pull request:

https://github.com/ceph/ceph/pull/1256


  * And moreover, does it make sense to backport these C APIs from
Firefly to Dumpling?


Backporting those will likely be more trouble than it's worth.
I'd suggest developing against librados from the firefly branch.
If you want to check individual op return codes, use firefly osds as
well, otherwise dumpling osds will work just fine.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD clone for OpenStack Nova ephemeral volumes

2014-03-21 Thread Josh Durgin

On 03/20/2014 07:03 PM, Dmitry Borodaenko wrote:

On Thu, Mar 20, 2014 at 3:43 PM, Josh Durgin  wrote:

On 03/20/2014 02:07 PM, Dmitry Borodaenko wrote:

The patch series that implemented clone operation for RBD backed
ephemeral volumes in Nova did not make it into Icehouse. We have tried
our best to help it land, but it was ultimately rejected. Furthermore,
an additional requirement was imposed to make this patch series
dependent on full support of Glance API v2 across Nova (due to its
dependency on direct_url that was introduced in v2).

You can find the most recent discussion of this patch series in the
FFE (feature freeze exception) thread on openstack-dev ML:
http://lists.openstack.org/pipermail/openstack-dev/2014-March/029127.html

As I explained in that thread, I believe this feature is essential for
using Ceph as a storage backend for Nova, so I'm going to try and keep
it alive outside of OpenStack mainline until it is allowed to land.

I have created rbd-ephemeral-clone branch in my nova repo fork on GitHub:
https://github.com/angdraug/nova/tree/rbd-ephemeral-clone

I will keep it rebased over nova master, and will create an
rbd-ephemeral-clone-stable-icehouse to track the same patch series
over nova stable/icehouse once it's branched. I also plan to make sure
that this patch series is included in Mirantis OpenStack 5.0 which
will be based on Icehouse.

If you're interested in this feature, please review and test. Bug
reports and patches are welcome, as long as their scope is limited to
this patch series and is not applicable for mainline OpenStack.


Thanks for taking this on Dmitry! Having rebased those patches many
times during icehouse, I can tell you it's often not trivial.


Indeed, I get conflicts every day lately, even in the current
bugfixing stage of the OpenStack release cycle. I have a feeling it
will not get easier when Icehouse is out and Juno is in full swing.


Do you think the imagehandler-based approach is best for Juno? I'm
leaning towards the older way [1] for simplicity of review, and to
avoid using glance's v2 api by default.
[1] https://review.openstack.org/#/c/46879/


Excellent question, I have thought long and hard about this. In
retrospect, requiring this change to depend on the imagehandler patch
back in December 2013 proven to have been a poor decision.
Unfortunately, now that it's done, porting your original patch from
Havana to Icehouse is more work than keeping the new patch series up
to date with Icehouse, at least short term. Especially if we decide to
keep the rbd_utils refactoring, which I've grown to like.

As far as I understand, your original code made use of the same v2 api
call even before it was rebased over imagehandler patch:
https://github.com/jdurgin/nova/blob/8e4594123b65ddf47e682876373bca6171f4a6f5/nova/image/glance.py#L304

If I read this right, imagehandler doesn't create the dependency on v2
api, the only reason it caused a problem was because it exposed the
output of the same Glance API call to a code path that assumed a v1
data structure. If so, decoupling rbd clone patch from imagehandler
will not help lift the full Glance API v2 support requirement, that v2
api call will still be there.

Also, there's always a chance that imagehandler lands in Juno. If it
does, we'd be forced to dust off the imagehandler based patch series
again, and the effort spent on maintaining the old patch would be
wasted.

Given all that, and without making any assumptions about stability of
the imagehandler patch in its current state, I'm leaning towards
keeping it. If you think it's likely that it will cause us more
problems than the Glance API v2 issue, or if you disagree with my
analysis of that issue, please tell.


My impression was that full glance v2 support was more of an issue
with the imagehandler approach because it's used by default there,
while the earlier approach only uses glance v2 when rbd is enabled.


I doubt that full support for
v2 will land very fast in nova, although I'd be happy to be proven wrong.


I'm sceptical about this, too. That's why right now my first priority
is making sure this patch is usable and stable with Icehouse.
Post-Icehouse, we'll have to see where glance v2 support in nova goes,
if anywhere at all. Not much point making plans when we can't even
tell if we'll have to rewrite this patch yet again for Juno.


Sounds good. We can discuss more with nova folks once Juno opens,
since we'll need to go through the new blueprint approval process
anyway.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD clone for OpenStack Nova ephemeral volumes

2014-03-20 Thread Josh Durgin

On 03/20/2014 02:07 PM, Dmitry Borodaenko wrote:

The patch series that implemented clone operation for RBD backed
ephemeral volumes in Nova did not make it into Icehouse. We have tried
our best to help it land, but it was ultimately rejected. Furthermore,
an additional requirement was imposed to make this patch series
dependent on full support of Glance API v2 across Nova (due to its
dependency on direct_url that was introduced in v2).

You can find the most recent discussion of this patch series in the
FFE (feature freeze exception) thread on openstack-dev ML:
http://lists.openstack.org/pipermail/openstack-dev/2014-March/029127.html

As I explained in that thread, I believe this feature is essential for
using Ceph as a storage backend for Nova, so I'm going to try and keep
it alive outside of OpenStack mainline until it is allowed to land.

I have created rbd-ephemeral-clone branch in my nova repo fork on GitHub:
https://github.com/angdraug/nova/tree/rbd-ephemeral-clone

I will keep it rebased over nova master, and will create an
rbd-ephemeral-clone-stable-icehouse to track the same patch series
over nova stable/icehouse once it's branched. I also plan to make sure
that this patch series is included in Mirantis OpenStack 5.0 which
will be based on Icehouse.

If you're interested in this feature, please review and test. Bug
reports and patches are welcome, as long as their scope is limited to
this patch series and is not applicable for mainline OpenStack.


Thanks for taking this on Dmitry! Having rebased those patches many
times during icehouse, I can tell you it's often not trivial.

Do you think the imagehandler-based approach is best for Juno? I'm
leaning towards the older way [1] for simplicity of review, and to
avoid using glance's v2 api by default. I doubt that full support for
v2 will land very fast in nova, although I'd be happy to be proven
wrong.

Josh

[1] https://review.openstack.org/#/c/46879/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-05 Thread Josh Durgin

On 02/05/2014 01:23 PM, Craig Lewis wrote:


On 2/4/14 20:02 , Josh Durgin wrote:


From the log it looks like you're hitting the default maximum number of
entries to be processed at once per shard. This was intended to prevent
one really busy shard from blocking progress on syncing other shards,
since the remainder will be synced the next time the shard is processed.
Perhaps the default is too low though, or the idea should be scrapped
altogether since you can sync other shards in parallel.

For your particular usage, since you're updating the same few buckets,
the max entries limit is hit constantly. You can increase it with
max-entries: 100 in the config file or --max-entries 1000 on
the command line.

Josh


This doesn't appear to have any effect:
root@ceph1c:/etc/init.d# grep max-entries /etc/ceph/radosgw-agent.conf
max-entries: 100
root@ceph1c:/etc/init.d# egrep 'has [0-9]+ entries after'
/var/log/ceph/radosgw-agent.log  | tail -1
2014-02-05T13:11:03.915 2743:INFO:radosgw_agent.worker:bucket instance
"live-2:us-west-1.35026898.2" has 1000 entries after "0206789.410535.3"

Neither does --max-entries 1000:
root@ceph1c:/etc/init.d# ps auxww | grep radosgw-agent | grep max-entries
root 19710  6.0  0.0  74492 18708 pts/3S13:22 0:00
/usr/bin/python /usr/bin/radosgw-agent --incremental-sync-delay=10
--max-entries 1000 -c /etc/ceph/radosgw-agent.conf
root@ceph1c:/etc/init.d# egrep 'has [0-9]+ entries after'
/var/log/ceph/radosgw-agent.us-west-1.us-central-1.log  | tail -1
2014-02-05T13:22:58.577 21626:INFO:radosgw_agent.worker:bucket instance
"live-2:us-west-1.35026898.2" has 1000 entries after "0207788.411542.2"


I guess I'll look into that too, since I'll be in that area of the code.


It seems to be a hardcoded limit on the server side to prevent a single
osd operation from taking too long (see cls_log_list() in ceph.git
src/cls/cls_log.cc).

This should probably be fixed in radosgw, but you could work around it
with a loop in radosgw-agent.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-04 Thread Josh Durgin

On 02/04/2014 07:44 PM, Craig Lewis wrote:



On 2/4/14 17:06 , Craig Lewis wrote:


On 2/4/14 14:43 , Yehuda Sadeh wrote:

Does it ever catching up? You mentioned before that most of the writes
went to the same two buckets, so that's probably one of them. Note
that writes to the same bucket are being handled in-order by the
agent.

Yehuda


... I think so.  This is what my graph looks like:

I think it's still catching up, despite the graph.  radosgw-admin bucket
stats shows more objects are being created in the slave zone than the
master zone during the same time period.  Not a large number though.  If
it is catching up, it'll take months at this rate.  It's not entirely
clear, because the slave zone trails the master, but it's been pretty
consistent for the past hour.

It's not doing all of the missing objects though.  I have a bucket that
I stopped writing to a few days ago.  The slave is missing ~500k
objects, and none have been added to the slave in the past hour.



You can run

$ radosgw-admin bilog list --bucket= --marker=

E.g.,

$ radosgw-admin bilog list --bucket=live-2 --marker=0127871.328492.2

The entries there should have timestamp info.


Thanks!  I'll see what I can figure out.


From the log it looks like you're hitting the default maximum number of
entries to be processed at once per shard. This was intended to prevent
one really busy shard from blocking progress on syncing other shards,
since the remainder will be synced the next time the shard is processed.
Perhaps the default is too low though, or the idea should be scrapped 
altogether since you can sync other shards in parallel.


For your particular usage, since you're updating the same few buckets,
the max entries limit is hit constantly. You can increase it with
max-entries: 100 in the config file or --max-entries 1000 on the 
command line.


Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] openstack -- does it use "copy-on-write"?

2014-01-08 Thread Josh Durgin

On 01/08/2014 11:07 AM, Gautam Saxena wrote:

When booting an image from Openstack in which CEPH is the back-end for
both volumes and images, I'm noticing that it takes about ~10 minutes
during the "spawning" phase -- I believe Openstack is making a fully
copy of the 30 GB Windows image. Shouldn't it be a "copy-on-write" image
and therefore take only a few seconds to spawn? (The other reason I
think it's copying the whole 30 GB is that I monitored ceph using "ceph
-w" and I saw the data volume size increase.)

I've configured OpenStack to use CEPH. (It's virtually identical to the
documentation, except that the glance api is version 1.0 and I've not
configured CEPH backup service.)


You need to enable glance's v2 api. v1 doesn't expose the image location
so cinder can't clone it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snapshot atomicity

2014-01-03 Thread Josh Durgin

On 01/02/2014 10:51 PM, James Harper wrote:

I've not used ceph snapshots before. The documentation says that the rbd device 
should not be in use before creating a snapshot. Does this mean that creating a 
snapshot is not an atomic operation? I'm happy with a crash consistent 
filesystem if that's all the warning is about.


It's atomic, the warning is just that it's crash consistent, not
application-level consistent.


If it is atomic, can you create multiple snapshots as an atomic operation? The 
use case for this would be a database spread across multiple volumes, eg 
database on one rbd, logfiles on another.


No, but now that you mention it this would be technically pretty
simple to implement. If multiple rbds referred to the same place to get
their snapshot context, they could all be snapshotted atomically.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] vm fs corrupt after pgs stuck

2014-01-03 Thread Josh Durgin

On 01/02/2014 01:40 PM, James Harper wrote:


I just had to restore an ms exchange database after an ceph hiccup (no actual
data lost - Exchange is very good like that with its no loss restore!). The 
order
of events went something like:

. Loss of connection on osd to the cluster network (public network was okay)
. pgs reported stuck
. stopped osd on the bad server
. resolved network problem
. restarted osd on the bad server
. noticed that the vm running exchange had hung
. rebooted and vm did a chkdsk automatically
. exchange refused to mount the main mailbox store

I'm not using rbd caching or anything, so for ntfs to lose files like that means
something fairly nasty happened. My best guess is that the loss of
connectivity and function while ceph was figuring out what was going on
meant that windows IO was frozen and started timing out, but I still can't see
how that could result in corruption.


NTFS may have gotten confused if some I/Os completed fine but others
timed out. It looks like ntfs journals metadata, but not data, so it
could lose data not written out yet after this kind of failure,
assuming it stops doing I/O after some timeouts are hit, so it's
similar to a sudden power loss. If the application was not doing the
windows equivalent of O_SYNC it could still lose writes. I'm not too
familiar with windows, but perhaps there's a way to configure disk
timeout behavior or NTFS writeback.


Any suggestions on how I could avoid this situation in the future would be
greatly appreciated!



Forgot to mention. This has also happened once previously when the OOM killer 
targeted ceph-osd.


If this caused I/O timeouts, it would make sense. If you can't adjust
the guest timeouts, you might want to decrease the ceph timeouts for
noticing and marking out osds with network or other issues.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ephemeral RBD with Havana and Dumpling

2013-12-06 Thread Josh Durgin

On 12/05/2013 02:37 PM, Dmitry Borodaenko wrote:

Josh,

On Tue, Nov 19, 2013 at 4:24 PM, Josh Durgin  wrote:

I hope I can release or push commits to this branch contains live-migration,
incorrect filesystem size fix and ceph-snapshort support in a few days.


Can't wait to see this patch! Are you getting rid of the shared
storage requirement for live-migration?


Yes, that's what Haomai's patch will fix for rbd-based ephemeral
volumes (bug https://bugs.launchpad.net/nova/+bug/1250751).


We've got a version of a Nova patch that makes live migrations work
for non volume-backed instances, and hopefully addresses the concerns
raised in code review in https://review.openstack.org/56527, along
with a bunch of small bugfixes, e.g. missing max_size parameter in
direct_fetch, and a fix for http://tracker.ceph.com/issues/6693. I
have submitted it as a pull request to your nova fork on GitHub:

https://github.com/jdurgin/nova/pull/1


Thanks!


Our changes depend on the rest of commits on your havana-ephemeral-rbd
branch, and the whole patchset is now at 7 commits, which is going to
be rather tedious to submit to the OpenStack Gerrit as a series of
dependent changes. Do you think we should keep the current commit
history in its current form, or would it be easier to squash it down
to a more manageable number of patches?


As discussed on irc yesterday, most of these are submitted to icehouse
already in slightly different form, since this branch is based on
stable/havana.

I'd prefer to keep the commits small and self contained in this branch
at least. If it takes too long to get them upstream, I'm fine with
having them squashed for faster upstream review.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >