Re: [ceph-users] Ceph Day Sunnyvale Presentations

2016-04-12 Thread Shinobu Kinjo
Alexandre,

Based on discussion with them at Ceph day in Tokyo JP, they have their own 
frozen the Ceph repository.
And they've been optimizing codes by their own team to meet their requirements.
AFAICT they had not done any do PR. 

Cheers,
Shinobu

- Original Message -
From: "Alexandre DERUMIER" 
To: "Patrick McGarry" 
Cc: "ceph-devel" , "ceph-users" 

Sent: Wednesday, April 13, 2016 12:45:31 PM
Subject: Re: [ceph-users] Ceph Day Sunnyvale Presentations

Hi,

I was reading this presentation from SK telecom about flash optimisations

AFCeph: Ceph Performance Analysis & Improvement on Flash [Slides]
http://fr.slideshare.net/Inktank_Ceph/af-ceph-ceph-performance-analysis-and-improvement-on-flash
Byung-Su Park, SK Telecom


They seem to have made optimisations in ceph code. Is there any patches 
reference ? (applied to infernalis/jewel ?)


They seem also to have done ceph config tuning and system tunning, but no 
config details is provided :(
It could be great to share with the community :)

Regards,

Alexandre

- Mail original -
De: "Patrick McGarry" 
À: "ceph-devel" , "ceph-users" 
Envoyé: Mercredi 6 Avril 2016 18:18:28
Objet: [ceph-users] Ceph Day Sunnyvale Presentations

Hey cephers, 

I have all but one of the presentations from Ceph Day Sunnyvale, so 
rather than wait for a full hand I went ahead and posted the link to 
the slides on the event page: 

http://ceph.com/cephdays/ceph-day-sunnyvale/ 

The videos probably wont be processed until after next week, but I’ll 
add those once we get them. Thanks to all of the presenters and 
attendees that made this another great event. 


-- 

Best Regards, 

Patrick McGarry 
Director Ceph Community || Red Hat 
http://ceph.com || http://community.redhat.com 
@scuttlemonkey || @ceph 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Day Sunnyvale Presentations

2016-04-12 Thread Alexandre DERUMIER
Hi,

I was reading this presentation from SK telecom about flash optimisations

AFCeph: Ceph Performance Analysis & Improvement on Flash [Slides]
http://fr.slideshare.net/Inktank_Ceph/af-ceph-ceph-performance-analysis-and-improvement-on-flash
Byung-Su Park, SK Telecom


They seem to have made optimisations in ceph code. Is there any patches 
reference ? (applied to infernalis/jewel ?)


They seem also to have done ceph config tuning and system tunning, but no 
config details is provided :(
It could be great to share with the community :)

Regards,

Alexandre

- Mail original -
De: "Patrick McGarry" 
À: "ceph-devel" , "ceph-users" 
Envoyé: Mercredi 6 Avril 2016 18:18:28
Objet: [ceph-users] Ceph Day Sunnyvale Presentations

Hey cephers, 

I have all but one of the presentations from Ceph Day Sunnyvale, so 
rather than wait for a full hand I went ahead and posted the link to 
the slides on the event page: 

http://ceph.com/cephdays/ceph-day-sunnyvale/ 

The videos probably wont be processed until after next week, but I’ll 
add those once we get them. Thanks to all of the presenters and 
attendees that made this another great event. 


-- 

Best Regards, 

Patrick McGarry 
Director Ceph Community || Red Hat 
http://ceph.com || http://community.redhat.com 
@scuttlemonkey || @ceph 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Christian Balzer

Hello,

On Tue, 12 Apr 2016 09:56:32 -0400 (EDT) Sage Weil wrote:

> Hi all,
> 
> I've posted a pull request that updates any mention of ext4 in the docs:
> 
>   https://github.com/ceph/ceph/pull/8556
> 
> In particular, I would appreciate any feedback on
> 
>   
> https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01
> 
> both on substance and delivery.
> 
> Given the previous lack of clarity around ext4, and that it works well 
> enough for RBD and other short object name workloads, I think the most
> we can do now is deprecate it to steer any new OSDs away.
> 
A clear statement of what "short" means in this context and if this (in
general) applies to RBD and CephFS would probably be helpful.

> And at least in the non-RGW case, I mean deprecate in the "recommend 
> alternative" sense of the word, not that it won't be tested or that any 
> code will be removed.
> 
>   https://en.wikipedia.org/wiki/Deprecation#Software_deprecation
> 
> If there are ext4 + RGW users, that is still a difficult issue, since it 
> is broken now, and expensive to fix.
> 
I'm wondering what the cross section of RGW (being "stable" a lot longer
than CephFS) and Ext4 users is for this to pop up so late in the game.

Also, since Sam didn't pipe up, I'd still would like to know if this is
"fixed" by having larger than the default 256Byte Ext4 inodes (2KB in my
case) as it isn't purely academic for me.
Or maybe other people like "Michael Metz-Martini" who need Ext4 for
performance reasons and can't obviously go to BlueStore yet.

> 
> On Tue, 12 Apr 2016, Christian Balzer wrote:
> > Only RBD on all clusters so far and definitely no plans to change that 
> > for the main, mission critical production cluster. I might want to add 
> > CephFS to the other production cluster at some time, though.
> 
> That's good to hear.  If you continue to use ext4 (by adjusting down the 
> max object length), the only limitation you should hit is an indirect
> cap on the max RBD image name length.
> 
Just to parse this sentence correctly, is it the name of the object
(output of "rados ls"), the name of the image "rbd ls" or either?

> > No RGW, but if/when RGW supports "listing objects quickly" (is what I
> > vaguely remember from my conversation with Timo Sirainen, the Dovecot
> > author) we would be very interested in that particular piece of Ceph as
> > well. On a completely new cluster though, so no issue.
> 
> OT, but I suspect he was referring to something slightly different
> here. Our conversations about object listing vs the dovecot backend
> surrounded the *rados* listing semantics (hash-based, not prefix/name
> based).  RGW supports fast sorted/prefix name listings, but you pay for
> it by maintaining an index (which slows down PUT).  The latest RGW in
> Jewel has experimental support for a non-indexed 'blind' bucket as well
> for users that need some of the RGW features (ACLs, striping, etc.) but
> not the ordered object listing and other index-dependent features.
> 
Sorry about the OT, but since the Dovecot (Pro) backend supports S3 I
would have thought that RGW would be a logical expansion from there, not
going for a completely new (but likely a lot faster) backend using rados.
Oh well, I shall go poke them.

> > Again, most people that deploy Ceph in a commercial environment (that
> > is working for a company) will be under pressure by the penny-pinching
> > department to use their HW for 4-5 years (never mind the pace of
> > technology and Moore's law).
> > 
> > So you will want to:
> > a) Announce the end of FileStore ASAP, but then again you can't really
> > do that before BlueStore is stable.
> > b) support FileStore for 4 years at least after BlueStore is the
> > default. This could be done by having a _real_ LTS release, instead of
> > dragging Filestore into newer version.
> 
> Right.  Nothing can be done until the preferred alternative is
> completely stable, and from then it will take quite some time to drop
> support or remove it given the install base.
> 
> > > > Which brings me to the reasons why people would want to migrate
> > > > (NOT talking about starting freshly) to bluestore.
> > > > 
> > > > 1. Will it be faster (IOPS) than filestore with SSD journals? 
> > > > Don't think so, but feel free to prove me wrong.
> > > 
> > > It will absolutely faster on the same hardware.  Whether BlueStore on
> > > HDD only is faster than FileStore HDD + SSD journal will depend on
> > > the workload.
> > > 
> > Where would the Journal SSDs enter the picture with BlueStore? 
> > Not at all, AFAIK, right?
> 
> BlueStore can use as many as three devices: one for the WAL (journal, 
> though it can be much smaller than FileStores, e.g., 128MB), one for 
> metadata (e.g., an SSD partition), and one for data.
> 
Right, I blanked on that, despite having read the K/V storage back when
they first showed up. Just didn't make the connection with BlueStore.

OK, so we have a small write-intent-log, 

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Christian Balzer

Hello,

On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner
GmbH wrote:

> Hi,
> 
> Am 11.04.2016 um 23:39 schrieb Sage Weil:
> > ext4 has never been recommended, but we did test it.  After Jewel is
> > out, we would like explicitly recommend *against* ext4 and stop
> > testing it.
> Hmmm. We're currently migrating away from xfs as we had some strange
> performance-issues which were resolved / got better by switching to
> ext4. We think this is related to our high number of objects (4358
> Mobjects according to ceph -s).
> 
It would be interesting to see on how this maps out to the OSDs/PGs.
I'd guess loads and loads of subdirectories per PG, which is probably where
Ext4 performs better than XFS.

> 
> > Recently we discovered an issue with the long object name handling
> > that is not fixable without rewriting a significant chunk of
> > FileStores filename handling.  (There is a limit in the amount of
> > xattr data ext4 can store in the inode, which causes problems in
> > LFNIndex.)
> We're only using cephfs so we shouldn't be affected by your discovered
> bug, right?
> 
I don't use CephFS, but you should be able to tell this yourself by doing
a "rados -p  ls" on your data and metadata pools and see the
resulting name lengths.
However since you have so many objects, I'd do that on a test cluster, if
you have one. ^o^
If CephFS is using the same/similar hashing to create object names as it
does with RBD images I'd imagine you're OK.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Christian Balzer

Hello,

On Tue, 12 Apr 2016 09:56:13 +0200 Udo Lembke wrote:

> Hi Sage,
Not Sage, but since he hasn't piped up yet...

> we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy 
> with ext4.
> We start with xfs but the latency was much higher comparable to ext4...
> 
Welcome to the club. ^o^

> But we use RBD only  with "short" filenames like 
> rbd_data.335986e2ae8944a.000761e1.
> If we can switch from Jewel to K* and change during the update the 
> filestore for each OSD to BlueStore it's will be OK for us.
I don't think K* will be a truly, fully stable BlueStore platform, but
I'll be happy to be proven wrong.
Also, would you really want to upgrade to a non-LTS version?

> I hope we will get than an better performance with BlueStore??
That seems to be a given, after having read up on it last night.

> Will be BlueStore production ready during the Jewel-Lifetime, so that we 
> can switch to BlueStore before the next big upgrade?
>
Again doubtful from my perspective. 
 
For example cache-tiering was introduced (and not as a technology preview,
requiring "will eat your data" flags to be set in ceph.conf) in Firely.

It worked seemingly well enough, but was broken in certain situations.
And in the latest Hammer release it is again broken dangerously by a
backport from Infernalis/Jewel.

Christian
> 
> Udo
> 
> Am 11.04.2016 um 23:39 schrieb Sage Weil:
> > Hi,
> >
> > ext4 has never been recommended, but we did test it.  After Jewel is
> > out, we would like explicitly recommend *against* ext4 and stop
> > testing it.
> >
> > Why:
> >
> > Recently we discovered an issue with the long object name handling
> > that is not fixable without rewriting a significant chunk of
> > FileStores filename handling.  (There is a limit in the amount of
> > xattr data ext4 can store in the inode, which causes problems in
> > LFNIndex.)
> >
> > We *could* invest a ton of time rewriting this to fix, but it only
> > affects ext4, which we never recommended, and we plan to deprecate
> > FileStore once BlueStore is stable anyway, so it seems like a waste of
> > time that would be better spent elsewhere.
> >
> > Also, by dropping ext4 test coverage in ceph-qa-suite, we can
> > significantly improve time/coverage for FileStore on XFS and on
> > BlueStore.
> >
> > The long file name handling is problematic anytime someone is storing
> > rados objects with long names.  The primary user that does this is RGW,
> > which means any RGW cluster using ext4 should recreate their OSDs to
> > use XFS.  Other librados users could be affected too, though, like
> > users with very long rbd image names (e.g., > 100 characters), or
> > custom librados users.
> >
> > How:
> >
> > To make this change as visible as possible, the plan is to make
> > ceph-osd refuse to start if the backend is unable to support the
> > configured max object name (osd_max_object_name_len).  The OSD will
> > complain that ext4 cannot store such an object and refuse to start.  A
> > user who is only using RBD might decide they don't need long file
> > names to work and can adjust the osd_max_object_name_len setting to
> > something small (say, 64) and run successfully.  They would be taking
> > a risk, though, because we would like to stop testing on ext4.
> >
> > Is this reasonable?  If there significant ext4 users that are
> > unwilling to recreate their OSDs, now would be the time to speak up.
> >
> > Thanks!
> > sage
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebalance near full osd

2016-04-12 Thread Christian Balzer

Hello,

On Tue, 12 Apr 2016 09:46:55 +0100 (BST) Andrei Mikhailovsky wrote:

> I've done the ceph osd reweight-by-utilization and it seems to have
> solved the issue. However, not sure if this will be the long term
> solution.
>
No.
As I said in my reply, use "crush reweight" to permanently adjust weights.
reweight-by-utilization is a band-aid and not permanent, see:
http://cephnotes.ksperis.com/blog/2014/12/23/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight

More OSDs may or may not result in a better uniformity, the 30% difference
you're seeing is definitely at the far end of what one would expect with
Ceph.

Christian

> Thanks for your help
> 
> Andrei
> 
> - Original Message -
> > From: "Shinobu Kinjo" 
> > To: "Andrei Mikhailovsky" 
> > Cc: "Christian Balzer" , "ceph-users"
> >  Sent: Friday, 8 April, 2016 01:35:18
> > Subject: Re: [ceph-users] rebalance near full osd
> 
> > There was a discussion before regarding to the situation where you are
> > facing now. [1]
> > Would you have a look, if it's helpful or not for you.
> > 
> > [1]
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007622.html
> > 
> > Cheers,
> > Shinobu
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
I apologise, I probably should have dialed down a bit.
I'd like to personally apologise to Sage, for being so patient with my ranting.

To be clear: We are so lucky to have Ceph. It was something we sorely needed 
and for the right price (free).
It's was a dream come true to cloud providers - and it still is.

However, working with it in production, spending much time getting to know how 
ceph works, what it does, and also seeing how and where it fails prompted my 
interest in where it's going, because big public clouds are one thing, 
traditional SMB/Small enterprise needs are another and that's where I feel it 
fails hard. So I tried prodding here on ML, watched performance talks (which, 
frankly, reinforced my confirmation bias) and hoped to see some hint of it 
getting bette. That for me equals simpler, faster, not reinventing the wheel. I 
truly don't see that and it makes me sad.

You are talking about the big picture - Ceph for storing anything, new 
architecture - and it sounds cool. Given enough money and time it can 
materialise, I won't elaborate on that. I just hope you don't forget about the 
measly RBD users like me (I'd guesstimate a silent 90%+ majority, but no idea, 
hopefully the product manager has a better one) who are frustrated from the 
current design. I'd like to think I represent those users who used to solve HA 
with DRBD 10 years ago, who had to battle NFS shares with rsync and inotify 
scripts, who were the only people on-call every morning at 3AM when logrotate 
killed their IO, all while having to work with rotting hardware and no budget. 
We are still out there and there's nothing for us - RBD is not as fast, simple 
or reliable as DRBD, filesystem is not as simple nor as fast as rsync, 
scrubbing still wakes us at 3AM...

I'd very much like Ceph to be my storage system of choice in the future again, 
which is why I am so vocal with my opinions, and maybe truly selfish with my 
needs. I have not yet been convinced of the bright future, and -  being the 
sceptical^Wcynical monster I turned into - I expect everything which makes my 
spidey sense tingle to fail, as it usually does. But that's called confirmation 
bias, which can make my whole point moot I guess :)

Jan 




> On 12 Apr 2016, at 23:08, Nick Fisk  wrote:
> 
> Jan,
> 
> I would like to echo Sage's response here. It seems you only want a subset
> of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
> which requires a lot more intelligence at the lower levels.
> 
> I must say I have found your attitude to both Sage and the Ceph project as a
> whole over the last few emails quite disrespectful. I spend a lot of my time
> trying to sell the benefits of open source, which centre on the openness of
> the idea/code and not around the fact that you can get it for free. One of
> the things that I like about open source is the constructive, albeit
> sometimes abrupt, constructive criticism that results in a better product.
> Simply shouting Ceph is slow and it's because dev's don't understand
> filesystems is not constructive.
> 
> I've just come back from an expo at ExCel London where many providers are
> passionately talking about Ceph. There seems to be a lot of big money
> sloshing about for something that is inherently "wrong"
> 
> Sage and the core Ceph team seem like  very clever people to me and I trust
> that over the years of development, that if they have decided that standard
> FS's are not the ideal backing store for Ceph, that this is probably correct
> decision. However I am also aware that the human condition "Can't see the
> wood for the trees" is everywhere and I'm sure if you have any clever
> insights into filesystem behaviour, the Ceph Dev team would be more than
> open to suggestions.
> 
> Personally I wish I could contribute more to the project as I feel that I
> (any my company) get more from Ceph than we put in, but it strikes a nerve
> when there is such negative criticism for what effectively is a free
> product.
> 
> Yes, I also suffer from the problem of slow sync writes, but the benefit of
> being able to shift 1U servers around a Rack/DC compared to a SAS tethered
> 4U jbod somewhat outweighs that as well as several other advanatages. A new
> cluster that we are deploying has several hardware choices which go a long
> way to improve this performance as well. Coupled with the coming Bluestore,
> the future looks bright.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Sage Weil
>> Sent: 12 April 2016 21:48
>> To: Jan Schermer 
>> Cc: ceph-devel ; ceph-users > us...@ceph.com>; ceph-maintain...@ceph.com
>> Subject: Re: [ceph-users] Deprecating ext4 support
>> 
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Still the answer to most of your points from me is "but who needs that?"
>>> Who needs to have exactly the same data in two separate objects
>>> 

[ceph-users] rbd/rados consistency mismatch (was "Deprecating ext4 support")

2016-04-12 Thread Gregory Farnum
On Tue, Apr 12, 2016 at 1:33 PM, Jan Schermer  wrote:
> Still the answer to most of your points from me is "but who needs that?"
> Who needs to have exactly the same data in two separate objects (replicas)? 
> Ceph needs it because "consistency"?, but the app (VM filesystem) is fine 
> with whatever version because the flush didn't happen (if it did the contents 
> would be the same).

This isn't something that can happen quickly due to the fundamental
designs of the RADOS architecture (I think we've discussed this
before?), built to underlie a posix filesystem and expecting all
object operations to be latency-sensitive.

However, figuring out a roadmap to reduce the correspondence of rbd
cache-too-full flushes to latency-sensitive disk hits is something
that's been tickling my brain for a while, and I think it's on the
mind of some other contributors as well. Many Ceph developers will be
seeing each other at conferences over the next couple of weeks and
this will be a topic of discussion. Assuming I figure out it's even
possible, and manage to persuade a few others, you should start
hearing things about it in the Ceph Dev Monthlies. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Gregory Farnum
Thank you for the votes of confidence, everybody. :)
It would be good if we could keep this thread focused on who is harmed
by retiring ext4 as a tested configuration at what speed, and break
out other threads for other issues. (I'm about to do that for one of
them!)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Oliver Dzombic
Hi Jan,

i can answer your question very quickly: We.

We need that!

We need and want a stable, selfhealing, scaleable, robust, reliable
storagesystem which can talk to our infrastructure in different languages.

I have full understanding, that people who are using an infrastructure,
which is going to loose support by a software are not too much amused.

I dont understand your strict insisting on looking at that matter from
different points of view.

And if you will just think about it for a moment, you will remember
yourself, that this software is not designed for a single purpose.

Its designed for multiple purposes. Where "purpose" is the different
flavour/ways the different people are trying to use a software for.

I am very thankful, if software designers are trying to make their
product better and better. If that means that they will have to drop the
support for a filesystem type, then may it be so.

You will not die from that, as well as all others.

I am waiting for the upcoming jewel to make a new cluster, to migrate
the old hammer cluster into that.

Jewel will have a new feature that will allow to migrate clusters.

So whats your problem ? For now i dont see any draw back for you.

If the software will be able to provide your rbd vm's, then you should
not care about if its ext2,3,4,200 or xfs or $what_ever_new.

As long as its working, and maybe even providing more features than
before, then, whats the problem ?

That YOU dont need that features ? That you dont want your running
system to be changed ? That you are not the only ceph user and the
software is not privately developed for your neeeds ?

Seriously ?

So, let me welcome to this world, where you are not alone, and where are
other people who also have wishes and wantings.

I am sure that the people who soo much need/want to have the ext4
support are in the minority. Otherwise the ceph developers wont drop it,
because they are not stupid to drop a feature which is wanted/needed by
a majority of people.

So please, try to open your eyes a bit for the rest of the ceph users.

And, if you managed that, try to open your eyes for the ceph developers
who made here a product that was enabling you to manage your stuff and
what ever you use ceph for.

And if that is all not ok/right from your side, then become a ceph
developer and code contributor. Keep up the ext4 support and try to
influence the other developers to maintain a feature with is technically
not needed, technically in the way of better software design and used by
a minority of users. Goood luck with that !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 12.04.2016 um 22:33 schrieb Jan Schermer:
> Still the answer to most of your points from me is "but who needs that?"
> Who needs to have exactly the same data in two separate objects (replicas)? 
> Ceph needs it because "consistency"?, but the app (VM filesystem) is fine 
> with whatever version because the flush didn't happen (if it did the contents 
> would be the same).
> 
> You say "Ceph needs", but I say "the guest VM needs" - there's the problem.
> 
>> On 12 Apr 2016, at 21:58, Sage Weil  wrote:
>>
>> Okay, I'll bite.
>>
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
 Local kernel file systems maintain their own internal consistency, but 
 they only provide what consistency promises the POSIX interface 
 does--which is almost nothing.
>>>
>>> ... which is exactly what everyone expects
>>> ... which is everything any app needs
>>>
 That's why every complicated data 
 structure (e.g., database) stored on a file system ever includes it's own 
 journal.
>>> ... see?
>>
>> They do this because POSIX doesn't give them what they want.  They 
>> implement a *second* journal on top.  The result is that you get the 
>> overhead from both--the fs journal keeping its data structures consistent, 
>> the database keeping its consistent.  If you're not careful, that means 
>> the db has to do something like file write, fsync, db journal append, 
>> fsync.
> It's more like
> transaction log write, flush
> data write
> That's simply because most filesystems don't journal data, but some do.
> 
> 
>> And both fsyncs turn into a *fs* journal io and flush.  (Smart 
>> databases often avoid most of the fs overhead by putting everything in a 
>> single large file, but at that point the file system isn't actually doing 
>> anything except passing IO to the block layer).
>>
>> There is nothing wrong with POSIX file systems.  They have the unenviable 
>> task of catering to a huge variety of workloads and applications, but are 
>> truly optimal for very few.  And that's fine.  If you want a local file 
>> system, you should use ext4 or XFS, not 

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread w...@42on.com


> Op 12 apr. 2016 om 23:09 heeft Nick Fisk  het volgende 
> geschreven:
> 
> Jan,
> 
> I would like to echo Sage's response here. It seems you only want a subset
> of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
> which requires a lot more intelligence at the lower levels.
> 

I fully agree with your e-mail. I think the Ceph devvers have earned their 
respect over the years and they know what they are talking about.

For years I have been wondering why there even was a POSIX filesystem 
underneath Ceph.

> I must say I have found your attitude to both Sage and the Ceph project as a
> whole over the last few emails quite disrespectful. I spend a lot of my time
> trying to sell the benefits of open source, which centre on the openness of
> the idea/code and not around the fact that you can get it for free. One of
> the things that I like about open source is the constructive, albeit
> sometimes abrupt, constructive criticism that results in a better product.
> Simply shouting Ceph is slow and it's because dev's don't understand
> filesystems is not constructive.
> 
> I've just come back from an expo at ExCel London where many providers are
> passionately talking about Ceph. There seems to be a lot of big money
> sloshing about for something that is inherently "wrong"
> 
> Sage and the core Ceph team seem like  very clever people to me and I trust
> that over the years of development, that if they have decided that standard
> FS's are not the ideal backing store for Ceph, that this is probably correct
> decision. However I am also aware that the human condition "Can't see the
> wood for the trees" is everywhere and I'm sure if you have any clever
> insights into filesystem behaviour, the Ceph Dev team would be more than
> open to suggestions.
> 
> Personally I wish I could contribute more to the project as I feel that I
> (any my company) get more from Ceph than we put in, but it strikes a nerve
> when there is such negative criticism for what effectively is a free
> product.
> 
> Yes, I also suffer from the problem of slow sync writes, but the benefit of
> being able to shift 1U servers around a Rack/DC compared to a SAS tethered
> 4U jbod somewhat outweighs that as well as several other advanatages. A new
> cluster that we are deploying has several hardware choices which go a long
> way to improve this performance as well. Coupled with the coming Bluestore,
> the future looks bright.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Sage Weil
>> Sent: 12 April 2016 21:48
>> To: Jan Schermer 
>> Cc: ceph-devel ; ceph-users > us...@ceph.com>; ceph-maintain...@ceph.com
>> Subject: Re: [ceph-users] Deprecating ext4 support
>> 
>>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Still the answer to most of your points from me is "but who needs that?"
>>> Who needs to have exactly the same data in two separate objects
>>> (replicas)? Ceph needs it because "consistency"?, but the app (VM
>>> filesystem) is fine with whatever version because the flush didn't
>>> happen (if it did the contents would be the same).
>> 
>> If you want replicated VM store that isn't picky about consistency, try
>> Sheepdog.  Or your mdraid over iSCSI proposal.
>> 
>> We care about these things because VMs are just one of many users of
>> rados, and because even if we could get away with being sloppy in some (or
>> even most) cases with VMs, we need the strong consistency to build other
>> features people want, like RBD journaling for multi-site async
> replication.
>> 
>> Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
>> chose rados for a reason.
>> 
>> And we want to make sense of an inconsistency when we find one on scrub.
>> (Does it mean the disk is returning bad data, or we just crashed during a
> write
>> a while back?)
>> 
>> ...
>> 
>> Cheers-
>> sage
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread ceph
On 12/04/2016 22:33, Jan Schermer wrote:
> I don't think it's apples and oranges.
> If I export two files via losetup over iSCSI and make a raid1 swraid out of 
> them in guest VM, I bet it will still be faster than ceph with bluestore.
> And yet it will provide the same guarantees and do the same job without 
> eating significant CPU time.
> True or false?
False
First, your iSCSI server will be a spoof
Second, you won't aggregate many things (limited by the network, at least)

Saying you don't care about consistency made be laught ..
You are using xfs/ext4 options, like nobarrier etc, on production, right
? They can improve really performance, and only provide the so useless
consistency that nobody care for :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Nick Fisk
Jan,

I would like to echo Sage's response here. It seems you only want a subset
of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
which requires a lot more intelligence at the lower levels.

I must say I have found your attitude to both Sage and the Ceph project as a
whole over the last few emails quite disrespectful. I spend a lot of my time
trying to sell the benefits of open source, which centre on the openness of
the idea/code and not around the fact that you can get it for free. One of
the things that I like about open source is the constructive, albeit
sometimes abrupt, constructive criticism that results in a better product.
Simply shouting Ceph is slow and it's because dev's don't understand
filesystems is not constructive.

I've just come back from an expo at ExCel London where many providers are
passionately talking about Ceph. There seems to be a lot of big money
sloshing about for something that is inherently "wrong"

Sage and the core Ceph team seem like  very clever people to me and I trust
that over the years of development, that if they have decided that standard
FS's are not the ideal backing store for Ceph, that this is probably correct
decision. However I am also aware that the human condition "Can't see the
wood for the trees" is everywhere and I'm sure if you have any clever
insights into filesystem behaviour, the Ceph Dev team would be more than
open to suggestions.

Personally I wish I could contribute more to the project as I feel that I
(any my company) get more from Ceph than we put in, but it strikes a nerve
when there is such negative criticism for what effectively is a free
product.

Yes, I also suffer from the problem of slow sync writes, but the benefit of
being able to shift 1U servers around a Rack/DC compared to a SAS tethered
4U jbod somewhat outweighs that as well as several other advanatages. A new
cluster that we are deploying has several hardware choices which go a long
way to improve this performance as well. Coupled with the coming Bluestore,
the future looks bright.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Sage Weil
> Sent: 12 April 2016 21:48
> To: Jan Schermer 
> Cc: ceph-devel ; ceph-users  us...@ceph.com>; ceph-maintain...@ceph.com
> Subject: Re: [ceph-users] Deprecating ext4 support
> 
> On Tue, 12 Apr 2016, Jan Schermer wrote:
> > Still the answer to most of your points from me is "but who needs that?"
> > Who needs to have exactly the same data in two separate objects
> > (replicas)? Ceph needs it because "consistency"?, but the app (VM
> > filesystem) is fine with whatever version because the flush didn't
> > happen (if it did the contents would be the same).
> 
> If you want replicated VM store that isn't picky about consistency, try
> Sheepdog.  Or your mdraid over iSCSI proposal.
> 
> We care about these things because VMs are just one of many users of
> rados, and because even if we could get away with being sloppy in some (or
> even most) cases with VMs, we need the strong consistency to build other
> features people want, like RBD journaling for multi-site async
replication.
> 
> Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that
> chose rados for a reason.
> 
> And we want to make sense of an inconsistency when we find one on scrub.
> (Does it mean the disk is returning bad data, or we just crashed during a
write
> a while back?)
> 
> ...
> 
> Cheers-
> sage
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Sage Weil
On Tue, 12 Apr 2016, Jan Schermer wrote:
> Still the answer to most of your points from me is "but who needs that?" 
> Who needs to have exactly the same data in two separate objects 
> (replicas)? Ceph needs it because "consistency"?, but the app (VM 
> filesystem) is fine with whatever version because the flush didn't 
> happen (if it did the contents would be the same).

If you want replicated VM store that isn't picky about consistency, 
try Sheepdog.  Or your mdraid over iSCSI proposal.

We care about these things because VMs are just one of many users of 
rados, and because even if we could get away with being sloppy in some (or 
even most) cases with VMs, we need the strong consistency to build other 
features people want, like RBD journaling for multi-site async 
replication.

Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that 
chose rados for a reason.

And we want to make sense of an inconsistency when we find one on scrub.  
(Does it mean the disk is returning bad data, or we just crashed during a 
write a while back?)

...

Cheers-
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects (replicas)? 
Ceph needs it because "consistency"?, but the app (VM filesystem) is fine with 
whatever version because the flush didn't happen (if it did the contents would 
be the same).

You say "Ceph needs", but I say "the guest VM needs" - there's the problem.

> On 12 Apr 2016, at 21:58, Sage Weil  wrote:
> 
> Okay, I'll bite.
> 
> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Local kernel file systems maintain their own internal consistency, but 
>>> they only provide what consistency promises the POSIX interface 
>>> does--which is almost nothing.
>> 
>> ... which is exactly what everyone expects
>> ... which is everything any app needs
>> 
>>> That's why every complicated data 
>>> structure (e.g., database) stored on a file system ever includes it's own 
>>> journal.
>> ... see?
> 
> They do this because POSIX doesn't give them what they want.  They 
> implement a *second* journal on top.  The result is that you get the 
> overhead from both--the fs journal keeping its data structures consistent, 
> the database keeping its consistent.  If you're not careful, that means 
> the db has to do something like file write, fsync, db journal append, 
> fsync.
It's more like
transaction log write, flush
data write
That's simply because most filesystems don't journal data, but some do.


> And both fsyncs turn into a *fs* journal io and flush.  (Smart 
> databases often avoid most of the fs overhead by putting everything in a 
> single large file, but at that point the file system isn't actually doing 
> anything except passing IO to the block layer).
> 
> There is nothing wrong with POSIX file systems.  They have the unenviable 
> task of catering to a huge variety of workloads and applications, but are 
> truly optimal for very few.  And that's fine.  If you want a local file 
> system, you should use ext4 or XFS, not Ceph.
> 
> But it turns ceph-osd isn't a generic application--it has a pretty 
> specific workload pattern, and POSIX doesn't give us the interfaces we 
> want (mainly, atomic transactions or ordered object/file enumeration).

The workload (with RBD) is inevitably expecting POSIX. Who needs more than 
that? To me that indicates unnecessary guarantees.

> 
>>> We coudl "wing it" and hope for 
>>> the best, then do an expensive crawl and rsync of data on recovery, but we 
>>> chose very early on not to do that.  If you want a system that "just" 
>>> layers over an existing filesystem, try you can try Gluster (although note 
>>> that they have a different sort of pain with the ordering of xattr 
>>> updates, and are moving toward a model that looks more like Ceph's backend 
>>> in their next version).
>> 
>> True, which is why we dismissed it.
> 
> ...and yet it does exactly what you asked for:

I was implying it suffers the same flaws. In any case it wasn't really fast and 
it seemed overly complex.
To be fair it was some while ago when I tried it.
Can't talk about consistency - I don't think I ever used it in production as 
more than a PoC.

> 
 IMO, If Ceph was moving in the right direction [...] Ceph would 
 simply distribute our IO around with CRUSH.
> 
> You want ceph to "just use a file system."  That's what gluster does--it 
> just layers the distributed namespace right on top of a local namespace.  
> If you didn't care about correctness or data safety, it would be 
> beautiful, and just as fast as the local file system (modulo network).  
> But if you want your data safe, you immediatley realize that local POSIX 
> file systems don't get you want you need: the atomic update of two files 
> on different servers so that you can keep your replicas in sync.  Gluster 
> originally took the minimal path to accomplish this: a "simple" 
> prepare/write/commit, using xattrs as transaction markers.  We took a 
> heavyweight approach to support arbitrary transactions.  And both of us 
> have independently concluded that the local fs is the wrong tool for the 
> job.
> 
>>> Offloading stuff to the file system doesn't save you CPU--it just makes 
>>> someone else responsible.  What does save you CPU is avoiding the 
>>> complexity you don't need (i.e., half of what the kernel file system is 
>>> doing, and everything we have to do to work around an ill-suited 
>>> interface) and instead implement exactly the set of features that we need 
>>> to get the job done.
>> 
>> In theory you are right.
>> In practice in-kernel filesystems are fast, and fuse filesystems are slow.
>> Ceph is like that - slow. And you want to be fast by writing more code :)
> 
> You get fast by writing the *right* code, and eliminating layers of the 
> stack (the local file system, in this case) that are providing 
> functionality you don't want (or more functionality than you need at too 
> high a price).
> 
>> I dug into bluestore and how you want to implement it, 

Re: [ceph-users] CephFS writes = Permission denied

2016-04-12 Thread Nate Curry
I thought that I had corrected that already and apparently I was wrong.
The permissions set on MDS for the user mounting the filesystem needs to be
"rw".  Mine was set to "r'.

ceph auth caps client.cephfs mon 'allow r' mds 'allow rw' osd 'allow rwx
pool=cephfs_metadata,allow rwx pool=cephfs_data'

Thanks!


*Nate Curry*


On Tue, Apr 12, 2016 at 3:56 PM, Gregory Farnum  wrote:

> On Tue, Apr 12, 2016 at 12:20 PM, Nate Curry  wrote:
> > I am seeing an issue with cephfs where I am unable to write changes to
> the
> > files system in anyway.  I am running commands using sudo with a user
> > account as well as the root user itself to modify ownership of files,
> delete
> > files, and create new files and all I get is "Permission denied".
> >
> > At first I thought maybe there was something wrong with the file system
> and
> > it was no longer read write but everything seems to check out.  It is not
> > mounted as read only, ceph is reporting HEALTH_OK, and there is nothing
> in
> > any of the logs that look like errors.  I am able to unmount and remount
> the
> > filesystem without any issues.  It also reboots and mounts no problem.
> I am
> > not sure what this could be caused by.  Any ideas?
>
> Sounds like you've got your cephx permission caps set wrong.
> http://docs.ceph.com/docs/master/cephfs/client-auth/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Sage Weil
Okay, I'll bite.

On Tue, 12 Apr 2016, Jan Schermer wrote:
> > Local kernel file systems maintain their own internal consistency, but 
> > they only provide what consistency promises the POSIX interface 
> > does--which is almost nothing.
> 
> ... which is exactly what everyone expects
> ... which is everything any app needs
> 
> >  That's why every complicated data 
> > structure (e.g., database) stored on a file system ever includes it's own 
> > journal.
> ... see?

They do this because POSIX doesn't give them what they want.  They 
implement a *second* journal on top.  The result is that you get the 
overhead from both--the fs journal keeping its data structures consistent, 
the database keeping its consistent.  If you're not careful, that means 
the db has to do something like file write, fsync, db journal append, 
fsync.  And both fsyncs turn into a *fs* journal io and flush.  (Smart 
databases often avoid most of the fs overhead by putting everything in a 
single large file, but at that point the file system isn't actually doing 
anything except passing IO to the block layer).

There is nothing wrong with POSIX file systems.  They have the unenviable 
task of catering to a huge variety of workloads and applications, but are 
truly optimal for very few.  And that's fine.  If you want a local file 
system, you should use ext4 or XFS, not Ceph.

But it turns ceph-osd isn't a generic application--it has a pretty 
specific workload pattern, and POSIX doesn't give us the interfaces we 
want (mainly, atomic transactions or ordered object/file enumeration).

> >  We coudl "wing it" and hope for 
> > the best, then do an expensive crawl and rsync of data on recovery, but we 
> > chose very early on not to do that.  If you want a system that "just" 
> > layers over an existing filesystem, try you can try Gluster (although note 
> > that they have a different sort of pain with the ordering of xattr 
> > updates, and are moving toward a model that looks more like Ceph's backend 
> > in their next version).
> 
> True, which is why we dismissed it.

...and yet it does exactly what you asked for:

> > > IMO, If Ceph was moving in the right direction [...] Ceph would 
> > > simply distribute our IO around with CRUSH.

You want ceph to "just use a file system."  That's what gluster does--it 
just layers the distributed namespace right on top of a local namespace.  
If you didn't care about correctness or data safety, it would be 
beautiful, and just as fast as the local file system (modulo network).  
But if you want your data safe, you immediatley realize that local POSIX 
file systems don't get you want you need: the atomic update of two files 
on different servers so that you can keep your replicas in sync.  Gluster 
originally took the minimal path to accomplish this: a "simple" 
prepare/write/commit, using xattrs as transaction markers.  We took a 
heavyweight approach to support arbitrary transactions.  And both of us 
have independently concluded that the local fs is the wrong tool for the 
job.

> > Offloading stuff to the file system doesn't save you CPU--it just makes 
> > someone else responsible.  What does save you CPU is avoiding the 
> > complexity you don't need (i.e., half of what the kernel file system is 
> > doing, and everything we have to do to work around an ill-suited 
> > interface) and instead implement exactly the set of features that we need 
> > to get the job done.
> 
> In theory you are right.
> In practice in-kernel filesystems are fast, and fuse filesystems are slow.
> Ceph is like that - slow. And you want to be fast by writing more code :)

You get fast by writing the *right* code, and eliminating layers of the 
stack (the local file system, in this case) that are providing 
functionality you don't want (or more functionality than you need at too 
high a price).

> I dug into bluestore and how you want to implement it, and from what I 
> understood you are reimplementing what the filesystem journal does...

Yes.  The difference is that a single journal manages all of the metadata 
and data consistency in the system, instead of a local fs journal managing 
just block allocation and a second ceph journal managing ceph's data 
structures.

The main benefit, though, is that we can choose a different set of 
semantics, like the ability to overwrite data in a file/object and update 
metadata atomically.  You can't do that with POSIX without building a 
write-ahead journal and double-writing.

> Btw I think at least i_version xattr could be atomic.

Nope.  All major file systems (other than btrfs) overwrite data in place, 
which means it is impossible for any piece of metadata to accurately 
indicate whether you have the old data or the new data (or perhaps a bit 
of both).

> It makes sense it will be 2x faster if you avoid the double-journalling, 
> but I'd be very much surprised if it helped with CPU usage one bit - I 
> certainly don't see my filesystems consuming significant amount 

Re: [ceph-users] CephFS writes = Permission denied

2016-04-12 Thread Gregory Farnum
On Tue, Apr 12, 2016 at 12:20 PM, Nate Curry  wrote:
> I am seeing an issue with cephfs where I am unable to write changes to the
> files system in anyway.  I am running commands using sudo with a user
> account as well as the root user itself to modify ownership of files, delete
> files, and create new files and all I get is "Permission denied".
>
> At first I thought maybe there was something wrong with the file system and
> it was no longer read write but everything seems to check out.  It is not
> mounted as read only, ceph is reporting HEALTH_OK, and there is nothing in
> any of the logs that look like errors.  I am able to unmount and remount the
> filesystem without any issues.  It also reboots and mounts no problem.  I am
> not sure what this could be caused by.  Any ideas?

Sounds like you've got your cephx permission caps set wrong.
http://docs.ceph.com/docs/master/cephfs/client-auth/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
The "out" OSD was "out" before the crash and doesn't hold any data as it 
was weighted out prior.


Restarting OSDs named as repeat offenders as listed by 'ceph health 
detail' has cleared problems.


Thanks to all for the guidance and suffering my panic,
--
Eric


On 4/12/16 12:38 PM, Eric Hall wrote:

Ok, mon2 and mon3 are happy together, but mon1 dies with
mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")

I take this to mean mon1:store.db is corrupt as I see no permission issues.

So... remove mon1 and add a mon?

Nothing special to worry about re-adding a mon on mon1, other than rm/mv
the current store.db path, correct?

Thanks again,
--
Eric

On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix the
other two monitors and recreate mon1.

If you prefer to also fix mon1, you can simply follow the same steps on
the previous email for all the monitors, but ensuring osdmap:full_latest
on mon1 reflects the last available full_ version on its store.

   -Joao

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread ceph
On 12/04/2016 21:19, Jan Schermer wrote:
> 
>> On 12 Apr 2016, at 20:00, Sage Weil  wrote:
>>
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> I'd like to raise these points, then
>>>
>>> 1) some people (like me) will never ever use XFS if they have a choice
>>> given no choice, we will not use something that depends on XFS
>>>
>>> 2) choice is always good
>>
>> Okay!
>>
>>> 3) doesn't majority of Ceph users only care about RBD?
>>
>> Probably that's true now.  We shouldn't recommend something that prevents 
>> them from adding RGW to an existing cluster in the future, though.
>>
>>> (Angry rant coming)
>>> Even our last performance testing of Ceph (Infernalis) showed abysmal 
>>> performance. The most damning sign is the consumption of CPU time at 
>>> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate 
>>> more CPU also, so in effect it was not really "faster".
>>>
>>> It would make *some* sense to only support ZFS or BTRFS because you can 
>>> offload things like clones/snapshots and consistency to the filesystem - 
>>> which would make the architecture much simpler and everything much 
>>> faster. Instead you insist on XFS and reimplement everything in 
>>> software. I always dismissed this because CPU time was ususally cheap, 
>>> but in practice it simply doesn't work. You duplicate things that 
>>> filesystems had solved for years now (namely crash consistency - though 
>>> we have seen that fail as well), instead of letting them do their work 
>>> and stripping the IO path to the bare necessity and letting someone 
>>> smarter and faster handle that.
>>>
>>> IMO, If Ceph was moving in the right direction there would be no 
>>> "supported filesystem" debate, instead we'd be free to choose whatever 
>>> is there that provides the guarantees we need from filesystem (which is 
>>> usually every filesystem in the kernel) and Ceph would simply distribute 
>>> our IO around with CRUSH.
>>>
>>> Right now CRUSH (and in effect what it allows us to do with data) is 
>>> _the_ reason people use Ceph, as there simply wasn't much else to use 
>>> for distributed storage. This isn't true anymore and the alternatives 
>>> are orders of magnitude faster and smaller.
>>
>> This touched on pretty much every reason why we are ditching file 
>> systems entirely and moving toward BlueStore.
> 
> Nooo!
> 
>>
>> Local kernel file systems maintain their own internal consistency, but 
>> they only provide what consistency promises the POSIX interface 
>> does--which is almost nothing.
> 
> ... which is exactly what everyone expects
> ... which is everything any app needs
Correction: this is every non-storage-related apps needs.
mdadm is an app, and do run over block storage (extrem comparison)
ext4 is an app, same results

Ceph is there to store the data, it is much "an FS" than "a regular app"

> 
>>  That's why every complicated data 
>> structure (e.g., database) stored on a file system ever includes it's own 
>> journal.
> ... see?
> 
> 
>>  In our case, what POSIX provides isn't enough.  We can't even 
>> update a file and it's xattr atomically, let alone the much more 
>> complicated transitions we need to do.
> ... have you thought that maybe xattrs weren't meant to be abused this way? 
> Filesystems usually aren't designed to be a performant key=value stores.
> btw at least i_version should be atomic?
> 
> And I still feel (ironically) that you don't understand what journals and 
> commits/flushes are for if you make this argument...
> 
> Btw I think at least i_version xattr could be atomic.
> 
> 
>>  We coudl "wing it" and hope for 
>> the best, then do an expensive crawl and rsync of data on recovery, but we 
>> chose very early on not to do that.  If you want a system that "just" 
>> layers over an existing filesystem, try you can try Gluster (although note 
>> that they have a different sort of pain with the ordering of xattr 
>> updates, and are moving toward a model that looks more like Ceph's backend 
>> in their next version).
> 
> True, which is why we dismissed it.
> 
>>
>> Offloading stuff to the file system doesn't save you CPU--it just makes 
>> someone else responsible.  What does save you CPU is avoiding the 
>> complexity you don't need (i.e., half of what the kernel file system is 
>> doing, and everything we have to do to work around an ill-suited 
>> interface) and instead implement exactly the set of features that we need 
>> to get the job done.
> 
> In theory you are right.
> In practice in-kernel filesystems are fast, and fuse filesystems are slow.
> Ceph is like that - slow. And you want to be fast by writing more code :)
Yep, let's push ceph near butterfs, where it belongs to
Would be awesome

> 
>>
>> FileStore is slow, mostly because of the above, but also because it is an 
>> old and not-very-enlightened design.  BlueStore is roughly 2x faster in 
>> early testing.
> ... which is still literally orders of magnitude slower than a 

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread ceph
On Tue, 12 Apr 2016, Jan Schermer wrote:
> I'd like to raise these points, then
>
> 1) some people (like me) will never ever use XFS if they have a choice
> given no choice, we will not use something that depends on XFS
Huh ?

> 3) doesn't majority of Ceph users only care about RBD?
Well, half users does
The other half, including myself, are using radosgw

> Finally, remember you *are* completely free to run Ceph on whatever file 
> system you want--and many do.
Yep


About the "ext4 support" stuff, the wiki was pretty clear : you *can*
use ext4, but you *should* use xfs
This is why, despite I mostly run ext4, my OSD are built upon xfs.

So, I think it is a good idea to disable ext4 testing, and make the wiki
more expressive about that.
Beyond that point, as Sage said, people can you whatever FS they want
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS writes = Permission denied

2016-04-12 Thread Nate Curry
I am seeing an issue with cephfs where I am unable to write changes to the
files system in anyway.  I am running commands using sudo with a user
account as well as the root user itself to modify ownership of files,
delete files, and create new files and all I get is "Permission denied".

At first I thought maybe there was something wrong with the file system and
it was no longer read write but everything seems to check out.  It is not
mounted as read only, ceph is reporting HEALTH_OK, and there is nothing in
any of the logs that look like errors.  I am able to unmount and remount
the filesystem without any issues.  It also reboots and mounts no problem.
I am not sure what this could be caused by.  Any ideas?



*Nate Curry*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer

> On 12 Apr 2016, at 20:00, Sage Weil  wrote:
> 
> On Tue, 12 Apr 2016, Jan Schermer wrote:
>> I'd like to raise these points, then
>> 
>> 1) some people (like me) will never ever use XFS if they have a choice
>> given no choice, we will not use something that depends on XFS
>> 
>> 2) choice is always good
> 
> Okay!
> 
>> 3) doesn't majority of Ceph users only care about RBD?
> 
> Probably that's true now.  We shouldn't recommend something that prevents 
> them from adding RGW to an existing cluster in the future, though.
> 
>> (Angry rant coming)
>> Even our last performance testing of Ceph (Infernalis) showed abysmal 
>> performance. The most damning sign is the consumption of CPU time at 
>> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate 
>> more CPU also, so in effect it was not really "faster".
>> 
>> It would make *some* sense to only support ZFS or BTRFS because you can 
>> offload things like clones/snapshots and consistency to the filesystem - 
>> which would make the architecture much simpler and everything much 
>> faster. Instead you insist on XFS and reimplement everything in 
>> software. I always dismissed this because CPU time was ususally cheap, 
>> but in practice it simply doesn't work. You duplicate things that 
>> filesystems had solved for years now (namely crash consistency - though 
>> we have seen that fail as well), instead of letting them do their work 
>> and stripping the IO path to the bare necessity and letting someone 
>> smarter and faster handle that.
>> 
>> IMO, If Ceph was moving in the right direction there would be no 
>> "supported filesystem" debate, instead we'd be free to choose whatever 
>> is there that provides the guarantees we need from filesystem (which is 
>> usually every filesystem in the kernel) and Ceph would simply distribute 
>> our IO around with CRUSH.
>> 
>> Right now CRUSH (and in effect what it allows us to do with data) is 
>> _the_ reason people use Ceph, as there simply wasn't much else to use 
>> for distributed storage. This isn't true anymore and the alternatives 
>> are orders of magnitude faster and smaller.
> 
> This touched on pretty much every reason why we are ditching file 
> systems entirely and moving toward BlueStore.

Nooo!

> 
> Local kernel file systems maintain their own internal consistency, but 
> they only provide what consistency promises the POSIX interface 
> does--which is almost nothing.

... which is exactly what everyone expects
... which is everything any app needs

>  That's why every complicated data 
> structure (e.g., database) stored on a file system ever includes it's own 
> journal.
... see?


>  In our case, what POSIX provides isn't enough.  We can't even 
> update a file and it's xattr atomically, let alone the much more 
> complicated transitions we need to do.
... have you thought that maybe xattrs weren't meant to be abused this way? 
Filesystems usually aren't designed to be a performant key=value stores.
btw at least i_version should be atomic?

And I still feel (ironically) that you don't understand what journals and 
commits/flushes are for if you make this argument...

Btw I think at least i_version xattr could be atomic.


>  We coudl "wing it" and hope for 
> the best, then do an expensive crawl and rsync of data on recovery, but we 
> chose very early on not to do that.  If you want a system that "just" 
> layers over an existing filesystem, try you can try Gluster (although note 
> that they have a different sort of pain with the ordering of xattr 
> updates, and are moving toward a model that looks more like Ceph's backend 
> in their next version).

True, which is why we dismissed it.

> 
> Offloading stuff to the file system doesn't save you CPU--it just makes 
> someone else responsible.  What does save you CPU is avoiding the 
> complexity you don't need (i.e., half of what the kernel file system is 
> doing, and everything we have to do to work around an ill-suited 
> interface) and instead implement exactly the set of features that we need 
> to get the job done.

In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)

> 
> FileStore is slow, mostly because of the above, but also because it is an 
> old and not-very-enlightened design.  BlueStore is roughly 2x faster in 
> early testing.
... which is still literally orders of magnitude slower than a filesystem.
I dug into bluestore and how you want to implement it, and from what I 
understood you are reimplementing what the filesystem journal does...
It makes sense it will be 2x faster if you avoid the double-journalling, but 
I'd be very much surprised if it helped with CPU usage one bit - I certainly 
don't see my filesystems consuming significant amount of CPU time on any of my 
machines, and I seriously doubt you're going to do that better, sorry.



> 
> 

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Sage Weil
On Tue, 12 Apr 2016, Jan Schermer wrote:
> I'd like to raise these points, then
> 
> 1) some people (like me) will never ever use XFS if they have a choice
> given no choice, we will not use something that depends on XFS
> 
> 2) choice is always good

Okay!

> 3) doesn't majority of Ceph users only care about RBD?

Probably that's true now.  We shouldn't recommend something that prevents 
them from adding RGW to an existing cluster in the future, though.

> (Angry rant coming)
> Even our last performance testing of Ceph (Infernalis) showed abysmal 
> performance. The most damning sign is the consumption of CPU time at 
> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate 
> more CPU also, so in effect it was not really "faster".
> 
> It would make *some* sense to only support ZFS or BTRFS because you can 
> offload things like clones/snapshots and consistency to the filesystem - 
> which would make the architecture much simpler and everything much 
> faster. Instead you insist on XFS and reimplement everything in 
> software. I always dismissed this because CPU time was ususally cheap, 
> but in practice it simply doesn't work. You duplicate things that 
> filesystems had solved for years now (namely crash consistency - though 
> we have seen that fail as well), instead of letting them do their work 
> and stripping the IO path to the bare necessity and letting someone 
> smarter and faster handle that.
> 
> IMO, If Ceph was moving in the right direction there would be no 
> "supported filesystem" debate, instead we'd be free to choose whatever 
> is there that provides the guarantees we need from filesystem (which is 
> usually every filesystem in the kernel) and Ceph would simply distribute 
> our IO around with CRUSH.
> 
> Right now CRUSH (and in effect what it allows us to do with data) is 
> _the_ reason people use Ceph, as there simply wasn't much else to use 
> for distributed storage. This isn't true anymore and the alternatives 
> are orders of magnitude faster and smaller.

This touched on pretty much every reason why we are ditching file 
systems entirely and moving toward BlueStore.

Local kernel file systems maintain their own internal consistency, but 
they only provide what consistency promises the POSIX interface 
does--which is almost nothing.  That's why every complicated data 
structure (e.g., database) stored on a file system ever includes it's own 
journal.  In our case, what POSIX provides isn't enough.  We can't even 
update a file and it's xattr atomically, let alone the much more 
complicated transitions we need to do.  We coudl "wing it" and hope for 
the best, then do an expensive crawl and rsync of data on recovery, but we 
chose very early on not to do that.  If you want a system that "just" 
layers over an existing filesystem, try you can try Gluster (although note 
that they have a different sort of pain with the ordering of xattr 
updates, and are moving toward a model that looks more like Ceph's backend 
in their next version).

Offloading stuff to the file system doesn't save you CPU--it just makes 
someone else responsible.  What does save you CPU is avoiding the 
complexity you don't need (i.e., half of what the kernel file system is 
doing, and everything we have to do to work around an ill-suited 
interface) and instead implement exactly the set of features that we need 
to get the job done.

FileStore is slow, mostly because of the above, but also because it is an 
old and not-very-enlightened design.  BlueStore is roughly 2x faster in 
early testing.

Finally, remember you *are* completely free to run Ceph on whatever file 
system you want--and many do.  We just aren't going to test them all for 
you and promise they will all work.  Remember that we have hit different 
bugs in every single one we've tried. It's not as simple as saying they 
just have to "provide the guarantees we need" given the complexity of the 
interface, and almost every time we've tried to use "supported" APIs that 
are remotely unusually (fallocate, zeroing extents... even xattrs) we've 
hit bugs or undocumented limits and idiosyncrasies on one fs or another.

Cheers-
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread LOPEZ Jean-Charles
Hi,

looks like one of your OSDs has been marked as out. Just make sure it’s in so 
you can read '67 osds: 67 up, 67 in' rather than '67 osds: 67 up, 66 in’ in the 
‘ceph -s’ output

You can quickly check which one is not in with the ‘ceph old tree’ command

JC

> On Apr 12, 2016, at 11:21, Joao Eduardo Luis  wrote:
> 
> On 04/12/2016 07:16 PM, Eric Hall wrote:
>> Removed mon on mon1, added mon on mon1 via ceph-deply.  mons now have
>> quorum.
>> 
>> I am left with:
>>cluster 5ee52b50-838e-44c4-be3c-fc596dc46f4e
>>  health HEALTH_WARN 1086 pgs peering; 1086 pgs stuck inactive; 1086
>> pgs stuck unclean; pool vms has too few pgs
>>  monmap e5: 3 mons at
>> {cephsecurestore1=172.16.250.7:6789/0,cephsecurestore2=172.16.250.8:6789/0,cephsecurestore3=172.16.250.9:6789/0},
>> election epoch 28, quorum 0,1,2
>> cephsecurestore1,cephsecurestore2,cephsecurestore3
>>  mdsmap e2: 0/0/1 up
>>  osdmap e38769: 67 osds: 67 up, 66 in
>>   pgmap v33886066: 7688 pgs, 24 pools, 4326 GB data, 892 kobjects
>> 11620 GB used, 8873 GB / 20493 GB avail
>>3 active+clean+scrubbing+deep
>> 1086 peering
>> 6599 active+clean
>> 
>> All OSDs are up/in as reported.  But I see no recovery I/O for those in
>> inactive/peering/unclean.
> 
> Someone else will probably be able to chime in with more authority than me, 
> but I would first try to restart the osds to which those stuck pgs are being 
> mapped.
> 
>  -Joao
> 
>> 
>> Thanks,
>> --
>> Eric
>> 
>> On 4/12/16 1:14 PM, Joao Eduardo Luis wrote:
>>> On 04/12/2016 06:38 PM, Eric Hall wrote:
 Ok, mon2 and mon3 are happy together, but mon1 dies with
 mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")
 
 I take this to mean mon1:store.db is corrupt as I see no permission
 issues.
 
 So... remove mon1 and add a mon?
 
 Nothing special to worry about re-adding a mon on mon1, other than rm/mv
 the current store.db path, correct?
>>> 
>>> You'll actually need to recreate the mon with 'ceph-mon --mkfs' for that
>>> to work, and that will likely require you to rm/mv the mon data
>>> directory.
>>> 
>>> You *could* copy the mon dir from one of the other monitors and use that
>>> instead. But given you have a functioning quorum, I don't think there's
>>> any reason to resort to that.
>>> 
>>> Follow the docs on removing monitors[1] and recreate the monitor from
>>> scratch, adding it to the cluster. It will sync up from scratch from the
>>> other monitors. That'll make them happy.
>>> 
>>>   -Joao
>>> 
>>> [1]
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors
>>> 
>>> 
>>> 
 
 Thanks again,
 --
 Eric
 
 On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:
> On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:
>> On 04/12/2016 04:27 PM, Eric Hall wrote:
>>> On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:
>>> 
 So this looks like the monitors didn't remove version 1, but this
 may
 just be a red herring.
 
 What matters, really, is the values in 'first_committed' and
 'last_committed'. If either first or last_committed happens to be
 '1',
 then there may be a bug somewhere in the code, but I doubt that.
 This
 seems just an artefact.
 
 So, it would be nice if you could provide the value of both
 'osdmap:first_committed' and 'osdmap:last_committed'.
>>> 
>>> mon1:
>>> (osdmap, last_committed)
>>>  : 01 00 00 00 00 00 00 00 : 
>>> (osdmap, fist_committed) does not exist
>>> 
>>> mon2:
>>> (osdmap, last_committed)
>>>  : 01 00 00 00 00 00 00 00 : 
>>> (osdmap, fist_committed) does not exist
>>> 
>>> mon3:
>>> (osdmap, last_committed)
>>>  : 01 00 00 00 00 00 00 00 : 
>>> (osdmap, first_committed)
>>>  : b8 94 00 00 00 00 00 00
>> 
>> Wow! This is unexpected, but fits the assertion just fine.
>> 
>> The solution, I think, will be rewriting first_committed and
>> last_committed on all monitors - except on mon1.
> 
> Let me clarify this a bit: the easy way out for mon1 would be to fix
> the
> other two monitors and recreate mon1.
> 
> If you prefer to also fix mon1, you can simply follow the same steps on
> the previous email for all the monitors, but ensuring
> osdmap:full_latest
> on mon1 reflects the last available full_ version on its store.
> 
>   -Joao
>>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis

On 04/12/2016 07:16 PM, Eric Hall wrote:

Removed mon on mon1, added mon on mon1 via ceph-deply.  mons now have
quorum.

I am left with:
cluster 5ee52b50-838e-44c4-be3c-fc596dc46f4e
  health HEALTH_WARN 1086 pgs peering; 1086 pgs stuck inactive; 1086
pgs stuck unclean; pool vms has too few pgs
  monmap e5: 3 mons at
{cephsecurestore1=172.16.250.7:6789/0,cephsecurestore2=172.16.250.8:6789/0,cephsecurestore3=172.16.250.9:6789/0},
election epoch 28, quorum 0,1,2
cephsecurestore1,cephsecurestore2,cephsecurestore3
  mdsmap e2: 0/0/1 up
  osdmap e38769: 67 osds: 67 up, 66 in
   pgmap v33886066: 7688 pgs, 24 pools, 4326 GB data, 892 kobjects
 11620 GB used, 8873 GB / 20493 GB avail
3 active+clean+scrubbing+deep
 1086 peering
 6599 active+clean

All OSDs are up/in as reported.  But I see no recovery I/O for those in
inactive/peering/unclean.


Someone else will probably be able to chime in with more authority than 
me, but I would first try to restart the osds to which those stuck pgs 
are being mapped.


  -Joao



Thanks,
--
Eric

On 4/12/16 1:14 PM, Joao Eduardo Luis wrote:

On 04/12/2016 06:38 PM, Eric Hall wrote:

Ok, mon2 and mon3 are happy together, but mon1 dies with
mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")

I take this to mean mon1:store.db is corrupt as I see no permission
issues.

So... remove mon1 and add a mon?

Nothing special to worry about re-adding a mon on mon1, other than rm/mv
the current store.db path, correct?


You'll actually need to recreate the mon with 'ceph-mon --mkfs' for that
to work, and that will likely require you to rm/mv the mon data
directory.

You *could* copy the mon dir from one of the other monitors and use that
instead. But given you have a functioning quorum, I don't think there's
any reason to resort to that.

Follow the docs on removing monitors[1] and recreate the monitor from
scratch, adding it to the cluster. It will sync up from scratch from the
other monitors. That'll make them happy.

   -Joao

[1]
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors





Thanks again,
--
Eric

On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this
may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be
'1',
then there may be a bug somewhere in the code, but I doubt that.
This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix
the
other two monitors and recreate mon1.

If you prefer to also fix mon1, you can simply follow the same steps on
the previous email for all the monitors, but ensuring
osdmap:full_latest
on mon1 reflects the last available full_ version on its store.

   -Joao




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
Removed mon on mon1, added mon on mon1 via ceph-deply.  mons now have 
quorum.


I am left with:
   cluster 5ee52b50-838e-44c4-be3c-fc596dc46f4e
 health HEALTH_WARN 1086 pgs peering; 1086 pgs stuck inactive; 1086 
pgs stuck unclean; pool vms has too few pgs
 monmap e5: 3 mons at 
{cephsecurestore1=172.16.250.7:6789/0,cephsecurestore2=172.16.250.8:6789/0,cephsecurestore3=172.16.250.9:6789/0}, 
election epoch 28, quorum 0,1,2 
cephsecurestore1,cephsecurestore2,cephsecurestore3

 mdsmap e2: 0/0/1 up
 osdmap e38769: 67 osds: 67 up, 66 in
  pgmap v33886066: 7688 pgs, 24 pools, 4326 GB data, 892 kobjects
11620 GB used, 8873 GB / 20493 GB avail
   3 active+clean+scrubbing+deep
1086 peering
6599 active+clean

All OSDs are up/in as reported.  But I see no recovery I/O for those in 
inactive/peering/unclean.


Thanks,
--
Eric

On 4/12/16 1:14 PM, Joao Eduardo Luis wrote:

On 04/12/2016 06:38 PM, Eric Hall wrote:

Ok, mon2 and mon3 are happy together, but mon1 dies with
mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")

I take this to mean mon1:store.db is corrupt as I see no permission
issues.

So... remove mon1 and add a mon?

Nothing special to worry about re-adding a mon on mon1, other than rm/mv
the current store.db path, correct?


You'll actually need to recreate the mon with 'ceph-mon --mkfs' for that
to work, and that will likely require you to rm/mv the mon data directory.

You *could* copy the mon dir from one of the other monitors and use that
instead. But given you have a functioning quorum, I don't think there's
any reason to resort to that.

Follow the docs on removing monitors[1] and recreate the monitor from
scratch, adding it to the cluster. It will sync up from scratch from the
other monitors. That'll make them happy.

   -Joao

[1]
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors




Thanks again,
--
Eric

On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be
'1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix the
other two monitors and recreate mon1.

If you prefer to also fix mon1, you can simply follow the same steps on
the previous email for all the monitors, but ensuring osdmap:full_latest
on mon1 reflects the last available full_ version on its store.

   -Joao



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis

On 04/12/2016 06:38 PM, Eric Hall wrote:

Ok, mon2 and mon3 are happy together, but mon1 dies with
mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")

I take this to mean mon1:store.db is corrupt as I see no permission issues.

So... remove mon1 and add a mon?

Nothing special to worry about re-adding a mon on mon1, other than rm/mv
the current store.db path, correct?


You'll actually need to recreate the mon with 'ceph-mon --mkfs' for that 
to work, and that will likely require you to rm/mv the mon data directory.


You *could* copy the mon dir from one of the other monitors and use that 
instead. But given you have a functioning quorum, I don't think there's 
any reason to resort to that.


Follow the docs on removing monitors[1] and recreate the monitor from 
scratch, adding it to the cluster. It will sync up from scratch from the 
other monitors. That'll make them happy.


  -Joao

[1] 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors




Thanks again,
--
Eric

On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix the
other two monitors and recreate mon1.

If you prefer to also fix mon1, you can simply follow the same steps on
the previous email for all the monitors, but ensuring osdmap:full_latest
on mon1 reflects the last available full_ version on its store.

   -Joao


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
Ok, mon2 and mon3 are happy together, but mon1 dies with 
mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db")


I take this to mean mon1:store.db is corrupt as I see no permission issues.

So... remove mon1 and add a mon?

Nothing special to worry about re-adding a mon on mon1, other than rm/mv 
the current store.db path, correct?


Thanks again,
--
Eric

On 4/12/16 11:18 AM, Joao Eduardo Luis wrote:

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix the
other two monitors and recreate mon1.

If you prefer to also fix mon1, you can simply follow the same steps on
the previous email for all the monitors, but ensuring osdmap:full_latest
on mon1 reflects the last available full_ version on its store.

   -Joao

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis

On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote:

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.


Let me clarify this a bit: the easy way out for mon1 would be to fix the 
other two monitors and recreate mon1.


If you prefer to also fix mon1, you can simply follow the same steps on 
the previous email for all the monitors, but ensuring osdmap:full_latest 
on mon1 reflects the last available full_ version on its store.


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis

On 04/12/2016 04:27 PM, Eric Hall wrote:

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Wow! This is unexpected, but fits the assertion just fine.

The solution, I think, will be rewriting first_committed and 
last_committed on all monitors - except on mon1.


As per your previous email, in which you listed the osdmap version 
intervals for each monitor, it seems like mon1 contains incremental 
versions [38072..38630] and full versions [38072..38456] - i.e., there's 
a bunch of full versions missing from 38456 to 38630.


The other two monitors do not seem afflicted by this gap.

This will not be necessarily a problem as long as osdmap:full_latest 
contains the version of the latest full map in the monitor's store. If 
by any chance osdmap:full_latest contains a lower version than the 
lowest full map version available, or a greater version than the highest 
full map version, then problems will ensue. That said,


I would advise performing the following in a copy of your monitors 
(injecting a custom monmap to make it run it solo[1]), so that any 
still-running osds are not affected by any eventual side effects. Once 
you are sure no assertions have been hit and the monitor is running 
fine, feel free to apply these to your monitors.


1. set osdmap:first_committed to 38072
2. set osdmap:last_committed to 38630
3. set osdmap:full_latest to whatever is the latest full_X version 
on the monitor.

  3.1. this means 38630 on mon2 and mon3 - but 38456 on mon1

Setting versions should be as simple as

ceph-kvstore-tool ${MONDATA}/store.db set osdmap ${KEY} ver ${VER}

with ${KEY} being either first_committed, last_committed or full_latest

and ${VER} being the appropriate value.


Hope this helps.

  -Joao

[1] This assert is only triggered once a quorum is formed, which means
you'll either have to have all the monitors running, or forcing the
quorum to be of just one single monitor.




Furthermore, the code is asserting on a basic check on
OSDMonitor::update_from_paxos(), which is definitely unexpected to fail.
It would also be nice if you could point us to a mon log with
'--debug-mon 20' from start to hitting the assertion. Feel free to send
it directly to me if you don't want the sitting on the internet.


Here is from mon2 (cephsecurestore2 IRL), which starts and dies with the
assert:
http://www.isis.vanderbilt.edu/mon2.log

Here is from mon3 (cephsecurestore3 IRL), which starts and runs, but
can't form quorum and never gives up on mon1 and mon2.  Removing mon1
and mon2 from mon3's via modmap extract/rm/inject results in same FAILED
assert as others:
http://www.isis.vanderbilt.edu/mon3.log


My thought was that if I could resolve the last_committed problem on
mon3, then it might have a change sans mon1 and mon2.

Thank you,
--
Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall

On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:


So this looks like the monitors didn't remove version 1, but this may
just be a red herring.

What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.

So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.


mon1:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon2:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, fist_committed) does not exist

mon3:
(osdmap, last_committed)
 : 01 00 00 00 00 00 00 00 : 
(osdmap, first_committed)
 : b8 94 00 00 00 00 00 00


Furthermore, the code is asserting on a basic check on
OSDMonitor::update_from_paxos(), which is definitely unexpected to fail.
It would also be nice if you could point us to a mon log with
'--debug-mon 20' from start to hitting the assertion. Feel free to send
it directly to me if you don't want the sitting on the internet.


Here is from mon2 (cephsecurestore2 IRL), which starts and dies with the 
assert:

http://www.isis.vanderbilt.edu/mon2.log

Here is from mon3 (cephsecurestore3 IRL), which starts and runs, but 
can't form quorum and never gives up on mon1 and mon2.  Removing mon1 
and mon2 from mon3's via modmap extract/rm/inject results in same FAILED 
assert as others:

http://www.isis.vanderbilt.edu/mon3.log


My thought was that if I could resolve the last_committed problem on 
mon3, then it might have a change sans mon1 and mon2.


Thank you,
--
Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis

On 04/12/2016 03:33 PM, Eric Hall wrote:

On 4/12/16 9:02 AM, Gregory Farnum wrote:

On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall 
wrote:

On 4/12/16 12:01 AM, Gregory Farnum wrote:

Exactly what values are you reading that's giving you those values?
The "real" OSDMap epoch is going to be at least 38630...if you're very
lucky it will be exactly 38630. But since it reset itself to 1 in the
monitor's store, I doubt you'll be lucky.


It's been my week...


I'm getting this from ceph-kvstore-tool list.


I meant the keys that it was outputting...I forgot we actually had one
called "osdmap".


 From ceph-kvstore-tool /path/monN/store.db list |grep osd:

mon1:
osdmap:1
osdmap:38072
[...]
osdmap:38630
osdmap:first_committed
osdmap:full_38072
[...]
osdmap:full_38456
osdmap:last_committed


So this looks like the monitors didn't remove version 1, but this may 
just be a red herring.


What matters, really, is the values in 'first_committed' and 
'last_committed'. If either first or last_committed happens to be '1', 
then there may be a bug somewhere in the code, but I doubt that. This 
seems just an artefact.


So, it would be nice if you could provide the value of both 
'osdmap:first_committed' and 'osdmap:last_committed'.


Furthermore, the code is asserting on a basic check on 
OSDMonitor::update_from_paxos(), which is definitely unexpected to fail. 
It would also be nice if you could point us to a mon log with 
'--debug-mon 20' from start to hitting the assertion. Feel free to send 
it directly to me if you don't want the sitting on the internet.


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and Ubuntu Backport Kernel Problem

2016-04-12 Thread Mathias Buresch
Thank you so much Ilya!

This is exactly what I have searched for!!

-Original Message-
From: Ilya Dryomov 
To: Mathias Buresch 
Cc: ceph-us...@ceph.com 
Subject: Re: [ceph-users] CephFS and Ubuntu Backport Kernel Problem
Date: Tue, 12 Apr 2016 16:21:04 +0200

On Tue, Apr 12, 2016 at 4:08 PM, Mathias Buresch
 wrote:
> 
> Hi there,
> 
> I have an issue with using Ceph and Ubuntu Backport Kernel newer than
> 3.19.0-43.
> 
> Following setup I have:
> 
> Ubuntu 14.04
> Kernel 3.19.0-43 (Backport Kernel)
> Ceph 0.94.6
> 
> I am using CephFS! The kernel 3.19.0-43 was the last working kernel.
> Every newer kernel is failing and has a kernel panic or something.
> When starting the server the processes itself starting normal, but
> when
> mounting CephFS (the kernel version - not the FUSE!) it hangs and I
> only can restart the server.
> 
> Does anyone know about that issue or that it would be fixed if I
> upgrade to one of the newer Ceph versions?!

See

http://www.spinics.net/lists/ceph-devel/msg29504.html
http://tracker.ceph.com/issues/15302

and search for "[ceph-users] cephfs Kernel panic" thread from yesterday
here on ceph-users - archives haven't caught up yet.

Thanks,

Ilya

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall

On 4/12/16 9:02 AM, Gregory Farnum wrote:

On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall  wrote:

On 4/12/16 12:01 AM, Gregory Farnum wrote:

Exactly what values are you reading that's giving you those values?
The "real" OSDMap epoch is going to be at least 38630...if you're very
lucky it will be exactly 38630. But since it reset itself to 1 in the
monitor's store, I doubt you'll be lucky.


It's been my week...


I'm getting this from ceph-kvstore-tool list.


I meant the keys that it was outputting...I forgot we actually had one
called "osdmap".


From ceph-kvstore-tool /path/monN/store.db list |grep osd:

mon1:
osdmap:1
osdmap:38072
[...]
osdmap:38630
osdmap:first_committed
osdmap:full_38072
[...]
osdmap:full_38456
osdmap:last_committed

mon2:
osdmap:1
osdmap:38072
[...]
osdmap:38630
osdmap:first_committed
osdmap:full_38072
[...]
osdmap:full_38630
osdmap:full_latest
osdmap:last_committed

mon3:
osdmap:1
osdmap:38072
[...]
osdmap:38630
osdmap:first_committed
osdmap:full_38072
[...]
osdmap:full_38630
osdmap:full_latest
osdmap:last_committed


So in order to get your cluster back up, you need to find the largest
osdmap version in your cluster. You can do that, very tediously, by
looking at the OSDMap stores. Or you may have debug logs indicating it
more easily on the monitors.



I don't see info like this in any logs.  How/where do I inspect this?


If you had debugging logs up high enough, it would tell you things
like each map commit. And every time the monitor subsystems (like the
OSD Monitor) print out any debugging info they include what
epoch/version they are on, so it's in the log output prefix.


I doubt I have debug high enough... example lines from mon3 log:
2016-04-11 02:59:27.534149 7fef19a86700  0 mon.mon3@2(peon) e1 
handle_command mon_command({"prefix": "status"} v 0) v1
2016-04-11 02:59:34.556487 7fef19a86700  1 mon.mon3@2(peon).log 
v32366957 check_sub sending message to client.6567304 
172.16.250.1:0/3381977473 with 1 entries (version 32366957)


Where is the OSDMap store if not in store.db?

Thank you,
--
Eric



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Sage Weil
Hi all,

I've posted a pull request that updates any mention of ext4 in the docs:

https://github.com/ceph/ceph/pull/8556

In particular, I would appreciate any feedback on


https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01

both on substance and delivery.

Given the previous lack of clarity around ext4, and that it works well 
enough for RBD and other short object name workloads, I think the most we 
can do now is deprecate it to steer any new OSDs away.

And at least in the non-RGW case, I mean deprecate in the "recommend 
alternative" sense of the word, not that it won't be tested or that any 
code will be removed.

https://en.wikipedia.org/wiki/Deprecation#Software_deprecation

If there are ext4 + RGW users, that is still a difficult issue, since it 
is broken now, and expensive to fix.


On Tue, 12 Apr 2016, Christian Balzer wrote:
> Only RBD on all clusters so far and definitely no plans to change that 
> for the main, mission critical production cluster. I might want to add 
> CephFS to the other production cluster at some time, though.

That's good to hear.  If you continue to use ext4 (by adjusting down the 
max object length), the only limitation you should hit is an indirect cap 
on the max RBD image name length.

> No RGW, but if/when RGW supports "listing objects quickly" (is what I
> vaguely remember from my conversation with Timo Sirainen, the Dovecot
> author) we would be very interested in that particular piece of Ceph as
> well. On a completely new cluster though, so no issue.

OT, but I suspect he was referring to something slightly different here.  
Our conversations about object listing vs the dovecot backend surrounded 
the *rados* listing semantics (hash-based, not prefix/name based).  RGW 
supports fast sorted/prefix name listings, but you pay for it by 
maintaining an index (which slows down PUT).  The latest RGW in Jewel has 
experimental support for a non-indexed 'blind' bucket as well for users 
that need some of the RGW features (ACLs, striping, etc.) but not the 
ordered object listing and other index-dependent features.

> Again, most people that deploy Ceph in a commercial environment (that is
> working for a company) will be under pressure by the penny-pinching
> department to use their HW for 4-5 years (never mind the pace of
> technology and Moore's law).
> 
> So you will want to:
> a) Announce the end of FileStore ASAP, but then again you can't really
> do that before BlueStore is stable.
> b) support FileStore for 4 years at least after BlueStore is the default. 
> This could be done by having a _real_ LTS release, instead of dragging
> Filestore into newer version.

Right.  Nothing can be done until the preferred alternative is completely 
stable, and from then it will take quite some time to drop support or 
remove it given the install base.

> > > Which brings me to the reasons why people would want to migrate (NOT
> > > talking about starting freshly) to bluestore.
> > > 
> > > 1. Will it be faster (IOPS) than filestore with SSD journals? 
> > > Don't think so, but feel free to prove me wrong.
> > 
> > It will absolutely faster on the same hardware.  Whether BlueStore on
> > HDD only is faster than FileStore HDD + SSD journal will depend on the 
> > workload.
> > 
> Where would the Journal SSDs enter the picture with BlueStore? 
> Not at all, AFAIK, right?

BlueStore can use as many as three devices: one for the WAL (journal, 
though it can be much smaller than FileStores, e.g., 128MB), one for 
metadata (e.g., an SSD partition), and one for data.

> I'm thinking again about people with existing HW again. 
> What do they do with those SSDs, which aren't necessarily sized in a
> fashion to be sensible SSD pools/cache tiers?

We can either use them for BlueStore wal and metadata, or as a cache for 
the data device (e.g., dm-cache, bcache, FlashCache), or some combination 
of the above.  It will take some time to figure out which gives the 
best performance (and for which workloads).

> > > 2. Will it be bit-rot proof? Note the deafening silence from the devs
> > > in this thread: 
> > > http://www.spinics.net/lists/ceph-users/msg26510.html
> > 
> > I missed that thread, sorry.
> > 
> > We (Mirantis, SanDisk, Red Hat) are currently working on checksum
> > support in BlueStore.  Part of the reason why BlueStore is the preferred
> > path is because we will probably never see full checksumming in ext4 or
> > XFS.
> > 
> Now this (when done correctly) and BlueStore being a stable default will
> be a much, MUCH higher motivation for people to migrate to it than
> terminating support for something that works perfectly well (for my use
> case at least).

Agreed.

> > > > How:
> > > > 
> > > > To make this change as visible as possible, the plan is to make
> > > > ceph-osd refuse to start if the backend is unable to support the
> > > > configured max object name (osd_max_object_name_len).  The OSD will
> > > > 

Re: [ceph-users] CephFS and Ubuntu Backport Kernel Problem

2016-04-12 Thread Ilya Dryomov
On Tue, Apr 12, 2016 at 4:08 PM, Mathias Buresch
 wrote:
>
> Hi there,
>
> I have an issue with using Ceph and Ubuntu Backport Kernel newer than
> 3.19.0-43.
>
> Following setup I have:
>
> Ubuntu 14.04
> Kernel 3.19.0-43 (Backport Kernel)
> Ceph 0.94.6
>
> I am using CephFS! The kernel 3.19.0-43 was the last working kernel.
> Every newer kernel is failing and has a kernel panic or something.
> When starting the server the processes itself starting normal, but when
> mounting CephFS (the kernel version - not the FUSE!) it hangs and I
> only can restart the server.
>
> Does anyone know about that issue or that it would be fixed if I
> upgrade to one of the newer Ceph versions?!

See

http://www.spinics.net/lists/ceph-devel/msg29504.html
http://tracker.ceph.com/issues/15302

and search for "[ceph-users] cephfs Kernel panic" thread from yesterday
here on ceph-users - archives haven't caught up yet.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and Ubuntu Backport Kernel Problem

2016-04-12 Thread John Spray
On Tue, Apr 12, 2016 at 3:08 PM, Mathias Buresch
 wrote:
>
> Hi there,
>
> I have an issue with using Ceph and Ubuntu Backport Kernel newer than
> 3.19.0-43.
>
> Following setup I have:
>
> Ubuntu 14.04
> Kernel 3.19.0-43 (Backport Kernel)
> Ceph 0.94.6
>
> I am using CephFS! The kernel 3.19.0-43 was the last working kernel.
> Every newer kernel is failing and has a kernel panic or something.

You're going to have to be more specific than that.  Is it a kernel
panic?  Is there a trace being dumped to the server's console?  Is
there anything in your system logs?

John

> When starting the server the processes itself starting normal, but when
> mounting CephFS (the kernel version - not the FUSE!) it hangs and I
> only can restart the server.
>
> Does anyone know about that issue or that it would be fixed if I
> upgrade to one of the newer Ceph versions?!
>
>
> Greetz
> Mathias
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS and Ubuntu Backport Kernel Problem

2016-04-12 Thread Mathias Buresch

Hi there,

I have an issue with using Ceph and Ubuntu Backport Kernel newer than
3.19.0-43.

Following setup I have:

Ubuntu 14.04
Kernel 3.19.0-43 (Backport Kernel)
Ceph 0.94.6

I am using CephFS! The kernel 3.19.0-43 was the last working kernel.
Every newer kernel is failing and has a kernel panic or something.
When starting the server the processes itself starting normal, but when
mounting CephFS (the kernel version - not the FUSE!) it hangs and I
only can restart the server.

Does anyone know about that issue or that it would be fixed if I
upgrade to one of the newer Ceph versions?!


Greetz
Mathias

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Gregory Farnum
On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall  wrote:
> On 4/12/16 12:01 AM, Gregory Farnum wrote:
>>
>> On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall 
>> wrote:
>>>
>>> Power failure in data center has left 3 mons unable to start with
>>> mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)
>>>
>>> Have found simliar problem discussed at
>>> http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsure
>>> how
>>> to proceed.
>>>
>>> If I read
>>> ceph-kvstore-tool /var/lib/ceph/mon/ceph-cephsecurestore1/store.db list
>>> correctly, they believe osdmap is 1, but they also have osdmap:full_38456
>>> and osdmap:38630 in the store.
>>
>>
>> Exactly what values are you reading that's giving you those values?
>> The "real" OSDMap epoch is going to be at least 38630...if you're very
>> lucky it will be exactly 38630. But since it reset itself to 1 in the
>> monitor's store, I doubt you'll be lucky.
>
>
> I'm getting this from ceph-kvstore-tool list.

I meant the keys that it was outputting...I forgot we actually had one
called "osdmap".

>
>> So in order to get your cluster back up, you need to find the largest
>> osdmap version in your cluster. You can do that, very tediously, by
>> looking at the OSDMap stores. Or you may have debug logs indicating it
>> more easily on the monitors.
>
>
> I don't see info like this in any logs.  How/where do I inspect this?

If you had debugging logs up high enough, it would tell you things
like each map commit. And every time the monitor subsystems (like the
OSD Monitor) print out any debugging info they include what
epoch/version they are on, so it's in the log output prefix.
-Greg


>
> Thank you,
> --
> Eric
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall

On 4/12/16 12:01 AM, Gregory Farnum wrote:

On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall  wrote:

Power failure in data center has left 3 mons unable to start with
mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)

Have found simliar problem discussed at
http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsure how
to proceed.

If I read
ceph-kvstore-tool /var/lib/ceph/mon/ceph-cephsecurestore1/store.db list
correctly, they believe osdmap is 1, but they also have osdmap:full_38456
and osdmap:38630 in the store.


Exactly what values are you reading that's giving you those values?
The "real" OSDMap epoch is going to be at least 38630...if you're very
lucky it will be exactly 38630. But since it reset itself to 1 in the
monitor's store, I doubt you'll be lucky.


I'm getting this from ceph-kvstore-tool list.


So in order to get your cluster back up, you need to find the largest
osdmap version in your cluster. You can do that, very tediously, by
looking at the OSDMap stores. Or you may have debug logs indicating it
more easily on the monitors.


I don't see info like this in any logs.  How/where do I inspect this?

Thank you,
--
Eric



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs Kernel panic

2016-04-12 Thread Ilya Dryomov
On Tue, Apr 12, 2016 at 12:21 PM, Simon Ferber
 wrote:
> Am 12.04.2016 um 12:09 schrieb Florian Haas:
>> On Tue, Apr 12, 2016 at 11:53 AM, Simon Ferber
>>  wrote:
>>> Thank you! That's it. I have installed the Kernel from the Jessie
>>> backport. Now the crashes are gone.
>>> How often do these things happen? It would be a worst case scenario, if
>>> a system update breaks a productive system.
>>
>> For what it's worth, what you saw is kernel (i.e. client) side
>> breakage. You didn't mess up your Ceph cluster, nor your CephFS
>> metadata, nor any data. Also, anything you do in CephFS using a
>> release before Jewel must be considered experimental, and while things
>> will generally not break even on the client, you shouldn't be
>> surprised if they do. Thirdly, my recommendation for any Ceph
>> client-side kernel functionality (both rbd.ko and CephFS) would be to
>> use nothing older than a 4.x kernel.
>
> Thank you for clarification, Florian.

Florian is correct.  I'd like to add that this (i.e. breaking an
already released kernel) happened maybe once before in many years.
It was truly an accident - a commit that wasn't explicitly set to be
backported to 3.16.* got backported semi-automatically.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Suggestion: flag HEALTH_WARN state if monmap has 2 mons

2016-04-12 Thread Wido den Hollander

> Op 12 april 2016 om 12:21 schreef Florian Haas :
> 
> 
> Hi everyone,
> 
> I wonder what others think about the following suggestion: running an
> even number of mons almost never makes sense, and specifically two
> mons never does at all. Wouldn't it make sense to just flag a
> HEALTH_WARN state if the monmap contained an even number of mons, or
> maybe only if the number of mons in the monmap is exactly 2?
> 
> Documentation on why this is a bad idea does exist and is actually
> just as comprehensive as it needs to be
> (http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/),
> but it still seems to me that this is a very common rookie mistake.
> 
> Thoughts?
> 

Good point. It should indeed! A config setting to override it might be good. But
in general I agree with you. 2 mons is a bad thing and should trigger a warning.

> Cheers,
> Florian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph striping

2016-04-12 Thread Christian Balzer
On Tue, 12 Apr 2016 10:53:50 +0200 Alwin Antreich wrote:

> 
> On 04/12/2016 01:48 AM, Christian Balzer wrote:
> > On Mon, 11 Apr 2016 09:25:35 -0400 (EDT) Jason Dillaman wrote:
> >
> > > In general, RBD "fancy" striping can help under certain workloads
> > > where small IO would normally be hitting the same object (e.g. small
> > > sequential IO).
> > >
> >
> > While the above is very true (especially for single/few clients), I
> > never bothered to deploy fancy striping because you have to plan it
> > very carefully, as you can't change it later on.
> >
> > For example if you start with 8 OSDs and set your striping accordingly
> > (as Alwin's example suggested) but later add more OSDs you won't be
> > taking full advantage of the IOPS availalbe.
> 
> I didn't think about that, thanks for pointing that out. As we have a
> mixed workload on our new cluster, VMs and cephfs for login directories
> and sources, I am definitely going to test these settings.
> 
Note that once you get to a point where all your OSDs are somewhat busy
all the time and/or you have many clients, stripping starts to make
somewhat less sense.

That said, one way around the issue above is of course to come up with a
planned, maximum size of a cluster (like 64 OSDs) and set the stripe count
accordingly. 
This SHOULD result in the same thing with 8 initial OSDs, but I haven't
tested it myself.

Also there is the question about what happens if you wind up with OSD
numbers that are uneven, will the odd one wind up with more data?
Test this if you have the chance.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs Kernel panic

2016-04-12 Thread Christian Balzer
On Tue, 12 Apr 2016 12:21:51 +0200 Simon Ferber wrote:

> Am 12.04.2016 um 12:09 schrieb Florian Haas:
> > On Tue, Apr 12, 2016 at 11:53 AM, Simon Ferber
> >  wrote:
> >> Thank you! That's it. I have installed the Kernel from the Jessie
> >> backport. Now the crashes are gone.
> >> How often do these things happen? It would be a worst case scenario,
> >> if a system update breaks a productive system.
> > 
> > For what it's worth, what you saw is kernel (i.e. client) side
> > breakage. You didn't mess up your Ceph cluster, nor your CephFS
> > metadata, nor any data. Also, anything you do in CephFS using a
> > release before Jewel must be considered experimental, and while things
> > will generally not break even on the client, you shouldn't be
> > surprised if they do. Thirdly, my recommendation for any Ceph
> > client-side kernel functionality (both rbd.ko and CephFS) would be to
> > use nothing older than a 4.x kernel.
> 
> Thank you for clarification, Florian.
> 
> > 
> > A good update on the current state of CephFS is this tech talk, which
> > John Spray did in February:
> > 
> > https://www.youtube.com/watch?v=GbdHxL0vc9I
> > slideshare.net/JohnSpray1/cephfs-update-february-2016
> > 
> > Also, please don't ever do this:
> > 
> > cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
> >  health HEALTH_OK
> >  monmap e2: 2 mons at
> > {ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
> > election epoch 12, quorum 0,1 stan2,ollie2
> >  mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
> >  osdmap e72: 8 osds: 8 up, 8 in
> > flags sortbitwise
> >   pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
> > 281 MB used, 14856 GB / 14856 GB avail
> >  428 active+clean
> > 
> > 2 mons. Never, and I repeat never, run your Ceph cluster with 2 mons.
> > You want to run 3.
> 
> Thus if there are two servers only (which used to use drdb) what would
> be the best solution? Just grab another Linux server and install a ceph
> cluster node without OSDs and a monitor only?
> 
Yes, even an independent VM will do. 
The busiest MON will be the leader, which is always the one with the
lowest IP address, so keep that in mind.

Christian
> Best
> Simon
> 
> > 
> > Cheers,
> > Florian
> > 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Suggestion: flag HEALTH_WARN state if monmap has 2 mons

2016-04-12 Thread Florian Haas
Hi everyone,

I wonder what others think about the following suggestion: running an
even number of mons almost never makes sense, and specifically two
mons never does at all. Wouldn't it make sense to just flag a
HEALTH_WARN state if the monmap contained an even number of mons, or
maybe only if the number of mons in the monmap is exactly 2?

Documentation on why this is a bad idea does exist and is actually
just as comprehensive as it needs to be
(http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/),
but it still seems to me that this is a very common rookie mistake.

Thoughts?

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs Kernel panic

2016-04-12 Thread Simon Ferber
Am 12.04.2016 um 12:09 schrieb Florian Haas:
> On Tue, Apr 12, 2016 at 11:53 AM, Simon Ferber
>  wrote:
>> Thank you! That's it. I have installed the Kernel from the Jessie
>> backport. Now the crashes are gone.
>> How often do these things happen? It would be a worst case scenario, if
>> a system update breaks a productive system.
> 
> For what it's worth, what you saw is kernel (i.e. client) side
> breakage. You didn't mess up your Ceph cluster, nor your CephFS
> metadata, nor any data. Also, anything you do in CephFS using a
> release before Jewel must be considered experimental, and while things
> will generally not break even on the client, you shouldn't be
> surprised if they do. Thirdly, my recommendation for any Ceph
> client-side kernel functionality (both rbd.ko and CephFS) would be to
> use nothing older than a 4.x kernel.

Thank you for clarification, Florian.

> 
> A good update on the current state of CephFS is this tech talk, which
> John Spray did in February:
> 
> https://www.youtube.com/watch?v=GbdHxL0vc9I
> slideshare.net/JohnSpray1/cephfs-update-february-2016
> 
> Also, please don't ever do this:
> 
> cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
>  health HEALTH_OK
>  monmap e2: 2 mons at
> {ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
> election epoch 12, quorum 0,1 stan2,ollie2
>  mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
>  osdmap e72: 8 osds: 8 up, 8 in
> flags sortbitwise
>   pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
> 281 MB used, 14856 GB / 14856 GB avail
>  428 active+clean
> 
> 2 mons. Never, and I repeat never, run your Ceph cluster with 2 mons.
> You want to run 3.

Thus if there are two servers only (which used to use drdb) what would
be the best solution? Just grab another Linux server and install a ceph
cluster node without OSDs and a monitor only?

Best
Simon

> 
> Cheers,
> Florian
> 


-- 
Simon Ferber
Techniker

Technische Universität Dortmund
Fakultät Statistik
Vogelpothsweg 87
44227 Dortmund

Tel.: +49 231-755 3188
Fax: +49 231-755 5305
simon.fer...@tu-dortmund.de
www.tu-dortmund.de


Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie
ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der
für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den
Absender und vernichten Sie diese Mail. Vielen Dank.
Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen
ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher
Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung
eines solchen Schriftstücks per Telefax erfolgen.

Important note: The information included in this e-mail is confidential.
It is solely intended for the recipient. If you are not the intended
recipient of this e-mail please contact the sender and delete this
message. Thank you.
Without prejudice of e-mail correspondence, our statements are only
legally binding when they are made in the conventional written form
(with personal signature) or when such documents are sent by fax.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs Kernel panic

2016-04-12 Thread Florian Haas
On Tue, Apr 12, 2016 at 11:53 AM, Simon Ferber
 wrote:
> Thank you! That's it. I have installed the Kernel from the Jessie
> backport. Now the crashes are gone.
> How often do these things happen? It would be a worst case scenario, if
> a system update breaks a productive system.

For what it's worth, what you saw is kernel (i.e. client) side
breakage. You didn't mess up your Ceph cluster, nor your CephFS
metadata, nor any data. Also, anything you do in CephFS using a
release before Jewel must be considered experimental, and while things
will generally not break even on the client, you shouldn't be
surprised if they do. Thirdly, my recommendation for any Ceph
client-side kernel functionality (both rbd.ko and CephFS) would be to
use nothing older than a 4.x kernel.

A good update on the current state of CephFS is this tech talk, which
John Spray did in February:

https://www.youtube.com/watch?v=GbdHxL0vc9I
slideshare.net/JohnSpray1/cephfs-update-february-2016

Also, please don't ever do this:

cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
 health HEALTH_OK
 monmap e2: 2 mons at
{ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
election epoch 12, quorum 0,1 stan2,ollie2
 mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
 osdmap e72: 8 osds: 8 up, 8 in
flags sortbitwise
  pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
281 MB used, 14856 GB / 14856 GB avail
 428 active+clean

2 mons. Never, and I repeat never, run your Ceph cluster with 2 mons.
You want to run 3.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs Kernel panic

2016-04-12 Thread Simon Ferber
Thank you! That's it. I have installed the Kernel from the Jessie
backport. Now the crashes are gone.
How often do these things happen? It would be a worst case scenario, if
a system update breaks a productive system.

Best
Simon

Am 11.04.2016 um 16:58 schrieb Ilya Dryomov:
> On Mon, Apr 11, 2016 at 4:37 PM, Simon Ferber
>  wrote:
>> Hi,
>>
>> I try to setup an ceph cluster on Debian 8.4. Mainly I followed a
>> tutorial at
>> http://adminforge.de/raid/ceph/ceph-cluster-unter-debian-wheezy-installieren/
>>
>> As far as I can see, the first steps are just working fine. I have two
>> nodes with four OSD on both nodes.
>> This is the output of ceph -s
>>
>> cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
>>  health HEALTH_OK
>>  monmap e2: 2 mons at
>> {ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
>> election epoch 12, quorum 0,1 stan2,ollie2
>>  mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
>>  osdmap e72: 8 osds: 8 up, 8 in
>> flags sortbitwise
>>   pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
>> 281 MB used, 14856 GB / 14856 GB avail
>>  428 active+clean
>>
>> Then I tried to add cephfs following the manual at
>> http://docs.ceph.com/docs/hammer/cephfs/createfs/ which seem to do it's
>> magic:
>> root@stan2:~# ceph fs ls
>> name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
>>
>> However, as soon as I try to mount the cephfs with mount.ceph
>> 129.217.207.206:6789:/ /mnt/ -v -o
>> name=cephfs,secretfile=/etc/ceph/client.cephfs the server which tries to
>> mount crashes and has to be cold started again. To be able to use
>> mount.ceph I had to install ceph-fs-common - if that does matter...
>>
>> Here is the kernel.log. Can you give me hints? I am pretty stuck on this
>> for the last few days.
>>
>> Apr 11 16:25:02 stan2 kernel: [  171.086381] Key type ceph registered
>> Apr 11 16:25:02 stan2 kernel: [  171.086649] libceph: loaded (mon/osd
>> proto 15/24)
>> Apr 11 16:25:02 stan2 kernel: [  171.090582] FS-Cache: Netfs 'ceph'
>> registered for caching
>> Apr 11 16:25:02 stan2 kernel: [  171.090596] ceph: loaded (mds proto 32)
>> Apr 11 16:25:02 stan2 kernel: [  171.096727] libceph: client34164 fsid
>> 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
>> Apr 11 16:25:02 stan2 kernel: [  171.133832] libceph: mon0
>> 129.217.207.206:6789 session established
>> Apr 11 16:25:02 stan2 kernel: [  171.161199] [ cut here
>> ]
>> Apr 11 16:25:02 stan2 kernel: [  171.161239] kernel BUG at
>> /build/linux-lqALYs/linux-3.16.7-ckt25/fs/ceph/mds_client.c:1846!
>> Apr 11 16:25:02 stan2 kernel: [  171.161294] invalid opcode:  [#1] SMP
>> Apr 11 16:25:02 stan2 kernel: [  171.161328] Modules linked in: cbc ceph
>> libceph xfs libcrc32c crc32c_generic binfmt_misc mptctl mptbase nfsd
>> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc nls_utf8
>> nls_cp437 vfat fat x86_pkg_temp_thermal intel_powerclamp intel_rapl
>> coretemp kvm_intel kvm crc32_pclmul cryptd iTCO_wdt iTCO_vendor_support
>> efi_pstore efivars pcspkr joydev evdev ast i2c_i801 ttm drm_kms_helper
>> drm lpc_ich mfd_core mei_me mei shpchp ioatdma tpm_tis wmi tpm ipmi_si
>> ipmi_msghandler processor thermal_sys acpi_power_meter button acpi_pad
>> fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod hid_generic sg
>> usbhid hid sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul
>> crct10dif_common crc32c_intel ahci libahci ehci_pci mpt3sas igb
>> raid_class i2c_algo_bit xhci_hcd libata ehci_hcd scsi_transport_sas
>> i2c_core dca usbcore ptp usb_common scsi_mod pps_core
>> Apr 11 16:25:02 stan2 kernel: [  171.162046] CPU: 0 PID: 3513 Comm:
>> kworker/0:9 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
>> Apr 11 16:25:02 stan2 kernel: [  171.162104] Hardware name: Supermicro
>> SYS-6028R-WTR/X10DRW-i, BIOS 1.0c 01/07/2015
>> Apr 11 16:25:02 stan2 kernel: [  171.162158] Workqueue: ceph-msgr
>> con_work [libceph]
>> Apr 11 16:25:02 stan2 kernel: [  171.162194] task: 88103f2e8ae0 ti:
>> 88103bfbc000 task.ti: 88103bfbc000
>> Apr 11 16:25:02 stan2 kernel: [  171.162243] RIP:
>> 0010:[]  []
>> __prepare_send_request+0x801/0x810 [ceph]
>> Apr 11 16:25:02 stan2 kernel: [  171.162312] RSP: 0018:88103bfbfba8
>> EFLAGS: 00010283
>> Apr 11 16:25:02 stan2 kernel: [  171.162347] RAX: 88103f88ad42 RBX:
>> 88103f7f7400 RCX: 
>> Apr 11 16:25:02 stan2 kernel: [  171.162394] RDX: 164c5ec6 RSI:
>>  RDI: 88103f88ad32
>> Apr 11 16:25:02 stan2 kernel: [  171.162440] RBP: 88103f7f95e0 R08:
>>  R09: 
>> Apr 11 16:25:02 stan2 kernel: [  171.162485] R10:  R11:
>> 002c R12: 88103f7f7c00
>> Apr 11 16:25:02 stan2 kernel: [  171.162531] R13: 88103f88acc0 R14:
>>  R15: 88103f88ad3a
>> Apr 11 16:25:02 stan2 kernel: [  171.162578] FS:  

Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Apr 12, 2016 at 07:48:58AM +, Maxime.Guyot wrote:

> Hi Adrian,

> Looking at the documentation RadosGW has multi region support with the 
> “federated gateways” 
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical locales, 
> configuring Ceph Object Gateway regions and metadata synchronization agents 
> enables the service to maintain a global namespace, even though Ceph Object 
> Gateway instances run in different geographic locales and potentially on 
> different Ceph Storage Clusters.”

> Maybe that could do the trick for your multi metro EC pools?

> Disclaimer: I haven't tested the federated gateways RadosGW.

As I can see in doc, Jewel have to be able to perform per-image async mirroring:

There is new support for mirroring (asynchronous replication) of RBD images
across clusters. This is implemented as a per-RBD image journal that can be
streamed across a WAN to another site, and a new rbd-mirror daemon that performs
the cross-cluster replication.

© http://docs.ceph.com/docs/master/release-notes/

I will test it 1-2 month later this year :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph striping

2016-04-12 Thread Alwin Antreich

On 04/12/2016 01:48 AM, Christian Balzer wrote:
> On Mon, 11 Apr 2016 09:25:35 -0400 (EDT) Jason Dillaman wrote:
>
> > In general, RBD "fancy" striping can help under certain workloads where
> > small IO would normally be hitting the same object (e.g. small
> > sequential IO).
> >
>
> While the above is very true (especially for single/few clients), I never
> bothered to deploy fancy striping because you have to plan it very
> carefully, as you can't change it later on.
>
> For example if you start with 8 OSDs and set your striping accordingly (as
> Alwin's example suggested) but later add more OSDs you won't be taking
> full advantage of the IOPS availalbe.

I didn't think about that, thanks for pointing that out. As we have a mixed 
workload on our new cluster, VMs and cephfs
for login directories and sources, I am definitely going to test these settings.

>
> Christian
>

Thanks for your replies.

-- 
with best regards,
Alwin Antreich
IT Analyst
antre...@cognitec.com
Cognitec Systems GmbH
Grossenhainer Strasse 101
01127, Dresden
Germany

Geschäftsführer: Alfredo Herrera
Amtsgericht Dresden, HRB 20776
Tel.: +49-351-862-92 0
Fax: +49-351-862-92 10
i...@cognitec.com
http://www.cognitec.com
VAT ID: DE 222661897

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd with RGW

2016-04-12 Thread Micha Krause

Hi,

> However, while creating bucket using *s3cmd mb s3://buck *gives error

message

DEBUG: ConnMan.get(): creating new connection:
http://buck.s3.amazonaws.com:7480
ERROR: [Errno 110] Connection timed out

Can anyone show forward path to check this further?


Not sure if all of these settings are necessary, but I have set these variables
in .s3cfg to our radosgw-servers:

cloudfront_host = rgw.noris.net
host_base = rgw.noris.net
host_bucket = %(bucket)s.rgw.noris.net
simpledb_host = rgw.noris.net

Allso check your dns settings, you should have a wild card dns-record for your 
base:

micha@micha:~$ host *.rgw.noris.net
*.rgw.noris.net has address 62.128.8.6
*.rgw.noris.net has address 62.128.8.7
*.rgw.noris.net has IPv6 address 2001:780:6::6
*.rgw.noris.net has IPv6 address 2001:780:6::7

Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebalance near full osd

2016-04-12 Thread Andrei Mikhailovsky
I've done the ceph osd reweight-by-utilization and it seems to have solved the 
issue. However, not sure if this will be the long term solution.

Thanks for your help

Andrei

- Original Message -
> From: "Shinobu Kinjo" 
> To: "Andrei Mikhailovsky" 
> Cc: "Christian Balzer" , "ceph-users" 
> 
> Sent: Friday, 8 April, 2016 01:35:18
> Subject: Re: [ceph-users] rebalance near full osd

> There was a discussion before regarding to the situation where you are
> facing now. [1]
> Would you have a look, if it's helpful or not for you.
> 
> [1]
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007622.html
> 
> Cheers,
> Shinobu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Max A. Krasilnikov
Hello!

On Mon, Apr 11, 2016 at 05:39:37PM -0400, sage wrote:

> Hi,

> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.

1. Does filestore_xattr_use_omap fix issues with ext4? So, can I continue using
ext4 for cluster with RBD && CephFS + this option set to true?
2. Agree with Christian, it would be better to warn but not drop support for
legacy fs until old HW is out of service, 4-5 years.
3. Also, if BlueStore will be so good, one prefer to use it instead of
FileStore, so fs deprecation would be not so painful.

I'm not so great ceph user, but I have limitations like Christian and changing
fs would cost me 24 nights for now :(

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] s3cmd with RGW

2016-04-12 Thread Daleep Singh Bais
Hi All,

I am trying to create a bucket using s3cmd on ceph radosgw. I am able to
get list of buckets using

#s3cmd ls
2016-04-12 07:02  s3://my-new-bucket
2016-04-11 14:46  s3://new-bucket-6f2327c1

However, while creating bucket using *s3cmd mb s3://buck *gives error
message

DEBUG: ConnMan.get(): creating new connection:
http://buck.s3.amazonaws.com:7480
ERROR: [Errno 110] Connection timed out

Can anyone show forward path to check this further?

Thanks.

Daleep Singh Bais

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Adrian Saul

At this stage the RGW component is down the line - pretty much just concept 
while we build out the RBD side first.

What I wanted to get out of EC was distributing the data across multiple DCs 
such that we were not simply replicating data - which would give us much better 
storage efficiency and redundancy.Some of what I had read in the past was 
around using EC to spread data over multiple DCs to be able to sustain loss of 
multiple sites.  Most of this was implied fairly clearly in the documentation 
under "CHEAP MULTIDATACENTER STORAGE":

http://docs.ceph.com/docs/hammer/dev/erasure-coded-pool/

Although I note that section appears to have disappeared in the later 
documentation versions

It seems a little disheartening that much of this promise and capability for 
Ceph appears to be just not there in practice.






> -Original Message-
> From: Maxime Guyot [mailto:maxime.gu...@elits.com]
> Sent: Tuesday, 12 April 2016 5:49 PM
> To: Adrian Saul; Christian Balzer; 'ceph-users@lists.ceph.com'
> Subject: Re: [ceph-users] Mon placement over wide area
>
> Hi Adrian,
>
> Looking at the documentation RadosGW has multi region support with the
> “federated gateways”
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical
> locales, configuring Ceph Object Gateway regions and metadata
> synchronization agents enables the service to maintain a global namespace,
> even though Ceph Object Gateway instances run in different geographic
> locales and potentially on different Ceph Storage Clusters.”
>
> Maybe that could do the trick for your multi metro EC pools?
>
> Disclaimer: I haven't tested the federated gateways RadosGW.
>
> Best Regards
>
> Maxime Guyot
> System Engineer
>
>
>
>
>
>
>
>
>
> On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul"  boun...@lists.ceph.com on behalf of adrian.s...@tpgtelecom.com.au>
> wrote:
>
> >Hello again Christian :)
> >
> >
> >> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> >> > will be distributed over every major capital in Australia.The config
> >> > will be dual sites in each city that will be coupled as HA pairs - 12
> >> > sites in total.   The vast majority of CRUSH rules will place data
> >> > either locally to the individual site, or replicated to the other HA
> >> > site in that city.   However there are future use cases where I think we
> >> > could use EC to distribute data wider or have some replication that
> >> > puts small data sets across multiple cities.
> >> This will very, very, VERY much depend on the data (use case) in question.
> >
> >The EC use case would be using RGW and to act as an archival backup
> >store
> >
> >> > The concern I have is around the placement of mons.  In the current
> >> > design there would be two monitors in each site, running separate to
> the
> >> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> >> > will also be a "tiebreaker" mon placed on a separate host which
> >> > will house some management infrastructure for the whole platform.
> >> >
> >> Yes, that's the preferable way, might want to up this to 5 mons so
> >> you can loose one while doing maintenance on another one.
> >> But if that would be a coupled, national cluster you're looking both
> >> at significant MON traffic, interesting "split-brain" scenarios and
> >> latencies as well (MONs get chosen randomly by clients AFAIK).
> >
> >In the case I am setting up it would be 2 per site plus the extra so 25 - 
> >but I
> am fearing that would make the mon syncing become to heavy.  Once we
> build up to multiple sites though we can maybe reduce to one per site to
> reduce the workload on keeping the mons in sync.
> >
> >> > Obviously a concern is latency - the east coast to west coast
> >> > latency is around 50ms, and on the east coast it is 12ms between
> >> > Sydney and the other two sites, and 24ms Melbourne to Brisbane.
> >> In any situation other than "write speed doesn't matter at all"
> >> combined with "large writes, not small ones" and "read-mostly" you're
> >> going to be in severe pain.
> >
> >For data yes, but the main case for that would be backup data where it
> would be large writes, read rarely and as long as streaming performance
> keeps up latency wont matter.   My concern with the latency would be how
> that impacts the monitors having to keep in sync and how that would impact
> client opertions, especially with the rate of change that would occur with the
> predominant RBD use in most sites.
> >
> >> > Most of the data
> >> > traffic will remain local but if we create a single national
> >> > cluster then how much of an impact will it be having all the mons
> >> > needing to keep in sync, as well as monitor and communicate with
> >> > all OSDs (in the end goal design there will be some 2300+ OSDs).
> >> >
> >> Significant.
> >> I wouldn't suggest it, but even if you deploy differently I'd suggest

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Udo Lembke

Hi Sage,
we run ext4 only on our 8node-cluster with 110 OSDs and are quite happy 
with ext4.

We start with xfs but the latency was much higher comparable to ext4...

But we use RBD only  with "short" filenames like 
rbd_data.335986e2ae8944a.000761e1.
If we can switch from Jewel to K* and change during the update the 
filestore for each OSD to BlueStore it's will be OK for us.

I hope we will get than an better performance with BlueStore??
Will be BlueStore production ready during the Jewel-Lifetime, so that we 
can switch to BlueStore before the next big upgrade?



Udo

Am 11.04.2016 um 23:39 schrieb Sage Weil:

Hi,

ext4 has never been recommended, but we did test it.  After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.

Why:

Recently we discovered an issue with the long object name handling that is
not fixable without rewriting a significant chunk of FileStores filename
handling.  (There is a limit in the amount of xattr data ext4 can store in
the inode, which causes problems in LFNIndex.)

We *could* invest a ton of time rewriting this to fix, but it only affects
ext4, which we never recommended, and we plan to deprecate FileStore once
BlueStore is stable anyway, so it seems like a waste of time that would be
better spent elsewhere.

Also, by dropping ext4 test coverage in ceph-qa-suite, we can
significantly improve time/coverage for FileStore on XFS and on BlueStore.

The long file name handling is problematic anytime someone is storing
rados objects with long names.  The primary user that does this is RGW,
which means any RGW cluster using ext4 should recreate their OSDs to use
XFS.  Other librados users could be affected too, though, like users
with very long rbd image names (e.g., > 100 characters), or custom
librados users.

How:

To make this change as visible as possible, the plan is to make ceph-osd
refuse to start if the backend is unable to support the configured max
object name (osd_max_object_name_len).  The OSD will complain that ext4
cannot store such an object and refuse to start.  A user who is only using
RBD might decide they don't need long file names to work and can adjust
the osd_max_object_name_len setting to something small (say, 64) and run
successfully.  They would be taking a risk, though, because we would like
to stop testing on ext4.

Is this reasonable?  If there significant ext4 users that are unwilling to
recreate their OSDs, now would be the time to speak up.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Maxime Guyot
Hi Adrian,

Looking at the documentation RadosGW has multi region support with the 
“federated gateways” 
(http://docs.ceph.com/docs/master/radosgw/federated-config/):
"When you deploy a Ceph Object Store service that spans geographical locales, 
configuring Ceph Object Gateway regions and metadata synchronization agents 
enables the service to maintain a global namespace, even though Ceph Object 
Gateway instances run in different geographic locales and potentially on 
different Ceph Storage Clusters.”

Maybe that could do the trick for your multi metro EC pools?

Disclaimer: I haven't tested the federated gateways RadosGW.

Best Regards 

Maxime Guyot
System Engineer









On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul" 
 
wrote:

>Hello again Christian :)
>
>
>> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
>> > will be distributed over every major capital in Australia.The config
>> > will be dual sites in each city that will be coupled as HA pairs - 12
>> > sites in total.   The vast majority of CRUSH rules will place data
>> > either locally to the individual site, or replicated to the other HA
>> > site in that city.   However there are future use cases where I think we
>> > could use EC to distribute data wider or have some replication that puts
>> > small data sets across multiple cities.
>> This will very, very, VERY much depend on the data (use case) in question.
>
>The EC use case would be using RGW and to act as an archival backup store
>
>> > The concern I have is around the placement of mons.  In the current
>> > design there would be two monitors in each site, running separate to the
>> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
>> > will also be a "tiebreaker" mon placed on a separate host which will
>> > house some management infrastructure for the whole platform.
>> >
>> Yes, that's the preferable way, might want to up this to 5 mons so you can
>> loose one while doing maintenance on another one.
>> But if that would be a coupled, national cluster you're looking both at
>> significant MON traffic, interesting "split-brain" scenarios and latencies as
>> well (MONs get chosen randomly by clients AFAIK).
>
>In the case I am setting up it would be 2 per site plus the extra so 25 - but 
>I am fearing that would make the mon syncing become to heavy.  Once we build 
>up to multiple sites though we can maybe reduce to one per site to reduce the 
>workload on keeping the mons in sync.
>
>> > Obviously a concern is latency - the east coast to west coast latency
>> > is around 50ms, and on the east coast it is 12ms between Sydney and
>> > the other two sites, and 24ms Melbourne to Brisbane.
>> In any situation other than "write speed doesn't matter at all" combined with
>> "large writes, not small ones" and "read-mostly" you're going to be in severe
>> pain.
>
>For data yes, but the main case for that would be backup data where it would 
>be large writes, read rarely and as long as streaming performance keeps up 
>latency wont matter.   My concern with the latency would be how that impacts 
>the monitors having to keep in sync and how that would impact client 
>opertions, especially with the rate of change that would occur with the 
>predominant RBD use in most sites.
>
>> > Most of the data
>> > traffic will remain local but if we create a single national cluster
>> > then how much of an impact will it be having all the mons needing to
>> > keep in sync, as well as monitor and communicate with all OSDs (in the
>> > end goal design there will be some 2300+ OSDs).
>> >
>> Significant.
>> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
>> run/setup and sharing the experience with us. ^.^
>
>Someone has to be the canary right :)
>
>> > The other options I  am considering:
>> > - split into east and west coast clusters, most of the cross city need
>> > is in the east coast, any data moves between clusters can be done with
>> > snap replication
>> > - city based clusters (tightest latency) but loose the multi-DC EC
>> > option, do cross city replication using snapshots
>> >
>> The later, I seem to remember that there was work in progress to do this
>> (snapshot replication) in an automated fashion.
>>
>> > Just want to get a feel for what I need to consider when we start
>> > building at this scale.
>> >
>> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
>> the only well known/supported way to do geo-replication with Ceph is via
>> RGW.
>
>iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
>sensitive workloads so while we are still working to keep that low, we wont be 
>putting the heavier IOP or latency sensitive workloads onto it until we get a 
>better feel for how it behaves at scale and can be sure of the performance.
>
>As above - for the most part we are going 

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
I'd like to raise these points, then

1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS

2) choice is always good

3) doesn't majority of Ceph users only care about RBD?

(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal 
performance. The most damning sign is the consumption of CPU time at 
unprecedented rate. Was it faster than Dumpling? Slightly, but it ate more CPU 
also, so in effect it was not really "faster".

It would make *some* sense to only support ZFS or BTRFS because you can offload 
things like clones/snapshots and consistency to the filesystem - which would 
make the architecture much simpler and everything much faster.
Instead you insist on XFS and reimplement everything in software. I always 
dismissed this because CPU time was ususally cheap, but in practice it simply 
doesn't work.
You duplicate things that filesystems had solved for years now (namely crash 
consistency - though we have seen that fail as well), instead of letting them 
do their work and stripping the IO path to the bare necessity and letting 
someone smarter and faster handle that.

IMO, If Ceph was moving in the right direction there would be no "supported 
filesystem" debate, instead we'd be free to choose whatever is there that 
provides the guarantees we need from filesystem (which is usually every 
filesystem in the kernel) and Ceph would simply distribute our IO around with 
CRUSH.

Right now CRUSH (and in effect what it allows us to do with data) is _the_ 
reason people use Ceph, as there simply wasn't much else to use for distributed 
storage. This isn't true anymore and the alternatives are orders of magnitude 
faster and smaller.

Jan

P.S. If anybody needs a way out I think I found it, with no need to trust a 
higher power :P


> On 11 Apr 2016, at 23:44, Sage Weil  wrote:
> 
> On Mon, 11 Apr 2016, Sage Weil wrote:
>> Hi,
>> 
>> ext4 has never been recommended, but we did test it.  After Jewel is out, 
>> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> I should clarify that this is a proposal and solicitation of feedback--we 
> haven't made any decisions yet.  Now is the time to weigh in.
> 
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can I monitor current ceph operation at cluster

2016-04-12 Thread Christian Balzer
On Mon, 11 Apr 2016 10:01:15 +0100 Nick Fisk wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of nick
> > Sent: 11 April 2016 08:26
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] How can I monitor current ceph operation at
> > cluster
> > 
> > Hi,
> > > We're parsing the output of 'ceph daemon osd.N perf dump' for the
> > > admin sockets in /var/run/ceph/ceph-osd.*.asok on each node in our
> > cluster.
> > > We then push that data into carbon-cache/graphite and using grafana
> > > for visualization.
> > which of those values are you using for monitoring? I can see a lot of
> > numbers when doing a 'ceph daemon osd.N perf dump'. Do you know if
> > there is some documentation what each value means? I could only find:
> > http://docs.ceph.com/docs/hammer/dev/perf_counters/ which describes
> > the schema.
> 
> I'm currently going through them and trying to write a short doc
> explaining what each one measures. Are you just interested in total
> number of read and write ops over the whole cluster?
> 
That would be much appreciated.

Note that I'm still seeing cache flushes that are not registering in ANY
(I checked them all) of the "osd_*bytes*" counters, but certainly do on the
actual disk.

Christian

> 
> > 
> > Best Regards
> > Nick
> > 
> > > Our numbers are much more consistent than yours appear.
> > >
> > > Bob
> > >
> > > On Thu, Apr 7, 2016 at 2:34 AM, David Riedl 
> > wrote:
> > > > Hi.
> > > >
> > > > I use this for my zabbix environment:
> > > >
> > > > https://github.com/thelan/ceph-zabbix/
> > > >
> > > > It works really well for me.
> > > >
> > > >
> > > > Regards
> > > >
> > > > David
> > > >
> > > > On 07.04.2016 11:20, Nick Fisk wrote:
> > > >   Hi.
> > > >
> > > > I have small question about monitoring performance at ceph cluster.
> > > >
> > > > We have cluster with 5 nodes and 8 drives on each node, and 5
> > > > monitor on every node. For monitoring cluster we use zabbix. It
> > > > asked every node for
> > > >
> > > > 30
> > > >
> > > > second about current ceph operation and get different result from
> > > > every node.
> > > > first node: 350op/s
> > > > second node: 900op/s
> > > > third node: 200ops/s
> > > > fourth node:   700op/s
> > > > fifth node: 1200ops/
> > > >
> > > > I don't understand how I can receive the total value of performance
> > > > ceph cluster?
> > > >
> > > > Easy Answer
> > > > Capture and parse the output from "ceph -s", not 100% accurate, but
> > > > probably good enough for a graph
> > > >
> > > > Complex Answer
> > > > Use something like Graphite to capture all the counters for every
> > > > OSD and then use something like sumSeries to add all the op/s
> > > > counters
> > together.
> > > >
> > > >
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing
> > > > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph
> > > > -user
> > > > s-ceph.com
> > > >
> > > > ___
> > > > ceph-users mailing
> > > > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph
> > > > -user
> > > > s-ceph.com
> > > >
> > > >
> > > > --
> > > > Mit freundlichen Grüßen
> > > >
> > > > David Riedl
> > > >
> > > >
> > > >
> > > > WINGcon GmbH Wireless New Generation - Consulting & Solutions
> > > >
> > > > Phone: +49 (0) 7543 9661 - 26
> > > > E-Mail: david.ri...@wingcon.com
> > > > Web: http://www.wingcon.com
> > > >
> > > > Sitz der Gesellschaft: Langenargen
> > > > Registergericht: ULM, HRB 632019
> > > > USt-Id.: DE232931635, WEEE-Id.: DE74015979
> > > > Geschäftsführer: Thomas Ehrle, Fritz R. Paul
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel
> > +41
> 44
> > 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Michael Metz-Martini | SpeedPartner GmbH
Hi,

Am 11.04.2016 um 23:39 schrieb Sage Weil:
> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.
Hmmm. We're currently migrating away from xfs as we had some strange
performance-issues which were resolved / got better by switching to
ext4. We think this is related to our high number of objects (4358
Mobjects according to ceph -s).


> Recently we discovered an issue with the long object name handling
> that is not fixable without rewriting a significant chunk of
> FileStores filename handling.  (There is a limit in the amount of
> xattr data ext4 can store in the inode, which causes problems in
> LFNIndex.)
We're only using cephfs so we shouldn't be affected by your discovered
bug, right?


-- 
Kind regards
 Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] Deprecating ext4 support

2016-04-12 Thread Loic Dachary
Hi Sage,

I suspect most people nowadays run tests and develop on ext4. Not supporting 
ext4 in the future means we'll need to find a convenient way for developers to 
run tests against the supported file systems.

My 2cts :-)

On 11/04/2016 23:39, Sage Weil wrote:
> Hi,
> 
> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> Why:
> 
> Recently we discovered an issue with the long object name handling that is 
> not fixable without rewriting a significant chunk of FileStores filename 
> handling.  (There is a limit in the amount of xattr data ext4 can store in 
> the inode, which causes problems in LFNIndex.)
> 
> We *could* invest a ton of time rewriting this to fix, but it only affects 
> ext4, which we never recommended, and we plan to deprecate FileStore once 
> BlueStore is stable anyway, so it seems like a waste of time that would be 
> better spent elsewhere.
> 
> Also, by dropping ext4 test coverage in ceph-qa-suite, we can 
> significantly improve time/coverage for FileStore on XFS and on BlueStore.
> 
> The long file name handling is problematic anytime someone is storing 
> rados objects with long names.  The primary user that does this is RGW, 
> which means any RGW cluster using ext4 should recreate their OSDs to use 
> XFS.  Other librados users could be affected too, though, like users 
> with very long rbd image names (e.g., > 100 characters), or custom 
> librados users.
> 
> How:
> 
> To make this change as visible as possible, the plan is to make ceph-osd 
> refuse to start if the backend is unable to support the configured max 
> object name (osd_max_object_name_len).  The OSD will complain that ext4 
> cannot store such an object and refuse to start.  A user who is only using 
> RBD might decide they don't need long file names to work and can adjust 
> the osd_max_object_name_len setting to something small (say, 64) and run 
> successfully.  They would be taking a risk, though, because we would like 
> to stop testing on ext4.
> 
> Is this reasonable?  If there significant ext4 users that are unwilling to 
> recreate their OSDs, now would be the time to speak up.
> 
> Thanks!
> sage
> 
> ___
> Ceph-maintainers mailing list
> ceph-maintain...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph breizh meetup

2016-04-12 Thread eric mourgaya
hi,

The next ceph breizh meetup up will be organized at Nantes,the April 19th
 in the Suravenir Building:
at 2 Impasse Vasco de Gama, 44800 Saint-Herblain

Here the doodle:

http://doodle.com/poll/3mxqqgfkn4ttpfib

Will see you soon at Nantes

-- 
Eric Mourgaya,


Respectons la planete!
Luttons contre la mediocrite!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com