date:20160225

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Huan Zhang

Since fio /dev/rbd0 sync=1 works well, it doesn't matter with ceph server,
just related to librbd (rbd_aio_flush) implement?

2016-02-26 14:50 GMT+08:00 Huan Zhang :

> rbd engine with fsync=1 seems stuck.
> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 1244d:10h:39m:18s]
>
> But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get
> very high iops ~35K, similar to direct wirte.
>
> I'm confused with that result, IMHO, ceph could just ignore the sync cache
> command since it always use sync write to journal, right?
>
> Why we get so bad sync iops, how ceph handle it?
> Very appreciated to get your reply!
>
> 2016-02-25 22:44 GMT+08:00 Jason Dillaman :
>
>> > 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't
>> actually
>> > work. Or it's not touching the same object (but I wonder whether write
>> > ordering is preserved at that rate?).
>>
>> The fio rbd engine does not support "sync=1"; however, it should support
>> "fsync=1" to accomplish roughly the same effect.
>>
>> Jason
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Huan Zhang

rbd engine with fsync=1 seems stuck.
Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
1244d:10h:39m:18s]

But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get
very high iops ~35K, similar to direct wirte.

I'm confused with that result, IMHO, ceph could just ignore the sync cache
command since it always use sync write to journal, right?

Why we get so bad sync iops, how ceph handle it?
Very appreciated to get your reply!

2016-02-25 22:44 GMT+08:00 Jason Dillaman :

> > 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't
> actually
> > work. Or it's not touching the same object (but I wonder whether write
> > ordering is preserved at that rate?).
>
> The fio rbd engine does not support "sync=1"; however, it should support
> "fsync=1" to accomplish roughly the same effect.
>
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Bug in rados bench with 0.94.6 (regression, not present in 0.94.5)

2016-02-25 Thread Christian Balzer


Hello,

On my crappy test cluster (Debian Jessie, Hammer 0.94.6) I'm seeing rados
bench crashing doing "seq" runs. 
As I'm testing cache tiers at the moment I also tried it with a normal,
replicated pool with the same result.

After creating some benchmark objects with:
---
rados -p data bench 20 write -t 32 --no-cleanup
---

A consecutive run of this ends in tears:
---
# rados -p data bench 10 seq -t 32 
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
rados: ./common/Mutex.h:96: void Mutex::_pre_unlock(): Assertion `nlock > 0' 
failed.
*** Caught signal (Aborted) **
 in thread 7f1894100780
 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: rados() [0x4e5e23]
 2: (()+0xf8d0) [0x7f18915268d0]
 3: (gsignal()+0x37) [0x7f188fde6067]
 4: (abort()+0x148) [0x7f188fde7448]
 5: (()+0x2e266) [0x7f188fddf266]
 6: (()+0x2e312) [0x7f188fddf312]
 7: (Mutex::Unlock()+0xb3) [0x4fda93]
 8: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x127c) [0x4da37c]
 9: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, 
bool)+0x2df) [0x4ded8f]
 10: (main()+0xa664) [0x4be834]
 11: (__libc_start_main()+0xf5) [0x7f188fdd2b45]
 12: rados() [0x4c2c97]
2016-02-26 14:18:52.641052 7f1894100780 -1 *** Caught signal (Aborted) **
 in thread 7f1894100780
---

There's nothing particular outstanding or malicious in the recent events,
here are the last 2:
---
-2> 2016-02-26 14:23:12.439214 7f18c113f780  1 -- 10.0.0.83:0/877189211 --> 
10.0.0.85:6804/2921 -- osd_op(client.31691145.0:34 
benchmark_data_engtest03_32406_object32 [read 0~4096] 0.def1bb6e 
ack+read+known_if_redirected e11724) v5 -- ?+0 0x39090d0 con 0x389bed0
-1> 2016-02-26 14:23:12.439930 7f18b4549700  1 -- 10.0.0.83:0/877189211 <== 
osd.11 10.0.0.34:6802/2973 1  osd_op_reply(9 
benchmark_data_engtest03_32406_object7 [read 0~4096] v0'0 uv15 ondisk = 0) v6 
 205+0+4096 (2792458300 0 1108541644) 0x7f1864000ca0 con 0x38bbf80
---

Note that "rand" works fine, as does "seq" on a 0.95.5 cluster. 

While certainly not production related (or so one hopes!), this cinches it
for me, no upgrade to .6 tomorrow on the mission critical cluster.

Also created a tracker issue, despite resounding success (none, it
probably was silently fixed ^o^) of my previous one:
http://tracker.ceph.com/issues/14873

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Robert LeBlanc

We replaced 32 S3500s with 48 Micron M600s in our production cluster. The
S3500s were only doing journals because they were too small and we still
ate 3-4% of their life in a couple of months. We started having high wait
times on the M600s so we got 6 S3610s, 6 M500dcs, and 6 500 GB M600s (they
have the SLC to MLC conversion that we thought might work better). And we
swapped out 18 of the M600s throughout our cluster with these test drives.
We have graphite gathering stats on the admin sockets for Ceph and the
standard system stats. We weighted the drives so they had the same byte
usage and let them run for a week or so, then made them the same percentage
of used space, let them run a couple of weeks, then set them to 80% full
and let them run a couple of weeks. We compared IOPS and IO time of the
drives to get our comparison. This was done on live production clusters and
not synthetic benchmarks. Some of the data about the S3500s is from my test
cluster that has them.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:20 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote:
>
> > We are moving to the Intel S3610, from our testing it is a good balance
> > between price, performance and longevity. But as with all things, do your
> > testing ahead of time. This will be our third model of SSDs for our
> > cluster. The S3500s didn't have enough life and performance tapers off
> > add it gets full. The Micron M600s looked good with the Sebastian journal
> > tests, but once in use for a while go downhill pretty bad. We also tested
> > Micron M500dc drives and they were on par with the S3610s and are more
> > expensive and are closer to EoL. The S3700s didn't have quite the same
> > performance as the S3610s, but they will last forever and are very stable
> > in terms of performance and have the best power loss protection.
> >
> That's interesting, how did you come to that conclusion and how did test
> it?
> Also which models did you compare?
>
>
> > Short answer is test them for yourself to make sure they will work. You
> > are pretty safe with the Intel S3xxx drives. The Micron M500dc is also
> > pretty safe based on my experience. It had also been mentioned that
> > someone has had good experience with a Samsung DC Pro (has to have both
> > DC and Pro in the name), but we weren't able to get any quick enough to
> > test so I can't vouch for them.
> >
> I have some Samsung DC Pro EVOs in production (non-Ceph, see that
> non-barrier thread).
> They do have issues with LSI occasionally, haven't gotten around to make
> that FS non-barrier to see if it fixes things.
>
> The EVOs are also similar to the Intel DC S3500s, meaning that they are
> not really suitable for Ceph due to their endurance.
>
> Never tested the "real" DC Pro ones, but they are likely to be OK.
>
> Christian
>
> > Sent from a mobile device, please excuse any typos.
> > On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  wrote:
> >
> > > Hello,
> > >
> > > There has been a bunch of discussion about using SSD.
> > > Does anyone have any list of SSDs describing which SSD is highly
> > > recommended, which SSD is not.
> > >
> > > Rgds,
> > > Shinobu
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-25 Thread Robert LeBlanc

I was only testing one SSD per node and it used 3.5-4.5 cores on my 8 core
Atom boxes. I've also set these boxes to only 4 GB of RAM to reduce the
effects of page cache. So no, I still had some headroom, but I was also
running fio on my nodes too. I don't remember how much idle I had overall,
but there was some.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:15 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Wed, 24 Feb 2016 23:01:43 -0700 Robert LeBlanc wrote:
>
> > With my S3500 drives in my test cluster, the latest master branch gave me
> > an almost 2x increase in performance compare to just a month or two ago.
> > There looks to be some really nice things coming in Jewel around SSD
> > performance. My drives are now 80-85% busy doing about 10-12K IOPS when
> > doing 4K fio to libRBD.
> >
> That's good news, but then again the future is always bright. ^o^
> Before that (or even now with the SSDs still 15% idle), were you
> exhausting your CPUs or are they also still not fully utilized as I am
> seeing below?
>
> Christian
>
> > Sent from a mobile device, please excuse any typos.
> > On Feb 24, 2016 8:10 PM, "Christian Balzer"  wrote:
> >
> > >
> > > Hello,
> > >
> > > For posterity and of course to ask some questions, here are my
> > > experiences with a pure SSD pool.
> > >
> > > SW: Debian Jessie, Ceph Hammer 0.94.5.
> > >
> > > HW:
> > > 2 nodes (thus replication of 2) with each:
> > > 2x E5-2623 CPUs
> > > 64GB RAM
> > > 4x DC S3610 800GB SSDs
> > > Infiniband (IPoIB) network
> > >
> > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> > > Ceph journal is inline (journal file).
> > >
> > > Performance:
> > > A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks)
> > > gives me about 620MB/s, the storage nodes are I/O bound (all SSDs are
> > > 100% busy according to atop) and this meshes nicely with the speeds I
> > > saw when testing the individual SSDs with fio before involving Ceph.
> > >
> > > To elaborate on that, an individual SSD of that type can do about
> > > 500MB/s sequential writes, so ideally you would see 1GB/s writes with
> > > Ceph (500*8/2(replication)/2(journal on same disk).
> > > However my experience tells me that other activities (FS journals,
> > > leveldb PG updates, etc) impact things as well.
> > >
> > > A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
> > > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
> > > All OSD processes are using about 2 cores and the OS another 2, but
> > > that leaves about 6 cores unused (MHz on all cores scales to max
> > > during the test run).
> > > Closer inspection with all CPUs being displayed in atop shows that no
> > > single core is fully used, they all average around 40% and even the
> > > busiest ones (handling IRQs) still have ample capacity available.
> > > I'm wondering if this an indication of insufficient parallelism or if
> > > it's latency of sorts.
> > > I'm aware of the many tuning settings for SSD based OSDs, however I was
> > > expecting to run into a CPU wall first and foremost.
> > >
> > >
> > > Write amplification:
> > > 10 second rados bench with 4MB blocks, 6348MB written in total.
> > > nand-writes per SSD:118*32MB=3776MB.
> > > 30208MB total written to all SSDs.
> > > Amplification:4.75
> > >
> > > Very close to what you would expect with a replication of 2 and
> > > journal on same disk.
> > >
> > >
> > > 10 second rados bench with 4KB blocks, 219MB written in total.
> > > nand-writes per SSD:41*32MB=1312MB.
> > > 10496MB total written to all SSDs.
> > > Amplification:48!!!
> > >
> > > Le ouch.
> > > In my use case with rbd cache on all VMs I expect writes to be rather
> > > large for the most part and not like this extreme example.
> > > But as I wrote the last time I did this kind of testing, this is an
> > > area where caveat emptor most definitely applies when planning and
> > > buying SSDs. And where the Ceph code could probably do with some
> > > attention.
> > >
> > > Regards,
> > >
> > > Christian
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can not disable rbd cache

2016-02-25 Thread Robert LeBlanc

My guess would be that if you are already running hammer on the client it
is already using the new watcher API. This would be a fix on the OSDs to
allow the object to be moved because the current client is smart enough to
try again. It would be watchers per object.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:10 PM, "Christian Balzer"  wrote:

> On Thu, 25 Feb 2016 10:07:37 -0500 (EST) Jason Dillaman wrote:
>
> > > > Let's start from the top. Where are you stuck with [1]? I have
> > > > noticed that after evicting all the objects with RBD that one object
> > > > for each active RBD is still left, I think this is the head object.
> > > Precisely.
> > > That came up in my extensive tests as well.
> >
> > Is this in reference to the RBD image header object (i.e. XYZ.rbd or
> > rbd_header.XYZ)?
> Yes.
>
> > The cache tier doesn't currently support evicting
> > objects that are being watched.  This guard was added to the OSD because
> > it wasn't previously possible to alert clients that a watched object had
> > encountered an error (such as it no longer exists in the cache tier).
> > Now that Hammer (and later) librbd releases will reconnect the watch on
> > error (eviction), perhaps this guard can be loosened [1].
> >
> > [1] http://tracker.ceph.com/issues/14865
> >
>
> How do I interpret "all watchers" in the issue above?
> As in, all watchers of an object, or all watchers in general.
>
> If it is per object (which I guess/hope), than this fix would mean that
> after an upgrade to Hammer or later on the client side a restart of the VM
> would allow the header object to be evicted, while the header objects for
> VMs that have been running since the dawn of time can not.
>
> Correct?
>
> This would definitely be better than having to stop the VM, flush things
> and then start it up again.
>
> Christian
>
> > --
> >
> > Jason
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Christian Balzer


Hello,

On Thu, 25 Feb 2016 23:09:52 -0600 Adam Tygart wrote:

> The docs are already split by version, although it doesn't help that
> it isn't linked in an obvious manner.
> 
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> 
> http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/
>
Indeed, but besides finding it, somebody going explicitly to the Hammer
version would be be foiled by the fact that is has NOT been updated like
master to reflect the need for setting absolute cache sizes. 
Which at best would be confusing during tests and at worst in production
is liable to run your cluster into the ground...
 
>  Updating the documentation takes a lot of effort by all involved, and
> in a project this size, it probably needs a team of people. From what
> I can tell, all the documentation is in the ceph source tree, and
> submitting pull requests/tickets is probably a good option to keep it
> up to date. From my perspective it is also our failure (the users),
> not updating the docs when we run into issues.
>
I have a feeling some dedicated editors including knowledgeable and vetted
volunteers  would do a better job that just spamming PRs, which tend to be
forgotten/ignored by the already overworked devs.

Christian

> --
> Adam
> 
> On Thu, Feb 25, 2016 at 10:59 PM, Nigel Williams
>  wrote:
> > On Fri, Feb 26, 2016 at 3:10 PM, Christian Balzer 
> > wrote:
> >>
> >> Then we come to a typical problem for fast evolving SW like Ceph,
> >> things that are not present in older versions.
> >
> >
> > I was going to post on this too (I had similar frustrations), and
> > would like to propose that a move to splitting the documentation by
> > versions:
> >
> > OLD
> > http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> >
> >
> > NEW
> > http://docs.ceph.com/docs/master/hammer/rados/operations/cache-tiering/
> >
> > http://docs.ceph.com/docs/master/infernalis/rados/operations/cache-tiering/
> >
> > http://docs.ceph.com/docs/master/jewel/rados/operations/cache-tiering/
> >
> > and so on.
> >
> > When a new version is started, the documentation should be 100% cloned
> > and the tree restructured around the version. It could equally be a
> > drop-down on the page to select the version.
> >
> > Postgres for example uses a similar mechanism:
> >
> > http://www.postgresql.org/docs/
> >
> > Note the version numbers are embedded in the URL. I like their
> > commenting mechanism too as it provides a running narrative of changes
> > that should be considered as practice develops around things to do or
> > avoid.
> >
> > Once the documentation is cloned for the new version, all the
> > inapplicable material should be removed and the new features/practice
> > changes should be added.
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart

Unfortunately, what seems to happen as users (and developers) get more
in tune with software projects, we forget what is and isn't common
knowledge.

Perhaps said "wall of text" should be a glossary of terms. A
definition list, something that can be open in one tab, and define any
ceph-specific or domain-specific terms. Maybe linking back to the
glossary for any specific instance of that term. Maybe there should be
a glossary per topic, as cephfs has its own set of domain-specific
language that isn't necessarily any use to those using rbd.

Comment systems are great, until you need people to moderate them, and
then that takes time away from people that could either be developing
the software or updating documentation.

On Thu, Feb 25, 2016 at 11:24 PM, Nigel Williams
 wrote:
> On Fri, Feb 26, 2016 at 4:09 PM, Adam Tygart  wrote:
>> The docs are already split by version, although it doesn't help that
>> it isn't linked in an obvious manner.
>>
>> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>
> Is there any reason to keep this "master" (version-less variant) given
> how much confusion it causes?
>
> I think I noticed the version split one time back but it didn't lodge
> in my mind, and when I looked for something today I hit the "master"
> and there were no hits for the version (which I should have been
> looking at).
>
> I'd be glad to contribute to the documentation effort. For example I
> would like to be able to ask questions around the terminology that is
> scattered through the documentation that I think needs better
> explanation. I'm not sure if pull-requests that try to annotate what
> is there would mean some parts would become a wall of text whereas the
> explanation would be better suited as a (more informal) comment-thread
> at the bottom of the page that can be browsed (mainly by beginners
> trying to navigate an unfamiliar architecture).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Nigel Williams

On Fri, Feb 26, 2016 at 4:09 PM, Adam Tygart  wrote:
> The docs are already split by version, although it doesn't help that
> it isn't linked in an obvious manner.
>
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

Is there any reason to keep this "master" (version-less variant) given
how much confusion it causes?

I think I noticed the version split one time back but it didn't lodge
in my mind, and when I looked for something today I hit the "master"
and there were no hits for the version (which I should have been
looking at).

I'd be glad to contribute to the documentation effort. For example I
would like to be able to ask questions around the terminology that is
scattered through the documentation that I think needs better
explanation. I'm not sure if pull-requests that try to annotate what
is there would mean some parts would become a wall of text whereas the
explanation would be better suited as a (more informal) comment-thread
at the bottom of the page that can be browsed (mainly by beginners
trying to navigate an unfamiliar architecture).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart

The docs are already split by version, although it doesn't help that
it isn't linked in an obvious manner.

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/

 Updating the documentation takes a lot of effort by all involved, and
in a project this size, it probably needs a team of people. From what
I can tell, all the documentation is in the ceph source tree, and
submitting pull requests/tickets is probably a good option to keep it
up to date. From my perspective it is also our failure (the users),
not updating the docs when we run into issues.
--
Adam

On Thu, Feb 25, 2016 at 10:59 PM, Nigel Williams
 wrote:
> On Fri, Feb 26, 2016 at 3:10 PM, Christian Balzer  wrote:
>>
>> Then we come to a typical problem for fast evolving SW like Ceph, things
>> that are not present in older versions.
>
>
> I was going to post on this too (I had similar frustrations), and would like
> to propose that a move to splitting the documentation by versions:
>
> OLD
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>
>
> NEW
> http://docs.ceph.com/docs/master/hammer/rados/operations/cache-tiering/
>
> http://docs.ceph.com/docs/master/infernalis/rados/operations/cache-tiering/
>
> http://docs.ceph.com/docs/master/jewel/rados/operations/cache-tiering/
>
> and so on.
>
> When a new version is started, the documentation should be 100% cloned and
> the tree restructured around the version. It could equally be a drop-down on
> the page to select the version.
>
> Postgres for example uses a similar mechanism:
>
> http://www.postgresql.org/docs/
>
> Note the version numbers are embedded in the URL. I like their commenting
> mechanism too as it provides a running narrative of changes that should be
> considered as practice develops around things to do or avoid.
>
> Once the documentation is cloned for the new version, all the inapplicable
> material should be removed and the new features/practice changes should be
> added.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Christian Balzer


Hello,

On Fri, 26 Feb 2016 15:59:51 +1100 Nigel Williams wrote:

> On Fri, Feb 26, 2016 at 3:10 PM, Christian Balzer  wrote:
> 
> > Then we come to a typical problem for fast evolving SW like Ceph,
> > things that are not present in older versions.
> 
> 
> I was going to post on this too (I had similar frustrations), and would
> like to propose that a move to splitting the documentation by versions:
> 
> OLD
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> 
> 
> NEW
> http://docs.ceph.com/docs/master/hammer/rados/operations/cache-tiering/
> 
> http://docs.ceph.com/docs/master/infernalis/rados/operations/cache-tiering/
> 
> http://docs.ceph.com/docs/master/jewel/rados/operations/cache-tiering/
>
Yup, that's a nice approach and besides Postgres Ganeti and MySQL uses that
setup as well, at the top of my mind.

Given that backports in the past have introduced new features
(osd_scrub_sleep comes to mind), an even finer grained split by actual
version number might be called for.
 
Christian

> and so on.
> 
> When a new version is started, the documentation should be 100% cloned
> and the tree restructured around the version. It could equally be a
> drop-down on the page to select the version.
> 
> Postgres for example uses a similar mechanism:
> 
> http://www.postgresql.org/docs/
> 
> Note the version numbers are embedded in the URL. I like their commenting
> mechanism too as it provides a running narrative of changes that should
> be considered as practice develops around things to do or avoid.
> 
> Once the documentation is cloned for the new version, all the
> inapplicable material should be removed and the new features/practice
> changes should be added.


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Nigel Williams

On Fri, Feb 26, 2016 at 3:10 PM, Christian Balzer  wrote:

> Then we come to a typical problem for fast evolving SW like Ceph, things
> that are not present in older versions.

I was going to post on this too (I had similar frustrations), and would
like to propose that a move to splitting the documentation by versions:

OLD
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

NEW
http://docs.ceph.com/docs/master/hammer/rados/operations/cache-tiering/

http://docs.ceph.com/docs/master/infernalis/rados/operations/cache-tiering/

http://docs.ceph.com/docs/master/jewel/rados/operations/cache-tiering/

and so on.

When a new version is started, the documentation should be 100% cloned and
the tree restructured around the version. It could equally be a drop-down
on the page to select the version.

Postgres for example uses a similar mechanism:

http://www.postgresql.org/docs/

Note the version numbers are embedded in the URL. I like their commenting
mechanism too as it provides a running narrative of changes that should be
considered as practice develops around things to do or avoid.

Once the documentation is cloned for the new version, all the inapplicable
material should be removed and the new features/practice changes should be
added.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Christian Balzer


Hello,

On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote:

> We are moving to the Intel S3610, from our testing it is a good balance
> between price, performance and longevity. But as with all things, do your
> testing ahead of time. This will be our third model of SSDs for our
> cluster. The S3500s didn't have enough life and performance tapers off
> add it gets full. The Micron M600s looked good with the Sebastian journal
> tests, but once in use for a while go downhill pretty bad. We also tested
> Micron M500dc drives and they were on par with the S3610s and are more
> expensive and are closer to EoL. The S3700s didn't have quite the same
> performance as the S3610s, but they will last forever and are very stable
> in terms of performance and have the best power loss protection.
> 
That's interesting, how did you come to that conclusion and how did test
it?
Also which models did you compare?


> Short answer is test them for yourself to make sure they will work. You
> are pretty safe with the Intel S3xxx drives. The Micron M500dc is also
> pretty safe based on my experience. It had also been mentioned that
> someone has had good experience with a Samsung DC Pro (has to have both
> DC and Pro in the name), but we weren't able to get any quick enough to
> test so I can't vouch for them.
>
I have some Samsung DC Pro EVOs in production (non-Ceph, see that
non-barrier thread). 
They do have issues with LSI occasionally, haven't gotten around to make
that FS non-barrier to see if it fixes things.

The EVOs are also similar to the Intel DC S3500s, meaning that they are
not really suitable for Ceph due to their endurance.

Never tested the "real" DC Pro ones, but they are likely to be OK.

Christian

> Sent from a mobile device, please excuse any typos.
> On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  wrote:
> 
> > Hello,
> >
> > There has been a bunch of discussion about using SSD.
> > Does anyone have any list of SSDs describing which SSD is highly
> > recommended, which SSD is not.
> >
> > Rgds,
> > Shinobu
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-25 Thread Christian Balzer


Hello,

On Wed, 24 Feb 2016 23:01:43 -0700 Robert LeBlanc wrote:

> With my S3500 drives in my test cluster, the latest master branch gave me
> an almost 2x increase in performance compare to just a month or two ago.
> There looks to be some really nice things coming in Jewel around SSD
> performance. My drives are now 80-85% busy doing about 10-12K IOPS when
> doing 4K fio to libRBD.
> 
That's good news, but then again the future is always bright. ^o^
Before that (or even now with the SSDs still 15% idle), were you
exhausting your CPUs or are they also still not fully utilized as I am
seeing below?

Christian

> Sent from a mobile device, please excuse any typos.
> On Feb 24, 2016 8:10 PM, "Christian Balzer"  wrote:
> 
> >
> > Hello,
> >
> > For posterity and of course to ask some questions, here are my
> > experiences with a pure SSD pool.
> >
> > SW: Debian Jessie, Ceph Hammer 0.94.5.
> >
> > HW:
> > 2 nodes (thus replication of 2) with each:
> > 2x E5-2623 CPUs
> > 64GB RAM
> > 4x DC S3610 800GB SSDs
> > Infiniband (IPoIB) network
> >
> > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> > Ceph journal is inline (journal file).
> >
> > Performance:
> > A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks)
> > gives me about 620MB/s, the storage nodes are I/O bound (all SSDs are
> > 100% busy according to atop) and this meshes nicely with the speeds I
> > saw when testing the individual SSDs with fio before involving Ceph.
> >
> > To elaborate on that, an individual SSD of that type can do about
> > 500MB/s sequential writes, so ideally you would see 1GB/s writes with
> > Ceph (500*8/2(replication)/2(journal on same disk).
> > However my experience tells me that other activities (FS journals,
> > leveldb PG updates, etc) impact things as well.
> >
> > A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
> > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
> > All OSD processes are using about 2 cores and the OS another 2, but
> > that leaves about 6 cores unused (MHz on all cores scales to max
> > during the test run).
> > Closer inspection with all CPUs being displayed in atop shows that no
> > single core is fully used, they all average around 40% and even the
> > busiest ones (handling IRQs) still have ample capacity available.
> > I'm wondering if this an indication of insufficient parallelism or if
> > it's latency of sorts.
> > I'm aware of the many tuning settings for SSD based OSDs, however I was
> > expecting to run into a CPU wall first and foremost.
> >
> >
> > Write amplification:
> > 10 second rados bench with 4MB blocks, 6348MB written in total.
> > nand-writes per SSD:118*32MB=3776MB.
> > 30208MB total written to all SSDs.
> > Amplification:4.75
> >
> > Very close to what you would expect with a replication of 2 and
> > journal on same disk.
> >
> >
> > 10 second rados bench with 4KB blocks, 219MB written in total.
> > nand-writes per SSD:41*32MB=1312MB.
> > 10496MB total written to all SSDs.
> > Amplification:48!!!
> >
> > Le ouch.
> > In my use case with rbd cache on all VMs I expect writes to be rather
> > large for the most part and not like this extreme example.
> > But as I wrote the last time I did this kind of testing, this is an
> > area where caveat emptor most definitely applies when planning and
> > buying SSDs. And where the Ceph code could probably do with some
> > attention.
> >
> > Regards,
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Christian Balzer

On Thu, 25 Feb 2016 16:41:59 -0500 (EST) Shinobu Kinjo wrote:

> > Just beware of HBA compatibility, even in passthrough mode some crappy
> > firmwares can try and be smart about what you can do (LSI-Avago, I'm
> > looking your way for crippling TRIM, seriously WTH).
> 
> This is very good to know.
> Can anybody elaborate on this a bit more?
> 
I suggest you read the entire "XFS and nobarriers on Intel SSD" thread on
this very ML, the first ML hit when googling "Ceph LSI SSD".

Note I saw problems with un-patched S3610s (never with 3700s) even when
connected to the onboard Intel SATA controller. 
After updating their firmware this was fixed, but for good measure I also
went no-barrier and upgraded to the latest Debian kernel (4.3), for all
the good that will do.

Christian
> Rgds,
> Shinobu
> 
> - Original Message -
> From: "Jan Schermer" 
> To: "Nick Fisk" 
> Cc: "Robert LeBlanc" , "Shinobu Kinjo"
> , ceph-users@lists.ceph.com Sent: Thursday, February
> 25, 2016 11:10:41 PM Subject: Re: [ceph-users] List of SSDs
> 
> We are very happy with S3610s in our cluster.
> We had to flash a new firmware because of latency spikes (NCQ-related),
> but had zero problems after that... Just beware of HBA compatibility,
> even in passthrough mode some crappy firmwares can try and be smart
> about what you can do (LSI-Avago, I'm looking your way for crippling
> TRIM, seriously WTH).
> 
> Jan
> 
> 
> > On 25 Feb 2016, at 14:48, Nick Fisk  wrote:
> > 
> > There’s two factors really
> >  
> > 1.   Suitability for use in ceph
> > 2.   Number of people using them
> >  
> > For #1, there are a number of people using various different drives,
> > so lots of options. The blog articled linked is a good place to start. 
> > For #2 and I think this is quite important. Lots of people use the
> > S3xx’s intel drives. This means any problems you face will likely have
> > a lot of input from other people. Also you are less likely to face
> > surprises, as most usage cases have already been covered. From:
> > ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> > ] On Behalf Of Robert
> > LeBlanc Sent: 25 February 2016 05:56 To: Shinobu Kinjo
> > > Cc: ceph-users
> > >
> > Subject: Re: [ceph-users] List of SSDs We are moving to the Intel
> > S3610, from our testing it is a good balance between price,
> > performance and longevity. But as with all things, do your testing
> > ahead of time. This will be our third model of SSDs for our cluster.
> > The S3500s didn't have enough life and performance tapers off add it
> > gets full. The Micron M600s looked good with the Sebastian journal
> > tests, but once in use for a while go downhill pretty bad. We also
> > tested Micron M500dc drives and they were on par with the S3610s and
> > are more expensive and are closer to EoL. The S3700s didn't have quite
> > the same performance as the S3610s, but they will last forever and are
> > very stable in terms of performance and have the best power loss
> > protection. 
> > 
> > Short answer is test them for yourself to make sure they will work.
> > You are pretty safe with the Intel S3xxx drives. The Micron M500dc is
> > also pretty safe based on my experience. It had also been mentioned
> > that someone has had good experience with a Samsung DC Pro (has to
> > have both DC and Pro in the name), but we weren't able to get any
> > quick enough to test so I can't vouch for them. 
> > 
> > Sent from a mobile device, please excuse any typos.
> > 
> > On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  > > wrote: Hello,
> > 
> > There has been a bunch of discussion about using SSD.
> > Does anyone have any list of SSDs describing which SSD is highly
> > recommended, which SSD is not.
> > 
> > Rgds,
> > Shinobu
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > ___ ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Re: [ceph-users] Can not disable rbd cache

2016-02-25 Thread Christian Balzer

On Thu, 25 Feb 2016 10:07:37 -0500 (EST) Jason Dillaman wrote:

> > > Let's start from the top. Where are you stuck with [1]? I have
> > > noticed that after evicting all the objects with RBD that one object
> > > for each active RBD is still left, I think this is the head object.
> > Precisely.
> > That came up in my extensive tests as well.
> 
> Is this in reference to the RBD image header object (i.e. XYZ.rbd or
> rbd_header.XYZ)? 
Yes.

> The cache tier doesn't currently support evicting
> objects that are being watched.  This guard was added to the OSD because
> it wasn't previously possible to alert clients that a watched object had
> encountered an error (such as it no longer exists in the cache tier).
> Now that Hammer (and later) librbd releases will reconnect the watch on
> error (eviction), perhaps this guard can be loosened [1].
> 
> [1] http://tracker.ceph.com/issues/14865
> 

How do I interpret "all watchers" in the issue above?
As in, all watchers of an object, or all watchers in general.

If it is per object (which I guess/hope), than this fix would mean that
after an upgrade to Hammer or later on the client side a restart of the VM
would allow the header object to be evicted, while the header objects for
VMs that have been running since the dawn of time can not.

Correct?

This would definitely be better than having to stop the VM, flush things
and then start it up again.

Christian

> --
> 
> Jason
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-25 Thread Christian Balzer

On Thu, 25 Feb 2016 13:44:30 - Nick Fisk wrote:

> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Jason Dillaman
> > Sent: 25 February 2016 01:30
> > To: Christian Balzer 
> > Cc: ceph-us...@ceph.com
> > Subject: Re: [ceph-users] ceph hammer : rbd info/Status : operation not
> > supported (95) (EC+RBD tier pools)
> > 
> > I'll speak to what I can answer off the top of my head.  The most
> important
> > point is that this issue is only related to EC pool base tiers, not
> replicated
> > pools.
> > 
> > > Hello Jason (Ceph devs et al),
> > >
> > > On Wed, 24 Feb 2016 13:15:34 -0500 (EST) Jason Dillaman wrote:
> > >
> > > > If you run "rados -p  ls | grep "rbd_id."
> > > > and don't see that object, you are experiencing that issue [1].
> > > >
> > > > You can attempt to work around this issue by running "rados -p
> > > > irfu-virt setomapval rbd_id. dummy value" to
> > > > force-promote the object to the cache pool.  I haven't tested /
> > > > verified that will alleviate the issue, though.
> > > >
> > > > [1] http://tracker.ceph.com/issues/14762
> > > >
> > >
> > > This concerns me greatly, as I'm about to phase in a cache tier this
> > > weekend into a very busy, VERY mission critical Ceph cluster.
> > > That is on top of a replicated pool, Hammer.
> > >
> > > That issue and the related git blurb are less than crystal clear, so
> > > for my and everybody else's benefit could you elaborate a bit more on
> > this?
> > >
> > > 1. Does this only affect EC base pools?
> > 
> > Correct -- this is only an issue because EC pools do not directly
> > support several operations required by RBD.  Placing a replicated
> > cache tier in
> front of
> > an EC pool was, in effect, a work-around to this limitation.
> > 
> > > 2. Is this a regressions of sorts and when came it about?
> > >I have a hard time imagining people not running into this earlier,
> > >unless that problem is very hard to trigger.
> > > 3. One assumes that this isn't fixed in any released version of Ceph,
> > >correct?
> > >
> > > Robert, sorry for CC'ing you, but AFAICT your cluster is about the
> > > closest approximation in terms of busyness to mine here.
> > > And I a assume that you're neither using EC pools (since you need
> > > performance, not space) and haven't experienced this bug all?
> > >
> > > Also, would you consider the benefits of the recency fix (thanks for
> > > that) being worth risk of being an early adopter of 0.94.6?
> > > In other words, are you eating your own dog food already and 0.94.6
> > > hasn't eaten your data babies yet? ^o^
> > 
> > Per the referenced email chain, it was potentially the recency fix that
> > exposed this issue for EC pools fronted by a cache tier.
> 
> Just to add. It's possible this bug was present for a while, but the
> broken recency logic effectively always promoted blocks regardless. Once
> this was fixed and ceph could actually make a decision of whether a
> block needed to be promoted or not this bug surfaced. You can always set
> the recency to 0 (possibly 1) and have the same behaviour as before the
> recency fix to ensure that you won't hit this bug.
> 
Of course the only immediate reason for me to go to 0.95.6 would be the
recency fix. ^o^

But it seems to be clear that only EC base pools are affected, which I
don't have and likely never will.

Christian

> > 
> > >
> > > Regards,
> > >
> > > Christian
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
> > 
> > --
> > 
> > Jason Dillaman
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] State of Ceph documention

2016-02-25 Thread Christian Balzer


Hello,

I know somebody will ask me to open a tracker issue, etc, but I feel
sufficiently frustrated to rant a bit here.

Case in point:

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

Let me start on a positive note, though. 
Somebody pretty recently added the much needed detail that you NEED to set
the absolute size and that the relative options do depend on it. 
That will help a lot of people doing this for the first time.

The whole thing is a bit sparse, the hit type has only one option, so why
not default to it instead of having to set it manually?

In the hit set section we get a example of:
---
ceph osd pool set {cachepool} hit_set_count 1
ceph osd pool set {cachepool} hit_set_period 3600
ceph osd pool set {cachepool} target_max_bytes 1
---
The last one being out of place here, that bit is explained later and if
people do cut/paste (cargo-culting) a likely mistake.

Then we come to a typical problem for fast evolving SW like Ceph, things
that are not present in older versions.
If a new option/parameter is added, state clearly from which version this
has been introduced.
Neither "cache_target_dirty_high_ratio" nor
"min_write_recency_for_promote" are present in the latest Hammer.

The first one fills me with dread when thinking about the potential I/O
storms when the cache gets full, the later removes some of the reasons
why I would want to go to 0.94.6 for the recency fix.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-25 Thread Christian Balzer


Hello Robert,

Thanks for the speedy reply. 

On Wed, 24 Feb 2016 22:44:47 -0700 Robert LeBlanc wrote:

> We have not seen this issue, but we don't run EC pools yet (we are
> waiting for multiple layers to be available). 

Yeah, that seems to be the consensus here, only EC is affected.

> We are not running 0.94.6
> in production yet either. We have adopted the policy to only run released
> versions in production unless there is a really pressing need to have a
> patch. 

Well, it is released since Wednesday. ^o^

But then again we don't update things here either unless we're hitting a
bug.
If not for the need of a working cache tier, we'd still be on Firefly.

>We are running 0.94.6 through our alpha and staging clusters and
> hoping to do the upgrade in the next couple of weeks. We won't know how
> much the recency fix will help until then because we have not been able
> to replicate our workload with fio accurately enough to get good test
> results. 

I had/have high hopes with working recency, as it will avoid getting the
cache filled with cold objects, in turn it having to evict stuff and
pounding the base pool again.
Alas I just found that write recency isn't supported with Hammer, see
another mail soon.

> Unfortunately we will probably be swapping out our M600s with
> S3610s. We've burned through 30% of the life in 2 months and they have
> 8x the op latency. 
Ouch, that's quite the wear-out. 
Aside from the SSDs having insufficient endurance, what level of
write-amplification are you seeing on average?

As for the 3610s, make sure to update their firmware before deployment. 

> Due to the 10 Minutes of Terror, we are going to have
> to do both at the same time to reduce the impact. Luckily, when you have
> weighted out OSDs or empty ones, it is much less impactful. If you get
> your upgrade done before ours, I'd like to know how it went. I'll be
> posting the results from ours when it is done.
> 
I think I'll pass on 0.94.6 for the moment, as I seem to have found
another bug, more on that later if I can confirm it.
Right now I'm reboot my entire test cluster to make sure this isn't a
residual effect from doing multiple upgrades w/o ever rebooting nodes.

Christian

> Sent from a mobile device, please excuse any typos.
> On Feb 24, 2016 5:43 PM, "Christian Balzer"  wrote:
> 
> >
> > Hello Jason (Ceph devs et al),
> >
> > On Wed, 24 Feb 2016 13:15:34 -0500 (EST) Jason Dillaman wrote:
> >
> > > If you run "rados -p  ls | grep "rbd_id." and
> > > don't see that object, you are experiencing that issue [1].
> > >
> > > You can attempt to work around this issue by running "rados -p
> > > irfu-virt setomapval rbd_id. dummy value" to
> > > force-promote the object to the cache pool.  I haven't tested /
> > > verified that will alleviate the issue, though.
> > >
> > > [1] http://tracker.ceph.com/issues/14762
> > >
> >
> > This concerns me greatly, as I'm about to phase in a cache tier this
> > weekend into a very busy, VERY mission critical Ceph cluster.
> > That is on top of a replicated pool, Hammer.
> >
> > That issue and the related git blurb are less than crystal clear, so
> > for my and everybody else's benefit could you elaborate a bit more on
> > this?
> >
> > 1. Does this only affect EC base pools?
> > 2. Is this a regressions of sorts and when came it about?
> >I have a hard time imagining people not running into this earlier,
> >unless that problem is very hard to trigger.
> > 3. One assumes that this isn't fixed in any released version of Ceph,
> >correct?
> >
> > Robert, sorry for CC'ing you, but AFAICT your cluster is about the
> > closest approximation in terms of busyness to mine here.
> > And I a assume that you're neither using EC pools (since you need
> > performance, not space) and haven't experienced this bug all?
> >
> > Also, would you consider the benefits of the recency fix (thanks for
> > that) being worth risk of being an early adopter of 0.94.6?
> > In other words, are you eating your own dog food already and 0.94.6
> > hasn't eaten your data babies yet? ^o^
> >
> > Regards,
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why my cluster performance is so bad?

2016-02-25 Thread yang

Thanks for your suggestion,
I have re-test my cluster and result is much better.


Regards,
Yang

-- Original --
From:  "Christian Balzer";;
Date:  Tue, Feb 23, 2016 09:49 PM
To:  "ceph-users";
Cc:  "yang";
Subject:  Re: [ceph-users] Why my cluster performance is so bad?

Hello,

This is sort of a FAQ, google is your friend.

For example find the recent thread "Performance Testing of CEPH on ARM
MicroServer" in this ML which addresses some points pertinent to your query.
Read it, I will reference things from it below

On Tue, 23 Feb 2016 19:55:22 +0800 yang wrote:

> My ceph cluster config:
Kernel, OS, Ceph version.

> 7 nodes(including 3 mons, 3 mds).
> 9 SATA HDD in every node and each HDD as an OSD(deployed by
What replication, default of 3?

That would give the theoretical IOPS of 21 HDDs, but your slow (more
precisely high latency) network and lack of SSD journals mean it will be
even lower than that.

> ceph-deploy). CPU:  32core
> Mem: 64GB
> public network: 1Gbx2 bond0,
> cluster network: 1Gbx2 bond0.
Latency in that kind of network will slow you down, especially when doing
small I/Os.

> 
As always, atop is a very nice tool to find where the bottlenecks and
hotspots are, you will have to run it preferably on all storage nodes with
nice large terminal windows to the get the most out of it, though.

> The read bw is 109910KB/s for 1M-read, and 34329KB/s for 1M-write.
> Why is it so bad?

Because your testing is flawed.

> Anyone who can give me some suggestion?
>
For starters to get a good baseline, do rados bench tests (see thread)
with the default block size (4MB) and 4KB size.

> 
> fio jobfile:
> [global]
> direct=1
> thread
Not sure how this affects things versus the default of fork.

> ioengine=psync
Definitely never used this, either use libaio or the rbd engine in newer
fio versions.

> size=10G
> runtime=300
> time_based
> iodepth=10
This is your main problem, Ceph/RBD does not do well with a low number of
threads.
Simply because you're likely to hit just a single OSD for a prolonged
time, thus getting more or less single disk speeds.

See more about this in the results below.

> group_reporting
> stonewall
> filename=/mnt/rbd/data

Are we to assume that this mounted via the kernel RBD module?
Where, different client node that's not part of the cluster?
Which FS?

> 
> [read1M]
> bs=1M
> rw=read
> numjobs=1
> name=read1M
> 
> [write1M]
> bs=1M
> rw=write
> numjobs=1
> name=write1M
> 
> [read4k-seq]
> bs=4k
> rw=read
> numjobs=8
> name=read4k-seq
> 
> [read4k-rand]
> bs=4k
> rw=randread
> numjobs=8
> name=read4k-rand
> 
> [write4k-seq]
> bs=4k
> rw=write
> numjobs=8
> name=write4k-seq
> 
> [write4k-rand]
> bs=4k
> rw=randwrite
> numjobs=8
> name=write4k-rand
> 
> 
> and the fio result is as follows:
> 
> read1M: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=10
> write1M: (g=1): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=psync,
> iodepth=10 read4k-seq: (g=2): rw=read, bs=4K-4K/4K-4K/4K-4K,
> ioengine=psync, iodepth=10 ...
> read4k-rand: (g=3): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> write4k-seq: (g=4): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> write4k-rand: (g=5): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> fio-2.3
> Starting 34 threads
> read1M: Laying out IO file(s) (1 file(s) / 10240MB)
> Jobs: 8 (f=8): [_(26),w(8)] [18.8% done] [0KB/1112KB/0KB /s] [0/278/0
> iops] [eta 02h:10m:00s] read1M: (groupid=0, jobs=1): err= 0: pid=17606:
> Tue Feb 23 14:28:45 2016 read : io=32201MB, bw=109910KB/s, iops=107,
> runt=37msec clat (msec): min=1, max=74, avg= 9.31, stdev= 2.78
>  lat (msec): min=1, max=74, avg= 9.31, stdev= 2.78
> clat percentiles (usec):
>  |  1.00th=[ 1448],  5.00th=[ 2040], 10.00th=[ 3952],
> 20.00th=[ 9792], | 30.00th=[ 9920], 40.00th=[ 9920], 50.00th=[ 9920],
> 60.00th=[10048], | 70.00th=[10176], 80.00th=[10304], 90.00th=[10688],
> 95.00th=[10944], | 99.00th=[11968], 99.50th=[19072], 99.90th=[27008],
> 99.95th=[29568], | 99.99th=[38144]
> bw (KB  /s): min=93646, max=139912, per=100.00%, avg=110022.09,
> stdev=7759.48 lat (msec) : 2=4.20%, 4=5.98%, 10=43.37%, 20=46.00%,
> 50=0.45% lat (msec) : 100=0.01%
>   cpu  : usr=0.05%, sys=0.81%, ctx=32209, majf=0, minf=1055
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,

According to this output, the IO depth was actually 1, not 10, probably
caused by the choice of your engine or the threads option.
And this explains a LOT of your results.

Regards,

Christian
> >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued: total=r=32201/w=0/d=0, short=r=0/w=0/d=0,
> >drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=10
> write1M: (groupid=1,

Re: [ceph-users] "ceph-installer" in GitHub

2016-02-25 Thread Shinobu Kinjo

Thank you for the pointer.

Rgds,
Shinobu

- Original Message -
From: "Yuri Weinstein" 
To: "Shinobu Kinjo" 
Cc: "Ken Dreyer" , "ceph-devel" 
, "ceph-users" 
Sent: Friday, February 26, 2016 8:01:36 AM
Subject: Re: [ceph-users] "ceph-installer" in GitHub

The code is here https://github.com/ceph/ceph-installer

Thx
YuriW

On Thu, Feb 25, 2016 at 2:57 PM, Shinobu Kinjo  wrote:
> Where should I go to get "ceph-installer" source code?
>
> Rgds,
> Shinobu
>
> - Original Message -
> From: "Ken Dreyer" 
> To: "ceph-devel" , "ceph-users" 
> 
> Sent: Friday, February 26, 2016 6:07:54 AM
> Subject: [ceph-users] "ceph-installer" in GitHub
>
> Hi folks,
>
> A few of us at RH are working on a project called "ceph-installer",
> which is a Pecan web app that exposes endpoints for running
> ceph-ansible under the hood.
>
> The idea is that other applications will be able to consume this REST
> API in order to orchestrate Ceph installations.
>
> Another team within Red Hat is also working on a GUI component that
> will interact with the ceph-installer web service, and that is
> https://github.com/skyrings
>
> These are all nascent projects that are very much works-in-progress,
> and so the workflows are very rough and there are a hundred things we
> could do to improve the experience and integration, etc. We welcome
> feedback from the rest of the community :)
>
> - Ken
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "ceph-installer" in GitHub

2016-02-25 Thread Yuri Weinstein

The code is here https://github.com/ceph/ceph-installer

Thx
YuriW

On Thu, Feb 25, 2016 at 2:57 PM, Shinobu Kinjo  wrote:
> Where should I go to get "ceph-installer" source code?
>
> Rgds,
> Shinobu
>
> - Original Message -
> From: "Ken Dreyer" 
> To: "ceph-devel" , "ceph-users" 
> 
> Sent: Friday, February 26, 2016 6:07:54 AM
> Subject: [ceph-users] "ceph-installer" in GitHub
>
> Hi folks,
>
> A few of us at RH are working on a project called "ceph-installer",
> which is a Pecan web app that exposes endpoints for running
> ceph-ansible under the hood.
>
> The idea is that other applications will be able to consume this REST
> API in order to orchestrate Ceph installations.
>
> Another team within Red Hat is also working on a GUI component that
> will interact with the ceph-installer web service, and that is
> https://github.com/skyrings
>
> These are all nascent projects that are very much works-in-progress,
> and so the workflows are very rough and there are a hundred things we
> could do to improve the experience and integration, etc. We welcome
> feedback from the rest of the community :)
>
> - Ken
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "ceph-installer" in GitHub

2016-02-25 Thread Shinobu Kinjo

Where should I go to get "ceph-installer" source code?

Rgds,
Shinobu

- Original Message -
From: "Ken Dreyer" 
To: "ceph-devel" , "ceph-users" 

Sent: Friday, February 26, 2016 6:07:54 AM
Subject: [ceph-users] "ceph-installer" in GitHub

Hi folks,

A few of us at RH are working on a project called "ceph-installer",
which is a Pecan web app that exposes endpoints for running
ceph-ansible under the hood.

The idea is that other applications will be able to consume this REST
API in order to orchestrate Ceph installations.

Another team within Red Hat is also working on a GUI component that
will interact with the ceph-installer web service, and that is
https://github.com/skyrings

These are all nascent projects that are very much works-in-progress,
and so the workflows are very rough and there are a hundred things we
could do to improve the experience and integration, etc. We welcome
feedback from the rest of the community :)

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer

> On 25 Feb 2016, at 22:41, Shinobu Kinjo  wrote:
> 
>> Just beware of HBA compatibility, even in passthrough mode some crappy 
>> firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
>> your way for crippling TRIM, seriously WTH).
> 
> This is very good to know.
> Can anybody elaborate on this a bit more?
> 

To some degree, it's been a while since I investigated this.
For TRIM/discard to work, you need to have
1) working TRIM/discard command on the drive
2) the scsi/libata layer (?) somehow detect how many blocks can be discarded at 
once and what the block size is etc.
those properties are found in /sys/block/xxx/queue/discard_*

3) filesystem that supports discard (and it looks at those discard_* properties 
to determine when/what to discard).
4) there are also flags (hdparm -I shows them) what happens after trim - either 
the data is zeroed or random data is returned (it is possible to TRIM a sector 
and then read the original data - it doesn't actually need to erase anything, 
it simply marks that sector as unused in bitmap and GC does it's magic when it 
feels like it, if ever)

RAID controllers need to have some degree of control over this, because they 
need to be able to compare the drive contents when scrubbing (the same probably 
somehow applies to mdraid) either by maintaining some bitmap of used blocks or 
by trusting the drives to be deterministic. If you discard a sector on a HW 
RAID, both drives need to start returning the same data or scrubbing will fail. 
Some drives guarantee that and some don't.
You either have DRAT - Deterministic Read After Trim (but this only guarantees 
that data don't change, but they can be random)
or you have DZAT - Deterministic read Zero After Trim (subsequent reads only 
return NULLs)
or you can have none of the above (whcih is no big deal, except for RAID).

Even though I don't use LSI HBAs in IR (RAID) mode, the firmware doesn't like 
that my drives don't have DZAT/DRAT (or rather didn't, this doesn't apply to 
the Intels I have now) and crippled the discard_* parameters to try and 
disallow the use of TRIM. And it mostly works because the filesystem doesn't 
have the discard_* parameters it needs for discard to work...
... BUT it doesn't cripple the TRIM command itself so running hdparm 
--trim-sector-ranges still works (lol) and I suppose if those discard_* 
parameters were made read/write (actually I found a patch that does exactly 
that back then) then we could re-enable trim in spite of the firmware nonsense, 
but with modern SSDs it's mostly pointless anyway and LSI sucks, so who cares 
:-)

*
Sorry if I mixed some layers, maybe it's not filesystem that calls discard but 
another layer in kernel, also not sure how exactly discard_* values are 
detected and when etc., but in essence it works like that.

Jan

> Rgds,
> Shinobu
> 
> - Original Message -
> From: "Jan Schermer" 
> To: "Nick Fisk" 
> Cc: "Robert LeBlanc" , "Shinobu Kinjo" 
> , ceph-users@lists.ceph.com
> Sent: Thursday, February 25, 2016 11:10:41 PM
> Subject: Re: [ceph-users] List of SSDs
> 
> We are very happy with S3610s in our cluster.
> We had to flash a new firmware because of latency spikes (NCQ-related), but 
> had zero problems after that...
> Just beware of HBA compatibility, even in passthrough mode some crappy 
> firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
> your way for crippling TRIM, seriously WTH).
> 
> Jan
> 
> 
>> On 25 Feb 2016, at 14:48, Nick Fisk  wrote:
>> 
>> There’s two factors really
>> 
>> 1.   Suitability for use in ceph
>> 2.   Number of people using them
>> 
>> For #1, there are a number of people using various different drives, so lots 
>> of options. The blog articled linked is a good place to start.
>> 
>> For #2 and I think this is quite important. Lots of people use the S3xx’s 
>> intel drives. This means any problems you face will likely have a lot of 
>> input from other people. Also you are less likely to face surprises, as most 
>> usage cases have already been covered. 
>> 
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>> ] On Behalf Of Robert LeBlanc
>> Sent: 25 February 2016 05:56
>> To: Shinobu Kinjo >
>> Cc: ceph-users >
>> Subject: Re: [ceph-users] List of SSDs
>> 
>> We are moving to the Intel S3610, from our testing it is a good balance 
>> between price, performance and longevity. But as with all things, do your 
>> testing ahead of time. This will be our third model of SSDs for our cluster. 
>> The S3500s didn't have enough life and performance tapers off add it gets 
>> full. The Micron M600s looked good with the Sebastian journal tests, but 
>> once in use for a while go downhill pretty bad.

Re: [ceph-users] List of SSDs

2016-02-25 Thread Shinobu Kinjo

> Just beware of HBA compatibility, even in passthrough mode some crappy 
> firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
> your way for crippling TRIM, seriously WTH).

This is very good to know.
Can anybody elaborate on this a bit more?

Rgds,
Shinobu

- Original Message -
From: "Jan Schermer" 
To: "Nick Fisk" 
Cc: "Robert LeBlanc" , "Shinobu Kinjo" 
, ceph-users@lists.ceph.com
Sent: Thursday, February 25, 2016 11:10:41 PM
Subject: Re: [ceph-users] List of SSDs

We are very happy with S3610s in our cluster.
We had to flash a new firmware because of latency spikes (NCQ-related), but had 
zero problems after that...
Just beware of HBA compatibility, even in passthrough mode some crappy 
firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
your way for crippling TRIM, seriously WTH).

Jan


> On 25 Feb 2016, at 14:48, Nick Fisk  wrote:
> 
> There’s two factors really
>  
> 1.   Suitability for use in ceph
> 2.   Number of people using them
>  
> For #1, there are a number of people using various different drives, so lots 
> of options. The blog articled linked is a good place to start.
>  
> For #2 and I think this is quite important. Lots of people use the S3xx’s 
> intel drives. This means any problems you face will likely have a lot of 
> input from other people. Also you are less likely to face surprises, as most 
> usage cases have already been covered. 
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Robert LeBlanc
> Sent: 25 February 2016 05:56
> To: Shinobu Kinjo >
> Cc: ceph-users >
> Subject: Re: [ceph-users] List of SSDs
>  
> We are moving to the Intel S3610, from our testing it is a good balance 
> between price, performance and longevity. But as with all things, do your 
> testing ahead of time. This will be our third model of SSDs for our cluster. 
> The S3500s didn't have enough life and performance tapers off add it gets 
> full. The Micron M600s looked good with the Sebastian journal tests, but once 
> in use for a while go downhill pretty bad. We also tested Micron M500dc 
> drives and they were on par with the S3610s and are more expensive and are 
> closer to EoL. The S3700s didn't have quite the same performance as the 
> S3610s, but they will last forever and are very stable in terms of 
> performance and have the best power loss protection. 
> 
> Short answer is test them for yourself to make sure they will work. You are 
> pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty 
> safe based on my experience. It had also been mentioned that someone has had 
> good experience with a Samsung DC Pro (has to have both DC and Pro in the 
> name), but we weren't able to get any quick enough to test so I can't vouch 
> for them. 
> 
> Sent from a mobile device, please excuse any typos.
> 
> On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  > wrote:
> Hello,
> 
> There has been a bunch of discussion about using SSD.
> Does anyone have any list of SSDs describing which SSD is highly recommended, 
> which SSD is not.
> 
> Rgds,
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.6 Hammer released

2016-02-25 Thread Odintsov Vladislav

Hi,

am I right, that official 0.94.5 el6 was built here?
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos6-5-amd64-basic/log.cgi?log=9764da52395923e0b32908d83a9f7304401fee43

If yes, it seems like hammer autobuild was broken more than one month ago (11th 
of Jan there is a first failed build):
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos6-5-amd64-basic/#origin/hammer

Error "sudo: no tty present and no askpass program specified".
Maybe there were changes in Jenkins?
How is the build called? From Jenkins via ssh? Maybe, it should be called ssh 
-t?


Regards,

Vladislav Odintsov



From: ceph-users  on behalf of Alfredo Deza 

Sent: Thursday, February 25, 2016 22:59
To: Udo Lembke
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] v0.94.6 Hammer released

On Thu, Feb 25, 2016 at 7:00 AM, Udo Lembke  wrote:
> Hi,
>
> Am 24.02.2016 um 17:27 schrieb Alfredo Deza:
>> On Wed, Feb 24, 2016 at 4:31 AM, Dan van der Ster  
>> wrote:
>>> Thanks Sage, looking forward to some scrub randomization.
>>>
>>> Were binaries built for el6? http://download.ceph.com/rpm-hammer/el6/x86_64/
>>
>> We are no longer building binaries for el6. Just for Centos 7, Ubuntu
>> Trusty, and Debian Jessie.
>>
> this means that our proxmox-ve server 3.4, which run debian wheezy, could not 
> be updated from ceph 0.94.5 to 0.94.6!
> The OSD-nodes run's wheezy too - they can be upgraded. But the MONs must be 
> also upgraded (first).
>
> I can understand, that newer versions will not supplied to an older OS, but 
> stop from minor.5 to minor.6 makes realy no
> sense to me.
>
> Of course, I can update to proxmox-ve 4.x, which is jessie based, but in this 
> case I have trouble with DRBD...

It would be really nice if the community could step up to help us out
in building binaries. Building Ceph is non-trivial and coming
up with all the different distros, distro versions, and architectures
(at some point we were close to 12 variations) is a tremendous
effort.

>
>
> Udo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd not removed from crush map after ceph osd crush remove

2016-02-25 Thread Stillwell, Bryan

Have you tried restarting each OSD one-by-one to see if that clears up the
problem?

Also, what does the output of this command look like:

ceph osd dump | grep 'replicated size'


As for whether or not 'ceph pg repair' will work, I doubt it.  It uses
copy on the primary OSD to fix the other OSDs in the PG.  From the
information you've provided, it looks like the PGs that are on osd.4 only
have one copy to me.  This seems like it would make 'ceph pg repair' fail
to do anything since the only PG is out of the cluster.

Bryan

On 2/24/16, 2:30 AM, "Dimitar Boichev" 
wrote:

>I think this happened because of the wrongly removed OSD...
>A bug maybe ?
>
>Do you think that "ceph pg repair" will force the remove of the PG from
>the missing osd ?
>I am concerned about executing "pg repair" or "osd lost" because maybe it
>will decide that the stuck one is the right data and try to do stuff with
>it and discard the active running copy ..
>
>
>Regards.
>
>Dimitar Boichev
>SysAdmin Team Lead
>AXSMarine Sofia
>Phone: +359 889 22 55 42
>Skype: dimitar.boichev.axsmarine
>E-mail: dimitar.boic...@axsmarine.com
>
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Stillwell, Bryan
>Sent: Tuesday, February 23, 2016 7:31 PM
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>Dimitar,
>
>I would agree with you that getting the cluster into a healthy state
>first is probably the better idea.  Based on your pg query, it appears
>like you're using only 1 replica.  Any ideas why that would be?
>
>The output should look like this (with 3 replicas):
>
>osdmap e133481 pg 11.1b8 (11.1b8) -> up [13,58,37] acting [13,58,37]
>
>Bryan
>
>From:  Dimitar Boichev 
>Date:  Tuesday, February 23, 2016 at 1:08 AM
>To:  CTG User , "ceph-users@lists.ceph.com"
>
>Subject:  RE: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>
>>Hello,
>>Thank you Bryan.
>>
>>I was just trying to upgrade to hammer or upper but before that I was
>>wanting to get the cluster in Healthy state.
>>Do you think it is safe to upgrade now first to latest firefly then to
>>Hammer ?
>>
>>
>>Regards.
>>
>>Dimitar Boichev
>>SysAdmin Team Lead
>>AXSMarine Sofia
>>Phone: +359 889 22 55 42
>>Skype: dimitar.boichev.axsmarine
>>E-mail:
>>dimitar.boic...@axsmarine.com
>>
>>
>>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>>On Behalf Of Stillwell, Bryan
>>Sent: Tuesday, February 23, 2016 1:51 AM
>>To: ceph-users@lists.ceph.com
>>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>>crush remove
>>
>>
>>
>>Dimitar,
>>
>>
>>
>>I'm not sure why those PGs would be stuck in the stale+active+clean
>>state.  Maybe try upgrading to the 0.80.11 release to see if it's a bug
>>that was fixed already?  You can use the 'ceph tell osd.*  version'
>>command after the upgrade to make sure all OSDs are running the new
>>version.  Also since firefly (0.80.x) is near its EOL, you should
>>consider upgrading to hammer (0.94.x).
>>
>>
>>
>>As for why osd.4 didn't get fully removed, the last command you ran
>>isn't correct.  It should be 'ceph osd rm 4'.  Trying to remember when
>>to use the CRUSH name (osd.4) versus the OSD number (4)  can be a pain.
>>
>>
>>
>>Bryan
>>
>>
>>
>>From: ceph-users  on behalf of
>>Dimitar Boichev 
>>Date: Monday, February 22, 2016 at 1:10 AM
>>To: Dimitar Boichev ,
>>"ceph-users@lists.ceph.com" 
>>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>>crush remove
>>
>>
>>
>>>Anyone ?
>>>
>>>Regards.
>>>
>>>
>>>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>>>On Behalf Of Dimitar Boichev
>>>Sent: Thursday, February 18, 2016 5:06 PM
>>>To: ceph-users@lists.ceph.com
>>>Subject: [ceph-users] osd not removed from crush map after ceph osd
>>>crush remove
>>>
>>>
>>>
>>>Hello,
>>>I am running a tiny cluster of 2 nodes.
>>>ceph -v
>>>ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>
>>>One osd died and I added a new osd (not replacing the old one).
>>>After that I wanted to remove the failed osd completely from the
>>>cluster.
>>>Here is what I did:
>>>ceph osd reweight osd.4 0.0
>>>ceph osd crush reweight osd.4 0.0
>>>ceph osd out osd.4
>>>ceph osd crush remove osd.4
>>>ceph auth del osd.4
>>>ceph osd rm osd.4
>>>
>>>
>>>But after the rebalancing I ended up with 155 PGs in
>>>stale+active+clean state.
>>>
>>>@storage1:/tmp# ceph -s
>>>cluster 7a9120b9-df42-4308-b7b1-e1f3d0f1e7b3
>>> health HEALTH_WARN 155 pgs stale; 155 pgs stuck stale; 1 requests
>>>are blocked > 32 sec; nodeep-scrub flag(s) set
>>> monmap e1: 1 mons at {storage1=192.168.10.3:6789/0}, election
>>>epoch 1, quorum 0 storage1
>>>

[ceph-users] "ceph-installer" in GitHub

2016-02-25 Thread Ken Dreyer

Hi folks,

A few of us at RH are working on a project called "ceph-installer",
which is a Pecan web app that exposes endpoints for running
ceph-ansible under the hood.

The idea is that other applications will be able to consume this REST
API in order to orchestrate Ceph installations.

Another team within Red Hat is also working on a GUI component that
will interact with the ceph-installer web service, and that is
https://github.com/skyrings

These are all nascent projects that are very much works-in-progress,
and so the workflows are very rough and there are a hundred things we
could do to improve the experience and integration, etc. We welcome
feedback from the rest of the community :)

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] tracking data to buckets, owners

2016-02-25 Thread Jeffrey McDonald

Hi,

I'm trying to track shards of an EC4+2 ceph filesystem back to users and
buckets.   Is there a procedure outlined somewhere for this?   All I have
is a file name from an osd data pool, e.g:

default.724733.17\u\ushadow\uprostate\srnaseq\sd959d5dd-2454-4f07-b69e-9ead4a58b5f2\sUNCID\u2256596.bf46c30c-14fa-4e2a-a013-4e84f24eb63b.130722\uUNC9-SN296\u0385\uAD2F28ACXX\u8\uGTTTCG.tar.gz.2~RGMpBL1jBOB6Pa4ZQrdgVMxKHw0CIGu.6_392587ace40e89b50fac_0_long


How do I find the bucket and owner of this file?

Thanks in advance,
Jeff

-- 

Jeffrey McDonald, PhD
Assistant Director for HPC Operations
Minnesota Supercomputing Institute
University of Minnesota Twin Cities
599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
117 Pleasant St SE   phone: +1 612 625-6905
Minneapolis, MN 55455fax:   +1 612 624-8861
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Stillwell, Bryan

It's good to hear that I'm not the only one affected by this!  After the
node was brought back into the cluster (I weighted it out for hardware
repairs) it appears to have removed some of the old maps, as I'm down to
8,000 now.  Although I did find another OSD in the cluster which has
95,000 osd maps (46GB)!  That's substantial considering each OSD in our
cluster is only 1.2TB.

Bryan

From:  ceph-users  on behalf of Tom
Christensen 
Date:  Thursday, February 25, 2016 at 10:36 AM
To:  "ceph-users@lists.ceph.com" 
Subject:  Re: [ceph-users] Over 13,000 osdmaps in current/meta

>We've seen this as well as early as 0.94.3 and have a bug,
>http://tracker.ceph.com/issues/13990
> which we're working through
>currently.  Nothing fixed yet, still trying to nail down exactly why the
>osd maps aren't being trimmed as they should.
>
>
>
>On Thu, Feb 25, 2016 at 10:16 AM, Stillwell, Bryan
> wrote:
>
>After evacuated all the PGs from a node in hammer 0.94.5, I noticed that
>each of the OSDs was still using ~8GB of storage.  After investigating it
>appears like all the data is coming from around 13,000 files in
>/usr/lib/ceph/osd/ceph-*/current/meta/ with names like:
>
>DIR_4/DIR_0/DIR_0/osdmap.303231__0_C23E4004__none
>DIR_4/DIR_2/DIR_F/osdmap.314431__0_C24ADF24__none
>DIR_4/DIR_0/DIR_A/osdmap.312688__0_C2510A04__none
>
>They're all around 500KB in size.  I'm guessing these are all old OSD
>maps, but I'm wondering why there are so many of them?
>
>Thanks,
>Bryan

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.6 Hammer released

2016-02-25 Thread Alfredo Deza

On Thu, Feb 25, 2016 at 7:00 AM, Udo Lembke  wrote:
> Hi,
>
> Am 24.02.2016 um 17:27 schrieb Alfredo Deza:
>> On Wed, Feb 24, 2016 at 4:31 AM, Dan van der Ster  
>> wrote:
>>> Thanks Sage, looking forward to some scrub randomization.
>>>
>>> Were binaries built for el6? http://download.ceph.com/rpm-hammer/el6/x86_64/
>>
>> We are no longer building binaries for el6. Just for Centos 7, Ubuntu
>> Trusty, and Debian Jessie.
>>
> this means that our proxmox-ve server 3.4, which run debian wheezy, could not 
> be updated from ceph 0.94.5 to 0.94.6!
> The OSD-nodes run's wheezy too - they can be upgraded. But the MONs must be 
> also upgraded (first).
>
> I can understand, that newer versions will not supplied to an older OS, but 
> stop from minor.5 to minor.6 makes realy no
> sense to me.
>
> Of course, I can update to proxmox-ve 4.x, which is jessie based, but in this 
> case I have trouble with DRBD...

It would be really nice if the community could step up to help us out
in building binaries. Building Ceph is non-trivial and coming
up with all the different distros, distro versions, and architectures
(at some point we were close to 12 variations) is a tremendous
effort.

>
>
> Udo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Dump Historic Ops Breakdown

2016-02-25 Thread Nick Fisk

I'm just trying to understand the steps each IO goes through and have been
looking at the output dump historic ops command from the admin socket.
There's a couple of steps I'm not quite sure what they mean and also
slightly puzzled by the delay and was wondering if anybody could share some
knowledge around this.

Here is what I think I understand so far:

Initiated = When the OSD received the OP

Queued for PG / Reached PG / Started = This seems to be how long the OSD has
to wait to get a lock on the PG before actually starting the write. Correct?
Is there any perf stats to track this number? And why do I see a 150ms delay
before started. Am I possibly hitting some sort of queue on the PG? Is this
just a large queue of requests on the PG that are waiting to be written to
the journal? Any tips to reduce this?

Waiting for Sub Ops = Self-explanatory, its waiting for replica OSD's to
apply the op to journal

commit_queued_for_journal_write/ write_thread_in_journal_buffer/
journaled_completion_queued/ op_commit = How long it takes to queue and
write to the journal. In example case its 4msseems very high for s3700
SSD? Maybe lots of ops are queued up? Most other ops show this <1ms.

sub_op_commit_rec = This is where we hear back from the replica OSD's

op_applied/done = We have finished so send ACK back to client


Thanks for any insight anyone can offer.
Nick

Sample Op

"description": "osd_op(client.9539566.0:292915056
rb.0.265a6.2ae8944a.00072421 [] 0.c1a473f3
ack+ondisk+write+known_if_redirected e51777)",
"initiated_at": "2016-02-25 17:02:53.017589",
"age": 445.814991,
"duration": 0.164949,
"type_data": [
"commit sent; apply or cleanup",
{
"client": "client.9539566",
"tid": 292915056
},
[
{
"time": "2016-02-25 17:02:53.017589",
"event": "initiated"
},
{
"time": "2016-02-25 17:02:53.017960",
"event": "queued_for_pg"
},
{
"time": "2016-02-25 17:02:53.018029",
"event": "reached_pg"
},
{
"time": "2016-02-25 17:02:53.173131",
"event": "started"
},
{
"time": "2016-02-25 17:02:53.175146",
"event": "waiting for subops from 24,43"
},
{
"time": "2016-02-25 17:02:53.177185",
"event": "commit_queued_for_journal_write"
},
{
"time": "2016-02-25 17:02:53.177285",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2016-02-25 17:02:53.177649",
"event": "journaled_completion_queued"
},
{
"time": "2016-02-25 17:02:53.181831",
"event": "op_commit"
},
{
"time": "2016-02-25 17:02:53.181958",
"event": "sub_op_commit_rec from 43"
},
{
"time": "2016-02-25 17:02:53.182257",
"event": "sub_op_commit_rec from 24"
},
{
"time": "2016-02-25 17:02:53.182491",
"event": "commit_sent"
},
{
"time": "2016-02-25 17:02:53.182512",
"event": "op_applied"
},
{
"time": "2016-02-25 17:02:53.182538",
"event": "done"
}
]
]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Tom Christensen

We've seen this as well as early as 0.94.3 and have a bug,
http://tracker.ceph.com/issues/13990 which we're working through
currently.  Nothing fixed yet, still trying to nail down exactly why the
osd maps aren't being trimmed as they should.


On Thu, Feb 25, 2016 at 10:16 AM, Stillwell, Bryan <
bryan.stillw...@twcable.com> wrote:

> After evacuated all the PGs from a node in hammer 0.94.5, I noticed that
> each of the OSDs was still using ~8GB of storage.  After investigating it
> appears like all the data is coming from around 13,000 files in
> /usr/lib/ceph/osd/ceph-*/current/meta/ with names like:
>
> DIR_4/DIR_0/DIR_0/osdmap.303231__0_C23E4004__none
> DIR_4/DIR_2/DIR_F/osdmap.314431__0_C24ADF24__none
> DIR_4/DIR_0/DIR_A/osdmap.312688__0_C2510A04__none
>
> They're all around 500KB in size.  I'm guessing these are all old OSD
> maps, but I'm wondering why there are so many of them?
>
> Thanks,
> Bryan
>
>
> 
>
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-maintainers] download.ceph.com has AAAA record that points to unavailable address

2016-02-25 Thread Dan Mick

Because we thought that he infrastructure did at the time.   We'll get that 
removed; I can see where it could cause hassles. 

Sent from Nine

From: Andy Allan 
Sent: Feb 25, 2016 6:11 AM
To: Dan Mick
Cc: Artem Fokin; ceph-users
Subject: Re: [ceph-users] [Ceph-maintainers] download.ceph.com has  record 
that points to unavailable address

Hi Dan, 

If download.ceph.com doesn't support IPv6, then why is there a  
record for it? 

Thanks, 
Andy 

On 25 February 2016 at 02:21, Dan Mick  wrote: 
> Yes.  download.ceph.com does not currently support IPv6 access. 
> 
> On 02/14/2016 11:53 PM, Artem Fokin wrote: 
>> Hi 
>> 
>> It seems like download.ceph.com has some outdated IPv6 address 
>> 
>> ~ curl -v -s download.ceph.com > /dev/null 
>> * About to connect() to download.ceph.com port 80 (#0) 
>> *   Trying 2607:f298:6050:51f3:f816:3eff:fe50:5ec... Connection refused 
>> *   Trying 173.236.253.173... connected 
>> 
>> 
>> 
>> ~ dig  download.ceph.com | grep  
>> ; <<>> DiG 9.8.1-P1 <<>>  download.ceph.com 
>> ;download.ceph.com.    IN     
>> download.ceph.com.    286    IN     
>> 2607:f298:6050:51f3:f816:3eff:fe50:5ec 
>> 
>> If this is the wrong mailing list, please refer to the correct one. 
>> 
>> Thanks! 
>> ___ 
>> Ceph-maintainers mailing list 
>> ceph-maintain...@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Problem: silently corrupted RadosGW objects caused by slow requests

2016-02-25 Thread Ritter Sławomir

Hi,

We have two CEPH clusters running on Dumpling 0.67.11 and some of our 
"multipart objects" are incompleted. It seems that some slow requests could 
cause corruption of related S3 objects. Moveover GETs for that objects are 
working without any error messages. There are only HTTP 200 in logs as well as 
no information about problems from popular client tools/libs.

The situation looks very similiar to described in bug #8269, but we are using 
fixed 0.67.11 version:  http://tracker.ceph.com/issues/8269

Regards,

Sławomir Ritter



EXAMPLE#1

slow_request

2016-02-23 13:49:58.818640 osd.260 10.176.67.27:6800/688083 2119 : [WRN] 4 slow 
requests, 4 included below; oldest blocked for > 30.727096 secs
2016-02-23 13:49:58.818673 osd.260 10.176.67.27:6800/688083 2120 : [WRN] slow 
request 30.727096 seconds old, received at 2016-02-23 13:49:28.091460: osd_op(c
lient.47792965.0:185007087 
default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
 [writef
ull 0~524288] 10.ce729ebe e107594) v4 currently waiting for subops from [469,9]


HTTP_500 in apache.log
==
127.0.0.1 - - [23/Feb/2016:13:49:27 +0100] "PUT 
/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z=56
 HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3 
Linux/3.13.0-39-generic(syncworker)"
127.0.0.1 - - [23/Feb/2016:13:49:28 +0100] "PUT 
/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z=57
 HTTP/1.0" 500 751 "-" "Boto/2.31.1 Python/2.7.3 
Linux/3.13.0-39-generic(syncworker)"
127.0.0.1 - - [23/Feb/2016:13:49:58 +0100] "PUT 
/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z=57
 HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3 
Linux/3.13.0-39-generic(syncworker)"
127.0.0.1 - - [23/Feb/2016:13:49:59 +0100] "PUT 
/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv?uploadId=b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z=58
 HTTP/1.0" 200 221 "-" "Boto/2.31.1 Python/2.7.3 
Linux/3.13.0-39-generic(syncworker)"


Empty RADOS object (real size = 0 bytes), list generated basis on MANIFEST
==
found  
default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.56_2
  2097152   ok  2097152   10.7acc9476 (10.1476) [278,142,436] 
[278,142,436]
found  
default.14654.445__multipart_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57
 0 diff4194304   10.4f5be025 (10.25)   [57,310,428]  
[57,310,428]
found  
default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_1
  4194304   ok  4194304   10.81191602 (10.1602) [441,109,420] 
[441,109,420]
found  
default.14654.445__shadow_c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv.b73N3OmW4OhCjDYSR-RTkZNNIKA1C9Z.57_2
  2097152   ok  2097152   10.ce729ebe (10.1ebe) [260,469,9]   
[260,469,9]


"Silent" GETs
=
# object size from headers
$ s3 -u head 
video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv  
 Content-Type: 
binary/octet-stream
Content-Length: 641775701
Server: nginx

# but GETs only 637581397 (641775701 - missing 4194304 = 637581397)
$ s3 -u get 
video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv > 
/tmp/test
$  ls -al /tmp/test
-rw-r--r-- 1 root root 637581397 Feb 23 17:05 /tmp/test

# no error in logs
127.0.0.1 - - [23/Feb/2016:17:05:00 +0100] "GET 
/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv 
HTTP/1.0" 200 637581711 "-" "Mozilla/4.0 (Compatible; s3; libs3 2.0; Linux 
x86_64)"

# wget - retry for missing part, but there is no missing part, so it GETs 
head/tail of the file again
$ wget 
http://127.0.0.1:88/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
--2016-02-23 17:10:11--  
http://127.0.0.1:88/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
Connecting to 127.0.0.1:88... connected.
HTTP request sent, awaiting response... 200 OK
Length: 641775701 (612M) [binary/octet-stream]
Saving to: `c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv'

99% 
[==>
 ] 637,581,397 63.9M/s   in 9.5s

2016-02-23 17:10:20 (64.1 MB/s) - Connection closed at byte 637581397. Retrying.

--2016-02-23 17:10:21--  (try: 2)  
http://127.0.0.1:88/video-shbc/c9f8db1b-cee2-4ec8-8fb3-8b4bc7585d80.1456231572.877051.ismv
Connecting to 127.0.0.1:88... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 641775701 (612M), 4194304 (4.0M)

Re: [ceph-users] Can not disable rbd cache

2016-02-25 Thread Jason Dillaman

> > Let's start from the top. Where are you stuck with [1]? I have noticed
> > that after evicting all the objects with RBD that one object for each
> > active RBD is still left, I think this is the head object.
> Precisely.
> That came up in my extensive tests as well.

Is this in reference to the RBD image header object (i.e. XYZ.rbd or 
rbd_header.XYZ)? The cache tier doesn't currently support evicting objects that 
are being watched.  This guard was added to the OSD because it wasn't 
previously possible to alert clients that a watched object had encountered an 
error (such as it no longer exists in the cache tier).  Now that Hammer (and 
later) librbd releases will reconnect the watch on error (eviction), perhaps 
this guard can be loosened [1].

[1] http://tracker.ceph.com/issues/14865

--

Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Jason Dillaman

> 35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't actually
> work. Or it's not touching the same object (but I wonder whether write
> ordering is preserved at that rate?).

The fio rbd engine does not support "sync=1"; however, it should support 
"fsync=1" to accomplish roughly the same effect.

Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-maintainers] download.ceph.com has AAAA record that points to unavailable address

2016-02-25 Thread Andy Allan

Hi Dan,

If download.ceph.com doesn't support IPv6, then why is there a 
record for it?

Thanks,
Andy

On 25 February 2016 at 02:21, Dan Mick  wrote:
> Yes.  download.ceph.com does not currently support IPv6 access.
>
> On 02/14/2016 11:53 PM, Artem Fokin wrote:
>> Hi
>>
>> It seems like download.ceph.com has some outdated IPv6 address
>>
>> ~ curl -v -s download.ceph.com > /dev/null
>> * About to connect() to download.ceph.com port 80 (#0)
>> *   Trying 2607:f298:6050:51f3:f816:3eff:fe50:5ec... Connection refused
>> *   Trying 173.236.253.173... connected
>>
>>
>>
>> ~ dig  download.ceph.com | grep 
>> ; <<>> DiG 9.8.1-P1 <<>>  download.ceph.com
>> ;download.ceph.com.IN
>> download.ceph.com.286IN
>> 2607:f298:6050:51f3:f816:3eff:fe50:5ec
>>
>> If this is the wrong mailing list, please refer to the correct one.
>>
>> Thanks!
>> ___
>> Ceph-maintainers mailing list
>> ceph-maintain...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer

We are very happy with S3610s in our cluster.
We had to flash a new firmware because of latency spikes (NCQ-related), but had 
zero problems after that...
Just beware of HBA compatibility, even in passthrough mode some crappy 
firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
your way for crippling TRIM, seriously WTH).

Jan


> On 25 Feb 2016, at 14:48, Nick Fisk  wrote:
> 
> There’s two factors really
>  
> 1.   Suitability for use in ceph
> 2.   Number of people using them
>  
> For #1, there are a number of people using various different drives, so lots 
> of options. The blog articled linked is a good place to start.
>  
> For #2 and I think this is quite important. Lots of people use the S3xx’s 
> intel drives. This means any problems you face will likely have a lot of 
> input from other people. Also you are less likely to face surprises, as most 
> usage cases have already been covered. 
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Robert LeBlanc
> Sent: 25 February 2016 05:56
> To: Shinobu Kinjo >
> Cc: ceph-users >
> Subject: Re: [ceph-users] List of SSDs
>  
> We are moving to the Intel S3610, from our testing it is a good balance 
> between price, performance and longevity. But as with all things, do your 
> testing ahead of time. This will be our third model of SSDs for our cluster. 
> The S3500s didn't have enough life and performance tapers off add it gets 
> full. The Micron M600s looked good with the Sebastian journal tests, but once 
> in use for a while go downhill pretty bad. We also tested Micron M500dc 
> drives and they were on par with the S3610s and are more expensive and are 
> closer to EoL. The S3700s didn't have quite the same performance as the 
> S3610s, but they will last forever and are very stable in terms of 
> performance and have the best power loss protection. 
> 
> Short answer is test them for yourself to make sure they will work. You are 
> pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty 
> safe based on my experience. It had also been mentioned that someone has had 
> good experience with a Samsung DC Pro (has to have both DC and Pro in the 
> name), but we weren't able to get any quick enough to test so I can't vouch 
> for them. 
> 
> Sent from a mobile device, please excuse any typos.
> 
> On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  > wrote:
> Hello,
> 
> There has been a bunch of discussion about using SSD.
> Does anyone have any list of SSDs describing which SSD is highly recommended, 
> which SSD is not.
> 
> Rgds,
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread nick

On 25 Feb 2016 1:47 pm, Jan Schermer  wrote:

> On 25 Feb 2016, at 14:39, Nick Fisk  wrote: > > >  
>> -Original Message- >> From: ceph-users  
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Huan  
Zhang >> Sent: 25 February 2016 11:11 >> To: josh.dur...@inktank.com  
>> Cc: ceph-users  >> Subject: [ceph-users]  
Guest sync write iops so poor. >> >> Hi, >>   We test sync iops with  
fio sync=1 for database workloads in VM, >> the backend is librbd  
and ceph (all SSD setup). >>   The result is sad to me. we only get  
~400 IOPS sync randwrite with >> iodepth=1 >> to iodepth=32. >>    
But test in physical machine with fio ioengine=rbd sync=1, we can  
reache >> ~35K IOPS. >> seems the qemu rbd is the bottleneck. >>    
qemu version is 2.1.2 with rbd_aio_flush patched. >>    rbd cache is  
off, qemu cache=none. >> >> So what's wrong with it? Is that normal?  
Could you give me some help? > > Yes, this is normal at QD=1. As the  
write needs to be acknowledged by both replica OSD's across a  
network connection the round trip latency severely limits you as  
compared to travelling along a 30cm sata cable. > > The two biggest  
contributors to latency is the network and the speed at which the  
CPU can process the ceph code.  To improve performance look at these  
two areas first. Easy win is to disable debug logging in ceph. > >  
However this number should scale as you increase the QD, so  
something is not right if you are seeing the same performance at  
QD=1 as QD=32. Are you sure?

Ah, sorry. It's sync and not direct io. Yes you are right, it will not  
scale. 400 iops at all qd is correct.

Unless something (io elevator) coalesces the writes then they should  
be serialized and blocking, QD doesn't necessarily help there. Either  
way, you're benchmarking the elevator and not RBD if you reach higher  
IOPS with QD>1, IMO.

35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't  
actually work. Or it's not touching the same object (but I wonder  
whether write ordering is preserved at that rate?). 400 IOPS is  
sadly the same figure I can reach on a raw device... testing with  
filesystem you can easily reach <200 IOPS (because of journal,  
metadata... but again, then you're benchmarking filesystem journal  
and ioelevator efficiency, not RBD itself). Jan > >> Thanks very  
much. > > ___ >  
ceph-users mailing list > ceph-users@lists.ceph.com >  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Quoting Jan Schermer 

On 25 Feb 2016, at 14:39, Nick Fisk  wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Huan Zhang
Sent: 25 February 2016 11:11
To: josh.dur...@inktank.com
Cc: ceph-users 
Subject: [ceph-users] Guest sync write iops so poor.

Hi,
  We test sync iops with fio sync=1 for database workloads in VM,
the backend is librbd and ceph (all SSD setup).
  The result is sad to me. we only get ~400 IOPS sync randwrite with
iodepth=1
to iodepth=32.
  But test in physical machine with fio ioengine=rbd sync=1, we can reache
~35K IOPS.
seems the qemu rbd is the bottleneck.
  qemu version is 2.1.2 with rbd_aio_flush patched.
   rbd cache is off, qemu cache=none.

So what's wrong with it? Is that normal? Could you give me some help?

Yes, this is normal at QD=1. As the write needs to be acknowledged  
by both replica OSD's across a network connection the round trip  
latency severely limits you as compared to travelling along a 30cm  
sata cable.

The two biggest contributors to latency is the network and the  
speed at which the CPU can process the ceph code.  To improve  
performance look at these two areas first. Easy win is to disable  
debug logging in ceph.

However this number should scale as you increase the QD, so  
something is not right if you are seeing the same performance at  
QD=1 as QD=32.

Are you sure? Unless something (io elevator) coalesces the writes  
then they should be serialized and blocking, QD doesn't necessarily  
help there. Either way, you're benchmarking the elevator and not RBD  
if you reach higher IOPS with QD>1, IMO.

35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't  
actually work. Or it's not touching the same object (but I wonder  
whether write ordering is preserved at that rate?).

400 IOPS is sadly the same figure I can reach on a raw device...  
testing with filesystem you can easily reach <200 IOPS (because of  
journal, metadata... but again, then you're benchmarking filesystem  
journal and ioelevator efficiency, not RBD itself).

Jan

Thanks very much.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users

Re: [ceph-users] List of SSDs

2016-02-25 Thread Nick Fisk

There’s two factors really

 

1.   Suitability for use in ceph

2.   Number of people using them

 

For #1, there are a number of people using various different drives, so lots of 
options. The blog articled linked is a good place to start.

 

For #2 and I think this is quite important. Lots of people use the S3xx’s intel 
drives. This means any problems you face will likely have a lot of input from 
other people. Also you are less likely to face surprises, as most usage cases 
have already been covered. 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert 
LeBlanc
Sent: 25 February 2016 05:56
To: Shinobu Kinjo 
Cc: ceph-users 
Subject: Re: [ceph-users] List of SSDs

 

We are moving to the Intel S3610, from our testing it is a good balance between 
price, performance and longevity. But as with all things, do your testing ahead 
of time. This will be our third model of SSDs for our cluster. The S3500s 
didn't have enough life and performance tapers off add it gets full. The Micron 
M600s looked good with the Sebastian journal tests, but once in use for a while 
go downhill pretty bad. We also tested Micron M500dc drives and they were on 
par with the S3610s and are more expensive and are closer to EoL. The S3700s 
didn't have quite the same performance as the S3610s, but they will last 
forever and are very stable in terms of performance and have the best power 
loss protection. 

Short answer is test them for yourself to make sure they will work. You are 
pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty safe 
based on my experience. It had also been mentioned that someone has had good 
experience with a Samsung DC Pro (has to have both DC and Pro in the name), but 
we weren't able to get any quick enough to test so I can't vouch for them. 

Sent from a mobile device, please excuse any typos.

On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  > wrote:

Hello,

There has been a bunch of discussion about using SSD.
Does anyone have any list of SSDs describing which SSD is highly recommended, 
which SSD is not.

Rgds,
Shinobu
___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Jan Schermer


> On 25 Feb 2016, at 14:39, Nick Fisk  wrote:
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Huan Zhang
>> Sent: 25 February 2016 11:11
>> To: josh.dur...@inktank.com
>> Cc: ceph-users 
>> Subject: [ceph-users] Guest sync write iops so poor.
>> 
>> Hi,
>>   We test sync iops with fio sync=1 for database workloads in VM,
>> the backend is librbd and ceph (all SSD setup).
>>   The result is sad to me. we only get ~400 IOPS sync randwrite with
>> iodepth=1
>> to iodepth=32.
>>   But test in physical machine with fio ioengine=rbd sync=1, we can reache
>> ~35K IOPS.
>> seems the qemu rbd is the bottleneck.
>>   qemu version is 2.1.2 with rbd_aio_flush patched.
>>rbd cache is off, qemu cache=none.
>> 
>> So what's wrong with it? Is that normal? Could you give me some help?
> 
> Yes, this is normal at QD=1. As the write needs to be acknowledged by both 
> replica OSD's across a network connection the round trip latency severely 
> limits you as compared to travelling along a 30cm sata cable.
> 
> The two biggest contributors to latency is the network and the speed at which 
> the CPU can process the ceph code.  To improve performance look at these two 
> areas first. Easy win is to disable debug logging in ceph.
> 
> However this number should scale as you increase the QD, so something is not 
> right if you are seeing the same performance at QD=1 as QD=32.

Are you sure? Unless something (io elevator) coalesces the writes then they 
should be serialized and blocking, QD doesn't necessarily help there. Either 
way, you're benchmarking the elevator and not RBD if you reach higher IOPS with 
QD>1, IMO.

35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't actually 
work. Or it's not touching the same object (but I wonder whether write ordering 
is preserved at that rate?).

400 IOPS is sadly the same figure I can reach on a raw device... testing with 
filesystem you can easily reach <200 IOPS (because of journal, metadata... but 
again, then you're benchmarking filesystem journal and ioelevator efficiency, 
not RBD itself).

Jan


> 
>> Thanks very much.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-25 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: 25 February 2016 01:30
> To: Christian Balzer 
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph hammer : rbd info/Status : operation not
> supported (95) (EC+RBD tier pools)
> 
> I'll speak to what I can answer off the top of my head.  The most
important
> point is that this issue is only related to EC pool base tiers, not
replicated
> pools.
> 
> > Hello Jason (Ceph devs et al),
> >
> > On Wed, 24 Feb 2016 13:15:34 -0500 (EST) Jason Dillaman wrote:
> >
> > > If you run "rados -p  ls | grep "rbd_id." and
> > > don't see that object, you are experiencing that issue [1].
> > >
> > > You can attempt to work around this issue by running "rados -p
> > > irfu-virt setomapval rbd_id. dummy value" to
> > > force-promote the object to the cache pool.  I haven't tested /
> > > verified that will alleviate the issue, though.
> > >
> > > [1] http://tracker.ceph.com/issues/14762
> > >
> >
> > This concerns me greatly, as I'm about to phase in a cache tier this
> > weekend into a very busy, VERY mission critical Ceph cluster.
> > That is on top of a replicated pool, Hammer.
> >
> > That issue and the related git blurb are less than crystal clear, so
> > for my and everybody else's benefit could you elaborate a bit more on
> this?
> >
> > 1. Does this only affect EC base pools?
> 
> Correct -- this is only an issue because EC pools do not directly support
> several operations required by RBD.  Placing a replicated cache tier in
front of
> an EC pool was, in effect, a work-around to this limitation.
> 
> > 2. Is this a regressions of sorts and when came it about?
> >I have a hard time imagining people not running into this earlier,
> >unless that problem is very hard to trigger.
> > 3. One assumes that this isn't fixed in any released version of Ceph,
> >correct?
> >
> > Robert, sorry for CC'ing you, but AFAICT your cluster is about the
> > closest approximation in terms of busyness to mine here.
> > And I a assume that you're neither using EC pools (since you need
> > performance, not space) and haven't experienced this bug all?
> >
> > Also, would you consider the benefits of the recency fix (thanks for
> > that) being worth risk of being an early adopter of 0.94.6?
> > In other words, are you eating your own dog food already and 0.94.6
> > hasn't eaten your data babies yet? ^o^
> 
> Per the referenced email chain, it was potentially the recency fix that
> exposed this issue for EC pools fronted by a cache tier.

Just to add. It's possible this bug was present for a while, but the broken
recency logic effectively always promoted blocks regardless. Once this was
fixed and ceph could actually make a decision of whether a block needed to
be promoted or not this bug surfaced. You can always set the recency to 0
(possibly 1) and have the same behaviour as before the recency fix to ensure
that you won't hit this bug.

> 
> >
> > Regards,
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >
> 
> --
> 
> Jason Dillaman
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Huan Zhang
> Sent: 25 February 2016 11:11
> To: josh.dur...@inktank.com
> Cc: ceph-users 
> Subject: [ceph-users] Guest sync write iops so poor.
> 
> Hi,
>We test sync iops with fio sync=1 for database workloads in VM,
> the backend is librbd and ceph (all SSD setup).
>The result is sad to me. we only get ~400 IOPS sync randwrite with
> iodepth=1
> to iodepth=32.
>But test in physical machine with fio ioengine=rbd sync=1, we can reache
> ~35K IOPS.
> seems the qemu rbd is the bottleneck.
>qemu version is 2.1.2 with rbd_aio_flush patched.
> rbd cache is off, qemu cache=none.
> 
> So what's wrong with it? Is that normal? Could you give me some help?

Yes, this is normal at QD=1. As the write needs to be acknowledged by both 
replica OSD's across a network connection the round trip latency severely 
limits you as compared to travelling along a 30cm sata cable.

The two biggest contributors to latency is the network and the speed at which 
the CPU can process the ceph code.  To improve performance look at these two 
areas first. Easy win is to disable debug logging in ceph.

However this number should scale as you increase the QD, so something is not 
right if you are seeing the same performance at QD=1 as QD=32.

> Thanks very much.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xfs corruption

2016-02-25 Thread Ferhat Ozkasgarli

This has happened me before but in virtual machine environment.

The VM was KVM and storage was RBD. My problem was a bad cable in network.

You should check following details:

1-) Do you use any kind of hardware raid configuration? (Raid 0, 5 or 10)

Ceph does not work well on hardware raid systems. You should use raid cards
in HBA (non-raid) mode and let raid card pass-throughput the disk.

2-) Check your network connections

It mas seem a obvious solution but  believe me network is one of the top
rated culprit in Ceph environments.

3-) If you are using SSD disk, make sure you use non-raid configuration.



On Tue, Feb 23, 2016 at 10:55 PM, fangchen sun 
wrote:

> Dear all:
>
> I have a ceph object storage cluster with 143 osd and 7 radosgw, and
> choose XFS as the underlying file system.
> I recently ran into a problem that sometimes a osd is marked down when the
> returned value of the function "chain_setxattr()" is -117. I only umount
> the disk and repair it with "xfs_repair".
>
> os: centos 6.5
> kernel version: 2.6.32
>
> the log for dmesg command:
> [41796028.532225] Pid: 1438740, comm: ceph-osd Not tainted
> 2.6.32-925.431.23.3.letv.el6.x86_64 #1
> [41796028.532227] Call Trace:
> [41796028.532255]  [] ? xfs_error_report+0x3f/0x50 [xfs]
> [41796028.532276]  [] ? xfs_da_read_buf+0x2a/0x30 [xfs]
> [41796028.532296]  [] ? xfs_corruption_error+0x5e/0x90
> [xfs]
> [41796028.532316]  [] ? xfs_da_do_buf+0x6cc/0x770 [xfs]
> [41796028.532335]  [] ? xfs_da_read_buf+0x2a/0x30 [xfs]
> [41796028.532359]  [] ? kmem_zone_alloc+0x77/0xf0 [xfs]
> [41796028.532380]  [] ? xfs_da_read_buf+0x2a/0x30 [xfs]
> [41796028.532399]  [] ? xfs_attr_leaf_addname+0x61/0x3d0
> [xfs]
> [41796028.532426]  [] ? xfs_attr_leaf_addname+0x61/0x3d0
> [xfs]
> [41796028.532455]  [] ? xfs_trans_add_item+0x57/0x70
> [xfs]
> [41796028.532476]  [] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
> [41796028.532495]  [] ? xfs_attr_set_int+0x3c4/0x510
> [xfs]
> [41796028.532517]  [] ? xfs_da_do_buf+0x6db/0x770 [xfs]
> [41796028.532536]  [] ? xfs_attr_set+0x81/0x90 [xfs]
> [41796028.532560]  [] ? __xfs_xattr_set+0x43/0x60 [xfs]
> [41796028.532584]  [] ? xfs_xattr_user_set+0x11/0x20
> [xfs]
> [41796028.532592]  [] ? generic_setxattr+0xa2/0xb0
> [41796028.532596]  [] ? __vfs_setxattr_noperm+0x4e/0x160
> [41796028.532600]  [] ? inode_permission+0xa7/0x100
> [41796028.532604]  [] ? vfs_setxattr+0xbc/0xc0
> [41796028.532607]  [] ? setxattr+0xd0/0x150
> [41796028.532612]  [] ? __dequeue_entity+0x30/0x50
> [41796028.532617]  [] ? __switch_to+0x26e/0x320
> [41796028.532621]  [] ? __sb_start_write+0x80/0x120
> [41796028.532626]  [] ? thread_return+0x4e/0x760
> [41796028.532630]  [] ? sys_fsetxattr+0xad/0xd0
> [41796028.532633]  [] ? system_call_fastpath+0x16/0x1b
> [41796028.532636] XFS (sdi1): Corruption detected. Unmount and run
> xfs_repair
>
> Any comments will be much appreciated!
>
> Best Regards!
> sunspot
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.94.6 Hammer released

2016-02-25 Thread Udo Lembke

Hi,

Am 24.02.2016 um 17:27 schrieb Alfredo Deza:
> On Wed, Feb 24, 2016 at 4:31 AM, Dan van der Ster  wrote:
>> Thanks Sage, looking forward to some scrub randomization.
>>
>> Were binaries built for el6? http://download.ceph.com/rpm-hammer/el6/x86_64/
> 
> We are no longer building binaries for el6. Just for Centos 7, Ubuntu
> Trusty, and Debian Jessie.
> 
this means that our proxmox-ve server 3.4, which run debian wheezy, could not 
be updated from ceph 0.94.5 to 0.94.6!
The OSD-nodes run's wheezy too - they can be upgraded. But the MONs must be 
also upgraded (first).

I can understand, that newer versions will not supplied to an older OS, but 
stop from minor.5 to minor.6 makes realy no
sense to me.

Of course, I can update to proxmox-ve 4.x, which is jessie based, but in this 
case I have trouble with DRBD...


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Erasure code Plugins

2016-02-25 Thread Adrien Gillard

The way LRC works is that is creates an additionnal parity chunk every l
OSD.
So with k=m=l=2, you will have 2 data chunks, 2 parity chunks and 2
locality parity chunks.

Your ruleset-failure-domain is set to "host", as well as your
ruleset-locality, so you will need 6 hosts in order to create the
placements groups.

You can edit you EC Profile / crushmap to set your ruleset-failure-domain
and ruleset-locality to "osd".

Adrien

On Thu, Feb 25, 2016 at 6:51 AM, Sharath Gururaj 
wrote:

> Try using more OSDs.
> I was encountering this scenario when my osds were equal to k+m
> The errors went away when I used k+m+2
> So in your case try with 8 or 10 osds.
>
> On Thu, Feb 25, 2016 at 11:18 AM, Daleep Singh Bais 
> wrote:
>
>> hi All,
>>
>> Any help in this regard will be appreciated.
>>
>> Thanks..
>> Daleep Singh Bais
>>
>>
>>  Forwarded Message 
>> Subject: Erasure code Plugins
>> Date: Fri, 19 Feb 2016 12:13:36 +0530
>> From: Daleep Singh Bais  
>> To: ceph-users  
>>
>> Hi All,
>>
>> I am experimenting with erasure profiles and would like to understand
>> more about them. I created an LRC profile based on *
>> http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/
>> *
>>
>> The LRC profile created by me is
>>
>> *ceph osd erasure-code-profile get lrctest1*
>> k=2
>> l=2
>> m=2
>> plugin=lrc
>> ruleset-failure-domain=host
>> ruleset-locality=host
>> ruleset-root=default
>>
>> However, when I create a pool based on this profile, I see a health
>> warning in ceph -w ( 128 pgs stuck inactive and 128 pgs stuck unclean).
>> This is the first pool in cluster.
>>
>> As i understand, m is parity bit and l will create additional parity bit
>> for data bit k. Please correct me if I am wrong.
>>
>> Below is output of ceph -w
>>
>> health HEALTH_WARN
>> *128 pgs stuck inactive*
>> *128 pgs stuck unclean*
>>  monmap e7: 1 mons at {node1=192.168.1.111:6789/0}
>> election epoch 101, quorum 0 node1
>>  osdmap e928: *6 osds: 6 up, 6 in*
>> flags sortbitwise
>>   pgmap v54114: 128 pgs, 1 pools, 0 bytes data, 0 objects
>> 10182 MB used, 5567 GB / 5589 GB avail
>>  *128 creating*
>>
>>
>> Any help or guidance in this regard is highly appreciated.
>>
>> Thanks,
>>
>> Daleep Singh Bais
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Huan Zhang

Hi,
   We test sync iops with fio sync=1 for database workloads in VM,
the backend is librbd and ceph (all SSD setup).

   The result is sad to me. we only get ~400 IOPS sync randwrite with
iodepth=1
to iodepth=32.

   But test in physical machine with fio ioengine=rbd sync=1, we can reache
~35K IOPS.
seems the qemu rbd is the bottleneck.

   qemu version is 2.1.2 with rbd_aio_flush patched.
rbd cache is off, qemu cache=none.

So what's wrong with it? Is that normal? Could you give me some help?
Thanks very much.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO

2016-02-25 Thread Saverio Proto

I confirm that the bug is fixed with the 0.94.6 release packages.

thank you

Saverio


2016-02-22 10:20 GMT+01:00 Saverio Proto :
> Hello Jason,
>
> from this email on ceph-dev
> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29692
>
> it looks like 0.94.6 is coming out very soon. We avoid testing the
> unreleased packaged then and we wait for the official release. thank
> you
>
> Saverio
>
>
> 2016-02-19 18:53 GMT+01:00 Jason Dillaman :
>> Correct -- a v0.94.6 tag on the hammer branch won't be created until the 
>> release.
>>
>> --
>>
>> Jason Dillaman
>>
>>
>> - Original Message -
>>> From: "Saverio Proto" 
>>> To: "Jason Dillaman" 
>>> Cc: ceph-users@lists.ceph.com
>>> Sent: Friday, February 19, 2016 11:38:08 AM
>>> Subject: Re: [ceph-users] Cannot reliably create snapshot after freezing 
>>> QEMU IO
>>>
>>> Hello,
>>>
>>> thanks for the pointer. Just to make sure, for dev/QE hammer release,
>>> do you mean the "hammer" branch ? So following the documentation,
>>> because I use Ubuntu Trusty, this should be the repository right ?
>>>
>>> deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer
>>> trusty main
>>>
>>> thanks
>>>
>>> Saverio
>>>
>>>
>>>
>>>
>>> 2016-02-19 16:41 GMT+01:00 Jason Dillaman :
>>> > I believe 0.94.6 is still in testing because of a possible MDS issue [1].
>>> > You can download the interim dev/QE hammer release by following the
>>> > instructions here [2] if you are in a hurry.  You would only need to
>>> > upgrade librbd1 (and its dependencies) to pick up the fix.  When you do
>>> > upgrade (either with the interim or the official release), I would
>>> > appreciate it if you could update the ticket to let me know if it resolved
>>> > your issue.
>>> >
>>> > [1] http://tracker.ceph.com/issues/13356
>>> > [2] http://docs.ceph.com/docs/master/install/get-packages/
>>> >
>>> > --
>>> >
>>> > Jason Dillaman
>>> >
>>> >
>>> > - Original Message -
>>> >> From: "Saverio Proto" 
>>> >> To: ceph-users@lists.ceph.com
>>> >> Sent: Friday, February 19, 2016 10:11:01 AM
>>> >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU
>>> >> IO
>>> >>
>>> >> Hello,
>>> >>
>>> >> we are hitting here Bug #14373 in our production cluster
>>> >> http://tracker.ceph.com/issues/14373
>>> >>
>>> >> Since we introduced the object map feature in our cinder rbd volumes,
>>> >> we are not able to make snapshot the volumes, unless they pause the
>>> >> VMs.
>>> >>
>>> >> We are running the latest Hammer and so we are really looking forward
>>> >> release v0.94.6
>>> >>
>>> >> Does anyone know when the release is going to happen ?
>>> >>
>>> >> If the release v0.94.6 is far away, we might have to build custom
>>> >> packages for Ubuntu and we really would like to avoid that.
>>> >> Any input ?
>>> >> Anyone else sharing the same bug ?
>>> >>
>>> >> thank you
>>> >>
>>> >> Saverio
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] List of SSDs

2016-02-25 Thread Ferhat Ozkasgarli

Hello,

I have also had some good experience with Micron M510DC. The disk has
pretty solid performance scores and works good with Ceph.

P.S.: Do not forget: If you are going to use raid controller, make sure
your raid card in HBA (Non-Raid) mode.



On Thu, Feb 25, 2016 at 8:23 AM, Shinobu Kinjo  wrote:

> Thanks, Robert for your more specific explanation.
>
> Rgds,
> Shinobu
>
> - Original Message -
> From: "Robert LeBlanc" 
> To: "Shinobu Kinjo" 
> Cc: "ceph-users" 
> Sent: Thursday, February 25, 2016 2:56:15 PM
> Subject: Re: [ceph-users] List of SSDs
>
> We are moving to the Intel S3610, from our testing it is a good balance
> between price, performance and longevity. But as with all things, do your
> testing ahead of time. This will be our third model of SSDs for our
> cluster. The S3500s didn't have enough life and performance tapers off add
> it gets full. The Micron M600s looked good with the Sebastian journal
> tests, but once in use for a while go downhill pretty bad. We also tested
> Micron M500dc drives and they were on par with the S3610s and are more
> expensive and are closer to EoL. The S3700s didn't have quite the same
> performance as the S3610s, but they will last forever and are very stable
> in terms of performance and have the best power loss protection.
>
> Short answer is test them for yourself to make sure they will work. You are
> pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty
> safe based on my experience. It had also been mentioned that someone has
> had good experience with a Samsung DC Pro (has to have both DC and Pro in
> the name), but we weren't able to get any quick enough to test so I can't
> vouch for them.
>
> Sent from a mobile device, please excuse any typos.
> On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  wrote:
>
> > Hello,
> >
> > There has been a bunch of discussion about using SSD.
> > Does anyone have any list of SSDs describing which SSD is highly
> > recommended, which SSD is not.
> >
> > Rgds,
> > Shinobu
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

50 matches

Mail list logo