Re: [ceph-users] Does anyone know why cephfs do not support EC pool?

2016-10-17 Thread huang jun
you can look into this:
https://github.com/ceph/ceph/pull/10334
https://github.com/ceph/ceph/compare/master...athanatos:wip-ec-cache
the community have do a lot works related to ec for rbd and fs interface.

2016-10-18 13:06 GMT+08:00 Erick Perez - Quadrian Enterprises <
epe...@quadrianweb.com>:

> On Mon, Oct 17, 2016 at 9:23 PM, huang jun  wrote:
>
>> ec only support writefull and append operations, but not partial write,
>> your can try it by doing random writes, see if the osd crash or not.
>>
>> 2016-10-18 10:10 GMT+08:00 Liuxuan :
>> > Hello:
>> >
>> >
>> >
>> >   I have create cephfs which data pool type is EC and metadata is
>> replica,
>> > The cluster reported errors from MDSMonitor::_check_pool function.
>> >
>> >  But when I ignore to check the pool type, the cephfs can write and read
>> > datas. Does anyone know why cephfs do not support EC pool?
>> >
>> >
>> >
>> > 
>> >
>> > liuxuan
>> >
>> >
>> >
>> > 
>> -
>> > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
>> > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
>> > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
>> > 邮件!
>> > This e-mail and its attachments contain confidential information from
>> H3C,
>> > which is
>> > intended only for the person or entity whose address is listed above.
>> Any
>> > use of the
>> > information contained herein in any way (including, but not limited to,
>> > total or partial
>> > disclosure, reproduction, or dissemination) by persons other than the
>> > intended
>> > recipient(s) is prohibited. If you receive this e-mail in error, please
>> > notify the sender
>> > by phone or email immediately and delete it!
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Thank you!
>> HuangJun
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> Is EC in the roadmap for CEPH? Cant seem to find it. My question is
> because "all others" (Nutanix, Hypergrid) do EC storage for VMs as the
> default way of storage. It seems EC in ceph (as of Sept 2016) is considered
> by many "experimental" unless is used for cold data.
> --
>
> -
> Erick Perez
> Soluciones Tacticas Pasivas/Activas de Inteligencia y Analitica de Datos
> para Gobiernos
> Quadrian Enterprises S.A. - Panama, Republica de Panama
> Skype chat: eaperezh
> WhatsApp IM: +507-6675-5083
> POBOX 0819-12679, Panama
> Tel. (507) 391.8174 / (507) 391.8175
>



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone know why cephfs do not support EC pool?

2016-10-17 Thread Erick Perez - Quadrian Enterprises
On Mon, Oct 17, 2016 at 9:23 PM, huang jun  wrote:

> ec only support writefull and append operations, but not partial write,
> your can try it by doing random writes, see if the osd crash or not.
>
> 2016-10-18 10:10 GMT+08:00 Liuxuan :
> > Hello:
> >
> >
> >
> >   I have create cephfs which data pool type is EC and metadata is
> replica,
> > The cluster reported errors from MDSMonitor::_check_pool function.
> >
> >  But when I ignore to check the pool type, the cephfs can write and read
> > datas. Does anyone know why cephfs do not support EC pool?
> >
> >
> >
> > 
> >
> > liuxuan
> >
> >
> >
> > 
> -
> > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> > 邮件!
> > This e-mail and its attachments contain confidential information from
> H3C,
> > which is
> > intended only for the person or entity whose address is listed above. Any
> > use of the
> > information contained herein in any way (including, but not limited to,
> > total or partial
> > disclosure, reproduction, or dissemination) by persons other than the
> > intended
> > recipient(s) is prohibited. If you receive this e-mail in error, please
> > notify the sender
> > by phone or email immediately and delete it!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Thank you!
> HuangJun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


Is EC in the roadmap for CEPH? Cant seem to find it. My question is because
"all others" (Nutanix, Hypergrid) do EC storage for VMs as the default way
of storage. It seems EC in ceph (as of Sept 2016) is considered by many
"experimental" unless is used for cold data.
-- 

-
Erick Perez
Soluciones Tacticas Pasivas/Activas de Inteligencia y Analitica de Datos
para Gobiernos
Quadrian Enterprises S.A. - Panama, Republica de Panama
Skype chat: eaperezh
WhatsApp IM: +507-6675-5083
POBOX 0819-12679, Panama
Tel. (507) 391.8174 / (507) 391.8175
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Christian Balzer

Hello,

On Tue, 18 Oct 2016 00:19:53 +0200 Lars Marowsky-Bree wrote:

> On 2016-10-17T15:31:31, Maged Mokhtar  wrote:
> 
> > This is our first beta version, we do not support cache tiering. We
> > definitely intend to support it.
> 
> Cache tiering in Ceph works for this use case. I assume you mean in
> your UI?
> 
May well be, but Oliver suggested that cache-tiering is not supported with
Hammer (0.94.x), which it most certainly is.
Unless you use 0.94.6, which will eat your babies and data.

> Though we all are waiting for Luminous to do away with the need for
> cache tiering to do rbd to ec pools ...
> 
Well, there's the EC band-aid use case for cache-tiers, but they can be
very helpful otherwise, depending on the size of working set/cache-pool,
configuration of the cache-pool (write-back vs. read-forward) and
specific use case.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone know why cephfs do not support EC pool?

2016-10-17 Thread huang jun
ec only support writefull and append operations, but not partial write,
your can try it by doing random writes, see if the osd crash or not.

2016-10-18 10:10 GMT+08:00 Liuxuan :
> Hello:
>
>
>
>   I have create cephfs which data pool type is EC and metadata is replica,
> The cluster reported errors from MDSMonitor::_check_pool function.
>
>  But when I ignore to check the pool type, the cephfs can write and read
> datas. Does anyone know why cephfs do not support EC pool?
>
>
>
> 
>
> liuxuan
>
>
>
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C,
> which is
> intended only for the person or entity whose address is listed above. Any
> use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does anyone know why cephfs do not support EC pool?

2016-10-17 Thread Liuxuan
Hello:

  I have create cephfs which data pool type is EC and metadata is replica, The 
cluster reported errors from MDSMonitor::_check_pool function.
 But when I ignore to check the pool type, the cephfs can write and read datas. 
Does anyone know why cephfs do not support EC pool?


liuxuan

-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-17 Thread Christian Balzer

Hello,

As I had this written mostly already and since it covers some points Nick
raised in more detail, here we go.

On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote:

> Thx Christian for helping troubleshooting the latency issues. I have
> attached my fio job template below.
> 
There's no trouble here per se, just facts of life (Ceph).

You'll be well advised to search the ML, especially with what Nick Fisk
had to write about these things (several times).

> I thought to eliminate the factor that the VM is the bottleneck, I've
> created a 128GB 32 cCPU flavor. 
Nope, The client is not the issue.

>Here's the latest fio benchmark.
> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> performance for SYNCED WRITEs and how well suited it would be for disk
> intensive workloads or DBs
>

A single IOPS of that type and size will only hit the journal and be
ACK'ed quickly (well quicker than what you see now), but FIO is a creating
a constant stream of requests, eventually hitting the actual OSD as well.

Aside from CPU load, of course.

> 
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> 
> To consume more of the SSDs in the hope to achieve lower latency, can
> you pls advice what parameters I should be looking at? 

Not going to help with your prolonged FIO runs and once the flushing to
OSDs comments, stalls will ensue.
The moment the journal is full or the timers kick in, things will go down
to OSD (HDD) speed. 
The journal is there to help with small, short bursts.

>I have already
> tried to what's mentioned in RaySun's ceph blog, which eventually
> lowered my overall sync write IOPs performance by 1-2k.
>
Unsurprisingly, the default values are there for a reason.
 
> # These are from RaySun's  write up, and worsen my total IOPs.
> # 
> http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
> 
> filestore xattr use omap = true
> filestore min sync interval = 10
Way too high, 0.5 is probably already excessive, I run with 0.1.

> filestore max sync interval = 15

> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000
> filestore queue committing max bytes = 1048576
Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of
the defaults may help.

> journal max write bytes = 1073714824
> journal max write entries = 1
> journal queue max ops = 5
> journal queue max bytes = 1048576
>
> My Journals are Intel s3610 200GB, split in 4-5 partitions each. 
Again, you want to event that out.

>When
> I did FIO on the disks locally with direct=1 and sync=1 the WRITE
> performance was 50k iops for 7 threads.
>
Yes, but as I wrote that's not how journals work, think more of 7
sequential writes, not rand-writes. 

And as I tried to explain before, the SSDs are not the bottleneck, your
CPUs may be and your OSD HDDs eventually will be. 
Run atop on all your nodes when doing those tests and see how much things
get pushed (CPUs, disks, the OSD processes).

> My hardware specs:
> 
> - 3 Controllers, The mons run here
> Dell PE R630, 64GB, Intel SSD s3610
> - 9 Storage nodes
> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> OSD: 18x1.8TB Hitachi 10krpm SAS
> 
I can't really fault you for the choice of CPU, but smaller nodes with
higher speed and fewer cores may help with this extreme test case (in
normal production you're fine).

> RAID Controller is PERC 730
> 
> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
> iperf, and I can do 10Gbps from the VM to the storage nodes.
> 
Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high.
If you look again at the flow in 
http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale

those will add up to a significant part of your Ceph latency.

To elaborate and demonstrate:

I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with
SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a
cache-tier for the "normal" ones. All replication 2.
So for the purpose of this test, this is all 100% against the SSDs in the
cache-pool only.

The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between
nodes, CPU is a single E5-2620 v3.

If I run this from a VM:
---
fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
---

We wind up with:
---
  write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec
slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07
clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
 lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
---
During this 

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Lars Marowsky-Bree
On 2016-10-17T15:31:31, Maged Mokhtar  wrote:

> This is our first beta version, we do not support cache tiering. We
> definitely intend to support it.

Cache tiering in Ceph works for this use case. I assume you mean in
your UI?

Though we all are waiting for Luminous to do away with the need for
cache tiering to do rbd to ec pools ...


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-17 Thread Gregory Farnum
On Mon, Oct 17, 2016 at 4:58 AM, Manuel Lausch  wrote:
> Hi Gregory,
>
> each datacenter has its own IP subnet which is routed. We created
> simultaneously iptables rules on each host wich drops all packages in and
> outgoing to the other datacenter. After this our application wrote to DC A,
> there are 3 of 5 Monitor Nodes.
> Now we modified in B the monmap (removed all mon nodes from DC A, so there
> are now 2 of 2 mon active). The monmap in A is untouched. The cluster in B
> was now active as well and the applications in B could now write to it. So
> we wrote definitely data in both clusterparts.
> After this we shut down the mon nodes in A. The part in A was now
> unavailable.
>
> Some hours later we removed the iptables rules and tried to rejoin the tow
> parts.
> we rejoined he three mon nodes from A as new nodes. the old mon data from
> this nodes was destroyed.
>
>
> Do you need further information?

Oh, so you actually forced both data centers to go active on purpose.
Yeah, there's no realistic recovery from that besides throwing out one
side and then adding it back to the cluster in the other DC. Sorry.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an erasure coded pool

2016-10-17 Thread Gregory Farnum
On Mon, Oct 17, 2016 at 3:34 AM, James Norman  wrote:
> Hi Gregory,
>
> Many thanks for your reply. I couldn't spot any resources that describe/show
> how you can successfully write / append to an EC pool with the librados API
> on those links. Do you know of any such examples or resources? Or is it just
> simply not possible?

If it's not in there I guess it's all "spoken" knowledge and you'll
have to dig through ceph-devel archives (probably for emails from
Sam). I'm not on the RADOS team, but the concept you need:
*) objects in EC pools can only be appended or truncated+recreated
*) because otherwise you'd need round-trip read-modify-write operations
*) so all operations must be in the block size you specify (or maybe
it's implicit based on stripe size and EC n count?) at pool create
time
*) including the appends.

I'm afraid that's about all the info I've got on it though.
-Greg

>
> Best regards,
>
> James Norman
>
> On 6 Oct 2016, at 19:17, Gregory Farnum  wrote:
>
> On Thu, Oct 6, 2016 at 4:08 AM, James Norman 
> wrote:
>
> Hi there,
>
> I am developing a web application that supports browsing, uploading,
> downloading, moving files in Ceph Rados pool. Internally to write objects we
> use rados_append, as it's often too memory intensive for us to have the full
> file in memory to do a rados_write_full.
>
> We do not control our customer's Ceph installations, such as whether they
> use replicated pools, EC pools etc. We've found that when dealing with a EC
> pool, our rados_append calls return error code 95 and message "Operation not
> supported".
>
> I've had several discussions with members in the IRC chatroom regarding
> this, and the general consensus I've got is:
> 1) Use write alignment.
> 2) Put a replicated pool in front of the EC pool
> 3) EC pools have a limited feature set
>
> Regarding point 1), are there any actual code example for how you would
> handle this in the context of rados_append? I have struggled to find even
> one. This seems to me something that should be handled by either the API
> libraries, or Ceph itself, not the client trying to write some data.
>
>
> librados requires a fair bit of knowledge from the user applications,
> yes. One thing you mention that sounds concerning is that you can't
> hold the objects in-memory — RADOS is not comfortable with very large
> objects and you'll find that things like backfill might not perform as
> you expect. (At this point everything will *probably* function, but it
> may be so slow as to make no difference to you when it hits that
> situation.) Certainly if your objects do not all fit neatly into
> buckets of a particular size and you have some that are very large,
> you will have a very not-uniform balance.
>
> But, if you want to learn about EC pools there is some documentation
> at http://docs.ceph.com/docs/master/dev/osd_internals/erasure_coding/
> (or in ceph.git/doc/dev/osd_internals/erasure_coding) from when they
> were being created.
>
>
> Regarding point 2) This seems to be a workaround, and generally not
> something we want to recommend to our customers. Is it detrimental to us an
> EC pool without a replicated pool? What are the performance costs of doing
> so?
>
>
> Yeah, don't do that. Cache pools are really tricky to use properly and
> turned out not to perform very well.
>
>
> Regarding point 3) Can you point me towards resources that describe what
> features / abilities you lose by adopting an EC pool?
>
>
> Same as above links, apparently. But really, you can read from and
> append to them. There are no object classes, no arbitrary overwrites,
> no omaps.
> -Greg
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Mike Christie
On 10/17/2016 02:40 PM, Mike Christie wrote:
> For the (non target_mode approach), everything that is needed for basic

Oops. Meant to write for the non target_mod_rbd approach.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Mike Christie
If it is just a couple kernel changes you should post them, so SUSE can
merge them in target_core_rbd and we can port them to upstream. You will
not have to carry them and SUSE and I will not have to re-debug the
problems :)

For the (non target_mode approach), everything that is needed for basic
IO, failover and failback (we only support active/passive right now and
no distributed PRs like SUSE) support is merged upstream:

- Linus's tree
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) for
4.9 has the kernel changes.
- The Ceph tree (https://github.com/ceph/ceph) has some rbd command line
tool changes that are needed.
- The multipath tools tree (https://github.com/ceph/ceph) has changes
needed for how we are doing active/passive with the rbd exclusive lock.

So you can build patches against those trees.

For SUSE's approach, I think everything is in SUSE's git trees which you
probably are familiar with already.

Also, if you are going to build off of upstream/distros and/or also
support other distros as a base, Kraken will have these features, and so
will RHEL 7.3 and RHCS 2.1.

And for setup/management Paul Cuzner (https://github.com/pcuzner)
implemented ansible playbooks to set everything up:

https://github.com/pcuzner/ceph-iscsi-ansible
https://github.com/pcuzner/ceph-iscsi-config

Maybe you can use that too, but since you are SUSE based I am guessing
you are using lrbd.


On 10/17/2016 10:24 AM, Maged Mokhtar wrote:
> Hi Lars,
> Yes I was aware of David Disseldorp & Mike Christie efforts to upstream
> the patches from a while back ago. I understand there will be a move
> away from the SUSE target_mod_rbd to support a more generic device
> handling but do not know what the current status of this work is. We
> have made a couple of tweaks to target_mod_rbd to support some issues
> with found with hyper-v which could be of use, we would be glad to help
> in any way.
> We will be moving to Jewel soon, but are still using Hammer simply
> because we did not have time to test it well.
> In our project we try to focus on HA clustered iSCSI only and make it
> easy to setup and use. Drbd will not give a scale-out solution.
> I will look into github, maybe it will help us in the future.
> 
> Cheers /maged
> 
> --
> From: "Lars Marowsky-Bree" 
> Sent: Monday, October 17, 2016 4:21 PM
> To: 
> Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project
> 
>> On 2016-10-17T13:37:29, Maged Mokhtar  wrote:
>>
>> Hi Maged,
>>
>> glad to see our patches caught your attention. You're aware that they
>> are being upstreamed by David Disseldorp and Mike Christie, right? You
>> don't have to uplift patches from our backported SLES kernel ;-)
>>
>> Also, curious why you based this on Hammer; SUSE Enterprise Storage at
>> this point is based on Jewel. Did you experience any problems with the
>> older release? The newer one has important fixes.
>>
>> Is this supposed to be a separate product/project forever? I mean, there
>> are several management frontends for Ceph at this stage gaining the
>> iSCSI functionality.
>>
>> And, lastly, if all I wanted to build was an iSCSI target and not expose
>> the rest of Ceph's functionality, I'd probably build it around drbd9.
>>
>> But glad to see the iSCSI frontend is gaining more traction. We have
>> many customers in the field deploying it successfully with our support
>> package.
>>
>> OK, not quite lastly - could you be convinced to make the source code
>> available in a bit more convenient form? I doubt that's the preferred
>> form of distribution for development ;-) A GitHub repo maybe?
>>
>>
>> Regards,
>>Lars
>>
>> -- 
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
>> HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radowsg keystone integration in mitaka

2016-10-17 Thread Andrew Woodward
Some config hints here, if you convert your config, you have to unset the
admin_token and change the api version to 3, then you can specify the
keystone user, password, domain, tenant, etc.

You can see what we do for puppet-ceph [1] if you need a refrence
[1]
https://github.com/openstack/puppet-ceph/blob/master/manifests/rgw/keystone.pp

On Sat, Oct 15, 2016 at 9:22 AM Logan V.  wrote:

> The ability to use Keystone v3 and authtokens in lieu of admin token was
> added in jewel. The release notes state it but unfortunately the Jewel docs
> don't reflect it, so you'll need to visit
> http://docs.ceph.com/docs/master/radosgw/keystone/ to find the
> configuration information.
>
> When I tested this out, I had something like:
>
> [client.rgw.radosgw-1]
> rgw keystone admin user = radosgw
> rgw keystone admin password = 
> rgw keystone token cache size = 1
> keyring = /var/lib/ceph/radosgw/ceph-rgw.radosgw-1/keyring
> rgw keystone url = http://keystone-admin-endpoint:35357
> rgw data = /var/lib/ceph/radosgw/ceph-rgw.radosgw-1
> rgw keystone admin tenant = service
> rgw keystone admin domain = default
> rgw keystone api version = 3
> host = radosgw-1
> rgw s3 auth use keystone = true
> rgw socket path = /tmp/radosgw-radosgw-1.sock
> log file = /var/log/ceph/ceph-rgw-radosgw-1.log
> rgw keystone accepted roles = Member, _member_, admin
> rgw frontends = civetweb port=10.13.32.15:8080 num_threads=50
> rgw keystone revocation interval = 900
>
> Logan
>
>
> On Friday, October 14, 2016, Jonathan Proulx  wrote:
>
> Hi All,
>
> Recently upgraded from Kilo->Mitaka on my OpenStack deploy and now
> radowsgw nodes (jewel) are unable to validate keystone tokens.
>
>
> Initially I though it was because radowsgw relies on admin_token
> (which is a a bad idea, but ...) and that's now deperecated.  I
> verified the token was still in keystone.conf and fixed it when I foun
> it had been commented out of  keystone-paste.ini but even after fixing
> that and resarting my keystone I get:
>
>
> -- grep req-a5030a83-f265-4b25-b6e5-1918c978f824
> /var/log/keystone/keystone.log
> 2016-10-14 15:12:47.631 35977 WARNING keystone.middleware.auth
> [req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] Deprecated:
> build_auth_context middleware checking for the admin token is deprecated as
> of the Mitaka release and will be removed in the O release. If your
> deployment requires use of the admin token, update keystone-paste.ini so
> that admin_token_auth is before build_auth_context in the paste pipelines,
> otherwise remove the admin_token_auth middleware from the paste pipelines.
> 2016-10-14 15:12:47.671 35977 INFO keystone.common.wsgi
> [req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] GET
> https://nimbus-1.csail.mit.edu:35358/v2.0/tokens/
> 2016-10-14 15:12:47.672 35977 WARNING oslo_log.versionutils
> [req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] Deprecated:
> validate_token of the v2 API is deprecated as of Mitaka in favor of a
> similar function in the v3 API and may be removed in Q.
> 2016-10-14 15:12:47.684 35977 WARNING keystone.common.wsgi
> [req-a5030a83-f265-4b25-b6e5-1918c978f824 - - - - -] You are not authorized
> to perform the requested action: identity:validate_token
>
> I've dug through keystone/policy.json and identity:validate_token is
> authorized to "role:admin or is_admin:1" which I *think* should cover
> the token use case...but not 100% sure.
>
> Can radosgw use a propper keystone user so I can avoid the admin_token
> mess (http://docs.ceph.com/docs/jewel/radosgw/keystone/ seems to
> indicate no)?
>
> Or anyone see where in my keystone chain I might have dropped a link?
>
> Thanks,
> -Jon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
Andrew Woodward
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Wei/Pavan for the response, it seems I need to debug osds to find out 
what is the cause of slowing down.
Will update community if I find anything conclusive.

Regards
Somnath

-Original Message-
From: Wei Jin [mailto:wjin...@gmail.com] 
Sent: Monday, October 17, 2016 2:13 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy  wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size 
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890 
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message will be 
dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd process is 
still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with heartbeat 
message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar

Thank you David very much and thank you for the correction.

--
From: "David Disseldorp" 
Sent: Monday, October 17, 2016 5:24 PM
To: "Maged Mokhtar" 
Cc: ; "Oliver Dzombic" ; 
"Mike Christie" 

Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi Maged,

Thanks for the announcement - good luck with the project!
One comment...

On Mon, 17 Oct 2016 13:37:29 +0200, Maged Mokhtar wrote:


if you are refering to clustering reservations through VAAI. We are using
upstream code from SUSE Enterprise Storage which adds clustered support 
for

VAAI (compare and write, write same) in the kernel as well as in ceph
(implemented as atomic  osd operations). We have tested VMware HA and
vMotion and they work fine. We have a guide you can download on this use
case.


Just so there's no ambiguity here, the vast majority of the clustered
compare-and-write and write-same implementation was done by Mike
Christie from Red Hat.

Cheers, David 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar

Hi Lars,
Yes I was aware of David Disseldorp & Mike Christie efforts to upstream the 
patches from a while back ago. I understand there will be a move away from 
the SUSE target_mod_rbd to support a more generic device handling but do not 
know what the current status of this work is. We have made a couple of 
tweaks to target_mod_rbd to support some issues with found with hyper-v 
which could be of use, we would be glad to help in any way.
We will be moving to Jewel soon, but are still using Hammer simply because 
we did not have time to test it well.
In our project we try to focus on HA clustered iSCSI only and make it easy 
to setup and use. Drbd will not give a scale-out solution.

I will look into github, maybe it will help us in the future.

Cheers /maged

--
From: "Lars Marowsky-Bree" 
Sent: Monday, October 17, 2016 4:21 PM
To: 
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


On 2016-10-17T13:37:29, Maged Mokhtar  wrote:

Hi Maged,

glad to see our patches caught your attention. You're aware that they
are being upstreamed by David Disseldorp and Mike Christie, right? You
don't have to uplift patches from our backported SLES kernel ;-)

Also, curious why you based this on Hammer; SUSE Enterprise Storage at
this point is based on Jewel. Did you experience any problems with the
older release? The newer one has important fixes.

Is this supposed to be a separate product/project forever? I mean, there
are several management frontends for Ceph at this stage gaining the
iSCSI functionality.

And, lastly, if all I wanted to build was an iSCSI target and not expose
the rest of Ceph's functionality, I'd probably build it around drbd9.

But glad to see the iSCSI frontend is gaining more traction. We have
many customers in the field deploying it successfully with our support
package.

OK, not quite lastly - could you be convinced to make the source code
available in a bit more convenient form? I doubt that's the preferred
form of distribution for development ;-) A GitHub repo maybe?


Regards,
   Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nürnberg)

"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread David Disseldorp
Hi Maged,

Thanks for the announcement - good luck with the project!
One comment...

On Mon, 17 Oct 2016 13:37:29 +0200, Maged Mokhtar wrote:

> if you are refering to clustering reservations through VAAI. We are using 
> upstream code from SUSE Enterprise Storage which adds clustered support for 
> VAAI (compare and write, write same) in the kernel as well as in ceph 
> (implemented as atomic  osd operations). We have tested VMware HA and 
> vMotion and they work fine. We have a guide you can download on this use 
> case.

Just so there's no ambiguity here, the vast majority of the clustered
compare-and-write and write-same implementation was done by Mike
Christie from Red Hat.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian jewel jessie packages missing from Packages file

2016-10-17 Thread Jon Morby (FidoNet)
Thanks

Yes … working again … *phew* :)

> On 17 Oct 2016, at 14:01, Dan Milon  wrote:
> 
> debian/jessie/jewel is fine now.

—
Jon Morby
FidoNet - the internet made simple!
tel: 0345 004 3050 / fax: 0345 004 3051
twitter: @fido | skype://jmorby  | web: https://www.fido.net





signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Lars Marowsky-Bree
On 2016-10-17T13:37:29, Maged Mokhtar  wrote:

Hi Maged,

glad to see our patches caught your attention. You're aware that they
are being upstreamed by David Disseldorp and Mike Christie, right? You
don't have to uplift patches from our backported SLES kernel ;-)

Also, curious why you based this on Hammer; SUSE Enterprise Storage at
this point is based on Jewel. Did you experience any problems with the
older release? The newer one has important fixes.

Is this supposed to be a separate product/project forever? I mean, there
are several management frontends for Ceph at this stage gaining the
iSCSI functionality.

And, lastly, if all I wanted to build was an iSCSI target and not expose
the rest of Ceph's functionality, I'd probably build it around drbd9.

But glad to see the iSCSI frontend is gaining more traction. We have
many customers in the field deploying it successfully with our support
package.

OK, not quite lastly - could you be convinced to make the source code
available in a bit more convenient form? I doubt that's the preferred
form of distribution for development ;-) A GitHub repo maybe?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Oliver Dzombic
Hi Maged,

sounds very valid.

And as soon as we can, we will try it out.

Thank you, and good luck with your project !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 17.10.2016 um 15:31 schrieb Maged Mokhtar:
> Hi Oliver,
> 
> This is our first beta version, we do not support cache tiering. We
> definitely intend to support it.
> Cheers /maged
> 
> --
> From: "Oliver Dzombic" 
> Sent: Monday, October 17, 2016 2:05 PM
> To: 
> Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project
> 
>> Hi Maged,
>>
>> thank you for your clearification ! That makes it intresting.
>>
>> I have red that your base is ceph 0.94, in this version using cache tier
>> is not recommanded, if i remember correctly.
>>
>> Does your codemodification also take care of this issue ?
>>
>> -- 
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 17.10.2016 um 13:37 schrieb Maged Mokhtar:
>>> Hi Oliver,
>>>
>>> if you are refering to clustering reservations through VAAI. We are
>>> using upstream code from SUSE Enterprise Storage which adds clustered
>>> support for VAAI (compare and write, write same) in the kernel as well
>>> as in ceph (implemented as atomic  osd operations). We have tested
>>> VMware HA and vMotion and they work fine. We have a guide you can
>>> download on this use case.
>>>
>>> --
>>> From: "Oliver Dzombic" 
>>> Sent: Sunday, October 16, 2016 10:58 PM
>>> To: 
>>> Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project
>>>
 Hi,

 its using LIO, means it will have the same compatibelity issues with
 vmware.

 So i am wondering, why they call it an idial solution.

 -- 
 Mit freundlichen Gruessen / Best regards

 Oliver Dzombic
 IP-Interactive

 mailto:i...@ip-interactive.de

 Anschrift:

 IP Interactive UG ( haftungsbeschraenkt )
 Zum Sonnenberg 1-3
 63571 Gelnhausen

 HRB 93402 beim Amtsgericht Hanau
 Geschäftsführung: Oliver Dzombic

 Steuer Nr.: 35 236 3622 1
 UST ID: DE274086107


 Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:
> Hello,
>
> I am happy to announce PetaSAN, an open source scale-out SAN that uses
> Ceph storage and LIO iSCSI Target.
> visit us at:
> www.petasan.org
>
> your feedback will be much appreciated.
> maged mokhtar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar

Hi Oliver,

This is our first beta version, we do not support cache tiering. We 
definitely intend to support it.

Cheers /maged

--
From: "Oliver Dzombic" 
Sent: Monday, October 17, 2016 2:05 PM
To: 
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi Maged,

thank you for your clearification ! That makes it intresting.

I have red that your base is ceph 0.94, in this version using cache tier
is not recommanded, if i remember correctly.

Does your codemodification also take care of this issue ?

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 17.10.2016 um 13:37 schrieb Maged Mokhtar:

Hi Oliver,

if you are refering to clustering reservations through VAAI. We are
using upstream code from SUSE Enterprise Storage which adds clustered
support for VAAI (compare and write, write same) in the kernel as well
as in ceph (implemented as atomic  osd operations). We have tested
VMware HA and vMotion and they work fine. We have a guide you can
download on this use case.

--
From: "Oliver Dzombic" 
Sent: Sunday, October 16, 2016 10:58 PM
To: 
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi,

its using LIO, means it will have the same compatibelity issues with
vmware.

So i am wondering, why they call it an idial solution.

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:

Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses
Ceph storage and LIO iSCSI Target.
visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu repo's broken

2016-10-17 Thread Alfredo Deza
On Mon, Oct 17, 2016 at 4:54 AM, Jon Morby (Fido)  wrote:
> full output at https://pastebin.com/tH65tNQy
>
>
> cephadmin@cephadmin:~$ cat /etc/apt/sources.list.d/ceph.list
> deb https://download.ceph.com/debian-jewel/ xenial main
>
>
> oh and fyi
> [osd04][WARNIN] W: 
> https://download.ceph.com/debian-jewel/dists/xenial/InRelease: Signature by 
> key 08B73419AC32B4E966C1A330E84AC2C0460F3994 uses weak digest algorithm (SHA1)
>
> - On 17 Oct, 2016, at 08:19, Wido den Hollander w...@42on.com wrote:
>
>>> Op 16 oktober 2016 om 11:57 schreef "Jon Morby (FidoNet)" :
>>>
>>>
>>> Morning
>>>
>>> It’s been a few days now since the outage however we’re still unable to 
>>> install
>>> new nodes, it seems the repo’s are broken … and have been for at least 2 
>>> days
>>> now (so not just a brief momentary issue caused by an update)
>>>
>>> [osd04][WARNIN] E: Package 'ceph-osd' has no installation candidate
>>> [osd04][WARNIN] E: Package 'ceph-mon' has no installation candidate
>>> [osd04][ERROR ] RuntimeError: command returned non-zero exit status: 100
>>> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
>>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get 
>>> --assume-yes -q
>>> --no-install-recommends install -o Dpkg::Options::=--force-confnew ceph-osd
>>> ceph-mds ceph-mon radosgw
>>>
>>> Is there any eta for when this might be fixed?

This should be now fixed. We had some issues where the published
packages didn't fully sync so the repository database
was out of sync even though the binaries were there.

Sorry for the troubles. Let us know if you encounter any more issues.

>>>
>>
>> What is the line in your sources.list on your system?
>>
>> Afaik the mirrors are working fine.
>>
>> Wido
>>
>>> —
>>> Jon Morby
>>> FidoNet - the internet made simple!
>>> tel: 0345 004 3050 / fax: 0345 004 3051
>>> twitter: @fido | skype://jmorby  | web: https://www.fido.net
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Jon Morby
> FidoNet - the internet made simple!
> 10 - 16 Tiller Road, London, E14 8PX
> tel: 0345 004 3050 / fax: 0345 004 3051
>
> Need more rack space?
> Check out our Co-Lo offerings at http://www.fido.net/services/colo/ 32 amp 
> racks in London and Brighton
> Linx ConneXions available at all Fido sites! 
> https://www.fido.net/services/backbone/connexions/
> PGP Key : 26DC B618 DE9E F9CB F8B7 1EFA 2A64 BA69 B3B5 AD3A - 
> http://jonmorby.com/B3B5AD3A.asc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-17 Thread Alfredo Deza
On Fri, Oct 14, 2016 at 5:42 PM, Stillwell, Bryan J
 wrote:
> On 10/14/16, 2:29 PM, "Alfredo Deza"  wrote:
>
>>On Thu, Oct 13, 2016 at 5:19 PM, Stillwell, Bryan J
>> wrote:
>>> On 10/13/16, 2:32 PM, "Alfredo Deza"  wrote:
>>>
On Thu, Oct 13, 2016 at 11:33 AM, Stillwell, Bryan J
 wrote:
> I have a basement cluster that is partially built with Odroid-C2
>boards
>and
> when I attempted to upgrade to the 10.2.3 release I noticed that this
> release doesn't have an arm64 build.  Are there any plans on
>continuing
>to
> make arm64 builds?

We have a couple of machines for building ceph releases on ARM64 but
unfortunately they sometimes have issues and since Arm64 is
considered a "nice to have" at the moment we usually skip them if
anything comes up.

So it is an on-and-off kind of situation (I don't recall what happened
for 10.2.3)

But since you've asked, I can try to get them built and see if we can
get 10.2.3 out.
>>>
>>> Sounds good, thanks Alfredo!
>>
>>10.2.3 arm64 for xenial (and centos7) is out. We only have xenial
>>available for arm64, hopefully that will work for you.
>
> Thanks Alfredo, but I'm only seeing xenial arm64 dbg packages here:
>
> http://download.ceph.com/debian-jewel/pool/main/c/ceph/
>
>
> There's also a report on IRC that the Packages file no longer contains the
> 10.2.3 amd64 packages for xenial.

It looks like some files didn't make it when publishing the
repositories. This has since been corrected and all arm64 and
other missing entries in the db are now fixed.

Let me know if you have any issues. Thanks for reporting it
>
> Bryan
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian jewel jessie packages missing from Packages file

2016-10-17 Thread Dan Milon
debian/jessie/jewel is fine now.

On 10/17/2016 02:36 PM, Jon Morby (FidoNet) wrote:
> Hi Dan
>
> The repos do indeed seem to be messed up …. it’s been like it for at
> least 4 days now (since everything went offline)
>
> I raised it via IRC over the weekend and also on this list on Saturday … 
>
> All the mirrors seem to be affected too (GiGo I guess) :(
>
> Jon
>
>> On 17 Oct 2016, at 11:33, Dan Milon > > wrote:
>>
>> Hello,
>>
>> I'm trying to install ceph jewel from the debian repository, but it
>> seems to be in a very weird state.
>> ceph, ceph-mon, ceph-osd exist in the pool, but the Packages file does
>> not have any of them.
>>
>> https://download.ceph.com/debian-jewel/dists/jessie/main/binary-amd64/Packages
>>
>> The other days I did the same thing and the repo was fine.
>> Am I doing something wrong, or the repo is indeed messed up?
>>
>>
>> Thank you,
>> Dan.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> — 
> Jon Morby
> FidoNet - the internet made simple!
> tel: 0345 004 3050 / fax: 0345 004 3051
> twitter: @fido | skype://jmorby | web: https://www.fido.net
>
>
>




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Oliver Dzombic
Hi Maged,

thank you for your clearification ! That makes it intresting.

I have red that your base is ceph 0.94, in this version using cache tier
is not recommanded, if i remember correctly.

Does your codemodification also take care of this issue ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 17.10.2016 um 13:37 schrieb Maged Mokhtar:
> Hi Oliver,
> 
> if you are refering to clustering reservations through VAAI. We are
> using upstream code from SUSE Enterprise Storage which adds clustered
> support for VAAI (compare and write, write same) in the kernel as well
> as in ceph (implemented as atomic  osd operations). We have tested
> VMware HA and vMotion and they work fine. We have a guide you can
> download on this use case.
> 
> --
> From: "Oliver Dzombic" 
> Sent: Sunday, October 16, 2016 10:58 PM
> To: 
> Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project
> 
>> Hi,
>>
>> its using LIO, means it will have the same compatibelity issues with
>> vmware.
>>
>> So i am wondering, why they call it an idial solution.
>>
>> -- 
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:
>>> Hello,
>>>
>>> I am happy to announce PetaSAN, an open source scale-out SAN that uses
>>> Ceph storage and LIO iSCSI Target.
>>> visit us at:
>>> www.petasan.org
>>>
>>> your feedback will be much appreciated.
>>> maged mokhtar
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-17 Thread Manuel Lausch

Hi Gregory,

each datacenter has its own IP subnet which is routed. We created 
simultaneously iptables rules on each host wich drops all packages in 
and outgoing to the other datacenter. After this our application wrote 
to DC A, there are 3 of 5 Monitor Nodes.
Now we modified in B the monmap (removed all mon nodes from DC A, so 
there are now 2 of 2 mon active). The monmap in A is untouched. The 
cluster in B was now active as well and the applications in B could now 
write to it. So we wrote definitely data in both clusterparts.
After this we shut down the mon nodes in A. The part in A was now 
unavailable.


Some hours later we removed the iptables rules and tried to rejoin the 
tow parts.
we rejoined he three mon nodes from A as new nodes. the old mon data 
from this nodes was destroyed.



Do you need further information?

Regards,
Manuel


Am 14.10.2016 um 17:58 schrieb Gregory Farnum:

On Fri, Oct 14, 2016 at 7:27 AM, Manuel Lausch  wrote:

Hi,

I need some help to fix a broken cluster. I think we broke the cluster, but
I want to know your opinion and if you see a possibility to recover it.

Let me explain what happend.

We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 12
nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The crushrule and
replication factor forces two replicas in each datacenter.

We write objects via librados in the cluster. The objects are immutable, so
they are either present or absent.

In this cluster we tested what happens if datacenter A will fail and we need
to bring up the cluster in B by creating a monitor quorum in B. We did this
by cut off the network connection betwenn the two datacenters. The OSDs from
DC B went down like expected. Now we removed the mon Nodes from the monmap
in B (by extracting it offline and edit it). Our clients wrote now data in
both independent clusterparts before we stopped the mons in A. (YES I know.
This is a really bad thing).

This story line seems to be missing some points. How did you cut off
the network connection? What leads you to believe the OSDs accepted
writes on both sides of the split? Did you edit the monmap in both
data centers, or just DC A (that you wanted to remain alive)? What
monitor counts do you have in each DC?
-Greg


Now we try to join the two sides again. But so far without success.

Only the OSDs in B are running. The OSDs in A started but the OSDs stay
down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring stats
from non-active osd“ alerts.

We see, that the current osdmap epoch in the running cluster is „28873“. In
the OSDs in A the epoch is „29003“. We assume that this is the reason why
the OSDs won't to jump in.


BTW: This is only a testcluster, so no important data are harmed.


Regards
Manuel


--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese
E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und
vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten
ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt
auf welche Weise auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient of this e-mail, you are hereby notified that
saving, distribution or use of the content of this e-mail in any way is
prohibited. If you have received this e-mail in error, please notify the
sender and delete the e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 

Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Maged Mokhtar

Hi Oliver,

if you are refering to clustering reservations through VAAI. We are using 
upstream code from SUSE Enterprise Storage which adds clustered support for 
VAAI (compare and write, write same) in the kernel as well as in ceph 
(implemented as atomic  osd operations). We have tested VMware HA and 
vMotion and they work fine. We have a guide you can download on this use 
case.


--
From: "Oliver Dzombic" 
Sent: Sunday, October 16, 2016 10:58 PM
To: 
Subject: Re: [ceph-users] new Open Source Ceph based iSCSI SAN project


Hi,

its using LIO, means it will have the same compatibelity issues with 
vmware.


So i am wondering, why they call it an idial solution.

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.10.2016 um 18:57 schrieb Maged Mokhtar:

Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses
Ceph storage and LIO iSCSI Target.
visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian jewel jessie packages missing from Packages file

2016-10-17 Thread Jon Morby (FidoNet)
Hi Dan

The repos do indeed seem to be messed up …. it’s been like it for at least 4 
days now (since everything went offline)

I raised it via IRC over the weekend and also on this list on Saturday …

All the mirrors seem to be affected too (GiGo I guess) :(

Jon

> On 17 Oct 2016, at 11:33, Dan Milon  wrote:
> 
> Hello,
> 
> I'm trying to install ceph jewel from the debian repository, but it
> seems to be in a very weird state.
> ceph, ceph-mon, ceph-osd exist in the pool, but the Packages file does
> not have any of them.
> 
> https://download.ceph.com/debian-jewel/dists/jessie/main/binary-amd64/Packages
> 
> The other days I did the same thing and the repo was fine.
> Am I doing something wrong, or the repo is indeed messed up?
> 
> 
> Thank you,
> Dan.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

—
Jon Morby
FidoNet - the internet made simple!
tel: 0345 004 3050 / fax: 0345 004 3051
twitter: @fido | skype://jmorby  | web: https://www.fido.net





signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Appending to an erasure coded pool

2016-10-17 Thread James Norman
Hi Gregory,

Many thanks for your reply. I couldn't spot any resources that describe/show 
how you can successfully write / append to an EC pool with the librados API on 
those links. Do you know of any such examples or resources? Or is it just 
simply not possible?

Best regards,

James Norman

> On 6 Oct 2016, at 19:17, Gregory Farnum  wrote:
> 
> On Thu, Oct 6, 2016 at 4:08 AM, James Norman  > wrote:
>> Hi there,
>> 
>> I am developing a web application that supports browsing, uploading,
>> downloading, moving files in Ceph Rados pool. Internally to write objects we
>> use rados_append, as it's often too memory intensive for us to have the full
>> file in memory to do a rados_write_full.
>> 
>> We do not control our customer's Ceph installations, such as whether they
>> use replicated pools, EC pools etc. We've found that when dealing with a EC
>> pool, our rados_append calls return error code 95 and message "Operation not
>> supported".
>> 
>> I've had several discussions with members in the IRC chatroom regarding
>> this, and the general consensus I've got is:
>> 1) Use write alignment.
>> 2) Put a replicated pool in front of the EC pool
>> 3) EC pools have a limited feature set
>> 
>> Regarding point 1), are there any actual code example for how you would
>> handle this in the context of rados_append? I have struggled to find even
>> one. This seems to me something that should be handled by either the API
>> libraries, or Ceph itself, not the client trying to write some data.
> 
> librados requires a fair bit of knowledge from the user applications,
> yes. One thing you mention that sounds concerning is that you can't
> hold the objects in-memory — RADOS is not comfortable with very large
> objects and you'll find that things like backfill might not perform as
> you expect. (At this point everything will *probably* function, but it
> may be so slow as to make no difference to you when it hits that
> situation.) Certainly if your objects do not all fit neatly into
> buckets of a particular size and you have some that are very large,
> you will have a very not-uniform balance.
> 
> But, if you want to learn about EC pools there is some documentation
> at http://docs.ceph.com/docs/master/dev/osd_internals/erasure_coding/ 
> 
> (or in ceph.git/doc/dev/osd_internals/erasure_coding) from when they
> were being created.
> 
>> 
>> Regarding point 2) This seems to be a workaround, and generally not
>> something we want to recommend to our customers. Is it detrimental to us an
>> EC pool without a replicated pool? What are the performance costs of doing
>> so?
> 
> Yeah, don't do that. Cache pools are really tricky to use properly and
> turned out not to perform very well.
> 
>> 
>> Regarding point 3) Can you point me towards resources that describe what
>> features / abilities you lose by adopting an EC pool?
> 
> Same as above links, apparently. But really, you can read from and
> append to them. There are no object classes, no arbitrary overwrites,
> no omaps.
> -Greg



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] debian jewel jessie packages missing from Packages file

2016-10-17 Thread Dan Milon
Hello,

I'm trying to install ceph jewel from the debian repository, but it
seems to be in a very weird state.
ceph, ceph-mon, ceph-osd exist in the pool, but the Packages file does
not have any of them.

https://download.ceph.com/debian-jewel/dists/jessie/main/binary-amd64/Packages

The other days I did the same thing and the repo was fine.
Am I doing something wrong, or the repo is indeed messed up?


Thank you,
Dan.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-17 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> William Josefsson
> Sent: 17 October 2016 10:39
> To: n...@fisk.me.uk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> 
> hi nick, I earlier did cpupower frequency-set --cpu-governor performance on 
> all my hosts, which bumped all CPUs up to almost max
> speed or more.

Did you also set /check the c-states, this can have a large impact as well?

> 
> It didn't really help much, and I still experience 5-10ms latency in my fio 
> benchmarks in VMs with this job description.
> 
> Is there anything else I can do to force the SSDs to be used more? 

Not really for small IO's you are limited by end to end latency of the whole 
system. Each request has to be actioned before the next
can be sent. You are probably get 100-200us of latency per network hop, plus 
Ceph introduces latency as it processes requests,
somewhere in the region of 500us to 1.5ms depending on CPU speed and finally 
your SSD's probably take between 50-100us per write. 

So

Client -> Net -> OSD1 -> SSDJournal -> Net -> OSD2+3 -> SSDJournal - > and then 
ACK back to client

It all adds up and so you will never get the same speed as testing to a locally 
attached SSD.

It might be worth running a single threaded test to get an idea on best case 
latency at least this with give you an idea of the best
you will be able to achieve. I would expect you to be able to get around ~1.5ms 
or 600-700 iops for a single threaded test with your
hardware



> I know DIRECT SYNCED WRITE may not be the most common
> application case, however I need help to improve a worst case. Benchmarking 
> these ssd locally with fio and direct sync write, can
do
> 40-50k IOPS.  I'm not sure exactly what, but something is holding back the 
> max performance.
> I know the journals are sparely used from collectd graphs. appreciate any 
> advice. thx will
> 
> >> [global]
> >> bs=4k
> >> rw=write
> >> sync=1
> >> direct=1
> >> iodepth=1
> >> filename=/dev/vdb1
> >> runtime=30
> >> stonewall=1
> >> group_reporting
> 
> 
> grep "cpu MHz" /proc/cpuinfo
> cpu MHz : 2945.250
> cpu MHz : 2617.500
> cpu MHz : 3065.062
> cpu MHz : 2574.281
> cpu MHz : 2739.468
> cpu MHz : 2857.593
> cpu MHz : 2602.125
> cpu MHz : 2581.687
> cpu MHz : 2958.656
> cpu MHz : 2793.093
> cpu MHz : 2682.750
> cpu MHz : 2699.718
> cpu MHz : 2620.125
> cpu MHz : 2926.875
> cpu MHz : 2740.031
> cpu MHz : 2559.656
> cpu MHz : 2758.875
> cpu MHz : 2656.593
> cpu MHz : 1476.187
> cpu MHz : 2545.125
> cpu MHz : 2792.718
> cpu MHz : 2630.156
> cpu MHz : 3090.750
> cpu MHz : 2951.906
> cpu MHz : 2845.875
> cpu MHz : 2553.281
> cpu MHz : 2602.125
> cpu MHz : 2600.906
> cpu MHz : 2737.031
> cpu MHz : 2552.156
> cpu MHz : 2624.625
> cpu MHz : 2614.125
> 
> 
> 
> 
> On Mon, Oct 17, 2016 at 5:17 PM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of William Josefsson
> >> Sent: 17 October 2016 09:31
> >> To: Christian Balzer 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> >>
> >> Thx Christian for helping troubleshooting the latency issues. I have 
> >> attached my fio job template below.
> >>
> >> I thought to eliminate the factor that the VM is the bottleneck, I've
> >> created a 128GB 32 cCPU flavor. Here's the latest fio
> > benchmark.
> >> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> >> performance for SYNCED WRITEs and how well suited it would be for
> >> disk intensive workloads or DBs
> >>
> >>
> >> > The size (45GB) of these journals is only going to be used by a
> >> > little fraction, unlikely to be more than 1GB in normal operations
> >> > and with default filestore/journal parameters.
> >>
> >> To consume more of the SSDs in the hope to achieve lower latency, can
> >> you pls advice what parameters I should be looking at? I
> > have
> >> already tried to what's mentioned in RaySun's ceph blog, which eventually 
> >> lowered my overall sync write IOPs performance by 1-
> 2k.
> >
> > You biggest gains will probably be around forcing the CPU's to max 
> > frequency and forcing c-state to 1.
> >
> > intel_idle.max_cstate=0 on kernel parameters and echo 100 >
> > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is
> > the same as performance governor)
> >
> > Use something like powertop to check that all cores are running at max
> > freq and are staying in cstate1
> >
> > I have managed to get the latency on my cluster down to about 600us,
> > but with your hardware I don't suspect you would be able to 

Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-17 Thread William Josefsson
hi nick, I earlier did cpupower frequency-set --cpu-governor
performance on all my hosts, which bumped all CPUs up to almost max
speed or more.

It didn't really help much, and I still experience 5-10ms latency in
my fio benchmarks in VMs with this job description.

Is there anything else I can do to force the SSDs to be used more? I
know DIRECT SYNCED WRITE may not be the most common application case,
however I need help to improve a worst case. Benchmarking these ssd
locally with fio and direct sync write, can do 40-50k IOPS.  I'm not
sure exactly what, but something is holding back the max performance.
I know the journals are sparely used from collectd graphs. appreciate
any advice. thx will

>> [global]
>> bs=4k
>> rw=write
>> sync=1
>> direct=1
>> iodepth=1
>> filename=/dev/vdb1
>> runtime=30
>> stonewall=1
>> group_reporting


grep "cpu MHz" /proc/cpuinfo
cpu MHz : 2945.250
cpu MHz : 2617.500
cpu MHz : 3065.062
cpu MHz : 2574.281
cpu MHz : 2739.468
cpu MHz : 2857.593
cpu MHz : 2602.125
cpu MHz : 2581.687
cpu MHz : 2958.656
cpu MHz : 2793.093
cpu MHz : 2682.750
cpu MHz : 2699.718
cpu MHz : 2620.125
cpu MHz : 2926.875
cpu MHz : 2740.031
cpu MHz : 2559.656
cpu MHz : 2758.875
cpu MHz : 2656.593
cpu MHz : 1476.187
cpu MHz : 2545.125
cpu MHz : 2792.718
cpu MHz : 2630.156
cpu MHz : 3090.750
cpu MHz : 2951.906
cpu MHz : 2845.875
cpu MHz : 2553.281
cpu MHz : 2602.125
cpu MHz : 2600.906
cpu MHz : 2737.031
cpu MHz : 2552.156
cpu MHz : 2624.625
cpu MHz : 2614.125




On Mon, Oct 17, 2016 at 5:17 PM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> William Josefsson
>> Sent: 17 October 2016 09:31
>> To: Christian Balzer 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
>>
>> Thx Christian for helping troubleshooting the latency issues. I have 
>> attached my fio job template below.
>>
>> I thought to eliminate the factor that the VM is the bottleneck, I've 
>> created a 128GB 32 cCPU flavor. Here's the latest fio
> benchmark.
>> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
>> performance for SYNCED WRITEs and how well suited it would be for disk 
>> intensive workloads or DBs
>>
>>
>> > The size (45GB) of these journals is only going to be used by a little
>> > fraction, unlikely to be more than 1GB in normal operations and with
>> > default filestore/journal parameters.
>>
>> To consume more of the SSDs in the hope to achieve lower latency, can you 
>> pls advice what parameters I should be looking at? I
> have
>> already tried to what's mentioned in RaySun's ceph blog, which eventually 
>> lowered my overall sync write IOPs performance by 1-2k.
>
> You biggest gains will probably be around forcing the CPU's to max frequency 
> and forcing c-state to 1.
>
> intel_idle.max_cstate=0 on kernel parameters
> and
> echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this 
> is the same as performance governor)
>
> Use something like powertop to check that all cores are running at max freq 
> and are staying in cstate1
>
> I have managed to get the latency on my cluster down to about 600us, but with 
> your hardware I don't suspect you would be able to get
> it below ~1-1.5ms best case.
>
>>
>> # These are from RaySun's  write up, and worsen my total IOPs.
>> # 
>> http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
>>
>> filestore xattr use omap = true
>> filestore min sync interval = 10
>> filestore max sync interval = 15
>> filestore queue max ops = 25000
>> filestore queue max bytes = 10485760
>> filestore queue committing max ops = 5000 filestore queue committing max 
>> bytes = 1048576 journal max write bytes =
>> 1073714824 journal max write entries = 1 journal queue max ops = 5 
>> journal queue max bytes = 1048576
>>
>> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did 
>> FIO on the disks locally with direct=1 and sync=1 the
> WRITE
>> performance was 50k iops for 7 threads.
>>
>> My hardware specs:
>>
>> - 3 Controllers, The mons run here
>> Dell PE R630, 64GB, Intel SSD s3610
>> - 9 Storage nodes
>> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
>> OSD: 18x1.8TB Hitachi 10krpm SAS
>>
>> RAID Controller is PERC 730
>>
>> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 
>> 7050X 10Gbit Switches with VARP, and LACP interfaces.
> I
>> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did 
>> iperf, and I can do 10Gbps from the VM to the storage
> nodes.
>>
>> I've already been tuning, CPU scaling governor to 'performance' 

Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-17 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> William Josefsson
> Sent: 17 October 2016 09:31
> To: Christian Balzer 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> 
> Thx Christian for helping troubleshooting the latency issues. I have attached 
> my fio job template below.
> 
> I thought to eliminate the factor that the VM is the bottleneck, I've created 
> a 128GB 32 cCPU flavor. Here's the latest fio
benchmark.
> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> performance for SYNCED WRITEs and how well suited it would be for disk 
> intensive workloads or DBs
> 
> 
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> 
> To consume more of the SSDs in the hope to achieve lower latency, can you pls 
> advice what parameters I should be looking at? I
have
> already tried to what's mentioned in RaySun's ceph blog, which eventually 
> lowered my overall sync write IOPs performance by 1-2k.

You biggest gains will probably be around forcing the CPU's to max frequency 
and forcing c-state to 1.

intel_idle.max_cstate=0 on kernel parameters
and
echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is 
the same as performance governor) 

Use something like powertop to check that all cores are running at max freq and 
are staying in cstate1

I have managed to get the latency on my cluster down to about 600us, but with 
your hardware I don't suspect you would be able to get
it below ~1-1.5ms best case.

> 
> # These are from RaySun's  write up, and worsen my total IOPs.
> # 
> http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
> 
> filestore xattr use omap = true
> filestore min sync interval = 10
> filestore max sync interval = 15
> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000 filestore queue committing max 
> bytes = 1048576 journal max write bytes =
> 1073714824 journal max write entries = 1 journal queue max ops = 5 
> journal queue max bytes = 1048576
> 
> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did 
> FIO on the disks locally with direct=1 and sync=1 the
WRITE
> performance was 50k iops for 7 threads.
> 
> My hardware specs:
> 
> - 3 Controllers, The mons run here
> Dell PE R630, 64GB, Intel SSD s3610
> - 9 Storage nodes
> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> OSD: 18x1.8TB Hitachi 10krpm SAS
> 
> RAID Controller is PERC 730
> 
> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 
> 7050X 10Gbit Switches with VARP, and LACP interfaces.
I
> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did 
> iperf, and I can do 10Gbps from the VM to the storage
nodes.
> 
> I've already been tuning, CPU scaling governor to 'performance' on all hosts 
> for all cores. My CEPH release is latest hammer on
> CentOS7.
> 
> The best write currently happens at 62 threads it seems, the IOPS is 8.3k for 
> the direct synced writes. The latency and stddev are
still
> concerning.. :(
> 
> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
> 15:20:05 2016
>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
> clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>  lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
> clat percentiles (usec):
>  |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
>  | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
>  | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
>  | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
>  | 99.99th=[17792]
> bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
> lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
>   cpu  : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued: total=r=0/w=250527/d=0, short=r=0/w=0/d=0
> 
> 
> From the above we can tell that the latency for clients doing synced writes, 
> is somewhere 5-10ms which seems very high, especially
> with quite high performing hardware, network, and SSD journals. I'm not sure 
> whether it may be the syncing from Journal to OSD
that
> causes these fluctuations or high latencies.
> 
> Any help or advice would be much appreciates. thx will
> 
> 
> [global]
> bs=4k
> rw=write
> sync=1
> direct=1
> iodepth=1
> filename=${FILE}
> runtime=30
> stonewall=1
> 

Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Wei Jin
On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy  wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
> v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
> message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message
will be dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd
process is still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with
heartbeat message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu repo's broken

2016-10-17 Thread Jon Morby (Fido)
full output at https://pastebin.com/tH65tNQy


cephadmin@cephadmin:~$ cat /etc/apt/sources.list.d/ceph.list
deb https://download.ceph.com/debian-jewel/ xenial main


oh and fyi
[osd04][WARNIN] W: 
https://download.ceph.com/debian-jewel/dists/xenial/InRelease: Signature by key 
08B73419AC32B4E966C1A330E84AC2C0460F3994 uses weak digest algorithm (SHA1)

- On 17 Oct, 2016, at 08:19, Wido den Hollander w...@42on.com wrote:

>> Op 16 oktober 2016 om 11:57 schreef "Jon Morby (FidoNet)" :
>> 
>> 
>> Morning
>> 
>> It’s been a few days now since the outage however we’re still unable to 
>> install
>> new nodes, it seems the repo’s are broken … and have been for at least 2 days
>> now (so not just a brief momentary issue caused by an update)
>> 
>> [osd04][WARNIN] E: Package 'ceph-osd' has no installation candidate
>> [osd04][WARNIN] E: Package 'ceph-mon' has no installation candidate
>> [osd04][ERROR ] RuntimeError: command returned non-zero exit status: 100
>> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes 
>> -q
>> --no-install-recommends install -o Dpkg::Options::=--force-confnew ceph-osd
>> ceph-mds ceph-mon radosgw
>> 
>> Is there any eta for when this might be fixed?
>> 
> 
> What is the line in your sources.list on your system?
> 
> Afaik the mirrors are working fine.
> 
> Wido
> 
>> —
>> Jon Morby
>> FidoNet - the internet made simple!
>> tel: 0345 004 3050 / fax: 0345 004 3051
>> twitter: @fido | skype://jmorby  | web: https://www.fido.net
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jon Morby 
FidoNet - the internet made simple! 
10 - 16 Tiller Road, London, E14 8PX 
tel: 0345 004 3050 / fax: 0345 004 3051 

Need more rack space? 
Check out our Co-Lo offerings at http://www.fido.net/services/colo/ 32 amp 
racks in London and Brighton 
Linx ConneXions available at all Fido sites! 
https://www.fido.net/services/backbone/connexions/ 
PGP Key : 26DC B618 DE9E F9CB F8B7 1EFA 2A64 BA69 B3B5 AD3A - 
http://jonmorby.com/B3B5AD3A.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-17 Thread William Josefsson
Thx Christian for helping troubleshooting the latency issues. I have
attached my fio job template below.

I thought to eliminate the factor that the VM is the bottleneck, I've
created a 128GB 32 cCPU flavor. Here's the latest fio benchmark.
http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
performance for SYNCED WRITEs and how well suited it would be for disk
intensive workloads or DBs


> The size (45GB) of these journals is only going to be used by a little
> fraction, unlikely to be more than 1GB in normal operations and with
> default filestore/journal parameters.

To consume more of the SSDs in the hope to achieve lower latency, can
you pls advice what parameters I should be looking at? I have already
tried to what's mentioned in RaySun's ceph blog, which eventually
lowered my overall sync write IOPs performance by 1-2k.

# These are from RaySun's  write up, and worsen my total IOPs.
# 
http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/

filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 1048576
journal max write bytes = 1073714824
journal max write entries = 1
journal queue max ops = 5
journal queue max bytes = 1048576

My Journals are Intel s3610 200GB, split in 4-5 partitions each. When
I did FIO on the disks locally with direct=1 and sync=1 the WRITE
performance was 50k iops for 7 threads.

My hardware specs:

- 3 Controllers, The mons run here
Dell PE R630, 64GB, Intel SSD s3610
- 9 Storage nodes
Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
OSD: 18x1.8TB Hitachi 10krpm SAS

RAID Controller is PERC 730

All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
iperf, and I can do 10Gbps from the VM to the storage nodes.

I've already been tuning, CPU scaling governor to 'performance' on all
hosts for all cores. My CEPH release is latest hammer on CentOS7.

The best write currently happens at 62 threads it seems, the IOPS is
8.3k for the direct synced writes. The latency and stddev are still
concerning.. :(

simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
15:20:05 2016
  write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
 lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
clat percentiles (usec):
 |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
 | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
 | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
 | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
 | 99.99th=[17792]
bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
  cpu  : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=250527/d=0, short=r=0/w=0/d=0


>From the above we can tell that the latency for clients doing synced
writes, is somewhere 5-10ms which seems very high, especially with
quite high performing hardware, network, and SSD journals. I'm not
sure whether it may be the syncing from Journal to OSD that causes
these fluctuations or high latencies.

Any help or advice would be much appreciates. thx will


[global]
bs=4k
rw=write
sync=1
direct=1
iodepth=1
filename=${FILE}
runtime=30
stonewall=1
group_reporting

[simple-write-6]
numjobs=6
[simple-write-10]
numjobs=10
[simple-write-14]
numjobs=14
[simple-write-18]
numjobs=18
[simple-write-22]
numjobs=22
[simple-write-26]
numjobs=26
[simple-write-30]
numjobs=30
[simple-write-34]
numjobs=34
[simple-write-38]
numjobs=38
[simple-write-42]
numjobs=42
[simple-write-46]
numjobs=46
[simple-write-50]
numjobs=50
[simple-write-54]
numjobs=54
[simple-write-58]
numjobs=58
[simple-write-62]
numjobs=62
[simple-write-66]
numjobs=66
[simple-write-70]
numjobs=70

On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer  wrote:
>
> Hello,
>
>
> On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
>
>> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
>> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
>> these as the journals of the OSDs.
>>
> The size (45GB) of these journals is only going to be used by a little
> fraction, unlikely to be more than 1GB in normal operations and with
> default filestore/journal parameters.
>
> Because those 

Re: [ceph-users] Ubuntu repo's broken

2016-10-17 Thread Vy Nguyen Tan
Hello,


I have the same problem. I am using Debian 8.6 and ceph-deploy 1.5.36


Logs from ceph-deploy:

[*hv01*][*INFO*  ] Running command: env DEBIAN_FRONTEND=noninteractive
DEBIAN_PRIORITY=critical apt-get --assume-yes -q
--no-install-recommends install -o Dpkg::Options::=--force-confnew
ceph-osd ceph-mds ceph-mon radosgw

[*hv01*][*DEBUG* ] Reading package lists...

[*hv01*][*DEBUG* ] Building dependency tree...

[*hv01*][*DEBUG* ] Reading state information...

[*hv01*][*DEBUG* ] Package ceph-osd is not available, but is referred
to by another package.

[*hv01*][*DEBUG* ] This may mean that the package is missing, has been
obsoleted, or

[*hv01*][*DEBUG* ] is only available from another source

[*hv01*][*DEBUG* ]

[*hv01*][*WARNIN*] E: Package 'ceph-osd' has no installation candidate

[*hv01*][*WARNIN*] E: Unable to locate package ceph-mon

[*hv01*][*ERROR* ] RuntimeError: command returned non-zero exit status: 100

[*ceph_deploy*][*ERROR* ] RuntimeError: Failed to execute command: env
DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get
--assume-yes -q --no-install-recommends install -o
Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw


> Op 16 oktober 2016 om 11:57 schreef "Jon Morby (FidoNet)" :
>
>
> Morning
>
> It’s been a few days now since the outage however we’re still unable to
> install new nodes, it seems the repo’s are broken … and have been for at
> least 2 days now (so not just a brief momentary issue caused by an update)
>
> [osd04][WARNIN] E: Package 'ceph-osd' has no installation candidate
> [osd04][WARNIN] E: Package 'ceph-mon' has no installation candidate
> [osd04][ERROR ] RuntimeError: command returned non-zero exit status: 100
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes
> -q --no-install-recommends install -o Dpkg::Options::=--force-confnew
> ceph-osd ceph-mds ceph-mon radosgw
>
> Is there any eta for when this might be fixed?
>

>> What is the line in your sources.list on your system?

>> Afaik the mirrors are working fine.

>> Wido

> —
> Jon Morby
> FidoNet - the internet made simple!
> tel: 0345 004 3050 / fax: 0345 004 3051
> twitter: @fido | skype://jmorby  | web: https://www.fido.net
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Pavan Rallabhandi
Regarding the mon_osd_min_down_reports I was looking at it recently, this could 
provide some insight 
https://github.com/ceph/ceph/commit/0269a0c17723fd3e22738f7495fe017225b924a4

Thanks!

On 10/17/16, 1:36 PM, "ceph-users on behalf of Somnath Roy" 
 wrote:

Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the 
log messages at least 7 osds are reporting failure , so, didn't try. BTW, I 
found default mon_osd_min_down_reporters is 2 , not 1 and latest master is not 
having mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger 
needs to respond to HB is confusing, I know each thread has a HB timeout value 
and beyond which it will crash with suicide timeout , are you talking about 
that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed 
to either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based) 
 is stressed with large block size and very high QD. Lowering QD it is working 
just fine.
> We are seeing the lossy connection message like below and followed by the 
osd marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly 
and rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
10-12Gb/s , no network error is reported. So, why this lossy connection message 
is coming ? what could go wrong here ? Is it network prioritization issue of 
smaller ping packets ? I tried to gaze ping round time during this and nothing 
seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of 
network/memory/cpu/disk is left. So, I doubt my osds are unresponsive but yes 
it is really busy on IO path. Heartbeat is going through separate messenger and 
threads as well, so, busy op threads should not be making heartbeat delayed. 
Increasing osd heartbeat grace is only delaying this phenomenon , but, 
eventually happens after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond 
to heartbeat messages in reasonable time, it is marked as down. Since packets 
are processed in FIFO/synchronous manner, overloading OSD with large I/O will 
cause it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream 
and process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, 
preventing stalled I/Os in most common cases. Grace period is only applied to 
unresponsive OSDs (i.e. temporary packet loss, bad cases of network lags, 
routing issues, in other words, everything that is known to be at least 
possible to resolve by itself in a finite amount of time). OSDs that crash and 
burn won't respond - instead, OS will respond with ECONNREFUSED indicating that 
OSD is not listening and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
   

Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-17 Thread Christian Balzer

Hello,

On Mon, 17 Oct 2016 09:42:09 +0200 (CEST) i...@witeq.com wrote:

> Hi Wido, 
> 
> thanks for the explanation, generally speaking what is the best practice when 
> a couple of OSDs are reaching near-full capacity? 
> 
This has (of course) been discussed here many times.
Google is your friend (when it's not creepy).

> I could set their weight do something like 0.9 but this seems only a 
> temporary solution. 
> Of course i can add more OSDs, but this change radically my prospective in 
> terms of capacity planning, what would you do in production? 
>

Manually re-weighting (CRUSH, not reweight) is one approach, IMHO it's
better to give the least utilized OSDs a higher score than the other way
around, while keeping per node scores as equal as possible.

Doing the reweight by utilization dance, which in the latest Hammer
versions is much improved and has a dry-run option is another option.
I don't like it because it's a temporary setting, lost if the OSD is ever
set OUT.

Both of these can create a very even cluster, but may make imbalances
during OSD adds/removals worse than otherwise.

The larger your cluster gets, the less likely you'll run into extreme
outliers, but of course it's something to monitor (graph) anyway.
The smaller your cluster, the less painful it is to manually adjust things.

If you have only 1 or a few over-utilized OSDs and don't want to add more
OSDs, fiddle their weight.

However if one OSD is getting near-full I'd take that also as hint to check
the numbers, i.e. what would happen if you'd loose an OSD (or 2) or a host,
could Ceph survive this w/o everything getting full?

Christian
 
> Thanks 
> Giordano 
> 
> 
> From: "Wido den Hollander"  
> To: ceph-users@lists.ceph.com, i...@witeq.com 
> Sent: Monday, October 17, 2016 8:57:16 AM 
> Subject: Re: [ceph-users] Even data distribution across OSD - Impossible 
> Achievement? 
> 
> > Op 14 oktober 2016 om 19:13 schreef i...@witeq.com: 
> > 
> > 
> > Hi all, 
> > 
> > after encountering a warning about one of my OSDs running out of space i 
> > tried to study better how data distribution works. 
> > 
> 
> 100% perfect data distribution is not possible with straw. It is even very 
> hard to accomplish this with a deterministic algorithm. It's a trade-off 
> between balance and performance. 
> 
> You might want to read the original paper from Sage: 
> http://ceph.com/papers/weil-crush-sc06.pdf 
> 
> Another thing to look at is: 
> http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters
>  
> 
> With different algorithms like list and uniform you could do other things, 
> but use them carefully! I would say, read the PDF first. 
> 
> Wido 
> 
> > I'm running a Hammer Ceph cluster v. 0.94.7 
> > 
> > I did some test with crushtool trying to figure out how to achieve even 
> > data distribution across OSDs. 
> > 
> > Let's take this simple CRUSH MAP: 
> > 
> > # begin crush map 
> > tunable choose_local_tries 0 
> > tunable choose_local_fallback_tries 0 
> > tunable choose_total_tries 50 
> > tunable chooseleaf_descend_once 1 
> > tunable straw_calc_version 1 
> > tunable chooseleaf_vary_r 1 
> > 
> > # devices 
> > # ceph-osd-001 
> > device 0 osd.0 # sata-p 
> > device 1 osd.1 # sata-p 
> > device 3 osd.3 # sata-p 
> > device 4 osd.4 # sata-p 
> > device 5 osd.5 # sata-p 
> > device 7 osd.7 # sata-p 
> > device 9 osd.9 # sata-p 
> > device 10 osd.10 # sata-p 
> > device 11 osd.11 # sata-p 
> > device 13 osd.13 # sata-p 
> > # ceph-osd-002 
> > device 14 osd.14 # sata-p 
> > device 15 osd.15 # sata-p 
> > device 16 osd.16 # sata-p 
> > device 18 osd.18 # sata-p 
> > device 19 osd.19 # sata-p 
> > device 21 osd.21 # sata-p 
> > device 23 osd.23 # sata-p 
> > device 24 osd.24 # sata-p 
> > device 25 osd.25 # sata-p 
> > device 26 osd.26 # sata-p 
> > # ceph-osd-003 
> > device 28 osd.28 # sata-p 
> > device 29 osd.29 # sata-p 
> > device 30 osd.30 # sata-p 
> > device 31 osd.31 # sata-p 
> > device 32 osd.32 # sata-p 
> > device 33 osd.33 # sata-p 
> > device 34 osd.34 # sata-p 
> > device 35 osd.35 # sata-p 
> > device 36 osd.36 # sata-p 
> > device 41 osd.41 # sata-p 
> > # types 
> > type 0 osd 
> > type 1 server 
> > type 3 datacenter 
> > 
> > # buckets 
> > 
> > ### CEPH-OSD-003 ### 
> > server ceph-osd-003-sata-p { 
> > id -12 
> > alg straw 
> > hash 0 # rjenkins1 
> > item osd.28 weight 1.000 
> > item osd.29 weight 1.000 
> > item osd.30 weight 1.000 
> > item osd.31 weight 1.000 
> > item osd.32 weight 1.000 
> > item osd.33 weight 1.000 
> > item osd.34 weight 1.000 
> > item osd.35 weight 1.000 
> > item osd.36 weight 1.000 
> > item osd.41 weight 1.000 
> > } 
> > 
> > ### CEPH-OSD-002 ### 
> > server ceph-osd-002-sata-p { 
> > id -9 
> > alg straw 
> > hash 0 # rjenkins1 
> > item osd.14 weight 1.000 
> > item osd.15 weight 1.000 
> > item osd.16 weight 1.000 
> > item osd.18 weight 1.000 
> > item osd.19 weight 1.000 
> > item osd.21 weight 1.000 
> > item osd.23 weight 1.000 

Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the log 
messages at least 7 osds are reporting failure , so, didn't try. BTW, I found 
default mon_osd_min_down_reporters is 2 , not 1 and latest master is not having 
mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger needs 
to respond to HB is confusing, I know each thread has a HB timeout value and 
beyond which it will crash with suicide timeout , are you talking about that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond to 
heartbeat messages in reasonable time, it is marked as down. Since packets are 
processed in FIFO/synchronous manner, overloading OSD with large I/O will cause 
it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream and 
process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, preventing 
stalled I/Os in most common cases. Grace period is only applied to unresponsive 
OSDs (i.e. temporary packet loss, bad cases of network lags, routing issues, in 
other words, everything that is known to be at least possible to resolve by 
itself in a finite amount of time). OSDs that crash and burn won't respond - 
instead, OS will respond with ECONNREFUSED indicating that OSD is not listening 
and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately 

Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-17 Thread info
Hi Wido, 

thanks for the explanation, generally speaking what is the best practice when a 
couple of OSDs are reaching near-full capacity? 

I could set their weight do something like 0.9 but this seems only a temporary 
solution. 
Of course i can add more OSDs, but this change radically my prospective in 
terms of capacity planning, what would you do in production? 

Thanks 
Giordano 


From: "Wido den Hollander"  
To: ceph-users@lists.ceph.com, i...@witeq.com 
Sent: Monday, October 17, 2016 8:57:16 AM 
Subject: Re: [ceph-users] Even data distribution across OSD - Impossible 
Achievement? 

> Op 14 oktober 2016 om 19:13 schreef i...@witeq.com: 
> 
> 
> Hi all, 
> 
> after encountering a warning about one of my OSDs running out of space i 
> tried to study better how data distribution works. 
> 

100% perfect data distribution is not possible with straw. It is even very hard 
to accomplish this with a deterministic algorithm. It's a trade-off between 
balance and performance. 

You might want to read the original paper from Sage: 
http://ceph.com/papers/weil-crush-sc06.pdf 

Another thing to look at is: 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters
 

With different algorithms like list and uniform you could do other things, but 
use them carefully! I would say, read the PDF first. 

Wido 

> I'm running a Hammer Ceph cluster v. 0.94.7 
> 
> I did some test with crushtool trying to figure out how to achieve even data 
> distribution across OSDs. 
> 
> Let's take this simple CRUSH MAP: 
> 
> # begin crush map 
> tunable choose_local_tries 0 
> tunable choose_local_fallback_tries 0 
> tunable choose_total_tries 50 
> tunable chooseleaf_descend_once 1 
> tunable straw_calc_version 1 
> tunable chooseleaf_vary_r 1 
> 
> # devices 
> # ceph-osd-001 
> device 0 osd.0 # sata-p 
> device 1 osd.1 # sata-p 
> device 3 osd.3 # sata-p 
> device 4 osd.4 # sata-p 
> device 5 osd.5 # sata-p 
> device 7 osd.7 # sata-p 
> device 9 osd.9 # sata-p 
> device 10 osd.10 # sata-p 
> device 11 osd.11 # sata-p 
> device 13 osd.13 # sata-p 
> # ceph-osd-002 
> device 14 osd.14 # sata-p 
> device 15 osd.15 # sata-p 
> device 16 osd.16 # sata-p 
> device 18 osd.18 # sata-p 
> device 19 osd.19 # sata-p 
> device 21 osd.21 # sata-p 
> device 23 osd.23 # sata-p 
> device 24 osd.24 # sata-p 
> device 25 osd.25 # sata-p 
> device 26 osd.26 # sata-p 
> # ceph-osd-003 
> device 28 osd.28 # sata-p 
> device 29 osd.29 # sata-p 
> device 30 osd.30 # sata-p 
> device 31 osd.31 # sata-p 
> device 32 osd.32 # sata-p 
> device 33 osd.33 # sata-p 
> device 34 osd.34 # sata-p 
> device 35 osd.35 # sata-p 
> device 36 osd.36 # sata-p 
> device 41 osd.41 # sata-p 
> # types 
> type 0 osd 
> type 1 server 
> type 3 datacenter 
> 
> # buckets 
> 
> ### CEPH-OSD-003 ### 
> server ceph-osd-003-sata-p { 
> id -12 
> alg straw 
> hash 0 # rjenkins1 
> item osd.28 weight 1.000 
> item osd.29 weight 1.000 
> item osd.30 weight 1.000 
> item osd.31 weight 1.000 
> item osd.32 weight 1.000 
> item osd.33 weight 1.000 
> item osd.34 weight 1.000 
> item osd.35 weight 1.000 
> item osd.36 weight 1.000 
> item osd.41 weight 1.000 
> } 
> 
> ### CEPH-OSD-002 ### 
> server ceph-osd-002-sata-p { 
> id -9 
> alg straw 
> hash 0 # rjenkins1 
> item osd.14 weight 1.000 
> item osd.15 weight 1.000 
> item osd.16 weight 1.000 
> item osd.18 weight 1.000 
> item osd.19 weight 1.000 
> item osd.21 weight 1.000 
> item osd.23 weight 1.000 
> item osd.24 weight 1.000 
> item osd.25 weight 1.000 
> item osd.26 weight 1.000 
> } 
> 
> ### CEPH-OSD-001 ### 
> server ceph-osd-001-sata-p { 
> id -5 
> alg straw 
> hash 0 # rjenkins1 
> item osd.0 weight 1.000 
> item osd.1 weight 1.000 
> item osd.3 weight 1.000 
> item osd.4 weight 1.000 
> item osd.5 weight 1.000 
> item osd.7 weight 1.000 
> item osd.9 weight 1.000 
> item osd.10 weight 1.000 
> item osd.11 weight 1.000 
> item osd.13 weight 1.000 
> } 
> 
> # DATACENTER 
> datacenter dc1 { 
> id -1 
> alg straw 
> hash 0 # rjenkins1 
> item ceph-osd-001-sata-p weight 10.000 
> item ceph-osd-002-sata-p weight 10.000 
> item ceph-osd-003-sata-p weight 10.000 
> } 
> 
> # rules 
> rule sata-p { 
> ruleset 0 
> type replicated 
> min_size 2 
> max_size 10 
> step take dc1 
> step chooseleaf firstn 0 type server 
> step emit 
> } 
> 
> # end crush map 
> 
> 
> Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
> replica-3 
> 
> 
> cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
> --show-utilization --num-rep 3 --tree --max-x 1 
> 
> ID WEIGHT TYPE NAME 
> -1 30.0 datacenter milano1 
> -5 10.0 server ceph-osd-001-sata-p 
> 0 1.0 osd.0 
> 1 1.0 osd.1 
> 3 1.0 osd.3 
> 4 1.0 osd.4 
> 5 1.0 osd.5 
> 7 1.0 osd.7 
> 9 1.0 osd.9 
> 10 1.0 osd.10 
> 11 1.0 osd.11 
> 13 1.0 osd.13 
> -9 10.0 server ceph-osd-002-sata-p 
> 14 1.0 osd.14 
> 15 1.0 osd.15 
> 16 1.0 osd.16 
> 18 1.0 

Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Wido den Hollander

> Op 17 oktober 2016 om 9:16 schreef Somnath Roy :
> 
> 
> Hi Sage et. al,
> 
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
> 
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
> v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
> message
> 
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.
> 
> My question is the following.
> 
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
> 
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
> 
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
> 

You might want to look at:

OPTION(mon_osd_min_down_reporters, OPT_INT, 1)   // number of OSDs who need to 
report a down OSD for it to count
OPTION(mon_osd_min_down_reports, OPT_INT, 3) // number of times a down OSD 
must be reported for it to count

Setting 'mon_osd_min_down_reporters' to 3 means that 3 individual OSDs have to 
mark a OSD as down. You could also increase the amount of reports.

On larger environments I always set reporters to 3 or 5, just to prevent such 
flapping.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
> 

A crashed OSD will not be detected by the MON. It are the other OSDs which 
inform the monitor about this OSD crashing. But you will have to wait for the 
heartbeats to time out.

Only when a OSD gracefully shuts down it will mark itself down instantly.

Wido

> Any help on clarifying this would be very helpful.
> 
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu repo's broken

2016-10-17 Thread Wido den Hollander

> Op 16 oktober 2016 om 11:57 schreef "Jon Morby (FidoNet)" :
> 
> 
> Morning
> 
> It’s been a few days now since the outage however we’re still unable to 
> install new nodes, it seems the repo’s are broken … and have been for at 
> least 2 days now (so not just a brief momentary issue caused by an update)
> 
> [osd04][WARNIN] E: Package 'ceph-osd' has no installation candidate
> [osd04][WARNIN] E: Package 'ceph-mon' has no installation candidate
> [osd04][ERROR ] RuntimeError: command returned non-zero exit status: 100
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env 
> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes 
> -q --no-install-recommends install -o Dpkg::Options::=--force-confnew 
> ceph-osd ceph-mds ceph-mon radosgw
> 
> Is there any eta for when this might be fixed?
> 

What is the line in your sources.list on your system?

Afaik the mirrors are working fine.

Wido

> — 
> Jon Morby
> FidoNet - the internet made simple!
> tel: 0345 004 3050 / fax: 0345 004 3051
> twitter: @fido | skype://jmorby  | web: https://www.fido.net
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Hi Sage et. al,

I know this issue is reported number of times in community and attributed to 
either network issue or unresponsive OSDs.
Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
stressed with large block size and very high QD. Lowering QD it is working just 
fine.
We are seeing the lossy connection message like below and followed by the osd 
marked down by monitor.

2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
[set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
message

In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
OSDs is rejoining the cluster after detecting it is marked down wrongly and 
rebalancing started. This is hurting performance very badly.

My question is the following.

1. I have 40Gb network and I am seeing network is not utilized beyond 10-12Gb/s 
, no network error is reported. So, why this lossy connection message is coming 
? what could go wrong here ? Is it network prioritization issue of smaller ping 
packets ? I tried to gaze ping round time during this and nothing seems 
abnormal.

2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk is 
left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
path. Heartbeat is going through separate messenger and threads as well, so, 
busy op threads should not be making heartbeat delayed. Increasing osd 
heartbeat grace is only delaying this phenomenon , but, eventually happens 
after several hours. Anything else we can tune here ?

3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

Any help on clarifying this would be very helpful.

Thanks & Regards
Somnath
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Even data distribution across OSD - Impossible Achievement?

2016-10-17 Thread Wido den Hollander

> Op 14 oktober 2016 om 19:13 schreef i...@witeq.com:
> 
> 
> Hi all, 
> 
> after encountering a warning about one of my OSDs running out of space i 
> tried to study better how data distribution works. 
> 

100% perfect data distribution is not possible with straw. It is even very hard 
to accomplish this with a deterministic algorithm. It's a trade-off between 
balance and performance.

You might want to read the original paper from Sage: 
http://ceph.com/papers/weil-crush-sc06.pdf

Another thing to look at is: 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/#crush-map-parameters

With different algorithms like list and uniform you could do other things, but 
use them carefully! I would say, read the PDF first.

Wido

> I'm running a Hammer Ceph cluster v. 0.94.7 
> 
> I did some test with crushtool trying to figure out how to achieve even data 
> distribution across OSDs. 
> 
> Let's take this simple CRUSH MAP: 
> 
> # begin crush map 
> tunable choose_local_tries 0 
> tunable choose_local_fallback_tries 0 
> tunable choose_total_tries 50 
> tunable chooseleaf_descend_once 1 
> tunable straw_calc_version 1 
> tunable chooseleaf_vary_r 1 
> 
> # devices 
> # ceph-osd-001 
> device 0 osd.0 # sata-p 
> device 1 osd.1 # sata-p 
> device 3 osd.3 # sata-p 
> device 4 osd.4 # sata-p 
> device 5 osd.5 # sata-p 
> device 7 osd.7 # sata-p 
> device 9 osd.9 # sata-p 
> device 10 osd.10 # sata-p 
> device 11 osd.11 # sata-p 
> device 13 osd.13 # sata-p 
> # ceph-osd-002 
> device 14 osd.14 # sata-p 
> device 15 osd.15 # sata-p 
> device 16 osd.16 # sata-p 
> device 18 osd.18 # sata-p 
> device 19 osd.19 # sata-p 
> device 21 osd.21 # sata-p 
> device 23 osd.23 # sata-p 
> device 24 osd.24 # sata-p 
> device 25 osd.25 # sata-p 
> device 26 osd.26 # sata-p 
> # ceph-osd-003 
> device 28 osd.28 # sata-p 
> device 29 osd.29 # sata-p 
> device 30 osd.30 # sata-p 
> device 31 osd.31 # sata-p 
> device 32 osd.32 # sata-p 
> device 33 osd.33 # sata-p 
> device 34 osd.34 # sata-p 
> device 35 osd.35 # sata-p 
> device 36 osd.36 # sata-p 
> device 41 osd.41 # sata-p 
> # types 
> type 0 osd 
> type 1 server 
> type 3 datacenter 
> 
> # buckets 
> 
> ### CEPH-OSD-003 ### 
> server ceph-osd-003-sata-p { 
> id -12 
> alg straw 
> hash 0 # rjenkins1 
> item osd.28 weight 1.000 
> item osd.29 weight 1.000 
> item osd.30 weight 1.000 
> item osd.31 weight 1.000 
> item osd.32 weight 1.000 
> item osd.33 weight 1.000 
> item osd.34 weight 1.000 
> item osd.35 weight 1.000 
> item osd.36 weight 1.000 
> item osd.41 weight 1.000 
> } 
> 
> ### CEPH-OSD-002 ### 
> server ceph-osd-002-sata-p { 
> id -9 
> alg straw 
> hash 0 # rjenkins1 
> item osd.14 weight 1.000 
> item osd.15 weight 1.000 
> item osd.16 weight 1.000 
> item osd.18 weight 1.000 
> item osd.19 weight 1.000 
> item osd.21 weight 1.000 
> item osd.23 weight 1.000 
> item osd.24 weight 1.000 
> item osd.25 weight 1.000 
> item osd.26 weight 1.000 
> } 
> 
> ### CEPH-OSD-001 ### 
> server ceph-osd-001-sata-p { 
> id -5 
> alg straw 
> hash 0 # rjenkins1 
> item osd.0 weight 1.000 
> item osd.1 weight 1.000 
> item osd.3 weight 1.000 
> item osd.4 weight 1.000 
> item osd.5 weight 1.000 
> item osd.7 weight 1.000 
> item osd.9 weight 1.000 
> item osd.10 weight 1.000 
> item osd.11 weight 1.000 
> item osd.13 weight 1.000 
> } 
> 
> # DATACENTER 
> datacenter dc1 { 
> id -1 
> alg straw 
> hash 0 # rjenkins1 
> item ceph-osd-001-sata-p weight 10.000 
> item ceph-osd-002-sata-p weight 10.000 
> item ceph-osd-003-sata-p weight 10.000 
> } 
> 
> # rules 
> rule sata-p { 
> ruleset 0 
> type replicated 
> min_size 2 
> max_size 10 
> step take dc1 
> step chooseleaf firstn 0 type server 
> step emit 
> } 
> 
> # end crush map 
> 
> 
> Basically it's 30 OSDs spanned across 3 servers. One rule exists, the classic 
> replica-3 
> 
> 
> cephadm@cephadm01:/etc/ceph/$ crushtool -i crushprova.c --test 
> --show-utilization --num-rep 3 --tree --max-x 1 
> 
> ID WEIGHT TYPE NAME 
> -1 30.0 datacenter milano1 
> -5 10.0 server ceph-osd-001-sata-p 
> 0 1.0 osd.0 
> 1 1.0 osd.1 
> 3 1.0 osd.3 
> 4 1.0 osd.4 
> 5 1.0 osd.5 
> 7 1.0 osd.7 
> 9 1.0 osd.9 
> 10 1.0 osd.10 
> 11 1.0 osd.11 
> 13 1.0 osd.13 
> -9 10.0 server ceph-osd-002-sata-p 
> 14 1.0 osd.14 
> 15 1.0 osd.15 
> 16 1.0 osd.16 
> 18 1.0 osd.18 
> 19 1.0 osd.19 
> 21 1.0 osd.21 
> 23 1.0 osd.23 
> 24 1.0 osd.24 
> 25 1.0 osd.25 
> 26 1.0 osd.26 
> -12 10.0 server ceph-osd-003-sata-p 
> 28 1.0 osd.28 
> 29 1.0 osd.29 
> 30 1.0 osd.30 
> 31 1.0 osd.31 
> 32 1.0 osd.32 
> 33 1.0 osd.33 
> 34 1.0 osd.34 
> 35 1.0 osd.35 
> 36 1.0 osd.36 
> 41 1.0 osd.41 
> 
> rule 0 (sata-performance), x = 0..1023, numrep = 3..3 
> rule 0 (sata-performance) num_rep 3 result size == 3: 1024/1024 
> device 0: stored : 95 expected : 102.49 
> device 1: stored : 95 expected : 102.49 
> device 3: 

Re: [ceph-users] Does marking OSD "down" trigger "AdvMap" event in other OSD?

2016-10-17 Thread Wido den Hollander

> Op 17 oktober 2016 om 6:37 schreef xxhdx1985126 :
> 
> 
> Hi, everyone.
> 
> 
> If one OSD's state transforms from up to down, by "kill -i" for example, will 
> an "AdvMap" event be triggered on other related 
> OSDs?___

iirc it wil. A down OSD will trigger a new OSDMap epoch, so all daemons have to 
advance the map by one epoch.

Wido

> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com