[ceph-users] Deep scrub distribution

2017-07-05 Thread Adrian Saul

During a recent snafu with a production cluster I disabled scrubbing and deep 
scrubbing in order to reduce load on the cluster while things backfilled and 
settled down.  The PTSD caused by the incident meant I was not keen to 
re-enable it until I was confident we had fixed the root cause of the issues 
(driver issues with a new NIC type introduced with new hardware that did not 
show up until production load hit them).   My cluster is using Jewel 10.2.1, 
and is a mix of SSD and SATA over 20 hosts, 352 OSDs in total.

Fast forward a few weeks and I was ready to re-enable it.  On some reading I 
was concerned the cluster might kick off excessive scrubbing once I unset the 
flags, so I tried increasing the deep scrub interval from 7 days to 60 days - 
with most of the last deep scrubs being from over a month before I was hoping 
it would distribute them over the next 30 days.  Having unset the flag and 
carefully watched the cluster it seems to have just run a steady catch up 
without significant impact.  What I am noticing though is that the scrubbing is 
seeming to just run through the full set of PGs, so it did some 2280 PGs last 
night over 6 hours, and so far today in 12 hours another 4000 odd.  With 13408 
PGs, I am guessing that all this will stop some time early tomorrow.

ceph-glb-fec-01[/var/log]$ sudo ceph pg dump|awk '{print $20}'|grep 
2017|sort|uniq -c
dumped all in format plain
  5 2017-05-23
 18 2017-05-24
 33 2017-05-25
 52 2017-05-26
 89 2017-05-27
114 2017-05-28
144 2017-05-29
172 2017-05-30
256 2017-05-31
191 2017-06-01
230 2017-06-02
369 2017-06-03
606 2017-06-04
680 2017-06-05
919 2017-06-06
   1261 2017-06-07
   1876 2017-06-08
 15 2017-06-09
   2280 2017-07-05
   4098 2017-07-06

My concern is am I now set to have all 13408 PGs do a deep scrub in 60 days in 
a serial fashion again over 3 days.  I would much rather they distribute over 
that period.

Will the OSDs do this distribution themselves now they have caught up, or do I 
need to say create a script that will trigger batches of PGs to deep scrub over 
time to push out the distribution again?





Adrian Saul | Infrastructure Projects Team Lead
IT
T 02 9009 9041 | M +61 402 075 760
30 Ross St, Glebe NSW 2037
adrian.s...@tpgtelecom.com.au | 
www.tpg.com.au

TPG Telecom (ASX: TPM)


[Description: http://res.tpgi.com.au/img/signature/tpgtelecomlogo.jpg]


This email and any attachments are confidential and may be subject to 
copyright, legal or some other professional privilege. They are intended solely 
for the attention and use of the named addressee(s). They may only be copied, 
distributed or disclosed with the consent of the copyright owner. If you have 
received this email by mistake or by breach of the confidentiality clause, 
please notify the sender immediately by return email and delete or destroy all 
copies of the email. Any confidentiality, privilege or copyright is not waived 
or lost because this email has been sent to you by mistake.



Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CDM APAC

2017-07-05 Thread Patrick McGarry
Hey cephers,

While tonight was my last CDM, I am considering a recommendation to my
replacement that we stop the alternating time zone and just
standardize on a NA/EMEA time slot. Attendance typically is much
better (even amongst APAC developers) during the NA/EMEA times.

So, if you want us to continue alternating times please speak up and
make yourself known. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Maged Mokhtar
On 2017-07-05 23:22, David Clarke wrote:

> On 07/05/2017 08:54 PM, Massimiliano Cuttini wrote: 
> 
>> Dear all,
>> 
>> luminous is coming and sooner we should be allowed to avoid double writing.
>> This means use 100% of the speed of SSD and NVMe.
>> Cluster made all of SSD and NVMe will not be penalized and start to make
>> sense.
>> 
>> Looking forward I'm building the next pool of storage which we'll setup
>> on next term.
>> We are taking in consideration a pool of 4 with the following single
>> node configuration:
>> 
>> * 2x E5-2603 v4 - 6 cores - 1.70GHz
>> * 2x 32Gb of RAM
>> * 2x NVMe M2 for OS
>> * 6x NVMe U2 for OSD
>> * 2x 100Gib ethernet cards
>> 
>> We have yet not sure about which Intel and how much RAM we should put on
>> it to avoid CPU bottleneck.
>> Can you help me to choose the right couple of CPU?
>> Did you see any issue on the configuration proposed?
> 
> There are notes on ceph.com regarding flash, and NVMe in particular,
> deployments:
> 
> http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This is a nice link, but the Ceph configuration is a bit dated, it was
done for Hammer and a couple of config params were dropped in Jewel. I
hope Intel does publish some new settings for Luminous/Bluestore !  

In addition to tuning ceph.conf, sysstl, udev, it is important to run
stress benchmarks such as rados bench/ rbd bench and measure the system
load via atop/collectl/sysstat. This will tell you where your
bottlenecks are like. If you will do many tests, you may find the CBT
Ceph Benchmarking Tool handy as you can script incremental tests.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to force "rbd unmap"

2017-07-05 Thread Maged Mokhtar
On 2017-07-05 20:42, Ilya Dryomov wrote:

> On Wed, Jul 5, 2017 at 8:32 PM, David Turner  wrote: 
> 
>> I had this problem occasionally in a cluster where we were regularly mapping
>> RBDs with KRBD.  Something else we saw was that after this happened for
>> un-mapping RBDs, was that it would start preventing mapping some RBDs as
>> well.  We were able to use strace and kill the sub-thread that was stuck to
>> allow the RBD to finish un-mapping, but as it turned out, the server would
>> just continue to hang on KRBD functions until the server was restarted.  The
> 
> Did you ever report this?
> 
> Thanks,
> 
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This may or may not be related, but if you do a lot of mapping/unmapping
it may be better to load the rbd module with  

modprobe rbd single_major=Y 

load it before running targetcli or rtslib as they load it without this
param. 

/Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread David Clarke
On 07/05/2017 08:54 PM, Massimiliano Cuttini wrote:
> Dear all,
> 
> luminous is coming and sooner we should be allowed to avoid double writing.
> This means use 100% of the speed of SSD and NVMe.
> Cluster made all of SSD and NVMe will not be penalized and start to make
> sense.
> 
> Looking forward I'm building the next pool of storage which we'll setup
> on next term.
> We are taking in consideration a pool of 4 with the following single
> node configuration:
> 
>   * 2x E5-2603 v4 - 6 cores - 1.70GHz
>   * 2x 32Gb of RAM
>   * 2x NVMe M2 for OS
>   * 6x NVMe U2 for OSD
>   * 2x 100Gib ethernet cards
> 
> We have yet not sure about which Intel and how much RAM we should put on
> it to avoid CPU bottleneck.
> Can you help me to choose the right couple of CPU?
> Did you see any issue on the configuration proposed?

There are notes on ceph.com regarding flash, and NVMe in particular,
deployments:

http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments


-- 
David Clarke
Systems Architect
Catalyst IT



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to force "rbd unmap"

2017-07-05 Thread Ilya Dryomov
On Wed, Jul 5, 2017 at 8:32 PM, David Turner  wrote:
> I had this problem occasionally in a cluster where we were regularly mapping
> RBDs with KRBD.  Something else we saw was that after this happened for
> un-mapping RBDs, was that it would start preventing mapping some RBDs as
> well.  We were able to use strace and kill the sub-thread that was stuck to
> allow the RBD to finish un-mapping, but as it turned out, the server would
> just continue to hang on KRBD functions until the server was restarted.  The

Did you ever report this?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to force "rbd unmap"

2017-07-05 Thread David Turner
I had this problem occasionally in a cluster where we were regularly
mapping RBDs with KRBD.  Something else we saw was that after this happened
for un-mapping RBDs, was that it would start preventing mapping some RBDs
as well.  We were able to use strace and kill the sub-thread that was stuck
to allow the RBD to finish un-mapping, but as it turned out, the server
would just continue to hang on KRBD functions until the server was
restarted.  The kernel just seemed to be in a bad state and the 6 different
kernels we tried were all affected the same way.

We switched to using rbd-fuse to get away from KRBD problems like this, but
rbd-fuse came with its own problems.

On Wed, Jul 5, 2017 at 1:56 PM Stanislav Kopp  wrote:

> Hello,
>
> I have problem that sometimes I can't unmap rbd device, I get "sysfs
> write failed rbd: unmap failed: (16) Device or resource busy", there
> is no open files and "holders" directory is empty. I saw on the
> mailling list that you can "force" unmapping the device, but I cant
> find how does it work. "man rbd" only mentions "force" as "KERNEL RBD
> (KRBD) OPTION", but "modinfo rbd" doesn't show this option. Did I miss
> something?
>
> As client where rbd is mapped I use Debian stretch with kernel 4.9,
> ceph cluster is on version 11.2.
>
> Thanks,
> Stan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to force "rbd unmap"

2017-07-05 Thread Ilya Dryomov
On Wed, Jul 5, 2017 at 7:55 PM, Stanislav Kopp  wrote:
> Hello,
>
> I have problem that sometimes I can't unmap rbd device, I get "sysfs
> write failed rbd: unmap failed: (16) Device or resource busy", there
> is no open files and "holders" directory is empty. I saw on the
> mailling list that you can "force" unmapping the device, but I cant
> find how does it work. "man rbd" only mentions "force" as "KERNEL RBD
> (KRBD) OPTION", but "modinfo rbd" doesn't show this option. Did I miss
> something?

Forcing unmap on an open device is not a good idea.  I'd suggest
looking into what's holding the device and fixing that instead.

Did you see http://tracker.ceph.com/issues/12763?

>
> As client where rbd is mapped I use Debian stretch with kernel 4.9,
> ceph cluster is on version 11.2.

rbd unmap -o force $DEV

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to force "rbd unmap"

2017-07-05 Thread Stanislav Kopp
Hello,

I have problem that sometimes I can't unmap rbd device, I get "sysfs
write failed rbd: unmap failed: (16) Device or resource busy", there
is no open files and "holders" directory is empty. I saw on the
mailling list that you can "force" unmapping the device, but I cant
find how does it work. "man rbd" only mentions "force" as "KERNEL RBD
(KRBD) OPTION", but "modinfo rbd" doesn't show this option. Did I miss
something?

As client where rbd is mapped I use Debian stretch with kernel 4.9,
ceph cluster is on version 11.2.

Thanks,
Stan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon stuck in synchronizing after upgrading from Hammer to Jewel

2017-07-05 Thread David Turner
Did you make sure that your upgraded mon was chown'd to ceph:ceph?

On Wed, Jul 5, 2017, 1:54 AM jiajia zhong  wrote:

> refer to http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
>
> I recalled we encoutered the same issue after upgrading to Jewel :(.
>
> 2017-07-05 11:21 GMT+08:00 许雪寒 :
>
>> Hi, everyone.
>>
>> Recently, we upgraded one of clusters from Hammer to Jewel. However,
>> after upgrading one of our monitors cannot finish the bootstrap procedure
>> and stuck in “synchronizing”. Does anyone has any clue about this?
>>
>> Thank you☺
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Wido den Hollander

> Op 5 juli 2017 om 12:39 schreef c...@jack.fr.eu.org:
> 
> 
> Beware, a single 10G NIC is easily saturated by a single NVMe device
> 

Yes, it is. But that what was what I'm pointing at. Bandwidth is usually not a 
problem, latency is.

Take a look at a Ceph cluster running out there, it is probably doing a lot of 
IOps, but not that much bandwidth.

A production cluster I took a look at:

"client io 405 MB/s rd, 116 MB/s wr, 12211 op/s rd, 13272 op/s wr"

This cluster is 15 machines with 10 OSDs (SSD, PM863a) each.

So 405/15 = 27MB/sec

It's doing 13k IOps now, that increases to 25k during higher load, but the 
bandwidth stays below 500MB/sec in TOTAL.

So yes, you are right, a NVMe device can sature a single NIC, but most of the 
time latency and IOps are what count. Not bandwidth.

Wido

> On 05/07/2017 11:54, Wido den Hollander wrote:
> > 
> >> Op 5 juli 2017 om 11:41 schreef "Van Leeuwen, Robert" 
> >> :
> >>
> >>
> >> Hi Max,
> >>
> >> You might also want to look at the PCIE lanes.
> >> I am not an expert on the matter but my guess would be the 8 NVME drives + 
> >> 2x100Gbit would be too much for
> >> the current Xeon generation (40 PCIE lanes) to fully utilize.
> >>
> > 
> > Fair enough, but you might want to think about if you really, really need 
> > 100Gbit. Those cards are expensive, same goes for the Gbics and switches.
> > 
> > Storage is usually latency bound and not so much bandwidth. Imho a lot of 
> > people focus on raw TBs and bandwidth, but in the end IOps and latency are 
> > what usually matters.
> > 
> > I'd probably stick with 2x10Gbit for now and use the money I saved on more 
> > memory and faster CPUs.
> > 
> > Wido
> > 
> >> I think the upcoming AMD/Intel offerings will improve that quite a bit so 
> >> you may want to wait for that.
> >> As mentioned earlier. Single Core cpu speed matters for latency so you 
> >> probably want to up that.
> >>
> >> You can also look at the DIMM configuration.
> >> TBH I am not sure how much it impacts Ceph performance but having just 2 
> >> DIMMS slots populated will not give you max memory bandwidth.
> >> Having some extra memory for read-cache probably won’t hurt either (unless 
> >> you know your workload won’t include any cacheable reads)
> >>
> >> Cheers,
> >> Robert van Leeuwen
> >>
> >> From: ceph-users  on behalf of 
> >> Massimiliano Cuttini 
> >> Organization: PhoenixWeb Srl
> >> Date: Wednesday, July 5, 2017 at 10:54 AM
> >> To: "ceph-users@lists.ceph.com" 
> >> Subject: [ceph-users] New cluster - configuration tips and reccomendation 
> >> - NVMe
> >>
> >>
> >> Dear all,
> >>
> >> luminous is coming and sooner we should be allowed to avoid double writing.
> >> This means use 100% of the speed of SSD and NVMe.
> >> Cluster made all of SSD and NVMe will not be penalized and start to make 
> >> sense.
> >>
> >> Looking forward I'm building the next pool of storage which we'll setup on 
> >> next term.
> >> We are taking in consideration a pool of 4 with the following single node 
> >> configuration:
> >>
> >>   *   2x E5-2603 v4 - 6 cores - 1.70GHz
> >>   *   2x 32Gb of RAM
> >>   *   2x NVMe M2 for OS
> >>   *   6x NVMe U2 for OSD
> >>   *   2x 100Gib ethernet cards
> >>
> >> We have yet not sure about which Intel and how much RAM we should put on 
> >> it to avoid CPU bottleneck.
> >> Can you help me to choose the right couple of CPU?
> >> Did you see any issue on the configuration proposed?
> >>
> >> Thanks,
> >> Max
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread ceph
Beware, a single 10G NIC is easily saturated by a single NVMe device

On 05/07/2017 11:54, Wido den Hollander wrote:
> 
>> Op 5 juli 2017 om 11:41 schreef "Van Leeuwen, Robert" 
>> :
>>
>>
>> Hi Max,
>>
>> You might also want to look at the PCIE lanes.
>> I am not an expert on the matter but my guess would be the 8 NVME drives + 
>> 2x100Gbit would be too much for
>> the current Xeon generation (40 PCIE lanes) to fully utilize.
>>
> 
> Fair enough, but you might want to think about if you really, really need 
> 100Gbit. Those cards are expensive, same goes for the Gbics and switches.
> 
> Storage is usually latency bound and not so much bandwidth. Imho a lot of 
> people focus on raw TBs and bandwidth, but in the end IOps and latency are 
> what usually matters.
> 
> I'd probably stick with 2x10Gbit for now and use the money I saved on more 
> memory and faster CPUs.
> 
> Wido
> 
>> I think the upcoming AMD/Intel offerings will improve that quite a bit so 
>> you may want to wait for that.
>> As mentioned earlier. Single Core cpu speed matters for latency so you 
>> probably want to up that.
>>
>> You can also look at the DIMM configuration.
>> TBH I am not sure how much it impacts Ceph performance but having just 2 
>> DIMMS slots populated will not give you max memory bandwidth.
>> Having some extra memory for read-cache probably won’t hurt either (unless 
>> you know your workload won’t include any cacheable reads)
>>
>> Cheers,
>> Robert van Leeuwen
>>
>> From: ceph-users  on behalf of 
>> Massimiliano Cuttini 
>> Organization: PhoenixWeb Srl
>> Date: Wednesday, July 5, 2017 at 10:54 AM
>> To: "ceph-users@lists.ceph.com" 
>> Subject: [ceph-users] New cluster - configuration tips and reccomendation - 
>> NVMe
>>
>>
>> Dear all,
>>
>> luminous is coming and sooner we should be allowed to avoid double writing.
>> This means use 100% of the speed of SSD and NVMe.
>> Cluster made all of SSD and NVMe will not be penalized and start to make 
>> sense.
>>
>> Looking forward I'm building the next pool of storage which we'll setup on 
>> next term.
>> We are taking in consideration a pool of 4 with the following single node 
>> configuration:
>>
>>   *   2x E5-2603 v4 - 6 cores - 1.70GHz
>>   *   2x 32Gb of RAM
>>   *   2x NVMe M2 for OS
>>   *   6x NVMe U2 for OSD
>>   *   2x 100Gib ethernet cards
>>
>> We have yet not sure about which Intel and how much RAM we should put on it 
>> to avoid CPU bottleneck.
>> Can you help me to choose the right couple of CPU?
>> Did you see any issue on the configuration proposed?
>>
>> Thanks,
>> Max
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Blair Bethwaite
On 5 July 2017 at 19:54, Wido den Hollander  wrote:

> I'd probably stick with 2x10Gbit for now and use the money I saved on more
> memory and faster CPUs.
>

On the latency point. - you will get an improvement going from 10Gb to
25Gb, but stepping up to 100Gb won't significantly change things as 100Gb =
4x25Gb lanes.

If I had to buy this sort of cluster today I'd probably look at a
multi-node chassis like Dell's C6320, that holds 4x 2-socket E5v4 nodes,
each of which can take 4x NVMe SSDs, I'm not sure of the exact network
daughter card options available in that configuration but they typically
have Mellanox options which would open up a 2x25Gb NIC option. At least
this way you are managing to get a reasonable number of storage devices per
rack unit, but still a terrible use of space compared to a dedicated flash
jbod array thing.

Also not sure if the end-to-end storage op latencies achieved with NVMe
versus SAS or SATA SSDs in a Ceph cluster really make it that much
better... would be interested to hear about any comparisons!

Wait another month and there should be a whole slew of new Intel and AMD
platform choices available on the market.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-05 Thread Joao Eduardo Luis

On 07/05/2017 08:01 AM, Z Will wrote:

Hi Joao:
I think this is all because we choose the monitor with the
smallest rank number to be leader. For this kind of network error, no
matter which mon has lost connection with the  mon who has the
smallest rank num , will be constantly calling an election, that say
,will constantly affact the cluster until it is stopped by human . So
do you think it make sense if I try to figure out a way to choose the
monitor who can see the most monitors ,  or with  the smallest rank
num if the view num is same , to be leader ?
In probing phase:
   they will know there own view, so can set a view num.
In election phase:
   they send the view num , rank num .
   when receiving the election message, it compare the view num (
higher is leader ) and rank num ( lower is leader).


As I understand it, our elector trades-off reliability in case of 
network failure for expediency in forming a quorum. This by itself is 
not a problem since we don't see many real-world cases where this 
behaviour happens, and we are a lot more interested in making sure we 
have a quorum - given without a quorum your cluster is effectively unusable.


Currently, we form a quorum with a minimal number of messages passed.
From my poor recollection, I think the Elector works something like

- 1 probe message to each monitor in the monmap
- receives defer from a monitor, or defers to a monitor
- declares victory if number of defers is an absolute majority 
(including one's defer).


An election cycle takes about 4-5 messages to complete, with roughly two 
round-trips (in the best case scenario).


Figuring out which monitor is able to contact the highest number of 
monitors, and having said monitor being elected the leader, will 
necessarily increase the number of messages transferred.


A rough idea would be

- all monitors will send probes to all other monitors in the monmap;
- all monitors need to ack the other's probes;
- each monitor will count the number of monitors it can reach, and then 
send a message proposing itself as the leader to the other monitors, 
with the list of monitors they see;
- each monitor will propose itself as the leader, or defer to some other 
monitor.


This is closer to 3 round-trips.

Additionally, we'd have to account for the fact that some monitors may 
be able to reach all other monitors, while some may only be able to 
reach a portion. How do we handle this scenario?


- What do we do with monitors that do not reach all other monitors?
- Do we ignore them for electoral purposes?
- Are they part of the final quorum?
- What if we need those monitors to form a quorum?

Personally, I think the easiest solution to this problem would be 
blacklisting a problematic monitor (for a given amount a time, or until 
a new election is needed due to loss of quorum, or by human intervention).


For example, if a monitor believes it should be the leader, and if all 
other monitors are deferring to someone else that is not reachable, the 
monitor could then enter a special case branch:


- send a probe to all monitors
- receive acks
- share that with other monitors
- if that list is missing monitors, then blacklist the monitor for a 
period, and send a message to that monitor with that decision

- the monitor would blacklist itself and retry in a given amount of time.

Basically, this would be something similar to heartbeats. If a monitor 
can't reach all monitors in an existing quorum, then just don't do anything.


In any case, you are more than welcome to propose a solution. Let us 
know what you come up with and if you want to discuss this a bit more ;)


  -Joao



On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis  wrote:

On 07/04/2017 06:57 AM, Z Will wrote:


Hi:
   I am testing ceph-mon brain split . I have read the code . If I
understand it right , I know it won't be brain split. But I think
there is still another problem. My ceph version is 0.94.10. And here
is my test detail :

3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
mon , and use iptables to block the communication between mon 0 and
mon 1. When the cluster is stable, start mon.1 .  I found the 3
monitors will all can not work well. They are all trying to call  new
leader  election . This means the cluster can't work anymore.

Here is my analysis. Because mon will always respond to leader
election message, so , in my test, communication between  mon.0 and
mon.1 is blocked , so mon.1 will always try to be leader, because it
will always see mon.2, and it should win over mon.2. Mon.0 should
always win over mon.2. But mon.2 will always responsd to the election
message issued by mon.1, so this loop will never end. Am I right ?

This should be a problem? Or is it  was just designed like this , and
should be handled by human ?



This is a known behaviour, quite annoying, but easily identifiable by having
the same monitor constantly calling an election and usually timing out
because t

Re: [ceph-users] bluestore behavior on disks sector read errors

2017-07-05 Thread Wido den Hollander

> Op 27 juni 2017 om 11:17 schreef SCHAER Frederic :
> 
> 
> Hi,
> 
> Every now and then , sectors die on disks.
> When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes 
> degraded.
> The exact status is :
> 
> 
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> 
> pg 12.127 is active+clean+inconsistent, acting [141,67,85]
> 
> If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty
> I get :
> (...)
> 
> "osd": 112,
> 
> "errors": [
> 
> "read_error"
> 
> ],
> 
> "size": 4194304
> 
> When this happens, I'm forced to manually run "ceph pg repair" on the 
> inconsistent PGs after I made sure this was a read error : I feel this should 
> not be a manual process.
> 
> If I go on the machine and look at the syslogs, I indeed see a sector read 
> error happened once or twice.
> But if I try to read the sector manually, then I can because it was 
> reallocated on the disk I presume.
> Last time this happened, I ran badblocks on the disk and it found no issue...
> 
> My question therefore are :
> 
> why doen't bluestore retry reading the sector (in case of transient errors) ? 
> (maybe it does)
> why isn't the pg automatically fixed when a read error was detected ?
> what will happen when the disks get old and reach up to 2048 bad sectors 
> before the controllers/smart declare them as "failure predicted" ?
> I can't imagine manually fixing  up to Nx2048 PGs in an infrastructure of N 
> disks where N could reach the sky...
> 
> Ideas ?

Try the Luminous RC? A lot has changed since Kraken was released. Using 
Luminous might help you.

Wido

> 
> Thanks && regards
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Wido den Hollander

> Op 5 juli 2017 om 11:41 schreef "Van Leeuwen, Robert" :
> 
> 
> Hi Max,
> 
> You might also want to look at the PCIE lanes.
> I am not an expert on the matter but my guess would be the 8 NVME drives + 
> 2x100Gbit would be too much for
> the current Xeon generation (40 PCIE lanes) to fully utilize.
> 

Fair enough, but you might want to think about if you really, really need 
100Gbit. Those cards are expensive, same goes for the Gbics and switches.

Storage is usually latency bound and not so much bandwidth. Imho a lot of 
people focus on raw TBs and bandwidth, but in the end IOps and latency are what 
usually matters.

I'd probably stick with 2x10Gbit for now and use the money I saved on more 
memory and faster CPUs.

Wido

> I think the upcoming AMD/Intel offerings will improve that quite a bit so you 
> may want to wait for that.
> As mentioned earlier. Single Core cpu speed matters for latency so you 
> probably want to up that.
> 
> You can also look at the DIMM configuration.
> TBH I am not sure how much it impacts Ceph performance but having just 2 
> DIMMS slots populated will not give you max memory bandwidth.
> Having some extra memory for read-cache probably won’t hurt either (unless 
> you know your workload won’t include any cacheable reads)
> 
> Cheers,
> Robert van Leeuwen
> 
> From: ceph-users  on behalf of 
> Massimiliano Cuttini 
> Organization: PhoenixWeb Srl
> Date: Wednesday, July 5, 2017 at 10:54 AM
> To: "ceph-users@lists.ceph.com" 
> Subject: [ceph-users] New cluster - configuration tips and reccomendation - 
> NVMe
> 
> 
> Dear all,
> 
> luminous is coming and sooner we should be allowed to avoid double writing.
> This means use 100% of the speed of SSD and NVMe.
> Cluster made all of SSD and NVMe will not be penalized and start to make 
> sense.
> 
> Looking forward I'm building the next pool of storage which we'll setup on 
> next term.
> We are taking in consideration a pool of 4 with the following single node 
> configuration:
> 
>   *   2x E5-2603 v4 - 6 cores - 1.70GHz
>   *   2x 32Gb of RAM
>   *   2x NVMe M2 for OS
>   *   6x NVMe U2 for OSD
>   *   2x 100Gib ethernet cards
> 
> We have yet not sure about which Intel and how much RAM we should put on it 
> to avoid CPU bottleneck.
> Can you help me to choose the right couple of CPU?
> Did you see any issue on the configuration proposed?
> 
> Thanks,
> Max
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread ceph
Interesting point, 100Gbps PCI is x16, NVMe is x4, that's 64 PCIe lanes
required

Should work at fullrate on a dual-socket server

On 05/07/2017 11:41, Van Leeuwen, Robert wrote:
> Hi Max,
> 
> You might also want to look at the PCIE lanes.
> I am not an expert on the matter but my guess would be the 8 NVME drives + 
> 2x100Gbit would be too much for
> the current Xeon generation (40 PCIE lanes) to fully utilize.
> 
> Cheers,
> Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Van Leeuwen, Robert
Hi Max,

You might also want to look at the PCIE lanes.
I am not an expert on the matter but my guess would be the 8 NVME drives + 
2x100Gbit would be too much for
the current Xeon generation (40 PCIE lanes) to fully utilize.

I think the upcoming AMD/Intel offerings will improve that quite a bit so you 
may want to wait for that.
As mentioned earlier. Single Core cpu speed matters for latency so you probably 
want to up that.

You can also look at the DIMM configuration.
TBH I am not sure how much it impacts Ceph performance but having just 2 DIMMS 
slots populated will not give you max memory bandwidth.
Having some extra memory for read-cache probably won’t hurt either (unless you 
know your workload won’t include any cacheable reads)

Cheers,
Robert van Leeuwen

From: ceph-users  on behalf of Massimiliano 
Cuttini 
Organization: PhoenixWeb Srl
Date: Wednesday, July 5, 2017 at 10:54 AM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] New cluster - configuration tips and reccomendation - NVMe


Dear all,

luminous is coming and sooner we should be allowed to avoid double writing.
This means use 100% of the speed of SSD and NVMe.
Cluster made all of SSD and NVMe will not be penalized and start to make sense.

Looking forward I'm building the next pool of storage which we'll setup on next 
term.
We are taking in consideration a pool of 4 with the following single node 
configuration:

  *   2x E5-2603 v4 - 6 cores - 1.70GHz
  *   2x 32Gb of RAM
  *   2x NVMe M2 for OS
  *   6x NVMe U2 for OSD
  *   2x 100Gib ethernet cards

We have yet not sure about which Intel and how much RAM we should put on it to 
avoid CPU bottleneck.
Can you help me to choose the right couple of CPU?
Did you see any issue on the configuration proposed?

Thanks,
Max
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-05 Thread Andreas Calminder
Sure thing!
I noted the new and old bucket instance id.

backup the bucket metadata
# radosgw-admin --cluster ceph-prod metadata get
bucket:1001/large_bucket > large_bucket.metadata.bak.json

# cp large_bucket.metadata.bak.json large_bucket.metadata.patched.json

set bucket_id in large_bucket.metadata.patched.json to the new bucket
instance id and replace the metadata in the bucket with
large_bucket.metadata.patched.json
# radosgw-admin --cluster ceph-prod metadata put
bucket:1001/large_bucket < large_bucket.metadata.patched.json

verify that bucket_id has been updated
# radosgw-admin --cluster ceph-prod metadata get bucket:1001/large_bucket

Try and access some objects in the updated bucket read/write, note
that any write operations at this point will still be slow as the old
instance id still have a large index, at least our cluster behaved
like that. Then purge the index from the old bucket instance id
#  radosgw-admin --cluster ceph-prod bi purge --bucket
1001/large_bucket --bucket-id old_bucket_instance_id

After that write operations against the index went smooth.

As said before, I didn't care about the data in the bucket at all, the
above steps is potentially dangerous and flat out wrong. But..
worksforme(tm)

/andreas

On 5 July 2017 at 10:45, Maarten De Quick  wrote:
> Hi Andreas,
>
> Interesting as we are also on Jewel 10.2.7. We do care about the data in the
> bucket so we really need the reshard process to run properly :).
> Could you maybe share how you linked the bucket to the new index by hand?
> That would already give me some extra insight.
> Thanks!
>
> Regards,
> Maarten
>
> On Wed, Jul 5, 2017 at 10:21 AM, Andreas Calminder
>  wrote:
>>
>> Hi,
>> I had a similar problem while resharding an oversized non-sharded
>> bucket in Jewel (10.2.7), the bi_list exited with ERROR: bi_list():
>> (4) Interrupted system call at, what seemed like the very end of the
>> operation. I went ahead and resharded the bucket anyway and the
>> reshard process ended the same way, seemingly at the end. Reshard
>> didn't link the bucket to new instance id though so I had to do that
>> by hand and then purge the index from the old instance id.
>> Note that I didn't care about the data in the bucket, I just wanted to
>> reshard the index so I could delete the bucket without my radosgw and
>> osds crashing due to out of memory issues.
>>
>> Regards,
>> Andreas
>>
>> On 4 July 2017 at 20:46, Maarten De Quick  wrote:
>> > Hi,
>> >
>> > Background: We're having issues with our index pool (slow requests /
>> > time
>> > outs causes crashing of an OSD and a recovery -> application issues). We
>> > know we have very big buckets (eg. bucket of 77 million objects with
>> > only 16
>> > shards) that need a reshard so we were looking at the resharding
>> > process.
>> >
>> > First thing we would like to do is making a backup of the bucket index,
>> > but
>> > this failed with:
>> >
>> > # radosgw-admin -n client.radosgw.be-west-3 bi list
>> > --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
>> > 2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
>> > ERROR: bi_list(): (4) Interrupted system call
>> >
>> > When I grep for "idx" and I count these:
>> >  # grep idx priv-prod-up-alex.list.backup | wc -l
>> > 2294942
>> > When I do a bucket stats for that bucket I get:
>> > # radosgw-admin -n client.radosgw.be-west-3 bucket stats
>> > --bucket=priv-prod-up-alex | grep num_objects
>> > 2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
>> > "num_objects": 20148575
>> >
>> > It looks like there are 18 million objects missing and the backup is not
>> > complete (not sure if that's a correct assumption?). We're also afraid
>> > that
>> > the resharding command will face the same issue.
>> > Has anyone seen this behaviour before or any thoughts on how to fix it?
>> >
>> > We were also wondering if we really need the backup. As the resharding
>> > process creates a complete new index and keeps the old bucket, is there
>> > maybe a possibility to relink your bucket to the old bucket in case of
>> > issues? Or am I missing something important here?
>> >
>> > Any help would be greatly appreciated, thanks!
>> >
>> > Regards,
>> > Maarten
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-05 Thread Lars Marowsky-Bree
On 2017-06-30T16:48:04, Sage Weil  wrote:

> > Simply disabling the tests while keeping the code in the distribution is
> > setting up users who happen to be using Btrfs for failure.
> 
> I don't think we can wait *another* cycle (year) to stop testing this.
> 
> We can, however,
> 
>  - prominently feature this in the luminous release notes, and
>  - require the 'enable experimental unrecoverable data corrupting features =
> btrfs' in order to use it, so that users are explicitly opting in to 
> luminous+btrfs territory.
> 
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.

That makes sense. Though btrfs is something users really shouldn't run
unless they get a heavily debugged and supported version from somewhere.

I'd also not mind just plain out dropping it completely, since I don't
believe any of our users runs it, they're all on XFS and will upconvert
to BlueStore.

That might be a good reasoning though: upgrading folks should be able to
get the OSDs on btrfs up (if they still have any) and go directly the
BlueStore, without having to first go via XFS.




Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Maxime Guyot
Hi Massimiliano,

I am a little surprised to see 6x NVMe, 64GB of RAM, 2x100 NICs and E5-2603
v4, that's one of the cheapest E5 Intel CPU mixed with some pretty high end
gear, it does not make sense. Wildo's right go with much higher frequency:
E5-2637 v4, E5-2643 v4, E5-1660 v4, E5-1650 v4. If you need to go on the
cheap, the E3 serie is interesting (E3-1220 v6, E3-1230 v6, ...) if you can
work with the limitations: max 64GB of RAM, max 4 cores  and single CPU.

Higher frequency should reduce latency when communicating with NICs and
SSDs which benefits Ceph's performance.

100G NICs is overkill for throughput, but it should reduce the latency. 25G
NIC are becoming popular for servers (replacing 10G NICs).

Cheers,
Maxime

On Wed, 5 Jul 2017 at 10:55 Massimiliano Cuttini  wrote:

> Dear all,
>
> luminous is coming and sooner we should be allowed to avoid double writing.
> This means use 100% of the speed of SSD and NVMe.
> Cluster made all of SSD and NVMe will not be penalized and start to make
> sense.
>
> Looking forward I'm building the next pool of storage which we'll setup on
> next term.
> We are taking in consideration a pool of 4 with the following single node
> configuration:
>
>- 2x E5-2603 v4 - 6 cores - 1.70GHz
>- 2x 32Gb of RAM
>- 2x NVMe M2 for OS
>- 6x NVMe U2 for OSD
>- 2x 100Gib ethernet cards
>
> We have yet not sure about which Intel and how much RAM we should put on
> it to avoid CPU bottleneck.
> Can you help me to choose the right couple of CPU?
> Did you see any issue on the configuration proposed?
>
>
> Thanks,
> Max
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread ceph
You will need CPUs as well if you want to push/fetch 200Gbps

2603 is really too short

(not really an issue, but NVMe for OS seems useless to me)

On 05/07/2017 11:02, Wido den Hollander wrote:
> 
>> Op 5 juli 2017 om 10:54 schreef Massimiliano Cuttini :
>>
>>
>> Dear all,
>>
>> luminous is coming and sooner we should be allowed to avoid double writing.
>> This means use 100% of the speed of SSD and NVMe.
>> Cluster made all of SSD and NVMe will not be penalized and start to make 
>> sense.
>>
>> Looking forward I'm building the next pool of storage which we'll setup 
>> on next term.
>> We are taking in consideration a pool of 4 with the following single 
>> node configuration:
>>
>>   * 2x E5-2603 v4 - 6 cores - 1.70GHz
> 
> You will need a faster CPU to get the maximum out of your OSDs.
> 
> IOps inside a OSD are single threaded, so if you want low-latency for a 
> single I/O you will need a faster CPU. Something like a 3Ghz CPU.
> 
> For example the E5-2643 v4 6core, 3.7Ghz
> 
>>   * 2x 32Gb of RAM
>>   * 2x NVMe M2 for OS
>>   * 6x NVMe U2 for OSD
>>   * 2x 100Gib ethernet cards
>>
>> We have yet not sure about which Intel and how much RAM we should put on 
>> it to avoid CPU bottleneck.
>> Can you help me to choose the right couple of CPU?
>> Did you see any issue on the configuration proposed?
>>
>>
>> Thanks,
>> Max
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Massive slowrequests causes OSD daemon to eat whole RAM

2017-07-05 Thread pwoszuk

Hello

We have a cluster of 10 ceph servers.

On that cluster there are EC pool with replicated SSD cache tier, used 
by OpenStack Cinder for volumes storage for production environment.


From 2 days we observe messages like this in logs:

2017-07-05 10:50:13.451987 osd.114 [WRN] slow request 1165.927215 
seconds old, received at 2017-07-05 10:30:47.104746: 
osd_op(osd.130.50779:43441 11.57a05c54 
rbd_data.5bc14d3135d111a.0084 [copy-get max 8388608] snapc 
0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected 
e50881) currently waiting for rw locks


in this example:

 * OSD.114 is on HDD backend with EC pool in it
 * OSD.130 is on SSD tier

We've analyzed logs and found, that from the beginning RBD image listed 
above [rbd_data.5bc14d3135d111a] causes problem from very beginning. 
Virtual machine (OpenStack uses Ceph cluster as backend storage for 
Cinder) is DOWN/STOPPED. Our conclusion is that this means that problem 
lies on cluster, not client side.


This unfortunately results in huge amount of blocked requests and RAM 
consumption. In result system restarts OSD daemon, and situation starts 
to repeat.


We've tried to temporary down problematic OSD's, but problem propagate 
to different OSD pair.


Using "ceph daemon osd. dump_ops_in_flight" on problematic OSDS 
causes OSD to hangand in few minutes down by cluster, with no response 
from command.


SSD model used for SSD cache tier pool is: SAMSUNG MZ7KM240

Could anyone tell what does those log messages means ? Anyone had such a 
problem and could help to diagnose/repair ?


Thanks for any help

-
Pawel Woszuk
PSNC, Poznan Supercomputing and Networking Center
Poznań, Poland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Wido den Hollander

> Op 5 juli 2017 om 10:54 schreef Massimiliano Cuttini :
> 
> 
> Dear all,
> 
> luminous is coming and sooner we should be allowed to avoid double writing.
> This means use 100% of the speed of SSD and NVMe.
> Cluster made all of SSD and NVMe will not be penalized and start to make 
> sense.
> 
> Looking forward I'm building the next pool of storage which we'll setup 
> on next term.
> We are taking in consideration a pool of 4 with the following single 
> node configuration:
> 
>   * 2x E5-2603 v4 - 6 cores - 1.70GHz

You will need a faster CPU to get the maximum out of your OSDs.

IOps inside a OSD are single threaded, so if you want low-latency for a single 
I/O you will need a faster CPU. Something like a 3Ghz CPU.

For example the E5-2643 v4 6core, 3.7Ghz

>   * 2x 32Gb of RAM
>   * 2x NVMe M2 for OS
>   * 6x NVMe U2 for OSD
>   * 2x 100Gib ethernet cards
> 
> We have yet not sure about which Intel and how much RAM we should put on 
> it to avoid CPU bottleneck.
> Can you help me to choose the right couple of CPU?
> Did you see any issue on the configuration proposed?
> 
> 
> Thanks,
> Max
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-05 Thread Maarten De Quick
Hi Orit,

We're running on jewel, version 10.2.7.

I've ran the bi-list with the debugging commands and this is the end of it:




















































*2017-07-05 08:50:19.705673 7ff3bfefe700  1 -- 10.21.4.1:0/3313807338
 <== osd.3 10.21.4.111:6810/3633200
 2297  osd_op_reply(2571
.dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk = 0) v7 
145+0+385625 (3432176685 0 1775102993) 0x7ff3b00041a0 con
0x7ff4272f48f02017-07-05 08:50:19.724193 7ff4250219c0  1 --
10.21.4.1:0/3313807338  -->
10.21.4.111:6810/3633200  --
osd_op(client.5971646.0:2572 48.1b47291b .dir.be-east.5582981.76.0 [call
rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0
0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.767758 7ff3bfefe700  1
-- 10.21.4.1:0/3313807338  <== osd.3
10.21.4.111:6810/3633200  2298 
osd_op_reply(2572 .dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk =
0) v7  145+0+385625 (3432176685 0 2330398289) 0x7ff3b00041a0 con
0x7ff4272f48f02017-07-05 08:50:19.786309 7ff4250219c0  1 --
10.21.4.1:0/3313807338  -->
10.21.4.111:6810/3633200  --
osd_op(client.5971646.0:2573 48.1b47291b .dir.be-east.5582981.76.0 [call
rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0
0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.827960 7ff3bfefe700  1
-- 10.21.4.1:0/3313807338  <== osd.3
10.21.4.111:6810/3633200  2299 
osd_op_reply(2573 .dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk =
0) v7  145+0+385625 (3432176685 0 1724305540) 0x7ff3b00041a0 con
0x7ff4272f48f02017-07-05 08:50:19.846588 7ff4250219c0  1 --
10.21.4.1:0/3313807338  -->
10.21.4.111:6810/3633200  --
osd_op(client.5971646.0:2574 48.1b47291b .dir.be-east.5582981.76.0 [call
rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0
0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.870830 7ff3bfefe700  1
-- 10.21.4.1:0/3313807338  <== osd.3
10.21.4.111:6810/3633200  2300 
osd_op_reply(2574 .dir.be-east.5582981.76.0 [call] v0'0 uv0 ondisk = -4
((4) Interrupted system call)) v7  145+0+0 (798610401 0 0)
0x7ff3b00041a0 con 0x7ff4272f48f0ERROR: bi_list(): (4) Interrupted system
call2017-07-05 08:50:19.872489 7ff4250219c0  1 -- 10.21.4.1:0/3313807338
 --> 10.21.4.112:6822/2795125
 -- osd_op(client.5971646.0:2575
24.4322fa9f notify.0 [watch unwatch cookie 140686606221264] snapc 0=[]
ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con
0x7ff427302b102017-07-05 08:50:19.878128 7ff3bf0f7700  1 --
10.21.4.1:0/3313807338  <== osd.23
10.21.4.112:6822/2795125  63 
osd_op_reply(2575 notify.0 [watch unwatch cookie 140686606221264]
v31545'6808 uv6416 ondisk = 0) v7  128+0+0 (3462997515 0 0)
0x7ff3980014f0 con 0x7ff427302b102017-07-05 08:50:19.878221 7ff4250219c0 20
remove_watcher() i=02017-07-05 08:50:19.878229 7ff4250219c0  2 removed
watcher, disabling cache2017-07-05 08:50:19.878278 7ff4250219c0  1 --
10.21.4.1:0/3313807338  -->
10.21.4.113:6807/2176843  --
osd_op(client.5971646.0:2576 24.16dafda0 notify.1 [watch unwatch cookie
140686606235888] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 --
?+0 0x7ff4272d5950 con 0x7ff427304ae02017-07-05 08:50:19.880843
7ff3beef5700  1 -- 10.21.4.1:0/3313807338 
<== osd.27 10.21.4.113:6807/2176843  63
 osd_op_reply(2576 notify.1 [watch unwatch cookie 140686606235888]
v31545'6706 uv6304 ondisk = 0) v7  128+0+0 (4086455760 0 0)
0x7ff3900014f0 con 0x7ff427304ae02017-07-05 08:50:19.880910 7ff4250219c0 20
remove_watcher() i=12017-07-05 08:50:19.880940 7ff4250219c0  1 --
10.21.4.1:0/3313807338  -->
10.21.4.111:6802/3632911  --
osd_op(client.5971646.0:2577 24.88aa5c95 notify.2 [watch unwatch cookie
140686606250416] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 --
?+0 0x7ff4272d5950 con 0x7ff4273083d02017-07-05 08:50:19.886387
7ff3becf3700  1 -- 10.21.4.1:0/3313807338 
<== osd.1 10.21.4.111:6802/3632911  94
 osd_op_reply(2577 notify.2 [watch unwatch cookie 140686606250416]
v31545'10057 uv9497 ondisk = 0) v7  128+0+0 (2583541993 0 0)
0x7ff388001630 con 0x7ff4273083d02017-07-05 08:50:19.886476 7ff4250219c0 20
remove_watcher() i=22017-07-05 08:50:19.886513 7ff4250

[ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Massimiliano Cuttini

Dear all,

luminous is coming and sooner we should be allowed to avoid double writing.
This means use 100% of the speed of SSD and NVMe.
Cluster made all of SSD and NVMe will not be penalized and start to make 
sense.


Looking forward I'm building the next pool of storage which we'll setup 
on next term.
We are taking in consideration a pool of 4 with the following single 
node configuration:


 * 2x E5-2603 v4 - 6 cores - 1.70GHz
 * 2x 32Gb of RAM
 * 2x NVMe M2 for OS
 * 6x NVMe U2 for OSD
 * 2x 100Gib ethernet cards

We have yet not sure about which Intel and how much RAM we should put on 
it to avoid CPU bottleneck.

Can you help me to choose the right couple of CPU?
Did you see any issue on the configuration proposed?


Thanks,
Max

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-05 Thread Maarten De Quick
Hi Andreas,

Interesting as we are also on Jewel 10.2.7. We do care about the data in
the bucket so we really need the reshard process to run properly :).
Could you maybe share how you linked the bucket to the new index by hand?
That would already give me some extra insight.
Thanks!

Regards,
Maarten

On Wed, Jul 5, 2017 at 10:21 AM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hi,
> I had a similar problem while resharding an oversized non-sharded
> bucket in Jewel (10.2.7), the bi_list exited with ERROR: bi_list():
> (4) Interrupted system call at, what seemed like the very end of the
> operation. I went ahead and resharded the bucket anyway and the
> reshard process ended the same way, seemingly at the end. Reshard
> didn't link the bucket to new instance id though so I had to do that
> by hand and then purge the index from the old instance id.
> Note that I didn't care about the data in the bucket, I just wanted to
> reshard the index so I could delete the bucket without my radosgw and
> osds crashing due to out of memory issues.
>
> Regards,
> Andreas
>
> On 4 July 2017 at 20:46, Maarten De Quick  wrote:
> > Hi,
> >
> > Background: We're having issues with our index pool (slow requests / time
> > outs causes crashing of an OSD and a recovery -> application issues). We
> > know we have very big buckets (eg. bucket of 77 million objects with
> only 16
> > shards) that need a reshard so we were looking at the resharding process.
> >
> > First thing we would like to do is making a backup of the bucket index,
> but
> > this failed with:
> >
> > # radosgw-admin -n client.radosgw.be-west-3 bi list
> > --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
> > 2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
> > ERROR: bi_list(): (4) Interrupted system call
> >
> > When I grep for "idx" and I count these:
> >  # grep idx priv-prod-up-alex.list.backup | wc -l
> > 2294942
> > When I do a bucket stats for that bucket I get:
> > # radosgw-admin -n client.radosgw.be-west-3 bucket stats
> > --bucket=priv-prod-up-alex | grep num_objects
> > 2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
> > "num_objects": 20148575
> >
> > It looks like there are 18 million objects missing and the backup is not
> > complete (not sure if that's a correct assumption?). We're also afraid
> that
> > the resharding command will face the same issue.
> > Has anyone seen this behaviour before or any thoughts on how to fix it?
> >
> > We were also wondering if we really need the backup. As the resharding
> > process creates a complete new index and keeps the old bucket, is there
> > maybe a possibility to relink your bucket to the old bucket in case of
> > issues? Or am I missing something important here?
> >
> > Any help would be greatly appreciated, thanks!
> >
> > Regards,
> > Maarten
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bucket resharding: "radosgw-admin bi list" ERROR

2017-07-05 Thread Andreas Calminder
Hi,
I had a similar problem while resharding an oversized non-sharded
bucket in Jewel (10.2.7), the bi_list exited with ERROR: bi_list():
(4) Interrupted system call at, what seemed like the very end of the
operation. I went ahead and resharded the bucket anyway and the
reshard process ended the same way, seemingly at the end. Reshard
didn't link the bucket to new instance id though so I had to do that
by hand and then purge the index from the old instance id.
Note that I didn't care about the data in the bucket, I just wanted to
reshard the index so I could delete the bucket without my radosgw and
osds crashing due to out of memory issues.

Regards,
Andreas

On 4 July 2017 at 20:46, Maarten De Quick  wrote:
> Hi,
>
> Background: We're having issues with our index pool (slow requests / time
> outs causes crashing of an OSD and a recovery -> application issues). We
> know we have very big buckets (eg. bucket of 77 million objects with only 16
> shards) that need a reshard so we were looking at the resharding process.
>
> First thing we would like to do is making a backup of the bucket index, but
> this failed with:
>
> # radosgw-admin -n client.radosgw.be-west-3 bi list
> --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup
> 2017-07-03 21:28:30.325613 7f07fb8bc9c0  0 System already converted
> ERROR: bi_list(): (4) Interrupted system call
>
> When I grep for "idx" and I count these:
>  # grep idx priv-prod-up-alex.list.backup | wc -l
> 2294942
> When I do a bucket stats for that bucket I get:
> # radosgw-admin -n client.radosgw.be-west-3 bucket stats
> --bucket=priv-prod-up-alex | grep num_objects
> 2017-07-03 21:33:05.776499 7faca49b89c0  0 System already converted
> "num_objects": 20148575
>
> It looks like there are 18 million objects missing and the backup is not
> complete (not sure if that's a correct assumption?). We're also afraid that
> the resharding command will face the same issue.
> Has anyone seen this behaviour before or any thoughts on how to fix it?
>
> We were also wondering if we really need the backup. As the resharding
> process creates a complete new index and keeps the old bucket, is there
> maybe a possibility to relink your bucket to the old bucket in case of
> issues? Or am I missing something important here?
>
> Any help would be greatly appreciated, thanks!
>
> Regards,
> Maarten
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon leader election problem, should it be improved ?

2017-07-05 Thread Z Will
Hi Joao:
I think this is all because we choose the monitor with the
smallest rank number to be leader. For this kind of network error, no
matter which mon has lost connection with the  mon who has the
smallest rank num , will be constantly calling an election, that say
,will constantly affact the cluster until it is stopped by human . So
do you think it make sense if I try to figure out a way to choose the
monitor who can see the most monitors ,  or with  the smallest rank
num if the view num is same , to be leader ?
In probing phase:
   they will know there own view, so can set a view num.
In election phase:
   they send the view num , rank num .
   when receiving the election message, it compare the view num (
higher is leader ) and rank num ( lower is leader).

On Tue, Jul 4, 2017 at 9:25 PM, Joao Eduardo Luis  wrote:
> On 07/04/2017 06:57 AM, Z Will wrote:
>>
>> Hi:
>>I am testing ceph-mon brain split . I have read the code . If I
>> understand it right , I know it won't be brain split. But I think
>> there is still another problem. My ceph version is 0.94.10. And here
>> is my test detail :
>>
>> 3 ceph-mons , there ranks are 0, 1, 2 respectively.I stop the rank 1
>> mon , and use iptables to block the communication between mon 0 and
>> mon 1. When the cluster is stable, start mon.1 .  I found the 3
>> monitors will all can not work well. They are all trying to call  new
>> leader  election . This means the cluster can't work anymore.
>>
>> Here is my analysis. Because mon will always respond to leader
>> election message, so , in my test, communication between  mon.0 and
>> mon.1 is blocked , so mon.1 will always try to be leader, because it
>> will always see mon.2, and it should win over mon.2. Mon.0 should
>> always win over mon.2. But mon.2 will always responsd to the election
>> message issued by mon.1, so this loop will never end. Am I right ?
>>
>> This should be a problem? Or is it  was just designed like this , and
>> should be handled by human ?
>
>
> This is a known behaviour, quite annoying, but easily identifiable by having
> the same monitor constantly calling an election and usually timing out
> because the peon did not defer to it.
>
> In a way, the elector algorithm does what it is intended to. Solving this
> corner case would be nice, but I don't think there's a good way to solve it.
> We may be able to presume a monitor is in trouble during the probe phase, to
> disqualify a given monitor from the election, but in the end this is a
> network issue that may be transient or unpredictable and there's only so
> much we can account for.
>
> Dealing with it automatically would be nice, but I think, thus far, the
> easiest way to address this particular issue is human intervention.
>
>   -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com