Re: [ceph-users] fibre channel as ceph storage interconnect

2016-04-21 Thread Paul Evans
On Apr 21, 2016, at 11:10 PM, Schlacta, Christ 
mailto:aarc...@aarcane.org>> wrote:

Would it be worth while development effort to establish a block
protocol between the nodes so that something like fibre channel could
be used to communicate internally?

With 25/100 Ethernet & IB becoming available now...and with some effort to 
integrate IB and ceph already completed, I can’t see FC for OSDs getting any 
traction.

 Looks like I'll have to look into infiniband or
CE, and possibly migrate away from Fibre Channel, even though it kinda
just works, and therefore I really like it :(

If by CE you’re referring to Converged Enhanced Ethernet and it’s brethren 
(DCB/DCE), those technologies should be transparent to Ceph and provide some 
degree of improvement over standard Ethernet behaviors.  YMMV.

As for FC ‘just works…’  +1   (but I really don’t want to inspire a flame war 
on the topic)

- Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-24 Thread Paul Evans
As the single job spec being referenced (from Sebastien Han’s 
blog,
 as I understand) includes the use of the —sync flag, the HBA and storage bus 
are unlikely to dominate the test results.  O_DSync operations must wait until 
a write is complete before starting another operation, which leaves both the 
CPU and the HBA ‘waiting' on the storage media for the completion.

All that said (mostly for those not familiar with the impact of —sync on SSDs): 
the communications path is a generally longer and more complicated with the 
introduction of (any) SAS HBA in this use case.  SAS is great for handling 
multipath, multi-device communications, but for a simple point-to-point link 
the SATA protocol ‘wins’ out: the SATA Controller is closer to the CPU, avoids 
the PCIe bus, avoids Expander chips (hops) and the entire SATA->SAS->SATA 
tunneling operations are avoided too.  Hence:  using SATA SSDs on a SAS bus is 
a ‘sub-optimal’ solution in most cases, although 12G SAS overcomes some of this 
while SATA is stuck at 6G.

Bottom line: for the high-level goal of getting good O_DSync performance, 
finding a ‘better' HBA isn’t a useful exercise. Having faster storage media and 
a simpler IO path (like NVMe) should yield the best results, especially for 
synchronous IO.   My $.02.
-- Paul


I've checked with a MegaRaid SAS 2208, then I get ~40 MB/s, both with a
1.9TB and a 240GB SM863 model.
So it seems the LSI MegaRaid HBA's is not optimized for a lot of single
job IOPS...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configure Ceph client network

2015-12-24 Thread Paul Evans
Yes:  make sure there is an entry in your ceph.conf to align the public 
(client) 
network
 with the IP space of the NIC where you want the ceph IO...

[global]
public_network = /

There is also an option to isolate the OSD (cluster) traffic, which is 
recommended for production clusters, but I don’t believe it will apply to your 
case:

[global]
cluster_network = /

-Paul


thanks ,do i need something in client configuration file?

---原始邮件---
Just give different IP to both the port and that should be okay.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Building a Pb EC cluster for a cheaper cold storage

2015-11-10 Thread Paul Evans
Mike - unless things have changed in the latest versions(s) of Ceph, I *not* 
believe CRUSH will be successful in creating a valid PG map if the ’n' value is 
10 (k+m), your host count is 6, and your failure domain is set to host.  You’ll 
need to increase your host count to match or exceed ’n', change the failure 
domain to OSD, or alter the k+m config to something that is more compatible to 
your host count…otherwise you’ll end up with incomplete PG’s.
Also note that having more failure domains (i.e. - hosts) than your ’n’ value 
is recommended.

Beyond that, you’re likely to run operational challenges putting that many 
drives behind a single CPU-complex when the host count is quite low. My $.02.
--
Paul

On Nov 10, 2015, at 2:29 AM, Mike Almateia 
mailto:mike.almat...@gmail.com>> wrote:

Hello.

For our CCTV storing streams project we decided to use Ceph cluster with EC 
pool.
Input requirements is not scary: max. 15Gbit/s input traffic from CCTV, 30 day 
storing,
99% write operations, a cluster must has grow up with out downtime.

By now our vision of architecture it like:
* 6 JBOD with 90 HDD 8Tb capacity each (540 HDD total)
* 6 Ceph servers connected to it own JBOD (we will have 6 pairs: 1 Server + 1 
JBOD).

Ceph servers hardware details:
* 2 x E5-2690v3 : 24 core (w/o HT), 2.6 Ghz each
* 256 Gb RAM DDR4
* 4 x 10Gbit/s NIC port (2 for Client network and 2 for Cluster Network)
* servers also have 4 (8) x 2.5" HDD SATA on board for Cache Tiering Feature 
(because ceph clients can't directly talk with EC pool)
* Two HBA SAS controllers work with multipathing feature, for HA scenario.
* For Ceph monitor functionality 3 servers have 2 SSD in Software RAID1

Some Ceph configuration rules:
* EC pools with K=7 and M=3
* EC plugin - ISA
* technique = reed_sol_van
* ruleset-failure-domain = host
* near full ratio = 0.75
* OSD journal partition on the same disk

We think that first and second problems it will be CPU and RAM on Ceph servers.

Any ideas? it is can fly?



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Paul Evans
Principal Architect
Daystrom Technology Group
m: 707-479-1034o:  800-656-3224  x511
f:  650-472-4005e:  
paul.ev...@daystrom.com<mailto:paul.ev...@daystrom.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] high density machines

2015-09-03 Thread Paul Evans
Echoing what Jan said, the 4U Fat Twin is the better choice of the two options, 
as it is very difficult to get long-term reliable and efficient operation of 
many OSDs when they are serviced by just one or two CPUs.
I don’t believe the FatTwin design has much of a backplane, primarily sharing 
power and cooling. That said: the cost savings would need to be solid to choose 
the FatTwin over 1U boxes, especially as (personally) I dislike lots of 
front-side cabling in the rack.
--
Paul Evans


On Sep 3, 2015, at 7:01 AM, Gurvinder Singh 
mailto:gurvindersinghdah...@gmail.com>> wrote:

Hi,

I am wondering if anybody in the community is running ceph cluster with
high density machines e.g. Supermicro SYS-F618H-OSD288P (288 TB),
Supermicro SSG-6048R-OSD432 (432 TB) or some other high density
machines. I am assuming that the installation will be of petabyte scale
as you would want to have at least 3 of these boxes.

It would be good to hear their experiences in terms of reliability,
performance (specially during node failures). As these machines have
40Gbit network connection it can be ok, but experience from real users
would be  great to hear. As these are mentioned in the reference
architecture published by red hat and supermicro.

Thanks for your time.
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH RBD with ESXi

2015-07-20 Thread Paul Evans
Hi Nikil. We just posted slides from Ceph Day (Los Angeles) about the use of 
iSCSI with Ceph at Electronic Arts. The slides can be found 
here  if 
you want to review them. (Note: it doesn’t answer your specific question, but 
might provide some insights)
--
Paul

On Jul 20, 2015, at 11:07 AM, Nikhil Mitra (nikmitra) 
mailto:nikmi...@cisco.com>> wrote:

Hi,

Has anyone implemented using CEPH RBD with Vmware ESXi hypervisor. Just looking 
to use it as a native VMFS datastore to host VMDK’s. Please let me know if 
there are any documents out there that might point me in the right direction to 
get started on this. Thank you.

Regards,
Nikhil Mitra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC cluster design considerations

2015-07-05 Thread Paul Evans
On Jul 4, 2015, at 2:44 PM, Adrien Gillard 
mailto:gillard.adr...@gmail.com>> wrote:

Lastly, regarding Cluster Throughput:  EC seems to require a bit more CPU and 
memory than straight replication, which begs the question of how much RAM and 
CPU are you putting into the chassis?  With proper amounts, you should be able 
to hit your throughput targets,.
Yes, I have read about that, I was thinking 64 GB of RAM  (maybe overkill, even 
with the 1GB of RAM per TB ? but I would rather have an optimal RAM 
configuration in terms of DIMM / channels / CPU) and 2x8 Intel cores per host 
(around 2Ghz per core). As the cluster will be used for backups, the goal is 
not to be limited by the storage backend during the backup window overnight. I 
do not expect much load during daytime.

64G is “OK” provided you tune the system well and DON”T add extra services onto 
your OSD nodes.  If you’ll also have 3 of them acting as MONs, more memory is 
advised (probably 96-128G).

 At the moment I am planning to have a smaller dedicated node for the master 
monitor ( ~ 8 cores, 32G RAM, SSD) and virtual machines for MON 2 and 3 (with 
enough resources and virtual disk on SSD)


It would be good to have others comment on the practicality of this design, as 
I don’t believe there is a benefit to having a single MON that is ‘better' than 
the other two. My reasoning comes from a limited understanding of the Paxos 
implementation within Ceph, which suggests that a majority of MONs must be 
available at all times (i.e. - 2 of the 3), and that MON activities will be 
processed according to the speed of the slowest quorum member.  If two of the 
MONs are running as VMs on OSD hosts, and you have a write-heavy workload, I 
can foresee some interesting resource contention issues that might sometimes 
destabilize your entire cluster.   YMMV.
- Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC cluster design considerations

2015-07-03 Thread Paul Evans
HI Adrien.  I can offer some feedback, and have a couple of questions myself:

1)  if you’re going to deploy 9x4TB OSDs per host, with 7 hosts, and 4+2 EC, do 
you really want to put extras OSDs in ‘inner” drive bays if the target capacity 
is 100TB?   My rough calculations indicate 150TB usable capacity from your 
baseline of 9 OSDs per host, which makes the use of non hot-swap bays somewhat 
troublesome over time.  That said:  taking down a host is easy if the cluster 
isn’t loaded, and problematic if you want to maintain write throughput.

2) In regards to the ‘fine tuning of how OSDs are marked out’:  many production 
clusters these days are tuned to minimize the impact of recovery & backfill 
operations by limiting the number of operations allowed, or are simply left in 
a ’noout’ state to allow an administrator to make the decision about recovery.  
If you’re faced with backfilling 4-16TB worth of data while under full write 
load, it will take a while. Running in a ‘noout’ state might be best while your 
cluster remains small (<20 nodes)

Also:  how are you going to access the Ceph cluster for your backups? Perhaps 
via a block device?  If so, you’ll need a cache tier in front of the EC pool.

Lastly, regarding Cluster Throughput:  EC seems to require a bit more CPU and 
memory than straight replication, which begs the question of how much RAM and 
CPU are you putting into the chassis?  With proper amounts, you should be able 
to hit your throughput targets,.
--
Paul

On Jul 3, 2015, at 6:47 AM, Adrien Gillard 
mailto:gillard.adr...@gmail.com>> wrote:

Hi everyone,

I am currently looking at Ceph to build a cluster to backup VMs. I am 
leveraging the solution against others like traditionnal SANs, etc. and to this 
point Ceph is economically more interesting and technically more challenging 
(not to bother me :) ).

OSD hosts should be based on Dell R730xd hardware, I plan to put 3 SSD and 9 
OSD (4TB) per host.
I need approximately 100TB and, in order to save some space and still get the 
level of resiliency you can expect for backups, i am leaning towards EC (4+2) 
and 7 hosts.

I would like some input on the questions that still remain :

 - I can put more OSD directly inside the server (up to 4 additional disks) but 
that would require to power down the host to replace an "inner" OSD in case of 
failure. I was thinking I could add 3 internal disks to have 12 OSD per node 
instead of 9 for a higher density, at the cost of a more complex maintenance 
and higher risk for the cluster as there would be 4 OSD journals per SSD 
instead of 3. How manageable is bringing down a complete node for replacing a 
disk ? noout will surely come into play. How will the cluster behave when the 
host is back online to sync data ?

 - I also wonder about SSD failure,even if I intend to use Intel 3700 or at 
least 3610, in order not to be bothered with such issues :) So, in case of a 
SSD failure, the cluster should start backfilling / rebalancing the data of 3 
to 4 OSDs. With proper monitoring and spare disks, one could replace the SSD 
within hours and avoid the impacts of backfilling lots of data, but this would 
require a fine tuning of how OSD are marked out. I know it is a bit bending the 
natural features and behaviours of Ceph but has anyone tested this approch ? 
With custom monitoring scripts or others ? Would you think it can be considered 
or the only way is to buy SSD that can sustain the load ? Also same question as 
above, how do Ceph handle down OSD that are set up after a while ?

 - My goal is to reach a bandwidth of several hundreds of MBytes of mostly 
sequential writes. Do you think a cluster of this type and size will be able to 
handle it ? The only benchmarks I could find on the mailing list are Loic's on 
EC plugins and Roy's on a full SSD EC backend.

Lots of thanks in advance,

Adrien



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple journals and an OSD on one SSD doable?

2015-06-07 Thread Paul Evans
Cameron,  Somnath already covered most of these points, but I’ll add my $.02…

The key question to me is this: will these 1TB SSDs perform well as a Journal 
target for Ceph?  They’ll need to be fast at synchronous writes to fill that 
role, and if they aren’t  I would use them for other OSD-related tasks and get 
the right SSDs for the journal workload.  For more thoughts on the matter, read 
below…


  *   1TB Capacity SSDs for journals is certainly overkill..unless the 
underlying SSD controller is able to extend the life span of the SSD by using 
the unallocated portions.  I would normally put the extra 950G of capacity to 
use, either as a cache tier or isolated pool depending on the workload… but 
both of those efforts have their own considerations too, especially regarding 
performance and fault domains, which brings us to...
  *   Performance is going to vary depending on the SSD you have: is it PCIe, 
NVMe, SATA, or SAS?  The connection type and SSD characteristics need to 
sustain the amount of bandwidth and IOPS you need for your workload, especially 
as you’ll be be doing double writes if you use them as both journals and some 
kind of OSD storage (either cache tier or dedicated pool).  Also, do you *know* 
if these SSDs handle writes effectively?  Many SSDs don’t perform well for the 
types of journal writes that Ceph performs.  Somnath already mentioned placing 
the primary OSDs on the spare space - a good way to get a boost in read 
performance if you ceph architecture will support it.
  *   Fault domain is another consideration : the more journals you put on one 
SSD, the larger your fault domain will be.  If you have non-Enterprise SSDs 
this is an important point, as the wrong SSD will die quickly in a busy cluster.


--
Paul


On Jun 7, 2015, at 1:48 PM, 
cameron.scr...@solnet.co.nz wrote:

Setting up a Ceph cluster and we want the journals for our spinning disks to be 
on SSDs but all of our SSDs are 1TB. We were planning on putting 3 journals on 
each SSD, but that leaves 900+GB unused on the drive, is it possible to use the 
leftover space as another OSD or will it affect performance too much?

Thanks,

Cameron Scrace
Infrastructure Engineer

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions

2015-06-02 Thread Paul Evans
Kenneth,
  My guess is that you’re hitting the cache_target_full_ratio on an individual 
OSD, which is easy to do since most of us tend to think of the 
cache_target_full_ratio as an aggregate of the OSDs (which it is not according 
to Greg Farnum).   This posting may shed more light on the issue, if it is 
indeed what you are bumping up against.  
https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg20207.html

  BTW: how are you determining that your OSDs are ‘not overloaded?’  Are you 
judging that by iostat utilization, or by capacity consumed?
--
Paul


On Jun 2, 2015, at 9:53 AM, Kenneth Waegeman 
mailto:kenneth.waege...@ugent.be>> wrote:

Hi,

we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache 
layer upon an erasure coded pool.
This was going on for some time, and didn't have real problems.

Today we added 2 more streams, and very soon we saw some strange behaviour:
- We are getting blocked requests on our cache pool osds
- our cache pool is often near/ at max ratio
- Our data streams have very bursty IO, (streaming a minute a few hunderds MB 
and then nothing)

Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), 
though it seems like the cache pool can not evict objects in time, and get 
blocked until that is ok, each time again.
If I rise the target_max_bytes limit, it starts streaming again until it is 
full again.

cache parameters we have are these:
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024))
ceph osd pool set cache cache_target_dirty_ratio 0.4
ceph osd pool set cache cache_target_full_ratio 0.8


What can be the issue here ? I tried to find some information about the 'cache 
agent' , but can only find some old references..

Thank you!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kicking 'Remapped' PGs

2015-05-07 Thread Paul Evans
It brings some comfort to know you found it weird too.
In the end, we noted that the tunables were in ‘legacy’ mode - a hold over from 
prior experimentation, and a possible source of how we ended up with the 
remapped PGs in the first place.  Setting that back to ‘firefly’ cleared up the 
remaining two ‘remapped’ PGs, bringing them online and restoring the cluster to 
health.
Thanks for the tips along the way to getting back to ‘healthy’ Greg!   (and it 
would still be great to have a targeted command to kick a PG)


On May 7, 2015, at 8:58 PM, Gregory Farnum 
mailto:g...@gregs42.com>> wrote:

This is pretty weird to me. Normally those PGs should be reported as
active, or stale, or something else in addition to remapped. Sam
suggests that they're probably stuck activating for some reason (which
is a state in new enough code, but not all versions), but I can't tell
or imagine why from these settings. You might have hit a bug I'm not
familiar with that will be jostled by just restarting the OSDs in
question... :/
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kicking 'Remapped' PGs

2015-05-05 Thread Paul Evans
 Gregory Farnum mailto:g...@gregs42.com>> wrote:

Oh. That's strange; they are all mapped to two OSDs but are placed on
two different ones. I'm...not sure why that would happen. Are these
PGs active? What's the full output of "ceph -s"?

Those 4 PG’s went inactive at some point, and we had the luxury of  time to 
understand how we arrived at this state before we truly have to fix it (but 
that time is soon).
So...We kicked a couple of OSD’s out yesterday to let the cluster re-shuffle 
things (osd.19 and osd.34…both of which were non-primary copies of the ‘acting’ 
PG map) and now the cluster status is even more interesting, IMHO:

ceph@nc48-n1:/ceph-deploy/nautilus$ ceph -s
cluster 68bc69c1-1382-4c30-9bf8-480e32cc5b92
 health HEALTH_WARN 2 pgs stuck inactive; 2 pgs stuck unclean; nodeep-scrub 
flag(s) set; crush map has legacy tunables
 monmap e1: 3 mons at 
{nc48-n1=10.253.50.211:6789/0,nc48-n2=10.253.50.212:6789/0,nc48-n3=10.253.50.213:6789/0},
 election epoch 564, quorum 0,1,2 nc48-n1,nc48-n2,nc48-n3
 osdmap e80862: 94 osds: 94 up, 92 in
flags nodeep-scrub
  pgmap v1954234: 6144 pgs, 2 pools, 35251 GB data, 4419 kobjects
91727 GB used, 245 TB / 334 TB avail
6140 active+clean
   2 remapped
   2 active+clean+scrubbing
ceph@nc48-n1:/ceph-deploy/nautilus$ ceph pg dump_stuck
ok
pg_statobjectsmipdegrunfbyteslogdisklogstate
state_stampvreportedupup_primaryactingacting_primary
last_scrubscrub_stamplast_deep_scrubdeep_scrub_stamp
11.e2f280000233984418130013001remapped
2015-04-23 13:18:59.29958968310'5108280862:121916[77,4]77
[77,34]7768310'510822015-04-23 11:40:11.5654870'02014-10-20 
13:41:46.122624
11.323282000235718664730013001remapped
2015-04-23 13:18:58.97039670105'4896180862:126346[0,37]0
[0,19]070105'489612015-04-23 11:47:02.9801458145'44375
2015-03-30 16:09:36.975875

--
Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kicking 'Remapped' PGs

2015-05-03 Thread Paul Evans
Thanks, Greg.  Following your lead, we discovered the proper 'set_choose_tries 
xxx’ value had not been applied to *this* pool’s rule, and we updated the 
cluster accordingly. We then moved a random OSD out and back in to ‘kick’ 
things, but no joy: we still have the 4 ‘remapped’ PGs.  BTW: the 4 PGs look OK 
from a basic rule perspective: they’re on different OSDs/on different Hosts, 
which is what we’re concerned with… but it seems CRUSH has different goals for 
them and they are inactive.
So..back to the basic question:  can we get just the ‘remapped’ PGs to re-sort 
themselves without causing massive data movement….or is a complete re-sort the 
only way to get to a desired CRUSH state?

As for the force_create_pg command: if it creates a blank PG element on a 
specific OSD (yes?), what happens to an existing PG element on other OSDs? 
Could we use force_create_pg followed by a ‘pg repair’ command to get things 
back to the proper state (in a very targeted way)?

For reference, below is the (reduced) output of dump_stuck:

pg_stat  objects mip  degr unf  bytes   logdisklog state  state_stamp   
v  reportedupup_pri  acting
acting_pri
11.6e52840002366787669  30123012  remapped  2015-04-23 
13:19:02.373507  68310'4906878500:123712   [0,92]0[0,84]0
11.8bb2830002349260884  30013001  remapped  2015-04-23 
13:19:02.550735  70105'4977678500:125026   [0,92]0[0,88]0
11.e2f2800002339844181  30013001  remapped  2015-04-23 
13:18:59.299589  68310'5108278500:119555   [77,4]77   [77,34]   77
11.3232820002357186647  30013001  remapped  2015-04-23 
13:18:58.970396  70105'4896178500:123987   [0,37]0[0,19]0



On Apr 30, 2015, at 10:30 AM, Gregory Farnum 
mailto:g...@gregs42.com>> wrote:

Remapped PGs that are stuck that way mean that CRUSH is failing to map
them appropriately — I think we talked about the circumstances around
that previously. :) So nudging CRUSH can't do anything; it will just
fail to map them appropriately again. (And indeed this is what happens
whenever anyone does something to that PG or the OSD Map gets
changed.)

The force_create_pg command does exactly what it sounds like: it tells
the OSDs which should currently host the named PG to create it. You
shouldn't need to run it and I don't remember exactly what checks it
goes through, but it's generally for when you've given up on
retrieving any data out of a PG whose OSDs died and want to just start
over with a completely blank one.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kicking 'Remapped' PGs

2015-04-29 Thread Paul Evans
In one of our clusters we sometimes end up with PGs that are mapped incorrectly 
and settle into a ‘remapped’ state (forever).  Is there a way to nudge a 
specific PG to recalculate placement and relocate the data?  One option that 
we’re *dangerously* unclear about is the use of ceph pg force_create_pg . 
Is this a viable command to use in a ‘remapped’ situation?

Paul Evans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interesting problem: 2 pgs stuck in EC pool with missing OSDs

2015-04-12 Thread Paul Evans
Thank you Loic & Greg. We followed the troubleshooting directions and ran the 
crushtool in test mode to verify that CRUSH was giving up too soon, and then 
confirmed that changing the set_choose_tries value to 100 would resolve the 
issue (it did).
We then implemented the change in the cluster, while also changing the tunable 
‘choose_total_tries’ to 150 from 50 (without that bump it seemed that we could 
still get a bad mapping).
It only took a few minutes for the remaining 2 PG’s to successfully 
re-distribute their data, and we have finally reached HEALTH_OK.   Thanks!
--
Paul Evans


On Apr 8, 2015, at 10:36 AM, Loic Dachary 
mailto:l...@dachary.org>> wrote:

Hi Paul,

Contrary to what the documentation states at

http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

the crush ruleset can be modified (an update at 
https://github.com/ceph/ceph/pull/4306 will fix that). Placement groups will 
move around, but that's to be expected.

Cheers

On 06/04/2015 20:40, Paul Evans wrote:
Thanks for the insights, Greg.  It would be great if the CRUSH rule for an EC 
pool can be dynamically changed…but if that’s not the case, the troubleshooting 
doc also offers up the idea of adding more OSDs, and we have another 8 OSDs 
(one from each node) we can move into the default root.
However: just to clarify the point of adding OSDs: the current EC profile has a 
failure domain of ‘host’... will adding more OSDs still improve the odds of 
CRUSH finding a good mapping within the given timeout period?

BTW, I’m a little concerned about moving all 8 OSDs at once, as we’re skinny on 
RAM and the EC pools seem to like more RAM that replicated pools do. 
Considering the RAM issue, is adding 2-4 OSDs at a time the recommendation? 
(other than adding more RAM).

--
*Paul Evans
*
*
*
This looks like it's just the standard risk of using a pseudo-random
algorithm: you need to "randomly" map 8 pieces into 8 slots. Sometimes
the CRUSH calculation will return the same 7 slots so many times in a
row that it simply fails to get all 8 of them inside of the time
bounds that are currently set.

If you look through the list archives we've discussed this a few
times, especially Loïc in the context of erasure coding. See
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
for the fix.
But I think that doc is wrong and you can change the CRUSH rule in use
without creating a new pool — right, Loïc?
-Greg


--
Loïc Dachary, Artisan Logiciel Libre


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-09 Thread Paul Evans
al /var/lib/ceph/osd/ceph-$i/journal --op 
> remove --pgid $j ; done ; done
> 
> Then I imported the PGs onto OSD.0 and OSD.15 with:
> for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data 
> /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op 
> import --file ~/${j}.export ; done ; done
> for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm 
> /var/log/ceph/osd/ceph-$i/journal ; done
> 
> Then I moved the disks back to Storage1 and started them all back up again. I 
> think that this should have worked but what happened in this case was that 
> OSD.0 didn't start up for some reason. I initially thought that that wouldn't 
> matter because OSD.15 did start and so we should have had everything but a 
> ceph pg query of the PGs showed something like:
> "blocked": "peering is blocked due to down osds",
> "down_osds_we_would_probe": [0],
> "peering_blocked_by": [{
> "osd": 0,
> "current_lost_at": 0,
> "comment": "starting or marking this osd lost may let us proceed"
> }]
> 
> So I then removed OSD.0 from the cluster and everything came back to life. 
> Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interesting problem: 2 pgs stuck in EC pool with missing OSDs

2015-04-06 Thread Paul Evans
Thanks for the insights, Greg.  It would be great if the CRUSH rule for an EC 
pool can be dynamically changed…but if that’s not the case, the troubleshooting 
doc also offers up the idea of adding more OSDs, and we have another 8 OSDs 
(one from each node) we can move into the default root.
However: just to clarify the point of adding OSDs: the current EC profile has a 
failure domain of ‘host’... will adding more OSDs still improve the odds of 
CRUSH finding a good mapping within the given timeout period?

BTW, I’m a little concerned about moving all 8 OSDs at once, as we’re skinny on 
RAM and the EC pools seem to like more RAM that replicated pools do. 
Considering the RAM issue, is adding 2-4 OSDs at a time the recommendation? 
(other than adding more RAM).

--
Paul Evans

This looks like it's just the standard risk of using a pseudo-random
algorithm: you need to "randomly" map 8 pieces into 8 slots. Sometimes
the CRUSH calculation will return the same 7 slots so many times in a
row that it simply fails to get all 8 of them inside of the time
bounds that are currently set.

If you look through the list archives we've discussed this a few
times, especially Loïc in the context of erasure coding. See
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
for the fix.
But I think that doc is wrong and you can change the CRUSH rule in use
without creating a new pool — right, Loïc?
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com