Re: Proposal for adding disable FileJournal option
It is not only for consistent between memory and disk. The key point is to implement the atomicity of an trancation. That is when an trancation needs to write an object and update the pglog at the same time, we must make sure the two IO do both or nether. With the journal, when osd restore from failure, the reply process can redo the transcation. I think that is why the journal can not be disabled. On 11 January 2014 13:24, Haomai Wang haomaiw...@gmail.com wrote: On Fri, Jan 10, 2014 at 11:13 AM, Gregory Farnum g...@inktank.com wrote: Exactly. We can't do a safe update without a journal — what if power goes out while the write is happening? When we boot back up, we don't know what version the object is actually at. So if you're using btrfs, you can run without a journal already (and depend on snapshots for recovering after failures); if you are using xfs or ext4 a journal is required for any safety at all, even when it's fronted by a cache pool. I'm not fully agree with it. Why we can't call fdatasync() during each transaction to ensure consistent if exists cache in the front of. On Thu, Jan 9, 2014 at 7:08 PM, Dong Yuan yuandong1...@gmail.com wrote: The Journal is the part of implementation of ObjectStore Transaction Interface, while transaction is used by PG to write pglog with object data in one transaction. So I think if the FileJournal could be disabled, there must be something else to implement the Transaction Interface. But it seems hard while no local file-system provide such function in my opinion. On 10 January 2014 10:04, Haomai Wang haomaiw...@gmail.com wrote: On Fri, Jan 10, 2014 at 1:28 AM, Gregory Farnum g...@inktank.com wrote: The FileJournal is also for data safety whenever we're using write ahead. To disable it we need a backing store that we know can provide us consistent checkpoints (i.e., we can use parallel journaling mode — so for the FileJournal, we're using btrfs, or maybe zfs someday). But for those systems you can already configure the system not to use a journal. Yes, it depends on backend. For example, FileStore can write a object with sync to sure consistent. If adding a disable FileJournal option, we need some works on FileStore to implement it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Jan 9, 2014 at 12:13 AM, Haomai Wang haomaiw...@gmail.com wrote: Hi all, We know FileJournal plays a important role in FileStore backend, it can hugely reduce write latency and improve small write operations. But in practice, there exists exceptions such as we already use FlashCache or cachepool(although it's not ready). If cachepool enabled, we may use use journal in cache_pool but may not like to use journal in base_pool. The main reason why drop journal in base_pool is that journal take over a single physical device and waste too much in base_pool. Like above, if I enable FlashCache or other cache, I'd not like to enable journal in OSD layer. So is it necessary to disable journal in special(not really special) case? Best regards, Wheats -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dong Yuan Email:yuandong1...@gmail.com -- Best Regards, Wheat -- Dong Yuan Email:yuandong1...@gmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pyramid erasure codes and replica hinted recovery
On 11/01/2014 00:40, Kyle Bader wrote: I've been researching what features might be necessary in Ceph to build multi-site RADOS clusters, whether for purposes of scale or to meet SLA requirements more stringent than is achievable with a single datacenter. According to [1], typical [datacenter] availability estimates used in the industry range from 99.7% for tier II to 99.98 and 99.995% for tiers II and IV respectively. Combine the possibility of border and/or core networking meltdown and it's all but impossible to achieve a Ceph service SLA that requires 3-5 nines of availability in a single facility. When we start looking at multi-site network configurations we need to make sure there is sufficient cluster level bandwidth for the following activities: 1. Write fan-out from replication on ingest 2. Backfills from OSD recovery 3. Backfills from OSD remapping Number 1 can be estimated based on historical usage with some additional padding for traffic spikes. Recovery backfills can be roughly estimated based on the size of the disk population in each facility and the OSD annualized failure rate. Number 3 makes multi-site configurations extremely challenging unless the organization building the cluster is willing to pay 7 zeros for 5 nines. Consider the following: 1x 16x40GbE switch with 8x used for access ports, 8x used for inter-site (x4 10GbE breakout per port) 32x Ceph OSD nodes with a 10GbE cluster link (working out to ~3PB raw) Topology: [A]-[B] \ / \ / [C] Since 40GbE is likely only an option if running over dark fiber, non-blocking multi-site would require a total of 12 leased 10GbE lines, 6 for 2:1, and 3 for 4:1. These lines will be extremely stressed each and every time capacity is added to the cluster due to the fact that pgs will be remapped and the OSD that is new to the PG needing to be backfilled by the primary at another site (for 3x replication). Erasure coding with regular MDS codes or even pyramid codes will exhibit similar issues, as described in [2] and [3]. It would be fantastic to see Ceph have a facility similar to what I describe in this bug for replication: http://tracker.ceph.com/issues/7114 For erasure coding, something similar to Facebook's LRC as described in [2] would be advantageous. For example: RS(8:4:2) [k][k][k][k][k][k][k][k] - [k][k][k][k][k][k][k][k][m][m][m][m] Split over 3 sites [k][k][k][k] [k][k][k][k] [k][k][k][k] Generate 2 more parity units [k][k][k][k][m][m] [k][k][k][k][m][m] [k][k][k][k][m][m] Now if each *set* of units could be placed such that they share a common ancestor in the CRUSH hierarchy then local unit sets from the lower level of the pyramid could be remapped/recovered without consuming inter-site bandwidth (maybe treat each set as a replica instead of treating each individual unit as a replica). Thoughts? If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the following rule could be used to spread it over 3 datacenters: rule erasure_ruleset { ruleset 1 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take root step choose indep 3 type datacenter step choose indep 4 type device step emit } crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter straw 10 root straw 0 crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12 rule 1 (erasure_ruleset), x = 1..1, numrep = 12..12 CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213] rule 1 (erasure_ruleset) num_rep 12 result size == 12: 1/1 399 is in datacenter 3, node 9, device 9 etc. It shows that the first four are in datacenter 3, the next in datacenter zero and the last four in datacenter 2. If the function calculating erasure code spreads local chunks evenly ( 321, 12, 213 for instance ), they will effectively be located as you suggest. Andreas may have a different view on this question though. In case 78 goes missing ( and assuming all other chunks are good ), it can be rebuilt with 512, 9, 12 only. However, if the primary driving the reconstruction is 270, data will need to cross datacenter boundaries. Would it be cheaper to elect a primary closest ( in the sense of get_common_ancestor_distance https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.h#L487 ) to the OSD to be recovered ? Only Sam or David could give you an authoritative answer. Cheers [1] http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024 [2] http://arxiv.org/pdf/1301.3791.pdf [3] https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36737.pdf -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
[no subject]
subscribe ceph-devel -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html