Re: Proposal for adding disable FileJournal option

2014-01-11 Thread Dong Yuan
It is not only for consistent between memory and disk. The key point
is to implement the atomicity of an trancation.

That is when an trancation needs to write an object and update the
pglog at the same time, we must make sure the two IO do both or
nether.

With the journal, when osd restore from failure, the reply process can
redo the transcation. I think that is why the journal can not be
disabled.

On 11 January 2014 13:24, Haomai Wang haomaiw...@gmail.com wrote:
 On Fri, Jan 10, 2014 at 11:13 AM, Gregory Farnum g...@inktank.com wrote:
 Exactly. We can't do a safe update without a journal — what if power
 goes out while the write is happening? When we boot back up, we don't
 know what version the object is actually at. So if you're using btrfs,
 you can run without a journal already (and depend on snapshots for
 recovering after failures); if you are using xfs or ext4 a journal is
 required for any safety at all, even when it's fronted by a cache
 pool.

 I'm not fully agree with it. Why we can't call fdatasync() during
 each transaction to
 ensure consistent if exists cache in the front of.


 On Thu, Jan 9, 2014 at 7:08 PM, Dong Yuan yuandong1...@gmail.com wrote:
 The Journal is the part of implementation of ObjectStore Transaction
 Interface, while transaction is used by PG to write pglog with object
 data in one transaction.
 So I think if the FileJournal could be disabled, there must be
 something else to implement the Transaction Interface. But it seems
 hard while no local file-system provide such function in my opinion.


 On 10 January 2014 10:04, Haomai Wang haomaiw...@gmail.com wrote:
 On Fri, Jan 10, 2014 at 1:28 AM, Gregory Farnum g...@inktank.com wrote:

 The FileJournal is also for data safety whenever we're using write
 ahead. To disable it we need a backing store that we know can provide
 us consistent checkpoints (i.e., we can use parallel journaling mode —
 so for the FileJournal, we're using btrfs, or maybe zfs someday). But
 for those systems you can already configure the system not to use a
 journal.

 Yes, it depends on backend. For example, FileStore can write a object with 
 sync
 to sure consistent. If adding a disable FileJournal option, we need
 some works on
 FileStore to implement it.

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Thu, Jan 9, 2014 at 12:13 AM, Haomai Wang haomaiw...@gmail.com wrote:
  Hi all,
 
  We know FileJournal plays a important role in FileStore backend, it can
  hugely reduce write latency and improve small write operations.
 
  But in practice, there exists exceptions such as we already use 
  FlashCache or cachepool(although it's not ready).
 
  If cachepool enabled, we may use use journal in cache_pool but may
  not like to use journal in base_pool. The main reason why drop journal
  in base_pool is that journal take over a single physical device and 
  waste
  too much in base_pool.
 
  Like above, if I enable FlashCache or other cache, I'd not like to 
  enable
  journal in OSD layer.
 
  So is it necessary to disable journal in special(not really special) 
  case?
 
  Best regards,
  Wheats
 
 
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html




 --

 Best Regards,

 Wheat
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Dong Yuan
 Email:yuandong1...@gmail.com



 --
 Best Regards,

 Wheat



-- 
Dong Yuan
Email:yuandong1...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Pyramid erasure codes and replica hinted recovery

2014-01-11 Thread Loic Dachary


On 11/01/2014 00:40, Kyle Bader wrote:
 I've been researching what features might be necessary in Ceph to
 build multi-site RADOS clusters, whether for purposes of scale or to
 meet SLA requirements more stringent than is achievable with a single
 datacenter. According to [1], typical [datacenter] availability
 estimates used in the industry range from 99.7% for tier II to 99.98
 and 99.995% for tiers II and IV respectively. Combine the possibility
 of border and/or core networking meltdown and it's all but impossible
 to achieve a Ceph service SLA that requires 3-5 nines of availability
 in a single facility.
 
 When we start looking at multi-site network configurations we need to
 make sure there is sufficient cluster level bandwidth for the
 following activities:
 
 1. Write fan-out from replication on ingest
 2. Backfills from OSD recovery
 3. Backfills from OSD remapping
 
 Number 1 can be estimated based on historical usage with some
 additional padding for traffic spikes. Recovery backfills can be
 roughly estimated based on the size of the disk population in each
 facility and the OSD annualized failure rate. Number 3 makes
 multi-site configurations extremely challenging unless the
 organization building the cluster is willing to pay 7 zeros for 5
 nines.
 
 Consider the following:
 
 1x 16x40GbE switch with 8x used for access ports, 8x used for
 inter-site (x4 10GbE breakout per port)
 32x Ceph OSD nodes with a 10GbE cluster link (working out to ~3PB raw)
 
 Topology:
 
 [A]-[B]
   \   /
\ /
 [C]
 
 Since 40GbE is likely only an option if running over dark fiber,
 non-blocking multi-site would require a total of 12 leased 10GbE
 lines, 6 for 2:1, and 3 for 4:1. These lines will be extremely
 stressed each and every time capacity is added to the cluster due to
 the fact that pgs will be remapped and the OSD that is new to the PG
 needing to be backfilled by the primary at another site (for 3x
 replication). Erasure coding with regular MDS codes or even pyramid
 codes will exhibit similar issues, as described in [2] and [3]. It
 would be fantastic to see Ceph have a facility similar to what I
 describe in this bug for replication:
 
 http://tracker.ceph.com/issues/7114
 
 For erasure coding, something similar to Facebook's LRC as described
 in [2] would be advantageous. For example:
 
 RS(8:4:2)
 
 [k][k][k][k][k][k][k][k] - [k][k][k][k][k][k][k][k][m][m][m][m]
 
 Split over 3 sites
 
 [k][k][k][k]  [k][k][k][k]  [k][k][k][k]
 
 Generate 2 more parity units
 
 [k][k][k][k][m][m]  [k][k][k][k][m][m]   [k][k][k][k][m][m]
 
 Now if each *set* of units could be placed such that they share a
 common ancestor in the CRUSH hierarchy then local unit sets from the
 lower level of the pyramid could be remapped/recovered without
 consuming inter-site bandwidth (maybe treat each set as a replica
 instead of treating each individual unit as a replica).
 
 Thoughts?

If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the 
following rule could be used to spread it over 3 datacenters:

rule erasure_ruleset {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take root
step choose indep 3 type datacenter
step choose indep 4 type device
step emit
}

crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter straw 
10 root straw 0
crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above
crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map 
--show-bad-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12
rule 1 (erasure_ruleset), x = 1..1, numrep = 12..12
CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213]
rule 1 (erasure_ruleset) num_rep 12 result size == 12:  1/1

399 is in datacenter 3, node 9, device 9 etc. It shows that the first four are 
in datacenter 3, the next in datacenter zero and the last four in datacenter 2. 

If the function calculating erasure code spreads local chunks evenly ( 321, 12, 
213 for instance ), they will effectively be located as you suggest. Andreas 
may have a different view on this question though.

In case 78 goes missing ( and assuming all other chunks are good ), it can be 
rebuilt with 512, 9, 12 only. However, if the primary driving the 
reconstruction is 270, data will need to cross datacenter boundaries. Would it 
be cheaper to elect a primary closest ( in the sense of 
get_common_ancestor_distance 
https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.h#L487 ) to the 
OSD to be recovered ? Only Sam or David could give you an authoritative answer.

Cheers

 
 [1] http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024
 [2] http://arxiv.org/pdf/1301.3791.pdf
 [3] 
 https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36737.pdf
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


[no subject]

2014-01-11 Thread Songjiang Zhao

subscribe ceph-devel
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html