On Tue, Dec 3, 2013 at 2:03 PM, Jonathan Hsieh <j...@cloudera.com> wrote:
> On Tue, Dec 3, 2013 at 11:42 AM, Enis Söztutar <enis....@gmail.com> wrote: > > > On Mon, Dec 2, 2013 at 10:20 PM, Jonathan Hsieh <j...@cloudera.com> > wrote: > > > > > > Deveraj: > > > > Jonathan Hsieh, WAL per region (WALpr) would give you the locality > (and > > > hence HDFS short > > > > circuit) of reads if you were to couple it with the favored nodes. > The > > > cost is of course more WAL > > > > files... In the current situation (no WALpr) it would create quite > some > > > traffic cross machine, no? > > > > > > I think we all agree that wal per region isn't efficient on today's > > > spinning hard drive world where we are limited to a relatively low > budget > > > or seeks (though may be better in the future with SSD's). > > > > > > > WALpr makes sense in fully SSD world and if hdfs had journaling for > writes. > > I don't think anybody > > is working on this yet. > > > what do you mean by journaling for writes? do you mean where sync > operations update length at the nn on every call? > I think hdfs guys were using "super sync" for referring to that. I was referring to journaling file system (http://en.wikipedia.org/wiki/Journaling_file_system) where the writes to multiple files are persisted to a journal disk so that you do not pay the constant seeks for writing to a lot of files (for regions wals) in parallel. > > > > Full SSD clusters are already in place (pinterest > > for example), so I > > think having WALpr as a pluggable implementation makes sense. HBase > should > > work with both > > WAL-per-regionserver (or multi) or WAL-per-region. > > > > > > I agree here. > > > > > > > > With this in mind, I actually I making the case that we would group the > > all > > > the regions from RS-A onto the same set of preferred regions servers. > > This > > > way we only need to have one or two other RS's tailing the RS. > > > > > > So for example, if region X, Y and Z were on RS-A and its hlog, the > > shadow > > > region memstores for X, Y, and Z would be assigned to the same one or > two > > > other RSs. Ideally this would be where the HLog files replicas have > > > locality (helped by favored nodes/block affinity). Doing this, we hold > > the > > > number of readers on the active hlogs to a constant number, do not add > > any > > > new cross machine traffic (though tailing currently has costs on the > NN). > > > > > > One inefficiency we have is that if there is a single log per RS, we > end > > up > > > reading all the logs to tables that may not have the shadow feature > > > enabled. However, with HBase multi-wals coming, one strategy is to > shard > > > wals to a number on the order of the number of disks on a machine > (12-24 > > > these days). I think the a wal per namespaces (this could be used to > > have > > > a wal per table) of the hlog would make sense. This way of shardind > the > > > hlog would reduce the amount of reading of irrelevant log entries on a > > log > > > tailing scheme. It would have the added benefit of reducing the log > > > splitting work reducing MTTR and allowing for recovery priorities if > the > > > primaries and shadows also go down. (this is an generalization of the > > > separate out the META into a separate log idea). > > > > > > Jon. > > > > > > -- > > > // Jonathan Hsieh (shay) > > > // Software Engineer, Cloudera > > > // j...@cloudera.com > > > > > > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // j...@cloudera.com >