[ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026802#comment-13026802
 ] 

Jitendra Nath Pandey commented on HDFS-1580:
--------------------------------------------

> In the file-based storage there's no clean way to seek to a particular 
> transaction ID
 A savenamespace will be preceded by a call to mark (like current roll). A file 
implementation can close the current file and start a new file at that point. 
Therefore in usual operation, when a namenode starts up it will load an fsimage 
and requests to read transactions after that point, it will most likely find a 
file that starts from next transaction id.
 Alternatively, a file implementation can ignore mark and close a file every 
100000 transactions. Now if it has to seek to 50000th transaction it can just 
read and ignore previous transactions. Since transaction files will be read 
only for checkpointing or at namenode startup or by backup at failover, it is 
not very expensive. In a recent measurement we found that namenode could load 
1.4M transactions in 27 seconds.

 Also if we store edit logs to book keeper, 2NN can read from book keeper and 
there won't be a need for edit transfer, that is another attraction for using 
book keeper.

> This seems like a somewhat serious flaw. If we anticipate using BK for HA.. 
  Agreed that the backup will lag behind the primary but when failover happens 
it can quickly read the additional transactions before declaring itself active. 
Won't that be an acceptable delay? There is some discussion on this in 
ZOOKEEPER-1016.

> Another way of doing this is to say that, if an implementation does have this 
> limitation, it can choose to "mark" whenever it likes.
  That is correct, however mark will be useful in the interface to be called 
before a savenamespace.

> Most operations write the edit to the log while holding the FSN lock (to 
> ensure serialized order between ops) and then drop the FSN lock to sync
  Good catch! A sync method is needed in EditLogOutputStream to be called after 
releasing the lock.

> edit log transfer right now is based around the concept of discrete files 
> which can be entirely fetched, with an associated md5sum
  I think it should be File storage implemenation's responsibility to keep an 
md5sum with every file, therefore the safety check while transferring files can 
still be supported.
  This interface doesn't manage transfer of edit logs. It only talks about 
reading/writing the transactions from/to a storage. When 2NN wants to do a 
checkpoint, it will download the files from primary, it will then get an 
EditLogInputStream object using this interface for the edit log files, and read 
the transactions.
 For Book-keeper storage, transfer will not be required.

> md5sum /data/{1..4}/dfs/name/current/
  If we use a system like Book-keeper, we won't have the ability to perform 
this sanity check anyway. For different file storages, this ability will 
continue to exist, because a) mark will be called for all journal instances at 
the same time, and b) even if file storage implementation closes file every 
100000 transactions it will be consistent for all files.

> Refer to the discussion on HDFS-1073 about this property.
 Sure, I will look at it.


> Add interface for generic Write Ahead Logging mechanisms
> --------------------------------------------------------
>
>                 Key: HDFS-1580
>                 URL: https://issues.apache.org/jira/browse/HDFS-1580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: EditlogInterface.1.pdf, HDFS-1580+1521.diff, 
> HDFS-1580.diff, HDFS-1580.diff, HDFS-1580.diff, generic_wal_iface.pdf, 
> generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.txt
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to