[ 
https://issues.apache.org/jira/browse/HDFS-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3726:
------------------------------

    Attachment: hdfs-3726.txt

Attached patch introduces the improvement as described above.

There is a new unit test, and I also tested manually as follows:

- Start cluster configured to write to QJM
- Start 10 threads performing HDFS transactions
- Restart one JN

This used to cause incessant log spew on the console of the restarted JN. With 
the patch, it resulted in the following on the server side:

{code}
12/09/03 20:55:55 INFO ipc.Server: IPC Server handler 0 on 13001, call 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from 
127.0.0.1:47669: error: 
org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't 
write, no segment open
org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't 
write, no segment open
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:384)
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:278)
        at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:121)
        at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:111)
        at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:12442)
{code}

and on the NN:

{code}
12/09/03 20:55:55 WARN client.QuorumJournalManager: Remote journal Channel to 
journal node localhost/127.0.0.1:13001 is not in sync. Will retry on next roll.
{code}

The web UI noted: "Written txid 33668659 (120608 behind) (will re-join on next 
segment)"

Upon the next roll, it logged:

{code}
12/09/03 20:59:09 INFO namenode.FSEditLog: Rolling edit logs.
12/09/03 20:59:09 INFO namenode.FSEditLog: Ending log segment 33668125
12/09/03 20:59:09 INFO namenode.FSEditLog: Number of transactions: 133332 Total 
time for transactions(ms): 2171Number of transactions batched in Syncs: 102034 
Number of syncs: 31297 SyncTimes(ms): 31840 9550 7644 
12/09/03 20:59:09 INFO namenode.FSEditLog: Number of transactions: 133332 Total 
time for transactions(ms): 2171Number of transactions batched in Syncs: 102034 
Number of syncs: 31298 SyncTimes(ms): 31844 9550 7644 
12/09/03 20:59:11 INFO namenode.FileJournalManager: Finalizing edits file 
/tmp/name1-name/current/edits_inprogress_0000000000033668125 -> 
/tmp/name1-name/current/edits_0000000000033668125-0000000000033801456
12/09/03 20:59:11 INFO namenode.FileJournalManager: Finalizing edits file 
/tmp/name1-name2/current/edits_inprogress_0000000000033668125 -> 
/tmp/name1-name2/current/edits_0000000000033668125-0000000000033801456
12/09/03 20:59:11 INFO namenode.FSEditLog: Starting log segment at 33801457
12/09/03 20:59:11 INFO client.QuorumJournalManager: Retrying Channel to journal 
node localhost/127.0.0.1:13001 in new segment starting at txid 33801457
{code}

and the restarted JN was up-to-date again.

                
> QJM: if a logger misses an RPC, don't retry that logger until next segment
> --------------------------------------------------------------------------
>
>                 Key: HDFS-3726
>                 URL: https://issues.apache.org/jira/browse/HDFS-3726
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3726.txt
>
>
> Currently, if a logger misses an RPC in the middle of a log segment, or 
> misses the {{startLogSegment}} RPC (eg it was down or network was 
> disconnected during that time period), then it will throw an exception on 
> every subsequent {{journal()}} call in that segment, since it knows that it 
> missed some edits in the middle.
> We should change this exception to a specific IOE subclass, and have the 
> client side of QJM detect the situation and stop sending IPCs until the next 
> {{startLogSegment}} call.
> This isn't critical for correctness but will help reduce log spew on both 
> sides.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to