[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091084#comment-17091084
 ] 

Kihwal Lee commented on HDFS-15287:
-----------------------------------

[~shv]. Sorry, I guess I was too terse. Here is what happens. It is clearly a 
rollback image, but the active namenode still rejects. When I disabled the 
check in code (changed the default to false), it works.  This is branch-2.10. 
So it is not working as you intended. I did not check trunk.

{noformat}
2020-04-07 20:17:05,686 [TransferFsImageUpload-62] INFO 
namenode.TransferFsImage: Sending fileName: 
/xxx/current/fsimage_rollback_000000000123456789, fileSize: 591328984. Sent 
total: 655360 bytes. Size of last segment in
tended to send: 131072 bytes.
java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3479)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3462)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:377)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:321)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:275)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
{noformat}

On the active namenode side, I see
{noformat}
2020-04-07 20:17:05,686 [qtp2000648320-936001] INFO namenode.ImageServlet: 
ImageServlet allowing checkpointer: hdfs/mycluster....@myrealm.com
2020-04-07 20:17:05,686 [qtp2000648320-936001] WARN conf.Configuration: No unit 
for dfs.namenode.checkpoint.period(43200) assuming SECONDS
{noformat}

This WARN message is indication of the interval check kicking in. 

bq. Active NameNode checks whether to accept a checkpoint from a StandbyNode in 
order to avoid too frequent checkpoints in case of multiple Standby 
checkpointers.

I understand the original intention, but that breaks existing use cases.  
Normal checkpointing can happen in two conditions. Either the configured time 
has passed or the number of transactions has exceeded the configured limit 
since last checkpoint.  This check is rejecting images from the latter.  This 
is a legitimate use case and we have relied on it for over a decade.  At 
minimum, please make it configurable.

> HDFS rollingupgrade prepare never finishes
> ------------------------------------------
>
>                 Key: HDFS-15287
>                 URL: https://issues.apache.org/jira/browse/HDFS-15287
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.10.0, 3.3.0
>            Reporter: Kihwal Lee
>            Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to