[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570305#comment-17570305 ] Hui Fei commented on HDFS-15079: This critical issue has been resolved. Thanks you all here. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562192#comment-17562192 ] ZanderXu commented on HDFS-15079: - Thanks [~hexiaoqiao][~ayushtkn][~elgoiri], I have created a [PR-4530|https://github.com/apache/hadoop/pull/4530] to fix this issue, please help me review it. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Labels: pull-request-available > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > Time Spent: 10m > Remaining Estimate: 0h > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561997#comment-17561997 ] Xiaoqiao He commented on HDFS-15079: [~xuzq_zander] Thanks to involve me here. I am not fans about `CallerContext`, but considering this is serious issue, I totally support to push this solution forward to fix it first, then improve it if need. Actually, in my practice, I try to extend RPC interface to support super user to get/set ClientId and CallId for Router, this solution is also applied to `data locality` and some other similar cases. It seems to work fine for more than two years. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561729#comment-17561729 ] ZanderXu commented on HDFS-15079: - Thanks [~elgoiri] and [~ayushtkn] for your comment, I will create a PR like HDFS-13248 to fix this issue. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561642#comment-17561642 ] Ayush Saxena commented on HDFS-15079: - I think RetryCache of Namenode solves this problem, and that uses both callId and clientId to fetch an entry, clientId we are still passing correct, if we pass the callId also from the actual client, this should get sorted. HDFS-13248 added clientIp details, I feel we can do similarly for callId as well and things should be sorted post that. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561554#comment-17561554 ] Íñigo Goiri commented on HDFS-15079: We definitely should fix this issue. Could you provide a PR with your proposal? Otherwise, we should create a PR with the current patch in this JIRA. In addition, maybe we can leverage some of the latest changes we've made to use the caller context in RBF. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561347#comment-17561347 ] ZanderXu commented on HDFS-15079: - [~hexiaoqiao][~ferhui][~ayushtkn] [~elgoiri] Do you plan to push this jira forward? I think this is a serious bug. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461176#comment-17461176 ] Hadoop QA commented on HDFS-15079: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 10s{color} | {color:red}{color} | {color:red} HDFS-15079 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-15079 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12999335/HDFS-15079.002.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/751/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Hui Fei >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078604#comment-17078604 ] Hadoop QA commented on HDFS-15079: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 50s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 4s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 21m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 24m 18s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 50s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 21m 3s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 21m 3s{color} | {color:red} root generated 1 new + 1870 unchanged - 0 fixed = 1871 total (was 1870) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 3m 21s{color} | {color:orange} root: The patch generated 21 new + 281 unchanged - 1 fixed = 302 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 32s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 29s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}111m 1s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 43s{color} | {color:red} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 49s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}272m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.server.datanode.TestBPOfferService | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.federation.router.TestRouterFaultTolerant | | | hadoop.hdfs.server.federation.router.TestRouterClientRejectOverload | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:e6455cc864d | | JIRA Issue | HDFS-15079 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12999335/HDFS-15079.002.patch | | Optional Tests | dupname as
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078329#comment-17078329 ] Fei Hui commented on HDFS-15079: Upload workaround patch. * use RetryCache of NN * Call maybe fail when namenode failover with network anomaly, but not overwrite > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008298#comment-17008298 ] Xiaoqiao He commented on HDFS-15079: Thanks [~ferhui] for your works and ping. I believe this issue is clear and obvious based on above discussion. About the solution, it seems they are same issue with data locality and we will turn back to discussion again. Due to we have to involve other modules to change and both solutions have its own pros and cons, we do not reach the final conclusion up to now. I would like to list both solutions here. a. limited open interface get/set ClientId and CallId to some super user. b. set ClientId/CallId to CallerContext and parse them at NameNode side. The detailed Pros and Cons have stated at HDFS-13248, FYI. In my own opinion, I prefer to solution a + strict access control. pending some other more suggestions? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006363#comment-17006363 ] Hadoop QA commented on HDFS-15079: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 30s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 12s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 8s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 16m 8s{color} | {color:red} root generated 1 new + 1868 unchanged - 0 fixed = 1869 total (was 1868) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 45s{color} | {color:orange} root: The patch generated 14 new + 268 unchanged - 1 fixed = 282 total (was 269) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 6s{color} | {color:red} hadoop-common-project/hadoop-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 7s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 58s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}120m 56s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 57s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 51s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}260m 56s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-common-project/hadoop-common | | | org.apache.hadoop.ipc.RetryCache$CacheEntryContext overrides equals in RetryCache$CacheEntry and may not be symmetric At RetryCache.java:and may not be symmetric At RetryCache.java:[lines 198-205] | | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeUUID | | | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean | |
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006279#comment-17006279 ] Hadoop QA commented on HDFS-15079: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 45s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 1s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 1s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 15m 1s{color} | {color:red} root generated 1 new + 1868 unchanged - 0 fixed = 1869 total (was 1868) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 37s{color} | {color:orange} root: The patch generated 14 new + 268 unchanged - 1 fixed = 282 total (was 269) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 0s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 49s{color} | {color:red} hadoop-common-project/hadoop-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 30s{color} | {color:red} hadoop-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 18s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 45s{color} | {color:red} hadoop-hdfs-rbf in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 47s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}236m 38s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-common-project/hadoop-common | | | org.apache.hadoop.ipc.RetryCache$CacheEntryContext overrides equals in RetryCache$CacheEntry and may not be symmetric At RetryCache.java:and may not be symmetric At RetryCache.java:[lines 198-205] | | Failed junit tests | hadoop.security.TestFixKerberosTicketOrder | | | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes | | | hadoop.hdfs.server.namenode.TestRedudantBlocks | | | hadoop.hdfs.server.federation.router.TestRouterQuota | | | hadoop.fs.contract.router.
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005794#comment-17005794 ] Hadoop QA commented on HDFS-15079: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 58s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 18s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 19s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 38s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 15m 38s{color} | {color:red} root generated 2 new + 1868 unchanged - 0 fixed = 1870 total (was 1868) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 37s{color} | {color:orange} root: The patch generated 21 new + 268 unchanged - 1 fixed = 289 total (was 269) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 48s{color} | {color:red} hadoop-common-project/hadoop-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 29s{color} | {color:red} hadoop-common in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}109m 33s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 7m 35s{color} | {color:red} hadoop-hdfs-rbf in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 48s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}246m 32s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-common-project/hadoop-common | | | org.apache.hadoop.ipc.RetryCache$CacheEntryContext overrides equals in RetryCache$CacheEntry and may not be symmetric At RetryCache.java:and may not be symmetric At RetryCache.java:[lines 197-204] | | Failed junit tests | hadoop.security.TestFixKerberosTicketOrder | | | hadoop.hdfs.TestDFSShell | | | hadoop.cli.TestHDFSCLI | | | hadoop.hdfs.TestTrashWithEncryptionZones | | | hadoop.hdfs.TestMaintenanceState | | | hadoop.hdfs.TestDecommissionWithBackoffMonitor | |
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005301#comment-17005301 ] Fei Hui commented on HDFS-15079: Upoad rough fix. I think maybe CallerContext is suitable here. If we change router client clientId and callId, it will have problem for router client retry or failover. And maybe we will add more fields to callercontext if needed. Here are some issues: * Normalize the callerContext,how should we use it. json format or string? * Checking callId for the same clientId is suitable? Delayed callId will be dropped. [~ayushtkn] [~elgoiri] [~hexiaoqiao] Any thoughts? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004001#comment-17004001 ] Ayush Saxena commented on HDFS-15079: - [~ferhui] if in case you want to follow the previous discussions related to the same Caller context problems, You may check HDFS-13248 , HADOOP-16254, HDFS-13293 We had a lot of discussions on this problem to solve the Data Locality problem, Give a try you get any Ideas or solutions. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003619#comment-17003619 ] Fei Hui commented on HDFS-15079: Thanks [~hexiaoqiao] {quote} ClientId & CallId of request from Router to NameNode are both created by Router itself {quote} Yes. I was wrong, regards clientName as clientId :( Digging in > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003570#comment-17003570 ] Xiaoqiao He commented on HDFS-15079: Hi [~ferhui], IIUC, ClientId & CallId of request from Router to NameNode are both created by Router itself and ClientId matches with connection one by one. Ref. ConnectionPool#newConnection {code:java} Object proxy = RPC.getProtocolProxy(classes.protoPb, version, socket, ugi, conf, factory, RPC.getRpcTimeout(conf), defaultPolicy, null).getProxy(); {code} If router could reuse ClientId and CallId received from client to request NameNode, we should not worry about non-idempotent operation request to different routers and get the wrong result in some case. The only question is this solution has to involve RPC changes, we need to open ClientId & CallId getter and setter to support router handle them. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003430#comment-17003430 ] Fei Hui commented on HDFS-15079: [~ayushtkn] {quote} The namenode Logic that you tend to add, that kind of logic is there in Namenode in form of RetryCache, It checks whether the call isn't a repeated one due to failover, if so, it doesn't execute it again rather sends the old response from the cache. {quote} Great. CallId is the client id i tend to add. CallId and clientId maybe can resolve the problem. For NN clientId is from client, but callId is from routerclient. Try to dig in more too. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003325#comment-17003325 ] Xiaoqiao He commented on HDFS-15079: Thanks [~ayushtkn] for your detailed comments. RetryCache may be one solution, however we have to pass clientid and callid to router (it could need to improve protocol, I am not very sure, will check it later) and reset these two parameter then forward this rpc request to namenode. IIRC, we have discussed this solution for long time about data locality but no conclusion up to now. Another side, if we could do that, I am not sure if there are some security issue such as some one using fake clientid + callid and it could pollute next other rpc requests. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003288#comment-17003288 ] Ayush Saxena commented on HDFS-15079: - Thanx [~ferhui] for the UT. The namenode Logic that you tend to add, that kind of logic is there in Namenode in form of RetryCache, It checks whether the call isn't a repeated one due to failover, if so, it doesn't execute it again rather sends the old response from the cache. You can check {{ClientProtocol.java}} there is an annotation above Create method, you can read its description. That RetryCache logic doesn't seems to be working here, I guess since for it the clientId and callId should be same, but here, since the call is going from different router, not from the same client. So the NN doesn't consider it as a repeated call. Will try to dig in more, but on a quick look, I feel this is the problem. I guess we have a JIRA too as Pass Client CallerContext to NN. May be this issue is also because of that. Not Sure. [~elgoiri] [~hexiaoqiao] Can you also give a check once? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003233#comment-17003233 ] Fei Hui commented on HDFS-15079: [~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078 > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003153#comment-17003153 ] Fei Hui commented on HDFS-15079: [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:1507) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy20.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008) {quote} I think maybe consistency is not guaranteed if resolve it on nn side. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003148#comment-17003148 ] Ayush Saxena commented on HDFS-15079: - Thanx [~ferhui] for the report, The issue is at Router side, and the proposed solution gets changes both at NN and client side, I don't think the proposed solution would get much of agreement, So, I didn't dig in much about this solution. I need to check how actually this is happening, in general case, if it was a namenode-client scenario, I guess the client tries 10 times({{ipc.client.connect.max.retries}}) and then failover to the next. May be this isn't happening? I think I am missing some context. It would be good you can put up a UT to repro this. May be then it could be easy for us to find a solution. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003139#comment-17003139 ] Xiaoqiao He commented on HDFS-15079: Move this JIRA to subtask of HDFS-14603. Thanks [~ferhui] for involves. {quote}Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and failovers to r1. {quote} I am interested in which and where exception throws? a. namenode(throw exception) -> router -> client? b. router(throw exception) -> client? In my experience we try to scheduler client request to the original router if it meet some retry exceptions unless this router set to offline. it may avoid this case. To be honest, this is not very graceful solution. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003138#comment-17003138 ] Íñigo Goiri commented on HDFS-15079: The solution might be a little involved in the NN side. Can we start with a test showing the issue? I'd like to go through the debugger. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002762#comment-17002762 ] Fei Hui commented on HDFS-15079: General Idea: * client generate id and send it with call to namenode * namenode keeps last id for the file of each lease * drop the call if its id less than last id [~ayushtkn] [~elgoiri] [~hexiaoqiao] Any thoughts? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org