[jira] [Updated] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls

2021-11-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-16283:

Attachment: RBF_ improve renewLease() to call only a specific NameNode 
rather than make fan-out calls.pdf

> RBF: improve renewLease() to call only a specific NameNode rather than make 
> fan-out calls
> -
>
> Key: HDFS-16283
> URL: https://issues.apache.org/jira/browse/HDFS-16283
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ improve renewLease() to call only a specific 
> NameNode rather than make fan-out calls.pdf
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently renewLease() against a router will make fan-out to all the 
> NameNodes. Since renewLease() call is so frequent and if one of the NameNodes 
> are slow, then eventually the router queues are blocked by all renewLease() 
> and cause router degradation. 
> We will make a change in the client side to keep track of NameNode Id in 
> additional to current fileId so routers understand which NameNodes the client 
> is renewing lease against.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls

2021-10-27 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435067#comment-17435067
 ] 

Aihua Xu commented on HDFS-16283:
-

[~jingzhao] and [~inigoiri] Can you help review the change?

> RBF: improve renewLease() to call only a specific NameNode rather than make 
> fan-out calls
> -
>
> Key: HDFS-16283
> URL: https://issues.apache.org/jira/browse/HDFS-16283
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently renewLease() against a router will make fan-out to all the 
> NameNodes. Since renewLease() call is so frequent and if one of the NameNodes 
> are slow, then eventually the router queues are blocked by all renewLease() 
> and cause router degradation. 
> We will make a change in the client side to keep track of NameNode Id in 
> additional to current fileId so routers understand which NameNodes the client 
> is renewing lease against.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls

2021-10-25 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-16283:
---

 Summary: RBF: improve renewLease() to call only a specific 
NameNode rather than make fan-out calls
 Key: HDFS-16283
 URL: https://issues.apache.org/jira/browse/HDFS-16283
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: rbf
Reporter: Aihua Xu
Assignee: Aihua Xu


Currently renewLease() against a router will make fan-out to all the NameNodes. 
Since renewLease() call is so frequent and if one of the NameNodes are slow, 
then eventually the router queues are blocked by all renewLease() and cause 
router degradation. 

We will make a change in the client side to keep track of NameNode Id in 
additional to current fileId so routers understand which NameNodes the client 
is renewing lease against.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16200) Improve NameNode failover

2021-09-15 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415781#comment-17415781
 ] 

Aihua Xu commented on HDFS-16200:
-

[~hexiaoqiao]  Thanks for checking. Regarding improving topology resolution 
performance, there is TableMapping with precomputed topology info but you need 
to know the list of the hosts and precompute the topology. We can convert the 
script into a build-in implementation, but I believe we will still hit some 
slowness there. 
For our particular case, we don't colocate storage with computing and the 
failover has been improved from over 10 minutes to just seconds by disabling 
it. Right now there are more cases to separate storage and computing. Should we 
have a global configuration to optimize for those cases?


> Improve NameNode failover
> -
>
> Key: HDFS-16200
> URL: https://issues.apache.org/jira/browse/HDFS-16200
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: namanode
>Affects Versions: 2.8.2
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In a busy cluster, we are noticing the NameNode failover takes longer time 
> (over 10 minutes) and it causes cluster down time during the time period.
> One bottleneck locates in resolving the client host's topology when the 
> cluster is not colocated with the computing hosts. NameNode resolves the 
> client host's topology and uses it to sort the hosts where the blocks locate 
> in. Such topology will be cached so the next access will be efficient, while 
> if the standby NameNode is newly restarted, then all the client hosts, e.g., 
> YARN hosts need to be resolved.
> Solutions can be: 1) we can expose an API in DFSAdmin to load topology cache, 
> or 2) we can add a new configuration in HDFS cluster to skip resolving 
> topology for non-colocated HDFS cluster. Since client hosts and HDFS hosts 
> are not colocated, it's unnecessary to sort the DataNodes for the clients.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16200) Improve NameNode failover

2021-08-31 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-16200:
---

 Summary: Improve NameNode failover
 Key: HDFS-16200
 URL: https://issues.apache.org/jira/browse/HDFS-16200
 Project: Hadoop HDFS
  Issue Type: Task
  Components: namanode
Affects Versions: 2.8.2
Reporter: Aihua Xu
Assignee: Aihua Xu


In a busy cluster, we are noticing the NameNode failover takes longer time 
(over 10 minutes) and it causes cluster down time during the time period.

One bottleneck locates in resolving the client host's topology when the cluster 
is not colocated with the computing hosts. NameNode resolves the client host's 
topology and uses it to sort the hosts where the blocks locate in. Such 
topology will be cached so the next access will be efficient, while if the 
standby NameNode is newly restarted, then all the client hosts, e.g., YARN 
hosts need to be resolved.

Solutions can be: 1) we can expose an API in DFSAdmin to load topology cache, 
or 2) we can add a new configuration in HDFS cluster to skip resolving topology 
for non-colocated HDFS cluster. Since client hosts and HDFS hosts are not 
colocated, it's unnecessary to sort the DataNodes for the clients. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16159) Support resolvePath() in DistributedFileSystem for federated HDFS

2021-08-10 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-16159:
---

 Summary: Support resolvePath() in DistributedFileSystem for 
federated HDFS
 Key: HDFS-16159
 URL: https://issues.apache.org/jira/browse/HDFS-16159
 Project: Hadoop HDFS
  Issue Type: Task
  Components: federation
Affects Versions: 3.3.1
Reporter: Aihua Xu
Assignee: Aihua Xu


DistributedFileSystem needs to support resolvePath() similar to ViewFileSystem 
since DistributedFileSystem can be used to talk to Router FileSystem. The 
clients like Hive need the functionality to determine the physical clusters to 
choose the data copy vs. move if src and dest are on different physical 
clusters although they are on the same router file system. See Hive-24742.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15905) Improve Router performance with router redirection

2021-03-18 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304294#comment-17304294
 ] 

Aihua Xu commented on HDFS-15905:
-

[~elgoiri], [~jingzhao], [~fengnanli] Can you provide any feedback/suggestion? 
Thanks a lot.

> Improve Router performance with router redirection
> --
>
> Key: HDFS-15905
> URL: https://issues.apache.org/jira/browse/HDFS-15905
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Affects Versions: 3.1.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> Router implementation currently takes the proxy approach to handle the client 
> requests: the routers receive the requests from the clients and send the 
> requests to the target clusters on behalf of the clients. 
> This approach works well,  while after moving more clusters on top of 
> routers, we are seeing that routers are becoming the bottleneck since e.g., 
> without RBF, the clients themselves manage the connections for themselves, 
> while with RBF, the limited routers manage much more connections for the 
> clients; we also keep idle connections to boost the connection performance. 
> We have done some work to tune connection management but it doesn't help much.
> We are proposing to reduce the functionality on the router side and use them 
> as actual router instead of proxy: the clients talk to routers to resolve 
> target cluster info given a path and get router delegation token; the clients 
> directly send the requests to target cluster.
> A big challenge here is the token authentication against target cluster with 
> router token only. One approach: we can ask router to return target cluster 
> token along with router token so the clients can authenticate against target 
> cluster. Second approach:  similar to block token mechanism, the router 
> exchanges secret keys with target clusters through heart-beats so the clients 
> can authenticate with target cluster with that router token.
> I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15905) Improve Router performance with router redirection

2021-03-18 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-15905:
---

 Summary: Improve Router performance with router redirection
 Key: HDFS-15905
 URL: https://issues.apache.org/jira/browse/HDFS-15905
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: rbf
Affects Versions: 3.1.0
Reporter: Aihua Xu
Assignee: Aihua Xu


Router implementation currently takes the proxy approach to handle the client 
requests: the routers receive the requests from the clients and send the 
requests to the target clusters on behalf of the clients. 

This approach works well,  while after moving more clusters on top of routers, 
we are seeing that routers are becoming the bottleneck since e.g., without RBF, 
the clients themselves manage the connections for themselves, while with RBF, 
the limited routers manage much more connections for the clients; we also keep 
idle connections to boost the connection performance. We have done some work to 
tune connection management but it doesn't help much.

We are proposing to reduce the functionality on the router side and use them as 
actual router instead of proxy: the clients talk to routers to resolve target 
cluster info given a path and get router delegation token; the clients directly 
send the requests to target cluster.

A big challenge here is the token authentication against target cluster with 
router token only. One approach: we can ask router to return target cluster 
token along with router token so the clients can authenticate against target 
cluster. Second approach:  similar to block token mechanism, the router 
exchanges secret keys with target clusters through heart-beats so the clients 
can authenticate with target cluster with that router token.

I would like to know your feedback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15800) DataNode to handle NameNode IP changes

2021-01-28 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15800:

Assignee: Aihua Xu
  Status: Patch Available  (was: Open)

Simple change to update remoteId.address directly instead of the local 
variable, so connection recreation will get the new IP address from 
remoteId.address.

> DataNode to handle NameNode IP changes
> --
>
> Key: HDFS-15800
> URL: https://issues.apache.org/jira/browse/HDFS-15800
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HDFS-15800.patch
>
>
> Hadoop-17068 is to handle the case of NameNode IP address changes in which 
> HDFS client will update the IP address after the connection failure.  
> DataNodes also use the same logic to refresh IP address for the connection. 
> Such connection is reused with the default idle time 10 seconds. (set by 
> ipc.client.connection.maxidletime). If the connection is closed and the 
> DataNode will use the old NameNode IP address to connect and refresh to the 
> new IP address after the first failure.  
> The problem with the refresh logic in org.apache.hadoop.ipc.Client is: the 
> server value getting refreshed will not reflect in remoteId.address, while 
> the next connection creation will use remoteId.address.
> {{if (!server.equals(currentAddr)) {}}
> {{  LOG.warn("Address change detected. Old: " + server.toString() +}}
> {{          " New: " + currentAddr.toString()); }}
> {{   server = currentAddr;}}
>  
> Such kind of retry in a big cluster will cause random "BLOCK* 
> blk_16987635027_18010098516 is COMMITTED but not COMPLETE(numNodes= 0 < 
> minimum = 1) in fie" error if all three replicas take one retry to read/write 
> the block. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15800) DataNode to handle NameNode IP changes

2021-01-28 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15800:

Attachment: HDFS-15800.patch

> DataNode to handle NameNode IP changes
> --
>
> Key: HDFS-15800
> URL: https://issues.apache.org/jira/browse/HDFS-15800
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Aihua Xu
>Priority: Major
> Attachments: HDFS-15800.patch
>
>
> Hadoop-17068 is to handle the case of NameNode IP address changes in which 
> HDFS client will update the IP address after the connection failure.  
> DataNodes also use the same logic to refresh IP address for the connection. 
> Such connection is reused with the default idle time 10 seconds. (set by 
> ipc.client.connection.maxidletime). If the connection is closed and the 
> DataNode will use the old NameNode IP address to connect and refresh to the 
> new IP address after the first failure.  
> The problem with the refresh logic in org.apache.hadoop.ipc.Client is: the 
> server value getting refreshed will not reflect in remoteId.address, while 
> the next connection creation will use remoteId.address.
> {{if (!server.equals(currentAddr)) {}}
> {{  LOG.warn("Address change detected. Old: " + server.toString() +}}
> {{          " New: " + currentAddr.toString()); }}
> {{   server = currentAddr;}}
>  
> Such kind of retry in a big cluster will cause random "BLOCK* 
> blk_16987635027_18010098516 is COMMITTED but not COMPLETE(numNodes= 0 < 
> minimum = 1) in fie" error if all three replicas take one retry to read/write 
> the block. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15800) DataNode to handle NameNode IP changes

2021-01-28 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-15800:
---

 Summary: DataNode to handle NameNode IP changes
 Key: HDFS-15800
 URL: https://issues.apache.org/jira/browse/HDFS-15800
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.8.0
Reporter: Aihua Xu


Hadoop-17068 is to handle the case of NameNode IP address changes in which HDFS 
client will update the IP address after the connection failure.  

DataNodes also use the same logic to refresh IP address for the connection. 
Such connection is reused with the default idle time 10 seconds. (set by 
ipc.client.connection.maxidletime). If the connection is closed and the 
DataNode will use the old NameNode IP address to connect and refresh to the new 
IP address after the first failure.  

The problem with the refresh logic in org.apache.hadoop.ipc.Client is: the 
server value getting refreshed will not reflect in remoteId.address, while the 
next connection creation will use remoteId.address.

{{if (!server.equals(currentAddr)) {}}
{{  LOG.warn("Address change detected. Old: " + server.toString() +}}
{{          " New: " + currentAddr.toString()); }}
{{   server = currentAddr;}}

 

Such kind of retry in a big cluster will cause random "BLOCK* 
blk_16987635027_18010098516 is COMMITTED but not COMPLETE(numNodes= 0 < minimum 
= 1) in fie" error if all three replicas take one retry to read/write the 
block. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15727) RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy

2020-12-11 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15727:

Description: 
RpcQueueTimeAvgTime of the NameNode on port 8020 (the client RPC calls) 
increases after it becomes StandBy. It will get resolved after it gets 
restarted. Seems there is something incorrect about this metrics.

See the following graph, the NameNode becomes StandBy at 10:13 while 
RpcQueueTimeAvgTime increases instead.

!image-2020-12-10-13-30-44-288.png!

  was:
RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy. It will 
get resolved after it gets restarted. Seems there is something incorrect about 
this metrics.

See the following graph, the NameNode becomes StandBy at 10:13 while 
RpcQueueTimeAvgTime increases instead.

!image-2020-12-10-13-30-44-288.png!


> RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy
> --
>
> Key: HDFS-15727
> URL: https://issues.apache.org/jira/browse/HDFS-15727
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Aihua Xu
>Priority: Major
> Attachments: image-2020-12-10-13-30-44-288.png
>
>
> RpcQueueTimeAvgTime of the NameNode on port 8020 (the client RPC calls) 
> increases after it becomes StandBy. It will get resolved after it gets 
> restarted. Seems there is something incorrect about this metrics.
> See the following graph, the NameNode becomes StandBy at 10:13 while 
> RpcQueueTimeAvgTime increases instead.
> !image-2020-12-10-13-30-44-288.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15727) RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy

2020-12-11 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248080#comment-17248080
 ] 

Aihua Xu commented on HDFS-15727:
-

[~kihwal] Seems it doesn't cause issues but wrong metrics. BTW: this metrics is 
actually for 8020 port of the client RPC calls, not for 8022 port of RPC calls 
from internal communication. 

> RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy
> --
>
> Key: HDFS-15727
> URL: https://issues.apache.org/jira/browse/HDFS-15727
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Aihua Xu
>Priority: Major
> Attachments: image-2020-12-10-13-30-44-288.png
>
>
> RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy. It 
> will get resolved after it gets restarted. Seems there is something incorrect 
> about this metrics.
> See the following graph, the NameNode becomes StandBy at 10:13 while 
> RpcQueueTimeAvgTime increases instead.
> !image-2020-12-10-13-30-44-288.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15727) RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy

2020-12-10 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-15727:
---

 Summary: RpcQueueTimeAvgTime of the NameNode increases after it 
becomes StandBy
 Key: HDFS-15727
 URL: https://issues.apache.org/jira/browse/HDFS-15727
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 2.8.2
Reporter: Aihua Xu
 Attachments: image-2020-12-10-13-30-44-288.png

RpcQueueTimeAvgTime of the NameNode increases after it becomes StandBy. It will 
get resolved after it gets restarted. Seems there is something incorrect about 
this metrics.

See the following graph, the NameNode becomes StandBy at 10:13 while 
RpcQueueTimeAvgTime increases instead.

!image-2020-12-10-13-30-44-288.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-11-16 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233178#comment-17233178
 ] 

Aihua Xu commented on HDFS-15562:
-

Thanks [~shv] for your comment. When I get time, I will focus on not recreating 
the image if there is a recent one. 

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15562.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-10 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229783#comment-17229783
 ] 

Aihua Xu commented on HDFS-15467:
-

[~csun] It doesn't fail but print the info indicating failover message above. 
So it was designed o use upper-level retry logic for msync()? I feel 
FailoverProxyProvider should have its own retry logic to find the active 
namenode while ObserverReadProxyProvider has its own retry logic to find the 
right Observer NameNode. 

> ObserverReadProxyProvider should skip logging first failover from each proxy
> 
>
> Key: HDFS-15467
> URL: https://issues.apache.org/jira/browse/HDFS-15467
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Hanisha Koneru
>Assignee: Aihua Xu
>Priority: Major
>
> After HADOOP-17116, \{{RetryInvocationHandler}} skips logging the first 
> failover INFO message from each proxy. But {{ObserverReadProxyProvider}} uses 
> {{combinedProxy}} object which combines all proxies into one and assigns 
> {{combinedInfo}} as the ProxyInfo.
> {noformat}
> ObserverReadProxyProvider# Lines 197-207:
> for (int i = 0; i < nameNodeProxies.size(); i++) {
>   if (i > 0) {
> combinedInfo.append(",");
>   }
>   combinedInfo.append(nameNodeProxies.get(i).proxyInfo);
> }
> combinedInfo.append(']');
> T wrappedProxy = (T) Proxy.newProxyInstance(
> ObserverReadInvocationHandler.class.getClassLoader(),
> new Class[] {xface}, new ObserverReadInvocationHandler());
> combinedProxy = new ProxyInfo<>(wrappedProxy, 
> combinedInfo.toString()){noformat}
> {{RetryInvocationHandler}} depends on the {{ProxyInfo}} to differentiate 
> between proxies while checking if failover from that proxy happened before. 
> And since combined proxy has only 1 proxy, HADOOP-17116 doesn't work on 
> {{ObserverReadProxyProvider.}}It would need to handled separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-09 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228899#comment-17228899
 ] 

Aihua Xu edited comment on HDFS-15467 at 11/10/20, 1:33 AM:


[~csun] In ObserverReadProxyProvider, the FailoverProxyProvider (failoverProxy) 
for active/standby namenode failover doesn't seem to have retry logic. When 
msync() is called against failoverProxy, it could fail when it's reaching out a 
standby namenode. The exception is thrown to the retry logic of 
ObserverReadProxyProvider to handle (see the stack trace below). Is this by 
design? Logically seems FailoverProxyProvider should also have retry around it 
as well like:

{{DfsClientConf config = new DfsClientConf(conf);}}
 {{ClientProtocol proxy = (ClientProtocol) RetryProxy.create(xface,}}
 {{failoverProxyProvider,}}
 {{RetryPolicies.failoverOnNetworkException(}}
 {{RetryPolicies.TRY_ONCE_THEN_FAIL, config.getMaxFailoverAttempts(),}}
 {{config.getMaxRetryAttempts(), config.getFailoverSleepBaseMillis(),}}
 {{config.getFailoverSleepMaxMillis()));}}
{quote}20/10/29 04:22:33 INFO retry.RetryInvocationHandler: Exception while 
invoking $Proxy5.getFileInfo over 
[hadoopetanamenode01-dca1.prod.uber.internal/10.22.3.137:8020,hadoopetanamenode02-dca1.prod.uber.internal/10.18.6.167:8020,hadoopetaobserver01-dca1.prod.uber.internal/10.14.137.154:8020]
 after 1 failover attempts. Trying to failover after sleeping for 693ms.
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
[http://t.uber.com/hdfs_faq]
 at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1942)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1387)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.msync(NameNodeRpcServer.java:1318)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.msync(ClientNamenodeProtocolServerSideTranslatorPB.java:1617)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:508)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:930)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:865)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2726)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1524)
 at org.apache.hadoop.ipc.Client.call(Client.java:1470)
 at org.apache.hadoop.ipc.Client.call(Client.java:1369)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:117)
 at com.sun.proxy.$Proxy15.msync(Unknown Source)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.msync(ClientNamenodeProtocolTranslatorPB.java:1634)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.initializeMsync(ObserverReadProxyProvider.java:350)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.access$600(ObserverReadProxyProvider.java:69)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider$ObserverReadInvocationHandler.invoke(ObserverReadProxyProvider.java:427)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1700)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1439)
 at 

[jira] [Comment Edited] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-09 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228899#comment-17228899
 ] 

Aihua Xu edited comment on HDFS-15467 at 11/10/20, 1:33 AM:


[~csun] In ObserverReadProxyProvider, the FailoverProxyProvider (failoverProxy) 
for active/standby namenode failover doesn't seem to have retry logic. When 
msync() is called against failoverProxy, it could fail when it's reaching out a 
standby namenode. The exception is thrown to the retry logic of 
ObserverReadProxyProvider to handle (see the stack trace below). Is this by 
design? Logically seems FailoverProxyProvider should also have retry around it 
as well like:

DfsClientConf config = new DfsClientConf(conf);
{{ClientProtocol proxy = (ClientProtocol) RetryProxy.create(xface,}}
{{failoverProxyProvider,}}
{{RetryPolicies.failoverOnNetworkException(}}
{{RetryPolicies.TRY_ONCE_THEN_FAIL, config.getMaxFailoverAttempts(),}}
{{config.getMaxRetryAttempts(), config.getFailoverSleepBaseMillis(),}}
{{config.getFailoverSleepMaxMillis()));}}
{quote}20/10/29 04:22:33 INFO retry.RetryInvocationHandler: Exception while 
invoking $Proxy5.getFileInfo over 
[hadoopetanamenode01-dca1.prod.uber.internal/10.22.3.137:8020,hadoopetanamenode02-dca1.prod.uber.internal/10.18.6.167:8020,hadoopetaobserver01-dca1.prod.uber.internal/10.14.137.154:8020]
 after 1 failover attempts. Trying to failover after sleeping for 693ms.
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
[http://t.uber.com/hdfs_faq]
 at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1942)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1387)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.msync(NameNodeRpcServer.java:1318)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.msync(ClientNamenodeProtocolServerSideTranslatorPB.java:1617)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:508)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:930)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:865)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2726)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1524)
 at org.apache.hadoop.ipc.Client.call(Client.java:1470)
 at org.apache.hadoop.ipc.Client.call(Client.java:1369)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:117)
 at com.sun.proxy.$Proxy15.msync(Unknown Source)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.msync(ClientNamenodeProtocolTranslatorPB.java:1634)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.initializeMsync(ObserverReadProxyProvider.java:350)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.access$600(ObserverReadProxyProvider.java:69)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider$ObserverReadInvocationHandler.invoke(ObserverReadProxyProvider.java:427)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1700)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1439)
 at 

[jira] [Comment Edited] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-09 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228899#comment-17228899
 ] 

Aihua Xu edited comment on HDFS-15467 at 11/10/20, 1:32 AM:


[~csun] In ObserverReadProxyProvider, the FailoverProxyProvider (failoverProxy) 
for active/standby namenode failover doesn't seem to have retry logic. When 
msync() is called against failoverProxy, it could fail when it's reaching out a 
standby namenode. The exception is thrown to the retry logic of 
ObserverReadProxyProvider to handle (see the stack trace below). Is this by 
design? Logically seems FailoverProxyProvider should also have retry around it 
as well like:

{{DfsClientConf config = new DfsClientConf(conf);}}
{{ ClientProtocol proxy = (ClientProtocol) RetryProxy.create(xface,}}
{{ failoverProxyProvider,}}
{{ RetryPolicies.failoverOnNetworkException(}}
{{ RetryPolicies.TRY_ONCE_THEN_FAIL, config.getMaxFailoverAttempts(),}}
{{ config.getMaxRetryAttempts(), config.getFailoverSleepBaseMillis(),}}
{{ config.getFailoverSleepMaxMillis()));}}
{quote}20/10/29 04:22:33 INFO retry.RetryInvocationHandler: Exception while 
invoking $Proxy5.getFileInfo over 
[hadoopetanamenode01-dca1.prod.uber.internal/10.22.3.137:8020,hadoopetanamenode02-dca1.prod.uber.internal/10.18.6.167:8020,hadoopetaobserver01-dca1.prod.uber.internal/10.14.137.154:8020]
 after 1 failover attempts. Trying to failover after sleeping for 693ms.
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
[http://t.uber.com/hdfs_faq]
 at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1942)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1387)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.msync(NameNodeRpcServer.java:1318)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.msync(ClientNamenodeProtocolServerSideTranslatorPB.java:1617)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:508)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:930)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:865)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2726)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1524)
 at org.apache.hadoop.ipc.Client.call(Client.java:1470)
 at org.apache.hadoop.ipc.Client.call(Client.java:1369)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:117)
 at com.sun.proxy.$Proxy15.msync(Unknown Source)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.msync(ClientNamenodeProtocolTranslatorPB.java:1634)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.initializeMsync(ObserverReadProxyProvider.java:350)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.access$600(ObserverReadProxyProvider.java:69)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider$ObserverReadInvocationHandler.invoke(ObserverReadProxyProvider.java:427)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
 at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1700)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1439)
 at 

[jira] [Commented] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-09 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228899#comment-17228899
 ] 

Aihua Xu commented on HDFS-15467:
-

[~csun] In ObserverReadProxyProvider, the FailoverProxyProvider (failoverProxy) 
for active/standby namenode failover doesn't seem to have retry logic. When 
msync() is called against failoverProxy, it could fail when it's reaching out a 
standby namenode. The exception is thrown to the retry logic of 
ObserverReadProxyProvider to handle (see the stack trace below). Is this by 
design? Logically seems FailoverProxyProvider should also have retry around it 
as well like:

{{DfsClientConf config = new DfsClientConf(conf);
   ClientProtocol proxy = (ClientProtocol) RetryProxy.create(xface,
failoverProxyProvider,
RetryPolicies.failoverOnNetworkException(
RetryPolicies.TRY_ONCE_THEN_FAIL, config.getMaxFailoverAttempts(),
config.getMaxRetryAttempts(), config.getFailoverSleepBaseMillis(),
config.getFailoverSleepMaxMillis()));}}


{quote}20/10/29 04:22:33 INFO retry.RetryInvocationHandler: Exception while 
invoking $Proxy5.getFileInfo over 
[hadoopetanamenode01-dca1.prod.uber.internal/10.22.3.137:8020,hadoopetanamenode02-dca1.prod.uber.internal/10.18.6.167:8020,hadoopetaobserver01-dca1.prod.uber.internal/10.14.137.154:8020]
 after 1 failover attempts. Trying to failover after sleeping for 693ms.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
http://t.uber.com/hdfs_faq
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1942)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1387)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.msync(NameNodeRpcServer.java:1318)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.msync(ClientNamenodeProtocolServerSideTranslatorPB.java:1617)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:508)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:930)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:865)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2726)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1524)
at org.apache.hadoop.ipc.Client.call(Client.java:1470)
at org.apache.hadoop.ipc.Client.call(Client.java:1369)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:117)
at com.sun.proxy.$Proxy15.msync(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.msync(ClientNamenodeProtocolTranslatorPB.java:1634)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.initializeMsync(ObserverReadProxyProvider.java:350)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.access$600(ObserverReadProxyProvider.java:69)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider$ObserverReadInvocationHandler.invoke(ObserverReadProxyProvider.java:427)
at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy5.getFileInfo(Unknown Source)
at 

[jira] [Commented] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-11-07 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227882#comment-17227882
 ] 

Aihua Xu commented on HDFS-15597:
-

Thanks for reviewing [~weichiu] and [~ayushtkn]. Let me take a look to address 
the comments.

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HDFS-15597.patch
>
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225120#comment-17225120
 ] 

Aihua Xu commented on HDFS-15467:
-

I will take a look.

> ObserverReadProxyProvider should skip logging first failover from each proxy
> 
>
> Key: HDFS-15467
> URL: https://issues.apache.org/jira/browse/HDFS-15467
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Hanisha Koneru
>Assignee: Aihua Xu
>Priority: Major
>
> After HADOOP-17116, \{{RetryInvocationHandler}} skips logging the first 
> failover INFO message from each proxy. But {{ObserverReadProxyProvider}} uses 
> {{combinedProxy}} object which combines all proxies into one and assigns 
> {{combinedInfo}} as the ProxyInfo.
> {noformat}
> ObserverReadProxyProvider# Lines 197-207:
> for (int i = 0; i < nameNodeProxies.size(); i++) {
>   if (i > 0) {
> combinedInfo.append(",");
>   }
>   combinedInfo.append(nameNodeProxies.get(i).proxyInfo);
> }
> combinedInfo.append(']');
> T wrappedProxy = (T) Proxy.newProxyInstance(
> ObserverReadInvocationHandler.class.getClassLoader(),
> new Class[] {xface}, new ObserverReadInvocationHandler());
> combinedProxy = new ProxyInfo<>(wrappedProxy, 
> combinedInfo.toString()){noformat}
> {{RetryInvocationHandler}} depends on the {{ProxyInfo}} to differentiate 
> between proxies while checking if failover from that proxy happened before. 
> And since combined proxy has only 1 proxy, HADOOP-17116 doesn't work on 
> {{ObserverReadProxyProvider.}}It would need to handled separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15467) ObserverReadProxyProvider should skip logging first failover from each proxy

2020-11-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-15467:
---

Assignee: Aihua Xu

> ObserverReadProxyProvider should skip logging first failover from each proxy
> 
>
> Key: HDFS-15467
> URL: https://issues.apache.org/jira/browse/HDFS-15467
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Hanisha Koneru
>Assignee: Aihua Xu
>Priority: Major
>
> After HADOOP-17116, \{{RetryInvocationHandler}} skips logging the first 
> failover INFO message from each proxy. But {{ObserverReadProxyProvider}} uses 
> {{combinedProxy}} object which combines all proxies into one and assigns 
> {{combinedInfo}} as the ProxyInfo.
> {noformat}
> ObserverReadProxyProvider# Lines 197-207:
> for (int i = 0; i < nameNodeProxies.size(); i++) {
>   if (i > 0) {
> combinedInfo.append(",");
>   }
>   combinedInfo.append(nameNodeProxies.get(i).proxyInfo);
> }
> combinedInfo.append(']');
> T wrappedProxy = (T) Proxy.newProxyInstance(
> ObserverReadInvocationHandler.class.getClassLoader(),
> new Class[] {xface}, new ObserverReadInvocationHandler());
> combinedProxy = new ProxyInfo<>(wrappedProxy, 
> combinedInfo.toString()){noformat}
> {{RetryInvocationHandler}} depends on the {{ProxyInfo}} to differentiate 
> between proxies while checking if failover from that proxy happened before. 
> And since combined proxy has only 1 proxy, HADOOP-17116 doesn't work on 
> {{ObserverReadProxyProvider.}}It would need to handled separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15664) Prevent Observer NameNode from becoming StandBy NameNode

2020-11-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu resolved HDFS-15664.
-
Resolution: Duplicate

> Prevent Observer NameNode from becoming StandBy NameNode
> 
>
> Key: HDFS-15664
> URL: https://issues.apache.org/jira/browse/HDFS-15664
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover
>Affects Versions: 2.10.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> When the cluster performs a failover from NN1 to NN2, NN2 is asking all the 
> other NNs to cede active state and transit to StandBy including the Observer 
> NameNodes.
> Seems we should block Observer from becoming StandBy and participating in 
> Failover. Of course, since we can transit StandBy NameNode to Observer, we 
> can separately support promote Observer NameNode to StandBy NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15664) Prevent Observer NameNode from becoming StandBy NameNode

2020-11-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224985#comment-17224985
 ] 

Aihua Xu commented on HDFS-15664:
-

Didn't know it was fixed in another place. Thanks [~csun]

> Prevent Observer NameNode from becoming StandBy NameNode
> 
>
> Key: HDFS-15664
> URL: https://issues.apache.org/jira/browse/HDFS-15664
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover
>Affects Versions: 2.10.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> When the cluster performs a failover from NN1 to NN2, NN2 is asking all the 
> other NNs to cede active state and transit to StandBy including the Observer 
> NameNodes.
> Seems we should block Observer from becoming StandBy and participating in 
> Failover. Of course, since we can transit StandBy NameNode to Observer, we 
> can separately support promote Observer NameNode to StandBy NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15664) Prevent Observer NameNode from becoming StandBy NameNode

2020-11-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224828#comment-17224828
 ] 

Aihua Xu commented on HDFS-15664:
-

+ [~sunchao] Comments on this?

> Prevent Observer NameNode from becoming StandBy NameNode
> 
>
> Key: HDFS-15664
> URL: https://issues.apache.org/jira/browse/HDFS-15664
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover
>Affects Versions: 2.10.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>
> When the cluster performs a failover from NN1 to NN2, NN2 is asking all the 
> other NNs to cede active state and transit to StandBy including the Observer 
> NameNodes.
> Seems we should block Observer from becoming StandBy and participating in 
> Failover. Of course, since we can transit StandBy NameNode to Observer, we 
> can separately support promote Observer NameNode to StandBy NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15664) Prevent Observer NameNode from becoming StandBy NameNode

2020-11-02 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-15664:
---

 Summary: Prevent Observer NameNode from becoming StandBy NameNode
 Key: HDFS-15664
 URL: https://issues.apache.org/jira/browse/HDFS-15664
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: auto-failover
Affects Versions: 2.10.0
Reporter: Aihua Xu
Assignee: Aihua Xu


When the cluster performs a failover from NN1 to NN2, NN2 is asking all the 
other NNs to cede active state and transit to StandBy including the Observer 
NameNodes.

Seems we should block Observer from becoming StandBy and participating in 
Failover. Of course, since we can transit StandBy NameNode to Observer, we can 
separately support promote Observer NameNode to StandBy NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-10-29 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223287#comment-17223287
 ] 

Aihua Xu commented on HDFS-15562:
-

[~weichiu], [~csun] Can you help review the change? Thanks.

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HDFS-15562.patch
>
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-10-29 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15562:

Status: Patch Available  (was: Open)

patch-1: Standby NameNode will do the checkpoint and upload the image to active 
and Observer NameNodes. Currently if any remote NameNode is down and uploads 
fails, then standby NameNode will immediately do the checkpoint again and try 
uploading.

With multiple Observer NameNodes, it's not required that all the Observers are 
running. The patch will throw exception for checkpoint itself but not for 
upload failures.   

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HDFS-15562.patch
>
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-10-29 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15562:

Attachment: HDFS-15562.patch

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HDFS-15562.patch
>
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14327) Using FQDN instead of IP to access servers with DNS resolving

2020-10-22 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-14327:

Description: 
*strong text*With 
[HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients can get 
the IP of the servers (NN/Routers) and use the IP addresses to access the 
machine. This will fail in secure environment as Kerberos is using the domain 
name  (FQDN) in the principal so it won't recognize the IP addresses.

This task is mainly adding a reverse look up on the current basis and get the 
domain name after the IP is fetched. After that clients will still use the 
domain name to access the servers.

  was:
With [HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients can 
get the IP of the servers (NN/Routers) and use the IP addresses to access the 
machine. This will fail in secure environment as Kerberos is using the domain 
name  (FQDN) in the principal so it won't recognize the IP addresses.

This task is mainly adding a reverse look up on the current basis and get the 
domain name after the IP is fetched. After that clients will still use the 
domain name to access the servers.


> Using FQDN instead of IP to access servers with DNS resolving
> -
>
> Key: HDFS-14327
> URL: https://issues.apache.org/jira/browse/HDFS-14327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14327.001.patch, HDFS-14327.002.patch
>
>
> *strong text*With 
> [HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients can 
> get the IP of the servers (NN/Routers) and use the IP addresses to access the 
> machine. This will fail in secure environment as Kerberos is using the domain 
> name  (FQDN) in the principal so it won't recognize the IP addresses.
> This task is mainly adding a reverse look up on the current basis and get the 
> domain name after the IP is fetched. After that clients will still use the 
> domain name to access the servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14327) Using FQDN instead of IP to access servers with DNS resolving

2020-10-22 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-14327:

Description: 
With [HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients can 
get the IP of the servers (NN/Routers) and use the IP addresses to access the 
machine. This will fail in secure environment as Kerberos is using the domain 
name  (FQDN) in the principal so it won't recognize the IP addresses.

This task is mainly adding a reverse look up on the current basis and get the 
domain name after the IP is fetched. After that clients will still use the 
domain name to access the servers.

  was:
*strong text*With 
[HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients can get 
the IP of the servers (NN/Routers) and use the IP addresses to access the 
machine. This will fail in secure environment as Kerberos is using the domain 
name  (FQDN) in the principal so it won't recognize the IP addresses.

This task is mainly adding a reverse look up on the current basis and get the 
domain name after the IP is fetched. After that clients will still use the 
domain name to access the servers.


> Using FQDN instead of IP to access servers with DNS resolving
> -
>
> Key: HDFS-14327
> URL: https://issues.apache.org/jira/browse/HDFS-14327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14327.001.patch, HDFS-14327.002.patch
>
>
> With [HDFS-14118|https://issues.apache.org/jira/browse/HDFS-14118], clients 
> can get the IP of the servers (NN/Routers) and use the IP addresses to access 
> the machine. This will fail in secure environment as Kerberos is using the 
> domain name  (FQDN) in the principal so it won't recognize the IP addresses.
> This task is mainly adding a reverse look up on the current basis and get the 
> domain name after the IP is fetched. After that clients will still use the 
> domain name to access the servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2020-10-09 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211161#comment-17211161
 ] 

Aihua Xu commented on HDFS-15601:
-

Seems I won't have time to work on this. Assign back.

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Priority: Major
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2020-10-09 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-15601:
---

Assignee: (was: Aihua Xu)

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Priority: Major
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2020-10-04 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-15601:
---

Assignee: Aihua Xu

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Aihua Xu
>Priority: Major
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2020-10-04 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207801#comment-17207801
 ] 

Aihua Xu commented on HDFS-15601:
-

I will take a look.

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Aihua Xu
>Priority: Major
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-10-03 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206838#comment-17206838
 ] 

Aihua Xu commented on HDFS-15562:
-

[~aswqazxsd] I will take a look. If you can provide more details like the 
version, stack trace, etc. that will be helpful.

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed

2020-10-03 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-15562:
---

Assignee: Aihua Xu

> StandbyCheckpointer will do checkpoint repeatedly while connecting 
> observer/active namenode failed
> --
>
> Key: HDFS-15562
> URL: https://issues.apache.org/jira/browse/HDFS-15562
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: SunHao
>Assignee: Aihua Xu
>Priority: Major
>
> We find the standby namenode will do checkpoint over and over while 
> connecting observer/active namenode failed.
> StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage 
> to the other namenode failed, so that the standby namenode will keep doing 
> checkpoint repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-10-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206583#comment-17206583
 ] 

Aihua Xu commented on HDFS-15597:
-

[~weichiu] Can you review the simple fix? Thanks.

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HDFS-15597.patch
>
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-10-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15597:

Status: Patch Available  (was: Open)

Patch-1: update getContentSummary function to consider replication for 
spaceConsumed field.

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HDFS-15597.patch
>
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-10-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-15597:

Attachment: HDFS-15597.patch

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HDFS-15597.patch
>
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-10-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206492#comment-17206492
 ] 

Aihua Xu commented on HDFS-15597:
-

Let me take a look.

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15597) ContentSummary.getSpaceConsumed does not consider replication

2020-10-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-15597:
---

Assignee: Aihua Xu

> ContentSummary.getSpaceConsumed does not consider replication
> -
>
> Key: HDFS-15597
> URL: https://issues.apache.org/jira/browse/HDFS-15597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 2.6.0
>Reporter: Ajmal Ahammed
>Assignee: Aihua Xu
>Priority: Minor
>
> I am trying to get the disk space consumed by an HDFS directory using the 
> {{ContentSummary.getSpaceConsumed}} method. I can't get the space consumption 
> correctly considering the replication factor. The replication factor is 2, 
> and I was expecting twice the size of the actual file size from the above 
> method.
> I can't get the space consumption correctly considering the replication 
> factor. The replication factor is 2, and I was expecting twice the size of 
> the actual file size from the above method.
> {code}
> ubuntu@ubuntu:~/ht$ sudo -u hdfs hdfs dfs -ls /var/lib/ubuntu
> Found 2 items
> -rw-r--r--   2 ubuntu ubuntu3145728 2020-09-08 09:55 
> /var/lib/ubuntu/size-test
> drwxrwxr-x   - ubuntu ubuntu  0 2020-09-07 06:37 /var/lib/ubuntu/test
> {code}
> But when I run the following code,
> {code}
> String path = "/etc/hadoop/conf/";
> conf.addResource(new Path(path + "core-site.xml"));
> conf.addResource(new Path(path + "hdfs-site.xml"));
> long size = 
> FileContext.getFileContext(conf).util().getContentSummary(fileStatus).getSpaceConsumed();
> System.out.println("Replication : " + fileStatus.getReplication());
> System.out.println("File size : " + size);
> {code}
> The output is
> {code}
> Replication : 0
> File size : 3145728
> {code}
> Both the file size and the replication factor seems to be incorrect.
> /etc/hadoop/conf/hdfs-site.xml contains the following config:
> {code}
>   
> dfs.replication
> 2
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-5389) A Namenode that keeps only a part of the namespace in memory

2020-10-02 Thread Aihua Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-5389:
--

Assignee: Aihua Xu  (was: Haohui Mai)

> A Namenode that keeps only a part of the namespace in memory
> 
>
> Key: HDFS-5389
> URL: https://issues.apache.org/jira/browse/HDFS-5389
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 0.23.1
>Reporter: Lin Xiao
>Assignee: Aihua Xu
>Priority: Minor
>
> *Background:*
> Currently, the NN Keeps all its namespace in memory. This has had the benefit 
> that the NN code is very simple and, more importantly, helps the NN scale to 
> over 4.5K machines with 60K  to 100K concurrently tasks.  HDFS namespace can 
> be scaled currently using more Ram on the NN and/or using Federation which 
> scales both namespace and performance. The current federation implementation 
> does not allow renames across volumes without data copying but there are 
> proposals to remove that limitation.
> *Motivation:*
>  Hadoop lets customers store huge amounts of data at very economical prices 
> and hence allows customers to store their data for several years. While most 
> customers perform analytics on recent  data (last hour, day, week, months, 
> quarter, year), the ability to have five year old data online for analytics 
> is very attractive for many businesses. Although one can use larger RAM in a 
> NN and/or use Federation, it not really necessary to store the entire 
> namespace in memory since only the recent data is typically heavily accessed. 
> *Proposed Solution:*
> Store a portion of the NN's namespace in memory- the "working set" of the 
> applications that are currently operating. LSM data structures are quite 
> appropriate for maintaining the full namespace in memory. One choice is 
> Google's LevelDB open-source implementation.
> *Benefits:*
>  *  Store larger namespaces without resorting to Federated namespace volumes.
>  * Complementary to NN Federated namespace volumes,  indeed will allow a 
> single NN to easily store multiple larger volumes.
>  *  Faster cold startup - the NN does not have read its full namespace before 
> responding to clients.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-5389) A Namenode that keeps only a part of the namespace in memory

2020-10-02 Thread Aihua Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206394#comment-17206394
 ] 

Aihua Xu commented on HDFS-5389:


Seems no activity on this but sounds a great area to help scale NN and reduce 
NN memory pressure. I will take a look. [~weichiu] and [~yzhangal]

> A Namenode that keeps only a part of the namespace in memory
> 
>
> Key: HDFS-5389
> URL: https://issues.apache.org/jira/browse/HDFS-5389
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 0.23.1
>Reporter: Lin Xiao
>Assignee: Haohui Mai
>Priority: Minor
>
> *Background:*
> Currently, the NN Keeps all its namespace in memory. This has had the benefit 
> that the NN code is very simple and, more importantly, helps the NN scale to 
> over 4.5K machines with 60K  to 100K concurrently tasks.  HDFS namespace can 
> be scaled currently using more Ram on the NN and/or using Federation which 
> scales both namespace and performance. The current federation implementation 
> does not allow renames across volumes without data copying but there are 
> proposals to remove that limitation.
> *Motivation:*
>  Hadoop lets customers store huge amounts of data at very economical prices 
> and hence allows customers to store their data for several years. While most 
> customers perform analytics on recent  data (last hour, day, week, months, 
> quarter, year), the ability to have five year old data online for analytics 
> is very attractive for many businesses. Although one can use larger RAM in a 
> NN and/or use Federation, it not really necessary to store the entire 
> namespace in memory since only the recent data is typically heavily accessed. 
> *Proposed Solution:*
> Store a portion of the NN's namespace in memory- the "working set" of the 
> applications that are currently operating. LSM data structures are quite 
> appropriate for maintaining the full namespace in memory. One choice is 
> Google's LevelDB open-source implementation.
> *Benefits:*
>  *  Store larger namespaces without resorting to Federated namespace volumes.
>  * Complementary to NN Federated namespace volumes,  indeed will allow a 
> single NN to easily store multiple larger volumes.
>  *  Faster cold startup - the NN does not have read its full namespace before 
> responding to clients.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14450) Erasure Coding: decommissioning datanodes cause replicate a large number of duplicate EC internal blocks

2019-04-23 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824561#comment-16824561
 ] 

Aihua Xu commented on HDFS-14450:
-

[~wuweiwei] Is this issue always reproducible?

> Erasure Coding: decommissioning datanodes cause replicate a large number of 
> duplicate EC internal blocks
> 
>
> Key: HDFS-14450
> URL: https://issues.apache.org/jira/browse/HDFS-14450
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Wu Weiwei
>Assignee: Wu Weiwei
>Priority: Major
>
> {code:java}
> //  [WARN] [RedundancyMonitor] : Failed to place enough replicas, still in 
> need of 2 to reach 167 (unavailableStorages=[DISK, ARCHIVE], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All 
> required storage types are unavailable:  unavailableStorages=[DISK, ARCHIVE], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> In a large-scale cluster, decommissioning large-scale datanodes cause EC 
> block groups to replicate a large number of duplicate internal blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14364) DataNode crashes without any workload

2019-03-12 Thread Aihua Xu (JIRA)
Aihua Xu created HDFS-14364:
---

 Summary: DataNode crashes without any workload
 Key: HDFS-14364
 URL: https://issues.apache.org/jira/browse/HDFS-14364
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Affects Versions: 3.1.1
Reporter: Aihua Xu
 Attachments: hs_err_pid106000.log

All the datanodes crash on the cluster with EC enabled on part of HDFS. At the 
crash time, there is no active workload. Please see attached for JVM crash 
report.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14363) Namenode crashes without any workload

2019-03-12 Thread Aihua Xu (JIRA)
Aihua Xu created HDFS-14363:
---

 Summary: Namenode crashes without any workload
 Key: HDFS-14363
 URL: https://issues.apache.org/jira/browse/HDFS-14363
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: 3.1.1
Reporter: Aihua Xu
 Attachments: hs_err_pid32124.log, hs_err_pid33479.log, 
hs_err_pid44708.log

The namenode and QJM both crash without active workloads. Please see attach log 
for JVM crash report. Erasing code is enabled on the cluster for part of the 
HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14362) HDFS cluster with EC enabled crashes without any workload

2019-03-12 Thread Aihua Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HDFS-14362:

Summary: HDFS cluster with EC enabled crashes without any workload  (was: 
HDFS cluster crashes without any workload)

> HDFS cluster with EC enabled crashes without any workload
> -
>
> Key: HDFS-14362
> URL: https://issues.apache.org/jira/browse/HDFS-14362
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.1.1
>Reporter: Aihua Xu
>Priority: Major
>
> We have a small test cluster on 3.1.1 with erasure coding enabled. We loaded 
> some data but doesn't have active data access. Then we are seeing the 
> namenode, datanode crash. Use this parent jira to track the issues. Right now 
> we are not sure if it's related to EC and if it has been fixed in the later 
> version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14362) HDFS cluster crashes without any workload

2019-03-12 Thread Aihua Xu (JIRA)
Aihua Xu created HDFS-14362:
---

 Summary: HDFS cluster crashes without any workload
 Key: HDFS-14362
 URL: https://issues.apache.org/jira/browse/HDFS-14362
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.1.1
Reporter: Aihua Xu


We have a small test cluster on 3.1.1 with erasure coding enabled. We loaded 
some data but doesn't have active data access. Then we are seeing the namenode, 
datanode crash. Use this parent jira to track the issues. Right now we are not 
sure if it's related to EC and if it has been fixed in the later version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10586) Erasure Code misfunctions when 3 DataNode down

2019-01-10 Thread Aihua Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-10586:
---

Assignee: Aihua Xu

> Erasure Code misfunctions when 3 DataNode down
> --
>
> Key: HDFS-10586
> URL: https://issues.apache.org/jira/browse/HDFS-10586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 9 DataNode and 1 NameNode,Erasured code policy is 
> set as "6--3",   When 3 DataNode down,  erasured code fails and an exception 
> is thrown
>Reporter: gao shan
>Assignee: Aihua Xu
>Priority: Major
>
> The following is the steps to reproduce:
> 1) hadoop fs -mkdir /ec
> 2) set erasured code policy as "6-3"
> 3) "write" data by : 
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -write -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> 4) Manually down 3 nodes.  Kill the threads of  "datanode" and "nodemanager" 
> in 3 DataNode.
> 5) By using erasured code to "read" data by:
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -read -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> then the failure occurs and the exception is thrown as:
> INFO mapreduce.Job: Task Id : attempt_1465445965249_0008_m_34_2, Status : 
> FAILED
> Error: java.io.IOException: 4 missing blocks, the stripe is: Offset=0, 
> length=8388608, fetchedChunksNum=0, missingChunksNum=4
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.checkMissingBlocks(DFSStripedInputStream.java:614)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readParityChunks(DFSStripedInputStream.java:647)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readStripe(DFSStripedInputStream.java:762)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:316)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:450)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:941)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:531)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:508)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:134)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:37)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10586) Erasure Code misfunctions when 3 DataNode down

2019-01-10 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739908#comment-16739908
 ] 

Aihua Xu commented on HDFS-10586:
-

I tested out this briefly and it doesn't seem to be an issue any more. I will 
do more investigation and confirm.

> Erasure Code misfunctions when 3 DataNode down
> --
>
> Key: HDFS-10586
> URL: https://issues.apache.org/jira/browse/HDFS-10586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 9 DataNode and 1 NameNode,Erasured code policy is 
> set as "6--3",   When 3 DataNode down,  erasured code fails and an exception 
> is thrown
>Reporter: gao shan
>Assignee: Aihua Xu
>Priority: Major
>
> The following is the steps to reproduce:
> 1) hadoop fs -mkdir /ec
> 2) set erasured code policy as "6-3"
> 3) "write" data by : 
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -write -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> 4) Manually down 3 nodes.  Kill the threads of  "datanode" and "nodemanager" 
> in 3 DataNode.
> 5) By using erasured code to "read" data by:
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -read -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> then the failure occurs and the exception is thrown as:
> INFO mapreduce.Job: Task Id : attempt_1465445965249_0008_m_34_2, Status : 
> FAILED
> Error: java.io.IOException: 4 missing blocks, the stripe is: Offset=0, 
> length=8388608, fetchedChunksNum=0, missingChunksNum=4
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.checkMissingBlocks(DFSStripedInputStream.java:614)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readParityChunks(DFSStripedInputStream.java:647)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readStripe(DFSStripedInputStream.java:762)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:316)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:450)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:941)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:531)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:508)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:134)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:37)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14190) Copying folders containing = - characters between hdfs (using webhdfs) does not work in distcp

2019-01-08 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737572#comment-16737572
 ] 

Aihua Xu commented on HDFS-14190:
-

I will take a look.

> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> --
>
> Key: HDFS-14190
> URL: https://issues.apache.org/jira/browse/HDFS-14190
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: distcp
>Affects Versions: 3.1.1
>Reporter: yinsong
>Assignee: Aihua Xu
>Priority: Major
>
> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> for example:
> src:hadoop2.7  target:hadoop3.1.1
> (1)
> hadoop distcp \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/sudiyi_datawarehouse 
> webhdfs://2.2.2.2:50070/sudiyi_datawarehouse
> ERROR tools.SimpleCopyListing: FileNotFoundException exception in listStatus: 
> File /sudiyi_datawarehouse/st_device_standard_ds/date_time%3D2018-10-10 does 
> not exist
>  
> (2)
> hadoop distcp \
> -Dmapreduce.framework.name=yarn \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/druid webhdfs://2.2.2.2:50070/druid
> Error: java.io.IOException: File copy failed: 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  --> 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:217)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  to 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:256)
>  ... 10 more
> Caused by: java.io.IOException: Failed to promote 
> tmp-file:webhdfs://10.27.234.198:50070/druid/.distcp.tmp.attempt_1545990837043_0016_m_15_2
>  to: 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:250)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:140)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14190) Copying folders containing = - characters between hdfs (using webhdfs) does not work in distcp

2019-01-08 Thread Aihua Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-14190:
---

Assignee: Aihua Xu

> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> --
>
> Key: HDFS-14190
> URL: https://issues.apache.org/jira/browse/HDFS-14190
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: distcp
>Affects Versions: 3.1.1
>Reporter: yinsong
>Assignee: Aihua Xu
>Priority: Major
>
> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> for example:
> src:hadoop2.7  target:hadoop3.1.1
> (1)
> hadoop distcp \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/sudiyi_datawarehouse 
> webhdfs://2.2.2.2:50070/sudiyi_datawarehouse
> ERROR tools.SimpleCopyListing: FileNotFoundException exception in listStatus: 
> File /sudiyi_datawarehouse/st_device_standard_ds/date_time%3D2018-10-10 does 
> not exist
>  
> (2)
> hadoop distcp \
> -Dmapreduce.framework.name=yarn \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/druid webhdfs://2.2.2.2:50070/druid
> Error: java.io.IOException: File copy failed: 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  --> 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:217)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  to 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:256)
>  ... 10 more
> Caused by: java.io.IOException: Failed to promote 
> tmp-file:webhdfs://10.27.234.198:50070/druid/.distcp.tmp.attempt_1545990837043_0016_m_15_2
>  to: 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:250)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:140)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10815) The state of the EC file is erroneously recognized when you restart the NameNode.

2019-01-07 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736229#comment-16736229
 ] 

Aihua Xu commented on HDFS-10815:
-

Along with HDFS-10775, I will try out for the scenario and close out if it's 
not an issue. 

> The state of the EC file is erroneously recognized when you restart the 
> NameNode.
> -
>
> Key: HDFS-10815
> URL: https://issues.apache.org/jira/browse/HDFS-10815
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 2 NameNodes, 5 DataNodes, Erasured code policy is set as 
> "RS-DEFAULT-3-2-64k"
>Reporter: Eisuke Umeda
>Assignee: Aihua Xu
>Priority: Major
>
> After carrying out an examination in the following procedures, an EC files 
> came to be recognized as corrupt files.
> These files were able to get in "hdfs dfs -get".
> NameNode might be causing the false recognition.
> DataNodes: datanode[1-5]
> Rack awareness: not set
> Copy target files: /tmp/tpcds-generate/25/store_sales/*
> {code}
> $ hdfs dfs -ls /tmp/tpcds-generate/25/store_sales
> Found 25 items
> -rw-r--r--   0 root supergroup  399430918 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-0
> -rw-r--r--   0 root supergroup  399054598 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-1
> -rw-r--r--   0 root supergroup  399329373 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-2
> -rw-r--r--   0 root supergroup  399528459 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-3
> -rw-r--r--   0 root supergroup  399329624 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-4
> -rw-r--r--   0 root supergroup  399085924 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-5
> -rw-r--r--   0 root supergroup  399337384 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-6
> -rw-r--r--   0 root supergroup  399199458 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-7
> -rw-r--r--   0 root supergroup  399679096 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-8
> -rw-r--r--   0 root supergroup  399440431 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-9
> -rw-r--r--   0 root supergroup  399403931 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00010
> -rw-r--r--   0 root supergroup  399472465 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00011
> -rw-r--r--   0 root supergroup  399451784 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00012
> -rw-r--r--   0 root supergroup  399240168 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00013
> -rw-r--r--   0 root supergroup  399370507 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00014
> -rw-r--r--   0 root supergroup  399633351 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00015
> -rw-r--r--   0 root supergroup  396532952 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00016
> -rw-r--r--   0 root supergroup  396258715 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00017
> -rw-r--r--   0 root supergroup  396382486 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00018
> -rw-r--r--   0 root supergroup  399016456 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00019
> -rw-r--r--   0 root supergroup  399465745 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00020
> -rw-r--r--   0 root supergroup  399208235 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00021
> -rw-r--r--   0 root supergroup  399198296 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00022
> -rw-r--r--   0 root supergroup  399599711 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00023
> -rw-r--r--   0 root supergroup  395150855 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00024
> {code}
> NameNodes:
>   namenode1(active)
>   namenode2(standby)
> The directory which there is "Under-erasure-coded block groups": 
> /tmp/tpcds-generate/test
> {code}
> $ sudo -u hdfs hdfs erasurecode -getPolicy /tmp/tpcds-generate/test
> ErasureCodingPolicy=[Name=RS-DEFAULT-3-2-64k, 
> Schema=[ECSchema=[Codec=rs-default, numDataUnits=3, numParityUnits=2]], 
> CellSize=65536 ]
> {code}
> The following is the steps to reproduce:
> 1) hdfs dfs -cp /tmp/tpcds-generate/25/store_sales/* /tmp/tpcds-generate/test
> 2) datanode1: (in the middle of the copy) sudo pkill -9 -f datanode
> 3) start a process of datanode1 two minutes later
> 4) carry out hdfs fsck and confirm that Under-Replicated Blocks occurred
> 5) wait until Under-Replicated Blocks becomes 0
> 6) (namenode1) /etc/init.d/hadoop-hdfs-namenode restart
> 7) (namenode2) 

[jira] [Assigned] (HDFS-10815) The state of the EC file is erroneously recognized when you restart the NameNode.

2019-01-07 Thread Aihua Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-10815:
---

Assignee: Aihua Xu

> The state of the EC file is erroneously recognized when you restart the 
> NameNode.
> -
>
> Key: HDFS-10815
> URL: https://issues.apache.org/jira/browse/HDFS-10815
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 2 NameNodes, 5 DataNodes, Erasured code policy is set as 
> "RS-DEFAULT-3-2-64k"
>Reporter: Eisuke Umeda
>Assignee: Aihua Xu
>Priority: Major
>
> After carrying out an examination in the following procedures, an EC files 
> came to be recognized as corrupt files.
> These files were able to get in "hdfs dfs -get".
> NameNode might be causing the false recognition.
> DataNodes: datanode[1-5]
> Rack awareness: not set
> Copy target files: /tmp/tpcds-generate/25/store_sales/*
> {code}
> $ hdfs dfs -ls /tmp/tpcds-generate/25/store_sales
> Found 25 items
> -rw-r--r--   0 root supergroup  399430918 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-0
> -rw-r--r--   0 root supergroup  399054598 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-1
> -rw-r--r--   0 root supergroup  399329373 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-2
> -rw-r--r--   0 root supergroup  399528459 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-3
> -rw-r--r--   0 root supergroup  399329624 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-4
> -rw-r--r--   0 root supergroup  399085924 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-5
> -rw-r--r--   0 root supergroup  399337384 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-6
> -rw-r--r--   0 root supergroup  399199458 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-7
> -rw-r--r--   0 root supergroup  399679096 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-8
> -rw-r--r--   0 root supergroup  399440431 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-9
> -rw-r--r--   0 root supergroup  399403931 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00010
> -rw-r--r--   0 root supergroup  399472465 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00011
> -rw-r--r--   0 root supergroup  399451784 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00012
> -rw-r--r--   0 root supergroup  399240168 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00013
> -rw-r--r--   0 root supergroup  399370507 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00014
> -rw-r--r--   0 root supergroup  399633351 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00015
> -rw-r--r--   0 root supergroup  396532952 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00016
> -rw-r--r--   0 root supergroup  396258715 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00017
> -rw-r--r--   0 root supergroup  396382486 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00018
> -rw-r--r--   0 root supergroup  399016456 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00019
> -rw-r--r--   0 root supergroup  399465745 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00020
> -rw-r--r--   0 root supergroup  399208235 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00021
> -rw-r--r--   0 root supergroup  399198296 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00022
> -rw-r--r--   0 root supergroup  399599711 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00023
> -rw-r--r--   0 root supergroup  395150855 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00024
> {code}
> NameNodes:
>   namenode1(active)
>   namenode2(standby)
> The directory which there is "Under-erasure-coded block groups": 
> /tmp/tpcds-generate/test
> {code}
> $ sudo -u hdfs hdfs erasurecode -getPolicy /tmp/tpcds-generate/test
> ErasureCodingPolicy=[Name=RS-DEFAULT-3-2-64k, 
> Schema=[ECSchema=[Codec=rs-default, numDataUnits=3, numParityUnits=2]], 
> CellSize=65536 ]
> {code}
> The following is the steps to reproduce:
> 1) hdfs dfs -cp /tmp/tpcds-generate/25/store_sales/* /tmp/tpcds-generate/test
> 2) datanode1: (in the middle of the copy) sudo pkill -9 -f datanode
> 3) start a process of datanode1 two minutes later
> 4) carry out hdfs fsck and confirm that Under-Replicated Blocks occurred
> 5) wait until Under-Replicated Blocks becomes 0
> 6) (namenode1) /etc/init.d/hadoop-hdfs-namenode restart
> 7) (namenode2) /etc/init.d/hadoop-hdfs-namenode restart



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HDFS-10775) Under-Replicated Blocks can not be recovered

2019-01-07 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736221#comment-16736221
 ] 

Aihua Xu commented on HDFS-10775:
-

Not sure if it's still an issue since it's a very old one. I will take a look.

> Under-Replicated Blocks can not be recovered
> 
>
> Key: HDFS-10775
> URL: https://issues.apache.org/jira/browse/HDFS-10775
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 2 NameNodes, 5 DataNodes, Erasured code policy is set as 
> "RS-DEFAULT-3-2-64k"
>Reporter: Eisuke Umeda
>Assignee: Aihua Xu
>Priority: Major
>
> I killed DataNode in the middle of the writing of the EC file. 
> Under-Replicated Blocks has occurred, but did not recover.
> DataNodes: datanode[1-5]
> Rack awareness: not set
> Copy target files: /tmp/tpcds-generate/25/store_sales/*
> {code}
> $ hdfs dfs -ls /tmp/tpcds-generate/25/store_sales
> Found 25 items
> -rw-r--r--   0 root supergroup  399430918 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-0
> -rw-r--r--   0 root supergroup  399054598 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-1
> -rw-r--r--   0 root supergroup  399329373 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-2
> -rw-r--r--   0 root supergroup  399528459 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-3
> -rw-r--r--   0 root supergroup  399329624 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-4
> -rw-r--r--   0 root supergroup  399085924 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-5
> -rw-r--r--   0 root supergroup  399337384 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-6
> -rw-r--r--   0 root supergroup  399199458 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-7
> -rw-r--r--   0 root supergroup  399679096 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-8
> -rw-r--r--   0 root supergroup  399440431 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-9
> -rw-r--r--   0 root supergroup  399403931 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00010
> -rw-r--r--   0 root supergroup  399472465 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00011
> -rw-r--r--   0 root supergroup  399451784 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00012
> -rw-r--r--   0 root supergroup  399240168 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00013
> -rw-r--r--   0 root supergroup  399370507 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00014
> -rw-r--r--   0 root supergroup  399633351 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00015
> -rw-r--r--   0 root supergroup  396532952 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00016
> -rw-r--r--   0 root supergroup  396258715 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00017
> -rw-r--r--   0 root supergroup  396382486 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00018
> -rw-r--r--   0 root supergroup  399016456 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00019
> -rw-r--r--   0 root supergroup  399465745 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00020
> -rw-r--r--   0 root supergroup  399208235 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00021
> -rw-r--r--   0 root supergroup  399198296 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00022
> -rw-r--r--   0 root supergroup  399599711 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00023
> -rw-r--r--   0 root supergroup  395150855 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00024
> {code}
> Destination directory: /tmp/tpcds-generate/test
> {code}
> $ sudo -u hdfs hdfs erasurecode -getPolicy /tmp/tpcds-generate/test
> ErasureCodingPolicy=[Name=RS-DEFAULT-3-2-64k, 
> Schema=[ECSchema=[Codec=rs-default, numDataUnits=3, numParityUnits=2]], 
> CellSize=65536 ]
> {code}
> The following is the steps to reproduce:
> 1) hdfs dfs -cp /tmp/tpcds-generate/25/store_sales/* /tmp/tpcds-generate/test
> 2) datanode1: (in the middle of the copy) sudo pkill -9 -f datanode
> 3) start a process of datanode1 two minutes later
> 4) wait for a while



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10775) Under-Replicated Blocks can not be recovered

2019-01-07 Thread Aihua Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu reassigned HDFS-10775:
---

Assignee: Aihua Xu

> Under-Replicated Blocks can not be recovered
> 
>
> Key: HDFS-10775
> URL: https://issues.apache.org/jira/browse/HDFS-10775
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 2 NameNodes, 5 DataNodes, Erasured code policy is set as 
> "RS-DEFAULT-3-2-64k"
>Reporter: Eisuke Umeda
>Assignee: Aihua Xu
>Priority: Major
>
> I killed DataNode in the middle of the writing of the EC file. 
> Under-Replicated Blocks has occurred, but did not recover.
> DataNodes: datanode[1-5]
> Rack awareness: not set
> Copy target files: /tmp/tpcds-generate/25/store_sales/*
> {code}
> $ hdfs dfs -ls /tmp/tpcds-generate/25/store_sales
> Found 25 items
> -rw-r--r--   0 root supergroup  399430918 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-0
> -rw-r--r--   0 root supergroup  399054598 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-1
> -rw-r--r--   0 root supergroup  399329373 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-2
> -rw-r--r--   0 root supergroup  399528459 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-3
> -rw-r--r--   0 root supergroup  399329624 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-4
> -rw-r--r--   0 root supergroup  399085924 2016-08-16 15:11 
> /tmp/tpcds-generate/25/store_sales/data-m-5
> -rw-r--r--   0 root supergroup  399337384 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-6
> -rw-r--r--   0 root supergroup  399199458 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-7
> -rw-r--r--   0 root supergroup  399679096 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-8
> -rw-r--r--   0 root supergroup  399440431 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-9
> -rw-r--r--   0 root supergroup  399403931 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00010
> -rw-r--r--   0 root supergroup  399472465 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00011
> -rw-r--r--   0 root supergroup  399451784 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00012
> -rw-r--r--   0 root supergroup  399240168 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00013
> -rw-r--r--   0 root supergroup  399370507 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00014
> -rw-r--r--   0 root supergroup  399633351 2016-08-16 15:12 
> /tmp/tpcds-generate/25/store_sales/data-m-00015
> -rw-r--r--   0 root supergroup  396532952 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00016
> -rw-r--r--   0 root supergroup  396258715 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00017
> -rw-r--r--   0 root supergroup  396382486 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00018
> -rw-r--r--   0 root supergroup  399016456 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00019
> -rw-r--r--   0 root supergroup  399465745 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00020
> -rw-r--r--   0 root supergroup  399208235 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00021
> -rw-r--r--   0 root supergroup  399198296 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00022
> -rw-r--r--   0 root supergroup  399599711 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00023
> -rw-r--r--   0 root supergroup  395150855 2016-08-16 15:13 
> /tmp/tpcds-generate/25/store_sales/data-m-00024
> {code}
> Destination directory: /tmp/tpcds-generate/test
> {code}
> $ sudo -u hdfs hdfs erasurecode -getPolicy /tmp/tpcds-generate/test
> ErasureCodingPolicy=[Name=RS-DEFAULT-3-2-64k, 
> Schema=[ECSchema=[Codec=rs-default, numDataUnits=3, numParityUnits=2]], 
> CellSize=65536 ]
> {code}
> The following is the steps to reproduce:
> 1) hdfs dfs -cp /tmp/tpcds-generate/25/store_sales/* /tmp/tpcds-generate/test
> 2) datanode1: (in the middle of the copy) sudo pkill -9 -f datanode
> 3) start a process of datanode1 two minutes later
> 4) wait for a while



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13293) RBF: The RouterRPCServer should transfer CallerContext and client ip to NamenodeRpcServer

2019-01-03 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733707#comment-16733707
 ] 

Aihua Xu commented on HDFS-13293:
-

[~ferhui] How is the progress on this jira? We are also interested in getting 
callerContext to work with RBF. Thanks.

> RBF: The RouterRPCServer should transfer CallerContext and client ip to 
> NamenodeRpcServer
> -
>
> Key: HDFS-13293
> URL: https://issues.apache.org/jira/browse/HDFS-13293
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: maobaolong
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-13293.001.patch
>
>
> Otherwise, the namenode don't know the client's callerContext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12953) XORRawDecoder.doDecode throws NullPointerException

2019-01-03 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733614#comment-16733614
 ] 

Aihua Xu commented on HDFS-12953:
-

[~xiaochen] Are you working on this? I can investigate the issue further if you 
are not. Does the test case you attached fail?

> XORRawDecoder.doDecode throws NullPointerException
> --
>
> Key: HDFS-12953
> URL: https://issues.apache.org/jira/browse/HDFS-12953
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Lei (Eddy) Xu
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HDFS-12953.test.patch
>
>
> Thanks [~danielpol] report on HDFS-12860.
> {noformat}
> 17/11/30 04:19:55 INFO mapreduce.Job: map 0% reduce 0%
> 17/11/30 04:20:01 INFO mapreduce.Job: Task Id : 
> attempt_1512036058655_0003_m_02_0, Status : FAILED
> Error: java.lang.NullPointerException
> at 
> org.apache.hadoop.io.erasurecode.rawcoder.XORRawDecoder.doDecode(XORRawDecoder.java:83)
> at 
> org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:106)
> at 
> org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:170)
> at 
> org.apache.hadoop.hdfs.StripeReader.decodeAndFillBuffer(StripeReader.java:423)
> at 
> org.apache.hadoop.hdfs.StatefulStripeReader.decode(StatefulStripeReader.java:94)
> at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:382)
> at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:318)
> at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:813)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at 
> org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:257)
> at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:563)
> at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:794)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14098) use fsck tools for EC blockId will throw NullPointerException

2019-01-03 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733607#comment-16733607
 ] 

Aihua Xu commented on HDFS-14098:
-

[~luoge123] Are you working on the fix? We are also seeing the same issue and I 
can take over if you are not working on it. Thanks.

> use fsck tools for EC blockId will throw NullPointerException
> -
>
> Key: HDFS-14098
> URL: https://issues.apache.org/jira/browse/HDFS-14098
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: luoge123
>Assignee: luoge123
>Priority: Major
>
>  
> when EC file have some block missing, use fsck tool for EC blockId whil throw 
> NullPointerException
> {code:java}
> hdfs fsck -blockId blk_-9223372036800049376 
> fsck information is:
> Block Id: blk_-9223372036800049376
> Block belongs to: /logdata/test.lzo
> No. of Expected Replica: 9
> No. of live Replica: 5
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> null
> {code}
> namenode will throw NullPointerException:
> {code:java}
> 2018-11-26 15:59:35,107 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Fsck on blockId 'blk_-9223372036800049376
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.blockIdCK(NamenodeFsck.java:270)
> at 
> org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.fsck(NamenodeFsck.java:313)
> at 
> org.apache.hadoop.hdfs.server.namenode.FsckServlet$1.run(FsckServlet.java:67)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10586) Erasure Code misfunctions when 3 DataNode down

2018-10-24 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662600#comment-16662600
 ] 

Aihua Xu commented on HDFS-10586:
-

[~gaoshbj] This is a very old jira. I'm wondering if you have any further 
update on the issue.

> Erasure Code misfunctions when 3 DataNode down
> --
>
> Key: HDFS-10586
> URL: https://issues.apache.org/jira/browse/HDFS-10586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
> Environment: 9 DataNode and 1 NameNode,Erasured code policy is 
> set as "6--3",   When 3 DataNode down,  erasured code fails and an exception 
> is thrown
>Reporter: gao shan
>Priority: Major
>
> The following is the steps to reproduce:
> 1) hadoop fs -mkdir /ec
> 2) set erasured code policy as "6-3"
> 3) "write" data by : 
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -write -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> 4) Manually down 3 nodes.  Kill the threads of  "datanode" and "nodemanager" 
> in 3 DataNode.
> 5) By using erasured code to "read" data by:
> time hadoop jar 
> /opt/hadoop/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
>   TestDFSIO -D test.build.data=/ec -read -nrFiles 30 -fileSize 12288 
> -bufferSize 1073741824
> then the failure occurs and the exception is thrown as:
> INFO mapreduce.Job: Task Id : attempt_1465445965249_0008_m_34_2, Status : 
> FAILED
> Error: java.io.IOException: 4 missing blocks, the stripe is: Offset=0, 
> length=8388608, fetchedChunksNum=0, missingChunksNum=4
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.checkMissingBlocks(DFSStripedInputStream.java:614)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readParityChunks(DFSStripedInputStream.java:647)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream$StripeReader.readStripe(DFSStripedInputStream.java:762)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:316)
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:450)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:941)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:531)
>   at org.apache.hadoop.fs.TestDFSIO$ReadMapper.doIO(TestDFSIO.java:508)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:134)
>   at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:37)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10385) LocalFileSystem rename() function should return false when destination file exists

2016-05-10 Thread Aihua Xu (JIRA)
Aihua Xu created HDFS-10385:
---

 Summary: LocalFileSystem rename() function should return false 
when destination file exists
 Key: HDFS-10385
 URL: https://issues.apache.org/jira/browse/HDFS-10385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: fs
Affects Versions: 2.6.0
Reporter: Aihua Xu


Currently rename() of LocalFileSystem returns true and renames successfully 
when the destination file exists. That seems to have different behavior from 
DFSFileSystem. 

If they can have the same behavior, then we can use one call to do rename 
rather than checking if destination exists and then making rename() call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org