Is it possible to configure hdfs in a federation mode and in an HA mode in the same time?

2016-08-15 Thread Alexandr Porunov
Hello all, I don't understand if it possible to configure HDFS in both modes in the same time. Does it make sense? Can somebody show a simple configuration of HDFS in both modes? (nameNode1, nameNode2, nameNodeStandby1, nameNodeStandby2) Sincerely, Alexandr

Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Shady Xu
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before starting. The specific problem now is that the DataNodes of the source cluster report their local ip instead of the public one, which cannot be accessed from the NodeManagers of the destination cluster. Seems the solution is to

Re: Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
Oh I should mention that creating the archive took only a few hours, but copying the files out of the archive back to HDFS was 80MB/min. Would take years to copy back which seems really surprising. -Aaron > On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze wrote: > > ls over

Re: Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
I can list all the files out of HDFS in a few hours, not a day. Listing the files in a single directory in the har takes ~50 min. Honestly I'd be happy with only a 10x performance hit. I'm seeing closer to 100-150x. -Aaron > On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze

Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
Basically I want to list all the files in a .har file and compare the file list/sizes to an existing directory in HDFS. The problem is that running commands like: hdfs dfs -ls -R is orders of magnitude slower then running the same command against a live HDFS file system. How much slower? I've

Re: Yarn web UI shows more memory used than actual

2016-08-15 Thread Ravi Prakash
Hi Suresh! YARN's accounting for memory on each node is completely different from the Linux kernel's accounting of memory used. e.g. I could launch a MapReduce task which in reality allocates just 100 Mb, and tell YARN to give it 8 Gb. The kernel would show the memory requested by the task, the

Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Sunil Govind
Hi I think you can also refer below link too. http://aajisaka.github.io/hadoop-project/hadoop-distcp/DistCp.html Thanks Sunil On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang wrote: > Hello, > if I understand your question correctly, you are actually building a > multi-home

Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Wei-Chiu Chuang
Hello, if I understand your question correctly, you are actually building a multi-home Hadoop, correct? Multi-homed Hadoop cluster can be tricky to set up, to the extend that Cloudera does not recommend it. I've not set up a multihome Hadoop cluster before, but I think you have to make sure the

Re: Yarn web UI shows more memory used than actual

2016-08-15 Thread Sunil Govind
Hi Suresh "This 'memory used' would be the memory used by all containers running on that node" >> "Memory Used" in Nodes page indicates how memory is used in all the node managers with respect to the corresponding demand made to RM. For eg, if application has asked for 4GB resource and if its

How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Shady Xu
Hi all, Recently I tried to use distcp to copy data across two clusters which are not in the same local network. Fortunately, the nodes of the source cluster each has an extra interface and ip which can be accessed from the destination cluster. But during the process of distcp, the map tasks