Re: copy files from ftp to hdfs in parallel, distcp failed

2013-07-11 Thread பாலாஜி நாராயணன்
On 11 July 2013 06:27, Hao Ren h@claravista.fr wrote:

 Hi,

 I am running a hdfs on Amazon EC2

 Say, I have a ftp server where stores some data.


I just want to copy these data directly to hdfs in a parallel way (which
 maybe more efficient).

 I think hadoop distcp is what I need.


http://hadoop.apache.org/docs/stable/distcp.html

DistCp (distributed copy) is a tool used for large inter/intra-cluster
copying. It uses MapReduce to effect its distribution, error handling and
recovery, and reporting


I doubt this is going to help. Are these lot of files. If yes, how about
multiple copy jobs to hdfs?
-balaji


Re: yarn Failed to bind to: 0.0.0.0/0.0.0.0:8080

2013-07-10 Thread பாலாஜி நாராயணன்
On Wednesday, 10 July 2013, ch huang wrote:

 i have 3 NM, on the box of one of NM ,the 8080 PORT has already ocuppied
 by tomcat,so i want to change all NM 8080 port to 8090,but problem is
 i do not know 8080 port is control by what option in yarn ,anyone can
 help??


Why would you want to do that? If ou want to test out any multi node
features you are better off running them in vms

-balaji


-- 
http://balajin.net/blog
http://flic.kr/balajijegan


Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread பாலாஜி நாராயணன்
Are you running balancer? If balancer is running and if it is slow, try
increasing the balancer bandwidth


On 24 March 2013 09:21, Tapas Sarangi tapas.sara...@gmail.com wrote:

 Thanks for the follow up. I don't know whether attachment will pass
 through this mailing list, but I am attaching a pdf that contains the usage
 of all live nodes.

 All nodes starting with letter g are the ones with smaller storage space
 where as nodes starting with letter s have larger storage space. As you
 will see, most of the gXX nodes are completely full whereas sXX nodes
 have a lot of unused space.

 Recently, we are facing crisis frequently as 'hdfs' goes into a mode where
 it is not able to write any further even though the total space available
 in the cluster is about 500 TB. We believe this has something to do with
 the way it is balancing the nodes, but don't understand the problem yet.
 May be the attached PDF will help some of you (experts) to see what is
 going wrong here...

 Thanks
 --







 Balancer know about topology,but when calculate balancing it operates only
 with nodes not with racks.
 You can see how it work in Balancer.java in  BalancerDatanode about string
 509.

 I was wrong about 350Tb,35Tb it calculates in such way :

 For example:
 cluster_capacity=3.5Pb
 cluster_dfsused=2Pb

 avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
 Then we know avg node utilization (node_dfsused/node_capacity*100)
 .Balancer think that all good if  avgutil
 +10node_utilizazation=avgutil-10.

 Ideal case that all node used avgutl of capacity.but for 12TB node its
 only 6.5Tb and for 72Tb its about 40Tb.

 Balancer cant help you.

 Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVEif you 
 can.





  In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb
 you will be able to have only 12Tb replication data.


 Yes, this is true for exactly two nodes in the cluster with 12 TB and 72
 TB, but not true for more than two nodes in the cluster.


 Best way,on my opinion,it is using multiple racks.Nodes in rack must be
 with identical capacity.Racks must be identical capacity.
 For example:

 rack1: 1 node with 72Tb
 rack2: 6 nodes with 12Tb
 rack3: 3 nodes with 24Tb

 It helps with balancing,because dublicated  block must be another rack.


 The same question I asked earlier in this message, does multiple racks
 with default threshold for the balancer minimizes the difference between
 racks ?

 Why did you select hdfs?May be lustre,cephfs and other is better choise.


 It wasn't my decision, and I probably can't change it now. I am new to
 this cluster and trying to understand few issues. I will explore other
 options as you mentioned.

 --
 http://balajin.net/blog
 http://flic.kr/balajijegan



Re: question for commetter

2013-03-24 Thread பாலாஜி நாராயணன்
is there a reason why you dont want to run MRv2 under yarn?


On 22 March 2013 22:49, Azuryy Yu azury...@gmail.com wrote:

 is there a way to separate hdfs2 from hadoop2? I want use hdfs2 and
 mapreduce1.0.4, exclude yarn. because I need HDFS-HA.

 --
 http://balajin.net/blog
 http://flic.kr/balajijegan



Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread பாலாஜி நாராயணன்
 -setBalancerBandwidth bandwidth in bytes per second

So the value is bytes per second. If it is running and exiting,it means it
has completed the balancing.


On 24 March 2013 11:32, Tapas Sarangi tapas.sara...@gmail.com wrote:

 Yes, we are running balancer, though a balancer process runs for almost a
 day or more before exiting and starting over.
 Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume
 that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If it
 is in Bits then we have a problem.
 What's the unit for dfs.balance.bandwidthPerSec ?

 -

 On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (பாலாஜி நாராயணன்) 
 li...@balajin.net wrote:

 Are you running balancer? If balancer is running and if it is slow, try
 increasing the balancer bandwidth


 On 24 March 2013 09:21, Tapas Sarangi tapas.sara...@gmail.com wrote:

 Thanks for the follow up. I don't know whether attachment will pass
 through this mailing list, but I am attaching a pdf that contains the usage
 of all live nodes.

 All nodes starting with letter g are the ones with smaller storage
 space where as nodes starting with letter s have larger storage space. As
 you will see, most of the gXX nodes are completely full whereas sXX
 nodes have a lot of unused space.

 Recently, we are facing crisis frequently as 'hdfs' goes into a mode
 where it is not able to write any further even though the total space
 available in the cluster is about 500 TB. We believe this has something to
 do with the way it is balancing the nodes, but don't understand the problem
 yet. May be the attached PDF will help some of you (experts) to see what is
 going wrong here...

 Thanks
 --







 Balancer know about topology,but when calculate balancing it operates
 only with nodes not with racks.
 You can see how it work in Balancer.java in  BalancerDatanode about
 string 509.

 I was wrong about 350Tb,35Tb it calculates in such way :

 For example:
 cluster_capacity=3.5Pb
 cluster_dfsused=2Pb

 avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
 Then we know avg node utilization (node_dfsused/node_capacity*100)
 .Balancer think that all good if  avgutil
 +10node_utilizazation=avgutil-10.

 Ideal case that all node used avgutl of capacity.but for 12TB node its
 only 6.5Tb and for 72Tb its about 40Tb.

 Balancer cant help you.

 Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVEif 
 you can.





  In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb
 you will be able to have only 12Tb replication data.


 Yes, this is true for exactly two nodes in the cluster with 12 TB and 72
 TB, but not true for more than two nodes in the cluster.


 Best way,on my opinion,it is using multiple racks.Nodes in rack must be
 with identical capacity.Racks must be identical capacity.
 For example:

 rack1: 1 node with 72Tb
 rack2: 6 nodes with 12Tb
 rack3: 3 nodes with 24Tb

 It helps with balancing,because dublicated  block must be another rack.


 The same question I asked earlier in this message, does multiple racks
 with default threshold for the balancer minimizes the difference between
 racks ?

 Why did you select hdfs?May be lustre,cephfs and other is better
 choise.


 It wasn't my decision, and I probably can't change it now. I am new to
 this cluster and trying to understand few issues. I will explore other
 options as you mentioned.

 --
 http://balajin.net/blog
 http://flic.kr/balajijegan





-- 
http://balajin.net/blog
http://flic.kr/balajijegan


Re: Cluster lost IP addresses

2013-03-22 Thread பாலாஜி நாராயணன்
Assuming you are using hostnAmes and not ip address in your config
files What happens when you start the cluster? If you are using IP address
in your configs just update them and start. It should work with no issues.

On Friday, March 22, 2013, John Meza wrote:

 I have a 18 node cluster that had to be physically moved.
 Unfortunately all the ip addresses were lost (recreated).

 This must have happened to someone before.
 Nothing else on the machines has been changed. Most importantly the data
 in HDFS is still sitting there.

 Is there a way to recover this cluster to a useable state?
 thanks
 John



-- 
http://balajin.net/blog
http://flic.kr/balajijegan


Re: HDFS disk space requirement

2013-01-10 Thread பாலாஜி நாராயணன்
If the replication factor is   5 you will need at least 5x the space if the
file. So this is not going tobe enough.

On Thursday, January 10, 2013, Panshul Whisper wrote:

 Hello,

 I have a hadoop cluster of 5 nodes with a total of available HDFS space
 130 GB with replication set to 5.
 I have a file of 115 GB, which needs to be copied to the HDFS and
 processed.
 Do I need to have anymore HDFS space for performing all processing without
 running into any problems? or is this space sufficient?

 --
 Regards,
 Ouch Whisper
 010101010101



-- 
http://balajin.net/blog
http://flic.kr/balajijegan


Re: HDFS HA IO Fencing

2012-10-27 Thread பாலாஜி நாராயணன்
If you use NSFv4 you should be able to use locks and when a machine dies /
fails to renew the lease, the other machine can take over.

On Friday, October 26, 2012, Todd Lipcon wrote:

 NFS Locks typically last forever if you disconnect abruptly. So they are
 not sufficient -- your standby wouldn't be able to take over without manual
 intervention to remove the lock.

 If you want to build an unreliable system that might corrupt your data,
 you could set up 'shell(/bin/true)' as a second fencing method. But, it's
 really a bad idea. There are failure scenarios which could cause split
 brain if you do this, and you'd very likely lose data.

 -Todd

 On Fri, Oct 26, 2012 at 1:59 AM, lei liu 
 liulei...@gmail.comjavascript:_e({}, 'cvml', 'liulei...@gmail.com');
  wrote:

 We are using NFS for Shared storage,  Can we use linux nfslcok service to
 implement IO Fencing ?


 2012/10/26 Steve Loughran ste...@hortonworks.com javascript:_e({},
 'cvml', 'ste...@hortonworks.com');



 On 25 October 2012 14:08, Todd Lipcon t...@cloudera.comjavascript:_e({}, 
 'cvml', 't...@cloudera.com');
  wrote:

 Hi Liu,

 Locks are not sufficient, because there is no way to enforce a lock in
 a distributed system without unbounded blocking. What you might be
 referring to is a lease, but leases are still problematic unless you can
 put bounds on the speed with which clocks progress on different machines,
 _and_ have strict guarantees on the way each node's scheduler works. With
 Linux and Java, the latter is tough.


 on any OS running in any virtual environment, including EC2, time is
 entirely unpredictable, just to make things worse.


 On a single machine you can use file locking as the OS will know that
 the process is dead and closes the file; other programs can attempt to open
 the same file with exclusive locking -and, by getting the right failures,
 know that something else has the file, hence the other process is live.
 Shared NFS storage you need to mount with softlock set precisely to stop
 file locks lasting until some lease has expired, because the on-host
 liveness probes detect failure faster and want to react to it.


 -Steve





 --
 Todd Lipcon
 Software Engineer, Cloudera



-- 
Thanks
-balaji

--
http://balajin.net/blog/
http://flic.kr/balajijegan