date:20130324

Re: how to control (or understand) the memory usage in hdfs

2013-03-24 Thread Ted

oh, really?

ulimit -n is 2048, I'd assumed that would be sufficient for just
testing on my machine. I was going to use 4096 in production.
my hdfs-site.xml has dfs.datanode.max.xcievers set to 4096.

As for my logs... there's a lot of INFO entries, I haven't gotten
around to configuring it down yet - I'm not quite sure why it's so
extensive at INFO level. My log files is 4.4gb (is this a sign I've
configured or done something wrong?)

I grep -v INFO in the log to get the actual error entry (assuming
the stack trace is actually is on the same line or else those stack
lines maybe misleading)

2013-03-23 15:11:43,653 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-1419421989-192.168.1.5-50010-1363780956652,
infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due
to:java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:691)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:133)
at java.lang.Thread.run(Thread.java:722)

2013-03-23 15:11:44,177 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-1419421989-192.168.1.5-50010-1363780956652,
infoPort=50075, ipcPort=50020):DataXceiver
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
left.
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:292)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:339)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:403)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:581)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:406)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
at java.lang.Thread.run(Thread.java:722)

On 3/23/13, Harsh J ha...@cloudera.com wrote:
 I'm guessing your OutOfMemory then is due to Unable to create native
 thread message? Do you mind sharing your error logs with us? Cause if
 its that, then its a ulimit/system limits issue and not a real memory
 issue.

 On Sat, Mar 23, 2013 at 2:30 PM, Ted r6squee...@gmail.com wrote:
 I just checked and after running my tests, I generate only 670mb of
 data, on 89 blocks.

 What's more, when I ran the test this time, I had increased my memory
 to 2048mb so it completed fine - but I decided to run jconsole through
 the test so I could see what's happenning. The data node never
 exceeded 200mb of memory usage. It mostly stayed under 100mb.

 I'm not sure why it would complain about out of memory and shut itself
 down when it was only 1024. It was fairly consistently doing that the
 last few days including this morning right before I switched it to
 2048.

 I'm going to run the test again with 1024mb and jconsole running, none
 of this makes any sense to me.

 On 3/23/13, Harsh J ha...@cloudera.com wrote:
 I run a 128 MB heap size DN for my simple purposes on my Mac and it
 runs well for what load I apply on it.

 A DN's primary, growing memory consumption comes from the # of blocks
 it carries. All of these blocks' file paths are mapped and kept in the
 RAM during its lifetime. If your DN has acquired a lot of blocks by
 now, like say close to a million or more, then 1 GB may not suffice
 anymore to hold them in and you'd need to scale up (add more RAM or
 increase heap size if you have more RAM)/scale out (add another node
 and run the balancer).

 On Sat, Mar 23, 2013 at 10:03 AM, Ted r6squee...@gmail.com wrote:
 Hi I'm new to hadoop/hdfs and I'm just running some tests on my local
 machines in a single node setup. I'm encountering out of memory errors
 on the jvm running my data node.

 I'm pretty sure I can just increase the heap size to fix the errors,
 but my question is about how memory is actually used.

 As an example, with other things like an OS's disk-cache or say
 databases, if you have or let it use as an example 1gb of ram, it will
 work with what it has available, if the data is more than 1gb of ram
 it just means it'll swap in and out of memory/disk more

Best practises to learn hadoop for new users

2013-03-24 Thread suraj nayak

Re: DistributedCache - why not read directly from HDFS?

2013-03-24 Thread Alberto Cordioli

Thanks for your reply Harsh.
So if I want to read a simple text file, choosing whether to use
DistributedCachce or HDFS it becomes just a matter of performance.


Alberto

On 23 March 2013 16:17, Harsh J ha...@cloudera.com wrote:
 A DistributedCache is not used just to distribute simple files but
 also native libraries and such which cannot be loaded by certain if
 its on HDFS.

 Also, keeping it on HDFS could provide less performant as non-local
 reads could happen (depending on the files' replication factor).

 On Sat, Mar 23, 2013 at 8:23 PM, Alberto Cordioli
 cordioli.albe...@gmail.com wrote:
 Hi all,

 I was not able to find an answer to the following question. If the
 question has already been answered please give me the pointer to the
 right thread.

 Which are actually the differences between read file from HDFS in one
 mapper and use DistributedCache.

 I saw that with DistributedCache you can give an hdfs path and the
 task nodes will get the data on local file system. But which
 advantages we have compared with a simple HDFS read with
 FSDataInputStream.open() method?

 Thank you very much,
 Alberto


 --
 Alberto Cordioli



 --
 Harsh J



-- 
Alberto Cordioli

Child JVM memory allocation / Usage

2013-03-24 Thread nagarjuna kanamarlapudi

Hi,

I configured  my child jvm heap to 2 GB. So, I thought I could really read
1.5GB of data and store it in memory (mapper/reducer).

I wanted to confirm the same and wrote the following piece of code in the
configure method of mapper.

@Override

public void configure(JobConf job) {

System.out.println(FREE MEMORY -- 

+ Runtime.getRuntime().freeMemory());

System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory());

}


Surprisingly the output was


FREE MEMORY -- 341854864  = 320 MB
MAX MEMORY ---1908932608  = 1.9 GB


I am just wondering what processes are taking up that extra 1.6GB of
heap which I configured for the child jvm heap.


Appreciate in helping me understand the scenario.



Regards

Nagarjuna K

2 Reduce method in one Job

2013-03-24 Thread Fatih Haltas

I want to get reduce output as key and value then I want to pass them to a
new reduce as input key and input value.

So is there any Map-Reduce-Reduce kind of method?

Thanks to all.

Re: 2 Reduce method in one Job

2013-03-24 Thread Azuryy Yu

there isn't such method, you had to submit another MR.
On Mar 24, 2013 9:03 PM, Fatih Haltas fatih.hal...@nyu.edu wrote:

 I want to get reduce output as key and value then I want to pass them to a
 new reduce as input key and input value.

 So is there any Map-Reduce-Reduce kind of method?

 Thanks to all.

Re: 2 Reduce method in one Job

2013-03-24 Thread Harsh J

You seem to want to re-sort/partition your data without materializing
it onto HDFS.

Azuryy is right: There isn't a way right now and a second job (with an
identity mapper) is necessary. With YARN this is more possible to
implement into the project, though.

The newly inducted incubator project Tez sorta targets this. Its in
its nascent stages though (for general user use), and the website
should hopefully appear at
http://incubator.apache.org/projects/tez.html soon. Meanwhile, you can
read the proposal behind this project at
http://wiki.apache.org/incubator/TezProposal. Initial sources are at
https://svn.apache.org/repos/asf/incubator/tez/trunk/.

On Sun, Mar 24, 2013 at 6:33 PM, Fatih Haltas fatih.hal...@nyu.edu wrote:
 I want to get reduce output as key and value then I want to pass them to a
 new reduce as input key and input value.

 So is there any Map-Reduce-Reduce kind of method?

 Thanks to all.



-- 
Harsh J

Re: 2 Reduce method in one Job

2013-03-24 Thread Fatih Haltas

Thank you very much.

You are right Harsh, it is exactly what i am trying to do.

I want to process my result, according to the keys and i donot spend time
writing this data to hdfs, I want to pass data as input to another reduce.

One more question then,
Creating 2 diffirent job, secondone has only reduce for example, is it
possible to pass first jobs output as argument to second job?


On Sun, Mar 24, 2013 at 5:44 PM, Harsh J ha...@cloudera.com wrote:

 You seem to want to re-sort/partition your data without materializing
 it onto HDFS.

 Azuryy is right: There isn't a way right now and a second job (with an
 identity mapper) is necessary. With YARN this is more possible to
 implement into the project, though.

 The newly inducted incubator project Tez sorta targets this. Its in
 its nascent stages though (for general user use), and the website
 should hopefully appear at
 http://incubator.apache.org/projects/tez.html soon. Meanwhile, you can
 read the proposal behind this project at
 http://wiki.apache.org/incubator/TezProposal. Initial sources are at
 https://svn.apache.org/repos/asf/incubator/tez/trunk/.

 On Sun, Mar 24, 2013 at 6:33 PM, Fatih Haltas fatih.hal...@nyu.edu
 wrote:
  I want to get reduce output as key and value then I want to pass them to
 a
  new reduce as input key and input value.
 
  So is there any Map-Reduce-Reduce kind of method?
 
  Thanks to all.



 --
 Harsh J

Re: 2 Reduce method in one Job

2013-03-24 Thread Harsh J

Yes, just use an identity mapper (in new API, the base Mapper class
itself identity-maps, in the old API use IdentityMapper class) and set
the input path as the output path of the first job.

If you'll be ending up doing more such step-wise job chaining,
consider using Apache Oozie's workflow system.

On Sun, Mar 24, 2013 at 7:23 PM, Fatih Haltas fatih.hal...@nyu.edu wrote:
 Thank you very much.

 You are right Harsh, it is exactly what i am trying to do.

 I want to process my result, according to the keys and i donot spend time
 writing this data to hdfs, I want to pass data as input to another reduce.

 One more question then,
 Creating 2 diffirent job, secondone has only reduce for example, is it
 possible to pass first jobs output as argument to second job?


 On Sun, Mar 24, 2013 at 5:44 PM, Harsh J ha...@cloudera.com wrote:

 You seem to want to re-sort/partition your data without materializing
 it onto HDFS.

 Azuryy is right: There isn't a way right now and a second job (with an
 identity mapper) is necessary. With YARN this is more possible to
 implement into the project, though.

 The newly inducted incubator project Tez sorta targets this. Its in
 its nascent stages though (for general user use), and the website
 should hopefully appear at
 http://incubator.apache.org/projects/tez.html soon. Meanwhile, you can
 read the proposal behind this project at
 http://wiki.apache.org/incubator/TezProposal. Initial sources are at
 https://svn.apache.org/repos/asf/incubator/tez/trunk/.

 On Sun, Mar 24, 2013 at 6:33 PM, Fatih Haltas fatih.hal...@nyu.edu
 wrote:
  I want to get reduce output as key and value then I want to pass them to
  a
  new reduce as input key and input value.
 
  So is there any Map-Reduce-Reduce kind of method?
 
  Thanks to all.



 --
 Harsh J





--
Harsh J

Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread பாலாஜி நாராயணன்

Are you running balancer? If balancer is running and if it is slow, try
increasing the balancer bandwidth


On 24 March 2013 09:21, Tapas Sarangi tapas.sara...@gmail.com wrote:

 Thanks for the follow up. I don't know whether attachment will pass
 through this mailing list, but I am attaching a pdf that contains the usage
 of all live nodes.

 All nodes starting with letter g are the ones with smaller storage space
 where as nodes starting with letter s have larger storage space. As you
 will see, most of the gXX nodes are completely full whereas sXX nodes
 have a lot of unused space.

 Recently, we are facing crisis frequently as 'hdfs' goes into a mode where
 it is not able to write any further even though the total space available
 in the cluster is about 500 TB. We believe this has something to do with
 the way it is balancing the nodes, but don't understand the problem yet.
 May be the attached PDF will help some of you (experts) to see what is
 going wrong here...

 Thanks
 --







 Balancer know about topology,but when calculate balancing it operates only
 with nodes not with racks.
 You can see how it work in Balancer.java in  BalancerDatanode about string
 509.

 I was wrong about 350Tb,35Tb it calculates in such way :

 For example:
 cluster_capacity=3.5Pb
 cluster_dfsused=2Pb

 avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
 Then we know avg node utilization (node_dfsused/node_capacity*100)
 .Balancer think that all good if  avgutil
 +10node_utilizazation=avgutil-10.

 Ideal case that all node used avgutl of capacity.but for 12TB node its
 only 6.5Tb and for 72Tb its about 40Tb.

 Balancer cant help you.

 Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVEif you 
 can.





  In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb
 you will be able to have only 12Tb replication data.


 Yes, this is true for exactly two nodes in the cluster with 12 TB and 72
 TB, but not true for more than two nodes in the cluster.


 Best way,on my opinion,it is using multiple racks.Nodes in rack must be
 with identical capacity.Racks must be identical capacity.
 For example:

 rack1: 1 node with 72Tb
 rack2: 6 nodes with 12Tb
 rack3: 3 nodes with 24Tb

 It helps with balancing,because dublicated  block must be another rack.


 The same question I asked earlier in this message, does multiple racks
 with default threshold for the balancer minimizes the difference between
 racks ?

 Why did you select hdfs?May be lustre,cephfs and other is better choise.


 It wasn't my decision, and I probably can't change it now. I am new to
 this cluster and trying to understand few issues. I will explore other
 options as you mentioned.

 --
 http://balajin.net/blog
 http://flic.kr/balajijegan

Re: question for commetter

2013-03-24 Thread பாலாஜி நாராயணன்

is there a reason why you dont want to run MRv2 under yarn?


On 22 March 2013 22:49, Azuryy Yu azury...@gmail.com wrote:

 is there a way to separate hdfs2 from hadoop2? I want use hdfs2 and
 mapreduce1.0.4, exclude yarn. because I need HDFS-HA.

 --
 http://balajin.net/blog
 http://flic.kr/balajijegan

Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread Tapas Sarangi

Yes, we are running balancer, though a balancer process runs for almost a day
or more before exiting and starting over.
Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume that's
bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If it is in Bits
then we have a problem.
What's the unit for dfs.balance.bandwidthPerSec ?

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (பாலாஜி நாராயணன்)
li...@balajin.net wrote:

Are you running balancer? If balancer is running and if it is slow, try
increasing the balancer bandwidth

On 24 March 2013 09:21, Tapas Sarangi tapas.sara...@gmail.com wrote:
Thanks for the follow up. I don't know whether attachment will pass through
this mailing list, but I am attaching a pdf that contains the usage of all
live nodes.

All nodes starting with letter g are the ones with smaller storage space
where as nodes starting with letter s have larger storage space. As you
will see, most of the gXX nodes are completely full whereas sXX nodes
have a lot of unused space.

Recently, we are facing crisis frequently as 'hdfs' goes into a mode where it
is not able to write any further even though the total space available in the
cluster is about 500 TB. We believe this has something to do with the way it
is balancing the nodes, but don't understand the problem yet. May be the
attached PDF will help some of you (experts) to see what is going wrong
here...

Thanks
--

Balancer know about topology,but when calculate balancing it operates only
with nodes not with racks.
You can see how it work in Balancer.java in BalancerDatanode about string
509.

I was wrong about 350Tb,35Tb it calculates in such way :

For example:
cluster_capacity=3.5Pb
cluster_dfsused=2Pb

avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
Then we know avg node utilization (node_dfsused/node_capacity*100) .Balancer
think that all good if avgutil +10node_utilizazation=avgutil-10.

Ideal case that all node used avgutl of capacity.but for 12TB node its only
6.5Tb and for 72Tb its about 40Tb.

Balancer cant help you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVE if
you can.

In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you
will be able to have only 12Tb replication data.

Yes, this is true for exactly two nodes in the cluster with 12 TB and 72 TB,
but not true for more than two nodes in the cluster.

Best way,on my opinion,it is using multiple racks.Nodes in rack must be
with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because dublicated block must be another rack.

The same question I asked earlier in this message, does multiple racks with
default threshold for the balancer minimizes the difference between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better choise.

It wasn't my decision, and I probably can't change it now. I am new to this
cluster and trying to understand few issues. I will explore other options as
you mentioned.

--
http://balajin.net/blog
http://flic.kr/balajijegan

Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread பாலாஜி நாராயணன்

-setBalancerBandwidth bandwidth in bytes per second

So the value is bytes per second. If it is running and exiting,it means it
has completed the balancing.

On 24 March 2013 11:32, Tapas Sarangi tapas.sara...@gmail.com wrote:

Yes, we are running balancer, though a balancer process runs for almost a
day or more before exiting and starting over.
Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume
that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If it
is in Bits then we have a problem.
What's the unit for dfs.balance.bandwidthPerSec ?

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (பாலாஜி நாராயணன்)
li...@balajin.net wrote:

Are you running balancer? If balancer is running and if it is slow, try
increasing the balancer bandwidth

On 24 March 2013 09:21, Tapas Sarangi tapas.sara...@gmail.com wrote:

Thanks for the follow up. I don't know whether attachment will pass
through this mailing list, but I am attaching a pdf that contains the usage
of all live nodes.

All nodes starting with letter g are the ones with smaller storage
space where as nodes starting with letter s have larger storage space. As
you will see, most of the gXX nodes are completely full whereas sXX
nodes have a lot of unused space.

Recently, we are facing crisis frequently as 'hdfs' goes into a mode
where it is not able to write any further even though the total space
available in the cluster is about 500 TB. We believe this has something to
do with the way it is balancing the nodes, but don't understand the problem
yet. May be the attached PDF will help some of you (experts) to see what is
going wrong here...

Thanks
--

Balancer know about topology,but when calculate balancing it operates
only with nodes not with racks.
You can see how it work in Balancer.java in BalancerDatanode about
string 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For example:
cluster_capacity=3.5Pb
cluster_dfsused=2Pb

avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
Then we know avg node utilization (node_dfsused/node_capacity*100)
.Balancer think that all good if avgutil
+10node_utilizazation=avgutil-10.

Ideal case that all node used avgutl of capacity.but for 12TB node its
only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVEif
you can.

In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb
you will be able to have only 12Tb replication data.

Yes, this is true for exactly two nodes in the cluster with 12 TB and 72
TB, but not true for more than two nodes in the cluster.

Best way,on my opinion,it is using multiple racks.Nodes in rack must be
with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because dublicated block must be another rack.

The same question I asked earlier in this message, does multiple racks
with default threshold for the balancer minimizes the difference between
racks ?

Why did you select hdfs?May be lustre,cephfs and other is better
choise.

It wasn't my decision, and I probably can't change it now. I am new to
this cluster and trying to understand few issues. I will explore other
options as you mentioned.

--
http://balajin.net/blog
http://flic.kr/balajijegan

Re: disk used percentage is not symmetric on datanodes (balancer)

2013-03-24 Thread Tapas Sarangi

Yes, thanks for pointing, but I already know that it is completing the
balancing when exiting otherwise it shouldn't exit.
Your answer doesn't solve the problem I mentioned earlier in my message. 'hdfs'
is stalling and hadoop is not writing unless space is cleared up from the
cluster even though df shows the cluster has about 500 TB of free space.

---

On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (பாலாஜி நாராயணன்)
bal...@balajin.net wrote:

-setBalancerBandwidth bandwidth in bytes per second

So the value is bytes per second. If it is running and exiting,it means it
has completed the balancing.

On 24 March 2013 11:32, Tapas Sarangi tapas.sara...@gmail.com wrote:
Yes, we are running balancer, though a balancer process runs for almost a day
or more before exiting and starting over.
Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume that's
bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If it is in
Bits then we have a problem.
What's the unit for dfs.balance.bandwidthPerSec ?

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (பாலாஜி நாராயணன்)
li...@balajin.net wrote:

Are you running balancer? If balancer is running and if it is slow, try
increasing the balancer bandwidth

Recently, we are facing crisis frequently as 'hdfs' goes into a mode where
it is not able to write any further even though the total space available in
the cluster is about 500 TB. We believe this has something to do with the
way it is balancing the nodes, but don't understand the problem yet. May be
the attached PDF will help some of you (experts) to see what is going wrong
here...