Hadoop Learning Environment
Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
Job failing due to premature EOF
Hi All, When I query on larger datasets job is failing due to below issue 2014-11-03 13:40:15,279 INFO datanode.DataNode - Exception for BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) at java.lang.Thread.run(Thread.java:724) 2014-11-03 13:40:15,279 INFO datanode.DataNode - PacketResponder: BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted. 2014-11-03 13:40:15,279 INFO datanode.DataNode - PacketResponder: BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2014-11-03 13:40:15,279 INFO datanode.DataNode - opWriteBlock BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702 received exception java.io.IOException: Premature EOF from inputStream 2014-11-03 13:40:15,280 ERROR datanode.DataNode - task4-14.sj2.net:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.29.15.23:51767 dest: / 10.29.15.23:50010 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) Any pointers on this type of issue. Thanks
Re: Hadoop Learning Environment
Hi tim. Id suggest using apache bigtop for this. BigTop integrates the hadoop ecosystem into a single upstream distribution, packages everything, curates smoke tests, vagrant, docker recipes for deployment. Also, we curate a blueprint hadoop application (bigpetstore) which you build yourself, easily, and can run to generate, process, and visualize the bigdata ecosystem. You can also easily deploy bigtop onto ec2 if you want to pay for it . On Tue, Nov 4, 2014 at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- jay vyas
Re: Hadoop Learning Environment
Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hadoop Learning Environment
Hello Tim, Horton and Cloudera both offer VM’s (Including Virtual box, which is free) you can pull down to play with, if you’re looking just for something small to get you started. i’m partial to the horton works one myself. Hope that help. JC On Nov 4, 2014, at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net http://pool.sks-keyservers.net/ --recv-keys F186197B
Re: Hadoop Learning Environment
Or on your local laptop or desktop you can setup the env using VM and VM image of Hadoop and related components. Wrote instructions sometime back here https://www.linkedin.com/today/post/article/20140924133831-2560863-new-to-hadoop-and-want-to-setup-dev-environment On Nov 5, 2014 2:25 AM, Jim Colestock j...@ramblingredneck.com wrote: Hello Tim, Horton and Cloudera both offer VM’s (Including Virtual box, which is free) you can pull down to play with, if you’re looking just for something small to get you started. i’m partial to the horton works one myself. Hope that help. JC On Nov 4, 2014, at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
Re: Hadoop Learning Environment
you can try the pivotal vm as well. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com wrote: Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hadoop Learning Environment
What you want as a sandbox depends on what you are trying to learn. If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all of the suggestions (perhaps excluding BigTop due to its setup complexities) are great. Laptop? perhaps but laptop's are really kind of infuriatingly slow (because of the hardware - you pay a price for a 30-45watt average heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7 with lots of memory. What do you think of the thought that you will pretty quickly graduate to wanting a small'ish desktop for your sandbox? A simple, single node, Hadoop instance will let you learn many things. The next level of complexity comes when you are attempting to deal with data whose processing needs to be split up, so you can learn about how to split data in Mapping, reduce the splits via reduce jobs, etc. For that, you could get a windows desktop box or e.g. RedHat/CentOS and use virtualization. Something like a 4 core i5 with 32gb of memory, running 3 or for some things 4, vm's. You could load e.g. hortonworks into each of the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives off of eBay and you can have a lot of learning. *...“The race is not to the swift,nor the battle to the strong,but to those who can see it coming and jump aside.” - Hunter ThompsonDaemeon* On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote: you can try the pivotal vm as well. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com wrote: Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Questions about Capacity scheduler behavior
Hi Fabio, To answer your questions: 1) CS (capacity-scheduler) will do allocation from root to leaf, level by level, and for queues with same parent, will do allocation in following way: First will allocate queue with least used resource If queues with same used resource, say root.qA used 1G, root.qB used 1G as well, it will allocate resource by name of queue, so qA will go first. 2) There're two parameters about capacity, one is capacity, another one is maximum-capacity, queue can only allocate resource with multiple of minimum-allocation of the cluster, and it will be = maximum-capacity of the queue always 3) I think 1) should already answer your question, queue will go first by used resource, not percentage of used resource, so in your example, queue-A will go first. 4) Sorry, I'm not quite understand this question, could you do some explanations about it? 5) You were not misunderstand this example, this is what capacity/maximum-capacity applied on queues. The maximum-capacity of queue is used to do such resource provisioning for queues. root.a.capacity = 50 root.a.maximum-capacity = 80 root.a.a1.capacity = 50 root.a.a1.maximum-capacity = 90 The guaranteed resource of a.a1 is a.capacity% * a.a1.capacity% = 25% The maximum resource of a.a1 is a.maximum-capacity% * a.a1.maximum-capacity = 72% (0.8 * 0.9). And as you said, a resource of child can use, will always = resource of its parent. If you want to let all leafQueues can leverage all resource in the cluster, you can simply set all maximum-capacity of parent queues and leaf queues to 100. Does this make sense to you? Thanks, Wangda On Mon, Nov 3, 2014 at 6:30 AM, Fabio anyte...@gmail.com wrote: Hi guys, I'm posting this in the user mailing list since I got no reply in the Yarn-dev. I have to model as well as possible the capacity scheduler behavior and I have some questions, hope someone can help me with this. In the following I will consider all containers to be equal for simplicity: 1) In case of multiple queues having the same level of assigned resources, what's the policy to decide which comes first in the resource allocation? 2) Let's consider this configuration: We have a cluster hosting a total of 40 containers. We have 3 queues: A is configured to get 39% of cluster capacity, B also gets 39% and C gets 22%. The number of containers is going to be 15.6, 15.6 and 8.8 for A, B an C. Since we can't split a container, how does the Capacity scheduler round these values in a real case? Who gets the two contended containers? I may think they are considered as extra containers, thus shared upon need among the three queues. Is this correct? 3) Let's say I have queues A and B. A is configured to get 20% (20 containers) of the total cluster capacity (100 containers), B gets 80% (80 containers). Capacity scheduler gives available resources firstly to the most under-served queue. In case A is using 10 containers and B is using 20, who is going to get the first available container? A is already using 50% of it's assigned capacity, B just 25%, but A has less containers than B... who is considered to be more under-served? 4) Does the previous question make sense at all? Because I have a doubt that when I have free containers I will just serve requests as they arrive, possibly over-provisioning a queue (that is: if I get a container request for an app in A, I will give it a container since I don't know that after a few milliseconds I will get a new request from B, or vice versa). The previous question may have sense if there was some sort of buffer that is filled with incoming requests, due to the difficulty of serving them in real time, thus making the scheduler able to choose the request from the most under-served queue. Is this what happens? 5) According to the example presented in Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 about the resource allocation with the capacity scheduler, what I understood is that the chance for a leaf queue to get resources above it's assigned capacity is always upper-limited by the fraction of cluster capacity assigned to its first/closer parent queue. That is: if I am a leaf queue A1, I can only get at most the resources dedicated to my parent A, while I can't get the ones from B, sibling of A, even if it doesn't have any running application. Actually at first I thought this over-provisioning was not limited, and regardless of the queue configuration a single application could get the whole cluster (excluding per-application limits). Did I misunderstood the example? Thanks a lot Fabio
Re: Hadoop Learning Environment
Hi daemon: Actually, for most folks who would want to actually use a hadoop cluster, i would think setting up bigtop is super easy ! If you have issues with it ping me and I can help you get started. Also, we have docker containers - so you dont even *need* a VM to run a 4 or 5 node hadoop cluster. install vagrant install VirtualBox git clone https://github.com/apache/bigtop cd bigtop/bigtop-deploy/vm/vagrant-puppet vagrant up Then vagrant destroy when your done. This to me is easier than manually downloading an appliance, picking memory starting the virtualbox gui, loading the appliance , etc... and also its easy to turn the simple single node bigtop VM into a multinode one, by just modifying the vagrantile. On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com wrote: What you want as a sandbox depends on what you are trying to learn. If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all of the suggestions (perhaps excluding BigTop due to its setup complexities) are great. Laptop? perhaps but laptop's are really kind of infuriatingly slow (because of the hardware - you pay a price for a 30-45watt average heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7 with lots of memory. What do you think of the thought that you will pretty quickly graduate to wanting a small'ish desktop for your sandbox? A simple, single node, Hadoop instance will let you learn many things. The next level of complexity comes when you are attempting to deal with data whose processing needs to be split up, so you can learn about how to split data in Mapping, reduce the splits via reduce jobs, etc. For that, you could get a windows desktop box or e.g. RedHat/CentOS and use virtualization. Something like a 4 core i5 with 32gb of memory, running 3 or for some things 4, vm's. You could load e.g. hortonworks into each of the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives off of eBay and you can have a lot of learning. *...“The race is not to the swift,nor the battle to the strong,but to those who can see it coming and jump aside.” - Hunter ThompsonDaemeon* On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote: you can try the pivotal vm as well. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com wrote: Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- jay vyas
Re: Hadoop Learning Environment
Try docker! http://ferry.opencore.io/en/latest/examples/hadoop.html On Tue, Nov 4, 2014 at 6:36 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi daemon: Actually, for most folks who would want to actually use a hadoop cluster, i would think setting up bigtop is super easy ! If you have issues with it ping me and I can help you get started. Also, we have docker containers - so you dont even *need* a VM to run a 4 or 5 node hadoop cluster. install vagrant install VirtualBox git clone https://github.com/apache/bigtop cd bigtop/bigtop-deploy/vm/vagrant-puppet vagrant up Then vagrant destroy when your done. This to me is easier than manually downloading an appliance, picking memory starting the virtualbox gui, loading the appliance , etc... and also its easy to turn the simple single node bigtop VM into a multinode one, by just modifying the vagrantile. On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com wrote: What you want as a sandbox depends on what you are trying to learn. If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all of the suggestions (perhaps excluding BigTop due to its setup complexities) are great. Laptop? perhaps but laptop's are really kind of infuriatingly slow (because of the hardware - you pay a price for a 30-45watt average heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7 with lots of memory. What do you think of the thought that you will pretty quickly graduate to wanting a small'ish desktop for your sandbox? A simple, single node, Hadoop instance will let you learn many things. The next level of complexity comes when you are attempting to deal with data whose processing needs to be split up, so you can learn about how to split data in Mapping, reduce the splits via reduce jobs, etc. For that, you could get a windows desktop box or e.g. RedHat/CentOS and use virtualization. Something like a 4 core i5 with 32gb of memory, running 3 or for some things 4, vm's. You could load e.g. hortonworks into each of the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives off of eBay and you can have a lot of learning. *...“The race is not to the swift,nor the battle to the strong,but to those who can see it coming and jump aside.” - Hunter ThompsonDaemeon* On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote: you can try the pivotal vm as well. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com wrote: Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- jay vyas
CPU usage of a container.
From: Smita Deshpande Sent: Wednesday, November 05, 2014 10:52 AM To: user@hadoop.apache.org Subject: CPU usage of a container. Hi All, I am facing sort of a weird issue in YARN. I am running a single container on a cluster whose cpu configuration is as follows: NODEMANAGER1 : 4 cpu cores NODEMANAGER2 : 4 cpu cores NODEMANAGER3 : 16 cpu cores All processors are Hyperthreaded ones. So if I am using 1 cpu core then max usage could be 200%. When I am running different number of threads in that container(basically cpu intensive calculation), its showing cpu usage more than allotted number of cores to it. Please refer to below table for different test cases. Highlighted values in Red seem to have crossed its usage. I am using DominantResourceCalculator in CS. PFA the screenshot for the same. Any help would be appreciated. Resource Ask %cpu Usage (from htop command) # of Threads launched in container 1024,1 176.8 4 108 1 177 2 291 3 342 4 337 4[container launched on NODEMANAGER3] 1024,2 177 3 182.6 9 336 4[container launched on NODEMANAGER3] 189 2 [container launched on NODEMANAGER2] 291 3 337 4 1024,3 283 3 329.7 9 343 4 [container launched on NODEMANAGER3] 122 1 216 2 290 3 1024,4 289 3 123 1 217 2 292 3 338 4 177.3 32 Regards, Smita
RE: CPU usage of a container.
Hi Smita, Can you please inform abt the following : 1. Which version of Hadoop ? 2. Linux Container Executor with DRC and CgroupsLCEResourcesHandler is being configured ? 3. if its against the trunk code, have you configured for yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage which is by default false? In general its not restrictive cpu usage, i.e. only when all the cpu cores are used cgroups tries to restrict the container usage if not container is allowed to use the cpu when its free Please refer comments from Chris Riccomini in https://issues.apache.org/jira/browse/YARN-600, will give some rough idea how cpu isolation can be validated and also his blog http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups which might help you in understanding cgroups and cpu isolation. After YARN-2531 yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage is supported so if you are using hadoop trunk code then you can restrict single container cpu usage. Regards, Naga Huawei Technologies Co., Ltd. Phone: Fax: Mobile: +91 9980040283 Email: naganarasimh...@huawei.commailto:naganarasimh...@huawei.com Huawei Technologies Co., Ltd. Bantian, Longgang District,Shenzhen 518129, P.R.China http://www.huawei.com From: Smita Deshpande [smita.deshpa...@cumulus-systems.com] Sent: Wednesday, November 05, 2014 13:21 To: user@hadoop.apache.org Subject: CPU usage of a container. Hi All, I am facing sort of a weird issue in YARN. I am running a single container on a cluster whose cpu configuration is as follows: NODEMANAGER1 : 4 cpu cores NODEMANAGER2 : 4 cpu cores NODEMANAGER3 : 16 cpu cores All processors are Hyperthreaded ones. So if I am using 1 cpu core then max usage could be 200%. When I am running different number of threads in that container(basically cpu intensive calculation), its showing cpu usage more than allotted number of cores to it. Please refer to below table for different test cases. Highlighted values in Red seem to have crossed its usage. I am using DominantResourceCalculator in CS. PFA the screenshot for the same. Any help would be appreciated. Resource Ask %cpu Usage (from htop command) # of Threads launched in container 1024,1 176.8 4 108 1 177 2 291 3 342 4 337 4[container launched on NODEMANAGER3] 1024,2 177 3 182.6 9 336 4[container launched on NODEMANAGER3] 189 2 [container launched on NODEMANAGER2] 291 3 337 4 1024,3 283 3 329.7 9 343 4 [container launched on NODEMANAGER3] 122 1 216 2 290 3 1024,4 289 3 123 1 217 2 292 3 338 4 177.3 32 Regards, Smita
hadoop admin user name with space
Hi All, I am using hadoop cluster with active directory setup , where I have usernames with spaces ex . AB CD can I run MR jobs will there be any issues ? -- Warm Regards, *Bharath Kumar *