Hadoop Learning Environment

2014-11-04 Thread Tim Dunphy
Hey all,

 I want to setup an environment where I can teach myself hadoop. Usually
the way I'll handle this is to grab a machine off the Amazon free tier and
setup whatever software I want.

However I realize that Hadoop is a memory intensive, big data solution. So
what I'm wondering is, would a t2.micro instance be sufficient for setting
up a cluster of hadoop nodes with the intention of learning it? To keep
things running longer in the free tier I would either setup however many
nodes as I want and keep them stopped when I'm not actively using them. Or
just setup a few nodes with a few different accounts (with a different
gmail address for each one.. easy enough to do).

Failing that, what are some other free/cheap solutions for setting up a
hadoop learning environment?

Thanks,
Tim

-- 
GPG me!!

gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B


Job failing due to premature EOF

2014-11-04 Thread Giri P
Hi All,

When I query on larger datasets job is failing due to below issue

2014-11-03 13:40:15,279 INFO  datanode.DataNode - Exception for
BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
at java.lang.Thread.run(Thread.java:724)
2014-11-03 13:40:15,279 INFO  datanode.DataNode - PacketResponder:
BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702,
type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
2014-11-03 13:40:15,279 INFO  datanode.DataNode - PacketResponder:
BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702,
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2014-11-03 13:40:15,279 INFO  datanode.DataNode - opWriteBlock
BP-1442477155-10.28.12.10-1391025835784:blk_1096748808_1099611172702
received exception java.io.IOException: Premature EOF from inputStream
2014-11-03 13:40:15,280 ERROR datanode.DataNode -
task4-14.sj2.net:50010:DataXceiver
error processing WRITE_BLOCK operation  src: /10.29.15.23:51767 dest: /
10.29.15.23:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)

Any pointers on this type of issue.

Thanks


Re: Hadoop Learning Environment

2014-11-04 Thread jay vyas
Hi tim.  Id suggest using apache bigtop for this.

BigTop integrates the hadoop ecosystem into a single upstream distribution,
packages everything, curates smoke tests, vagrant, docker recipes for
deployment.
Also, we curate a blueprint hadoop application (bigpetstore) which you
build yourself, easily, and can run to generate, process, and visualize the
bigdata ecosystem.

You can also easily deploy bigtop onto ec2 if you want to pay for it .




On Tue, Nov 4, 2014 at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution. So
 what I'm wondering is, would a t2.micro instance be sufficient for setting
 up a cluster of hadoop nodes with the intention of learning it? To keep
 things running longer in the free tier I would either setup however many
 nodes as I want and keep them stopped when I'm not actively using them. Or
 just setup a few nodes with a few different accounts (with a different
 gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




-- 
jay vyas


Re: Hadoop Learning Environment

2014-11-04 Thread Leonid Fedotov
Tim,
download Sandbox from http://hortonworks/com
You will have everything needed in a small VM instance which will run on
your home desktop.


*Thank you!*


*Sincerely,*

*Leonid Fedotov*

Systems Architect - Professional Services

lfedo...@hortonworks.com

office: +1 855 846 7866 ext 292

mobile: +1 650 430 1673

On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution. So
 what I'm wondering is, would a t2.micro instance be sufficient for setting
 up a cluster of hadoop nodes with the intention of learning it? To keep
 things running longer in the free tier I would either setup however many
 nodes as I want and keep them stopped when I'm not actively using them. Or
 just setup a few nodes with a few different accounts (with a different
 gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Hadoop Learning Environment

2014-11-04 Thread Jim Colestock
Hello Tim, 

Horton and Cloudera both offer VM’s (Including Virtual box, which is free) you 
can pull down to play with, if you’re looking just for something small to get 
you started.  i’m partial to the horton works one myself. 

Hope that help. 

JC



 On Nov 4, 2014, at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote:
 
 Hey all,
 
  I want to setup an environment where I can teach myself hadoop. Usually the 
 way I'll handle this is to grab a machine off the Amazon free tier and setup 
 whatever software I want. 
 
 However I realize that Hadoop is a memory intensive, big data solution. So 
 what I'm wondering is, would a t2.micro instance be sufficient for setting up 
 a cluster of hadoop nodes with the intention of learning it? To keep things 
 running longer in the free tier I would either setup however many nodes as I 
 want and keep them stopped when I'm not actively using them. Or just setup a 
 few nodes with a few different accounts (with a different gmail address for 
 each one.. easy enough to do).
 
 Failing that, what are some other free/cheap solutions for setting up a 
 hadoop learning environment?
 
 Thanks,
 Tim
 
 -- 
 GPG me!!
 
 gpg --keyserver pool.sks-keyservers.net http://pool.sks-keyservers.net/ 
 --recv-keys F186197B
 



Re: Hadoop Learning Environment

2014-11-04 Thread Sandeep Khurana
Or on your local laptop or desktop you can setup the env using VM and VM
image of Hadoop and related components. Wrote instructions sometime back
here
https://www.linkedin.com/today/post/article/20140924133831-2560863-new-to-hadoop-and-want-to-setup-dev-environment
On Nov 5, 2014 2:25 AM, Jim Colestock j...@ramblingredneck.com wrote:

 Hello Tim,

 Horton and Cloudera both offer VM’s (Including Virtual box, which is free)
 you can pull down to play with, if you’re looking just for something small
 to get you started.  i’m partial to the horton works one myself.

 Hope that help.

 JC



 On Nov 4, 2014, at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution. So
 what I'm wondering is, would a t2.micro instance be sufficient for setting
 up a cluster of hadoop nodes with the intention of learning it? To keep
 things running longer in the free tier I would either setup however many
 nodes as I want and keep them stopped when I'm not actively using them. Or
 just setup a few nodes with a few different accounts (with a different
 gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B





Re: Hadoop Learning Environment

2014-11-04 Thread oscar sumano
you can try the pivotal vm as well.

http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html

On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com
wrote:

 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run on
 your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution.
 So what I'm wondering is, would a t2.micro instance be sufficient for
 setting up a cluster of hadoop nodes with the intention of learning it? To
 keep things running longer in the free tier I would either setup however
 many nodes as I want and keep them stopped when I'm not actively using
 them. Or just setup a few nodes with a few different accounts (with a
 different gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Re: Hadoop Learning Environment

2014-11-04 Thread daemeon reiydelle
What you want as a sandbox depends on what you are trying to learn.

If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all
of the suggestions (perhaps excluding BigTop due to its setup complexities)
are great. Laptop? perhaps but laptop's are really kind of infuriatingly
slow (because of the hardware - you pay a price for a 30-45watt average
heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7
with lots of memory. What do you think of the thought that you will pretty
quickly graduate to wanting a small'ish desktop for your sandbox?

A simple, single node, Hadoop instance will let you learn many things. The
next level of complexity comes when you are attempting to deal with data
whose processing needs to be split up, so you can learn about how to split
data in Mapping, reduce the splits via reduce jobs, etc. For that, you
could get a windows desktop box or e.g. RedHat/CentOS and use
virtualization. Something like a 4 core i5 with 32gb of memory, running 3
or for some things 4, vm's. You could load e.g. hortonworks into each of
the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives
off of eBay and you can have a lot of learning.











*...“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon*
On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote:

 you can try the pivotal vm as well.

 http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html

 On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com
 wrote:

 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run on
 your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution.
 So what I'm wondering is, would a t2.micro instance be sufficient for
 setting up a cluster of hadoop nodes with the intention of learning it? To
 keep things running longer in the free tier I would either setup however
 many nodes as I want and keep them stopped when I'm not actively using
 them. Or just setup a few nodes with a few different accounts (with a
 different gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





Re: Questions about Capacity scheduler behavior

2014-11-04 Thread Wangda Tan
Hi Fabio,
To answer your questions:
1)
CS (capacity-scheduler) will do allocation from root to leaf, level by
level, and for queues with same parent, will do allocation in following way:
First will allocate queue with least used resource
If queues with same used resource, say root.qA used 1G, root.qB used 1G as
well, it will allocate resource by name of queue, so qA will go first.

2)
There're two parameters about capacity, one is capacity, another one is
maximum-capacity, queue can only allocate resource with multiple of
minimum-allocation of the cluster, and it will be = maximum-capacity of
the queue always

3)
I think 1) should already answer your question, queue will go first by used
resource, not percentage of used resource, so in your example, queue-A will
go first.

4)
Sorry, I'm not quite understand this question, could you do some
explanations about it?

5)
You were not misunderstand this example, this is what
capacity/maximum-capacity applied on queues. The maximum-capacity of queue
is used to do such resource provisioning for queues.

root.a.capacity = 50
root.a.maximum-capacity = 80
root.a.a1.capacity = 50
root.a.a1.maximum-capacity = 90

The guaranteed resource of a.a1 is a.capacity% *  a.a1.capacity% = 25%
The maximum resource of a.a1 is a.maximum-capacity% * a.a1.maximum-capacity
= 72% (0.8 * 0.9).

And as you said, a resource of child can use, will always = resource of
its parent.

If you want to let all leafQueues can leverage all resource in the cluster,
you can simply set all maximum-capacity of parent queues and leaf queues to
100.

Does this make sense to you?

Thanks,
Wangda

On Mon, Nov 3, 2014 at 6:30 AM, Fabio anyte...@gmail.com wrote:

 Hi guys, I'm posting this in the user mailing list since I got no reply in
 the Yarn-dev. I have to model as well as possible the capacity scheduler
 behavior and I have some questions, hope someone can help me with this. In
 the following I will consider all containers to be equal for simplicity:

 1) In case of multiple queues having the same level of assigned resources,
 what's the policy to decide which comes first in the resource allocation?

 2) Let's consider this configuration:
 We have a cluster hosting a total of 40 containers. We have 3 queues: A is
 configured to get 39% of cluster capacity, B also gets 39% and C gets 22%.
 The number of containers is going to be 15.6, 15.6 and 8.8 for A, B an C.
 Since we can't split a container, how does the Capacity scheduler round
 these values in a real case? Who gets the two contended containers? I may
 think they are considered as extra containers, thus shared upon need among
 the three queues. Is this correct?

 3) Let's say I have queues A and B. A is configured to get 20% (20
 containers) of the total cluster capacity (100 containers), B gets 80% (80
 containers). Capacity scheduler gives available resources firstly to the
 most under-served queue.
 In case A is using 10 containers and B is using 20, who is going to get
 the first available container? A is already using 50% of it's assigned
 capacity, B just 25%, but A has less containers than B... who is considered
 to be more under-served?

 4) Does the previous question make sense at all? Because I have a doubt
 that when I have free containers I will just serve requests as they arrive,
 possibly over-provisioning a queue (that is: if I get a container request
 for an app in A, I will give it a container since I don't know that after a
 few milliseconds I will get a new request from B, or vice versa). The
 previous question may have sense if there was some sort of buffer that is
 filled with incoming requests, due to the difficulty of serving them in
 real time, thus making the scheduler able to choose the request from the
 most under-served queue. Is this what happens?

 5) According to the example presented in Apache Hadoop YARN: Moving
 beyond MapReduce and Batch Processing with Apache Hadoop 2 about the
 resource allocation with the capacity scheduler, what I understood is that
 the chance for a leaf queue to get resources above it's assigned capacity
 is always upper-limited by the fraction of cluster capacity assigned to its
 first/closer parent queue. That is: if I am a leaf queue A1, I can only get
 at most the resources dedicated to my parent A, while I can't get the ones
 from B, sibling of A, even if it doesn't have any running application.
 Actually at first I thought this over-provisioning was not limited, and
 regardless of the queue configuration a single application could get the
 whole cluster (excluding per-application limits). Did I misunderstood the
 example?

 Thanks a lot

 Fabio



Re: Hadoop Learning Environment

2014-11-04 Thread jay vyas
Hi daemon:  Actually, for most folks who would want to actually use a
hadoop cluster,  i would think setting up bigtop is super easy ! If you
have issues with it ping me and I can help you get started.
Also, we have docker containers - so you dont even *need* a VM to run a 4
or 5 node hadoop cluster.

install vagrant
install VirtualBox
git clone https://github.com/apache/bigtop
cd bigtop/bigtop-deploy/vm/vagrant-puppet
vagrant up
Then vagrant destroy when your done.

This to me is easier than manually downloading an appliance, picking memory
starting the virtualbox gui, loading the appliance , etc...  and also its
easy to turn the simple single node bigtop VM into a multinode one,
by just modifying the vagrantile.


On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com
wrote:

 What you want as a sandbox depends on what you are trying to learn.

 If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all
 of the suggestions (perhaps excluding BigTop due to its setup complexities)
 are great. Laptop? perhaps but laptop's are really kind of infuriatingly
 slow (because of the hardware - you pay a price for a 30-45watt average
 heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7
 with lots of memory. What do you think of the thought that you will pretty
 quickly graduate to wanting a small'ish desktop for your sandbox?

 A simple, single node, Hadoop instance will let you learn many things. The
 next level of complexity comes when you are attempting to deal with data
 whose processing needs to be split up, so you can learn about how to split
 data in Mapping, reduce the splits via reduce jobs, etc. For that, you
 could get a windows desktop box or e.g. RedHat/CentOS and use
 virtualization. Something like a 4 core i5 with 32gb of memory, running 3
 or for some things 4, vm's. You could load e.g. hortonworks into each of
 the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives
 off of eBay and you can have a lot of learning.











 *...“The race is not to the swift,nor the battle to the strong,but to
 those who can see it coming and jump aside.” - Hunter ThompsonDaemeon*
 On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote:

 you can try the pivotal vm as well.


 http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html

 On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com
 wrote:

 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run on
 your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com
 wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop.
 Usually the way I'll handle this is to grab a machine off the Amazon free
 tier and setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution.
 So what I'm wondering is, would a t2.micro instance be sufficient for
 setting up a cluster of hadoop nodes with the intention of learning it? To
 keep things running longer in the free tier I would either setup however
 many nodes as I want and keep them stopped when I'm not actively using
 them. Or just setup a few nodes with a few different accounts (with a
 different gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.






-- 
jay vyas


Re: Hadoop Learning Environment

2014-11-04 Thread Gavin Yue
Try docker!

http://ferry.opencore.io/en/latest/examples/hadoop.html



On Tue, Nov 4, 2014 at 6:36 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi daemon:  Actually, for most folks who would want to actually use a
 hadoop cluster,  i would think setting up bigtop is super easy ! If you
 have issues with it ping me and I can help you get started.
 Also, we have docker containers - so you dont even *need* a VM to run a 4
 or 5 node hadoop cluster.

 install vagrant
 install VirtualBox
 git clone https://github.com/apache/bigtop
 cd bigtop/bigtop-deploy/vm/vagrant-puppet
 vagrant up
 Then vagrant destroy when your done.

 This to me is easier than manually downloading an appliance, picking memory
 starting the virtualbox gui, loading the appliance , etc...  and also its
 easy to turn the simple single node bigtop VM into a multinode one,
 by just modifying the vagrantile.


 On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com
 wrote:

 What you want as a sandbox depends on what you are trying to learn.

 If you are trying to learn to code in e.g PigLatin, Sqooz, or similar,
 all of the suggestions (perhaps excluding BigTop due to its setup
 complexities) are great. Laptop? perhaps but laptop's are really kind of
 infuriatingly slow (because of the hardware - you pay a price for a
 30-45watt average heating bill). A laptop is an OK place to start if it is
 e.g. an i5 or i7 with lots of memory. What do you think of the thought that
 you will pretty quickly graduate to wanting a small'ish desktop for your
 sandbox?

 A simple, single node, Hadoop instance will let you learn many things.
 The next level of complexity comes when you are attempting to deal with
 data whose processing needs to be split up, so you can learn about how to
 split data in Mapping, reduce the splits via reduce jobs, etc. For that,
 you could get a windows desktop box or e.g. RedHat/CentOS and use
 virtualization. Something like a 4 core i5 with 32gb of memory, running 3
 or for some things 4, vm's. You could load e.g. hortonworks into each of
 the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives
 off of eBay and you can have a lot of learning.











 *...“The race is not to the swift,nor the battle to the strong,but to
 those who can see it coming and jump aside.” - Hunter ThompsonDaemeon*
 On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote:

 you can try the pivotal vm as well.


 http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html

 On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com
  wrote:

 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run
 on your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com
 wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop.
 Usually the way I'll handle this is to grab a machine off the Amazon free
 tier and setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data
 solution. So what I'm wondering is, would a t2.micro instance be 
 sufficient
 for setting up a cluster of hadoop nodes with the intention of learning 
 it?
 To keep things running longer in the free tier I would either setup 
 however
 many nodes as I want and keep them stopped when I'm not actively using
 them. Or just setup a few nodes with a few different accounts (with a
 different gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up
 a hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
 entity to which it is addressed and may contain information that is
 confidential, privileged and exempt from disclosure under applicable law.
 If the reader of this message is not the intended recipient, you are hereby
 notified that any printing, copying, dissemination, distribution,
 disclosure or forwarding of this communication is strictly prohibited. If
 you have received this communication in error, please contact the sender
 immediately and delete it from your system. Thank You.






 --
 jay vyas



CPU usage of a container.

2014-11-04 Thread Smita Deshpande


From: Smita Deshpande
Sent: Wednesday, November 05, 2014 10:52 AM
To: user@hadoop.apache.org
Subject: CPU usage of a container.

Hi All,
I am facing sort of a weird issue in YARN. I am running a 
single container on a cluster whose cpu configuration is as follows:
NODEMANAGER1 : 4 cpu cores
NODEMANAGER2 : 4 cpu cores
NODEMANAGER3 : 16 cpu cores
All processors are Hyperthreaded ones. So if I am using 1 cpu 
core then max usage could be 200%.
When I am running different number of threads in that 
container(basically cpu intensive calculation), its showing cpu usage more than 
allotted number of cores to it. Please refer to below table for different test 
cases. Highlighted values in Red seem to have crossed its usage. I am using 
DominantResourceCalculator in CS.
PFA the screenshot for the same.
Any help would be appreciated.

Resource Ask

%cpu Usage (from htop command)

# of Threads launched in container

1024,1

176.8

4

108

1

177

2

291

3

342

4

337

4[container launched on NODEMANAGER3]

1024,2

177

3

182.6

9

336

4[container launched on NODEMANAGER3]

189

2   [container launched on NODEMANAGER2]

291

3

337

4

1024,3

283

3

329.7

9

343

4  [container launched on NODEMANAGER3]

122

1

216

2

290

3

1024,4

289

3

123

1

217

2

292

3

338

4

177.3

32



Regards,
Smita


RE: CPU usage of a container.

2014-11-04 Thread Naganarasimha G R (Naga)
Hi Smita,
Can you please inform abt the following :
1. Which version of Hadoop ?
2. Linux Container Executor with DRC and CgroupsLCEResourcesHandler is being 
configured ?
3. if its against the trunk code, have you configured for 
yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage which 
is by default false?

In general its not restrictive cpu usage, i.e. only when all the cpu cores are 
used cgroups tries to restrict the container usage if not container is allowed 
to use the cpu when its free
Please refer comments from Chris Riccomini in 
https://issues.apache.org/jira/browse/YARN-600, will give some rough idea how 
cpu isolation can be validated and also his blog 
http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups
which might help you in understanding cgroups and cpu isolation.

After YARN-2531 
yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage is 
supported so if you are using hadoop trunk code then you can restrict single 
container cpu usage.



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimh...@huawei.commailto:naganarasimh...@huawei.com
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com



From: Smita Deshpande [smita.deshpa...@cumulus-systems.com]
Sent: Wednesday, November 05, 2014 13:21
To: user@hadoop.apache.org
Subject: CPU usage of a container.

Hi All,
I am facing sort of a weird issue in YARN. I am running a 
single container on a cluster whose cpu configuration is as follows:
NODEMANAGER1 : 4 cpu cores
NODEMANAGER2 : 4 cpu cores
NODEMANAGER3 : 16 cpu cores
All processors are Hyperthreaded ones. So if I am using 1 cpu 
core then max usage could be 200%.
When I am running different number of threads in that 
container(basically cpu intensive calculation), its showing cpu usage more than 
allotted number of cores to it. Please refer to below table for different test 
cases. Highlighted values in Red seem to have crossed its usage. I am using 
DominantResourceCalculator in CS.
PFA the screenshot for the same.
Any help would be appreciated.

Resource Ask

%cpu Usage (from htop command)

# of Threads launched in container

1024,1

176.8

4

108

1

177

2

291

3

342

4

337

4[container launched on NODEMANAGER3]

1024,2

177

3

182.6

9

336

4[container launched on NODEMANAGER3]

189

2   [container launched on NODEMANAGER2]

291

3

337

4

1024,3

283

3

329.7

9

343

4  [container launched on NODEMANAGER3]

122

1

216

2

290

3

1024,4

289

3

123

1

217

2

292

3

338

4

177.3

32


Regards,
Smita


hadoop admin user name with space

2014-11-04 Thread Bharath Kumar
Hi All,
I am using hadoop cluster with active directory setup , where I
have usernames with spaces ex . AB CD can I run MR jobs will there be any
issues ?
-- 
Warm Regards,
 *Bharath Kumar *