Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)
Just out of curiosity, have you enabled S3 endpoint for this ? Hopefully u are running this cluster inside a VPC, if so an endpoint would help as the S3 traffic will not go out to the Internet... Any new policies put in place for your S3 bucket as others have mentioned something about throttling ? On Wed, Aug 17, 2016, 3:22 PM Sebastian Nagelwrote: > Hi Dheeren, hi Chris, > > > >> Are you able to share a bit more about your deployment architecture? > Are these EC2 VMs? If so, > are they co-located in the same AWS region as the S3 bucket? > > Running a cluster of 100 m1.xlarge EC2 instances with Ubuntu 14.04 > (ami-41a20f2a). > The cluster is running in a single availability zone (us-east-1d), the S3 > bucket > is in the same region (us-east-1). > > % lsb_release -d > Description:Ubuntu 14.04.3 LTS > > % uname -a > Linux ip-10-91-235-121 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 > 11:21:34 UTC 2015 x86_64 x86_64 > x86_64 GNU/Linux > > > Did you change java idk version as well, as part of the upgrade? > > Java is taken as provided by Ubuntu: > > % java -version > java version "1.7.0_111" > OpenJDK Runtime Environment (IcedTea 2.6.7) (7u111-2.6.7-0ubuntu0.14.04.3) > OpenJDK 64-Bit Server VM (build 24.111-b01, mixed mode) > > Cloudera CDH is installed from > > http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb > > After the jobs are done the cluster is shut down and bootstrapped (bash + > cloudinit) anew on demand. > A new launch of the cluster may, of course, include updates of > - the underlying Amazon machine image > - Ubuntu packages > - Cloudera packages > > And the real reason for the problem may come from any of these changes. > The update to Cloudera CDH 5.8.0 was the most obvious since the problems > appeared > (seen first 2016-08-01). > > >> If the cluster is not running in EC2 (e.g. on-premises physical > hardware), then are there any > notable differences on nodes that experienced this problem (e.g. smaller > capacity on the outbound NIC)? > > Probably not, although I cannot exclude this. I've the last days run into > problems which could be > related: few tasks are slow, even seem to hang, e.g., reducers during > copy. But that's also looks > more like a Hadoop (configuration) problem. Network throughput between > nodes measured with iperf is > not super-performant but generally ok (5-20 MBit/s). > > >> This is just a theory, but If your bandwidth to the S3 service is > intermittently saturated or > throttled or somehow compromised, then I could see how longer timeouts and > more retries might > increase overall job time. With the shorter settings, it might cause > individual task attempts to > fail sooner. Then, if the next attempt gets scheduled to a different node > with better bandwidth to > S3, it would start making progress faster in the second attempt. Then, > the effect on overall job > execution might be faster. > > That's also my assumption. While connecting to S3 a server is selected > which is fast now. > While copying 1 GB which takes a couple of minutes just because of general > network throughput, > the server may become more loaded. When reconnecting a better server is > chosen. > > Btw., tasks are not failing when choosing a moderate timeout - 30 sec. is > ok, with lower > values (a few seconds) the file uploads frequently fail. > > I've seen this behavior with a simple distcp from S3: with the default > values, it took 1 day to copy > 300 GB from S3 to HDFS. After choosing a shorter timeout the job finished > within 5 hours. > > Thanks, > Sebastian > > On 08/16/2016 09:11 PM, Dheeren Bebortha wrote: > > Did you change java idk version as well, as part of the upgrade? > > Dheeren > > > >> On Aug 16, 2016, at 11:59 AM, Chris Nauroth > wrote: > >> > >> Hello Sebastian, > >> > >> This is an interesting finding. Thank you for reporting it. > >> > >> Are you able to share a bit more about your deployment architecture? > Are these EC2 VMs? If so, are they co-located in the same AWS region as > the S3 bucket? If the cluster is not running in EC2 (e.g. on-premises > physical hardware), then are there any notable differences on nodes that > experienced this problem (e.g. smaller capacity on the outbound NIC)? > >> > >> This is just a theory, but If your bandwidth to the S3 service is > intermittently saturated or throttled or somehow compromised, then I could > see how longer timeouts and more retries might increase overall job time. > With the shorter settings, it might cause individual task attempts to fail > sooner. Then, if the next attempt gets scheduled to a different node with > better bandwidth to S3, it would start making progress faster in the second > attempt. Then, the effect on overall job execution might be faster. > >> > >> --Chris Nauroth > >> > >> On 8/7/16, 12:12 PM, "Sebastian Nagel" > wrote: > >> > >>Hi, > >> > >>
Re: HDFS backup to S3
Hi Anu, Thank for the information, the link you provided does not work. @Hari, Let me do some quick research on what you guys can provide and get back to you. On Wed, Jun 15, 2016, 10:59 AM Anu Engineer <aengin...@hortonworks.com> wrote: > Hi Max, > > > > Unfortunately, we don’t have a better solution at the moment. I am > wondering if the right approach might be to use user-defined metadata ( > http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and > put that information along with the object that we are backing up. > > > > However, that would be a code change in DistCp, and not as easy as a > script. But that would address the scalability issue that you are worried > about. > > > > Thanks > > Anu > > > > > > > > *From: *max scalf <oracle.bl...@gmail.com> > *Date: *Wednesday, June 15, 2016 at 7:15 AM > *To: *HDP mailing list <user@hadoop.apache.org> > *Subject: *HDFS backup to S3 > > > > Hello Hadoop community, > > > > we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 > instance. Everything is all setup and working as expected. Our design > calls for running HDFS/data nodes on local/ephemeral storage and we have 3X > replication enabled by default, all of the metastore (hive, oozie, ranger, > ambari etc etc ..) are external to the cluster using RDS/mysql. > > > > The question that I have is with regards to backups. We want to run a > night job that copies data from HDFS into S3. Knowing that we our cluster > lives in AWS, the obvious choice is to run our backup to S3. We do not > want a warm backup(backup this cluster to another cluster), our RTO/RPO is > 5 days for this cluster. So we can run distcp (something like below link) > to backup our hdfs to S3 and we have tested this and works just fine, but > how do we go about storage the ownership/permission on these files. > > > > http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script > > > > As S3 is a blob storage and does not store any ownership/permission, how > do we go about backing that up? One of the ideas I had was to run hdfs dfs > -lsr (and recursively get all files and folders permissions/ownership) and > dump that into a file and send that file over to S3 as well, but I am > guessing it will work now but as the cluster grows it might not scale... > > > > So I wanted to find out how are people managed backing up > ownership/permission of HDFS file/folder when sending back up to a blob > storage like S3. > > > > >
HDFS backup to S3
Hello Hadoop community, we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 instance. Everything is all setup and working as expected. Our design calls for running HDFS/data nodes on local/ephemeral storage and we have 3X replication enabled by default, all of the metastore (hive, oozie, ranger, ambari etc etc ..) are external to the cluster using RDS/mysql. The question that I have is with regards to backups. We want to run a night job that copies data from HDFS into S3. Knowing that we our cluster lives in AWS, the obvious choice is to run our backup to S3. We do not want a warm backup(backup this cluster to another cluster), our RTO/RPO is 5 days for this cluster. So we can run distcp (something like below link) to backup our hdfs to S3 and we have tested this and works just fine, but how do we go about storage the ownership/permission on these files. http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script As S3 is a blob storage and does not store any ownership/permission, how do we go about backing that up? One of the ideas I had was to run hdfs dfs -lsr (and recursively get all files and folders permissions/ownership) and dump that into a file and send that file over to S3 as well, but I am guessing it will work now but as the cluster grows it might not scale... So I wanted to find out how are people managed backing up ownership/permission of HDFS file/folder when sending back up to a blob storage like S3.
Re: HDFS how to specify the exact datanode to put data on?
May I ask why you need to do that? Y not let Hadoop handle that for u? On Sunday, July 19, 2015, Shiyao Ma i...@introo.me wrote: Hi, I'd like to put my data selectively on some datanodes. Currently I can do that by shutting down un-needed datanodes. But this is a little laborsome. Is it possible to directly specify the datanodes I'd like to put the data on when doing operations like: hdfs dfs -put / TIA. -- 吾輩は猫である。ホームーページはhttps://introo.me http://introo.me。
Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp
Not to hijack this post but how would you deal with data that is maintained by hive(Orc format file, hive created tables etc..)...Would we copy the hivemetastore(MySQL) and move that over to new cluster? On Friday, June 19, 2015, Joep Rottinghuis jrottingh...@gmail.com wrote: You can't set up a proxy ? You probably want to avoid writing to local file system because aside from that being slow, it limits the size of your file to the free space on your local disc. If you do need to go commando and go through a single client machine that can see both clusters you probably want to pipe a get to a put. Any kind of serious data volume pulled through a straw is going to be rather slow though. Cheers, Joep Sent from my iPhone On Jun 19, 2015, at 12:09 AM, Nitin Pawar nitinpawar...@gmail.com javascript:_e(%7B%7D,'cvml','nitinpawar...@gmail.com'); wrote: yes On Fri, Jun 19, 2015 at 11:36 AM, Divya Gehlot divya.htco...@gmail.com javascript:_e(%7B%7D,'cvml','divya.htco...@gmail.com'); wrote: In thats It will be like three step process . 1. first cluster (secure zone) HDFS - copytoLocal - user local file system 2. user local space - copy data - second cluster user local file system 3. second cluster user local file system - copyfromlocal - second clusterHDFS Am I on the right track ? On 19 June 2015 at 12:38, Nitin Pawar nitinpawar...@gmail.com javascript:_e(%7B%7D,'cvml','nitinpawar...@gmail.com'); wrote: What's the size of the data? If you can not do distcp between clusters then other way is doing hdfs get on the data and then hdfs put on another cluster On 19-Jun-2015 9:56 am, Divya Gehlot divya.htco...@gmail.com javascript:_e(%7B%7D,'cvml','divya.htco...@gmail.com'); wrote: Hi, I need to copy data from first hadoop cluster to second hadoop cluster. I cant access second hadoop cluster from first hadoop cluster due to some security issue. Can any point me how can I do apart from distcp command. For instance Cluster 1 secured zone - copy hdfs data to - cluster 2 in non secured zone Thanks, Divya -- Nitin Pawar
Re: Swap requirements
Thank you harsh. Can you please explain what you mean when u said Just simple virtual memory used by the process ? Doesn't virtual memory means swap? On Wednesday, March 25, 2015, Harsh J ha...@cloudera.com wrote: The suggestion (regarding swappiness) is not for disabling swap as much as it is to 'not using swap (until really necessary)'. When you run a constant memory-consuming service such as HBase you'd ideally want the RAM to serve up as much as it can, which setting that swappiness value helps do (the OS otherwise begins swapping way before its available physical RAM is nearing full state). The vmem-pmem ratio is something entirely else. The vmem of a process does not mean swap space usage, just simple virtual memory used by the . I'd recommend disabling YARN's vmem checks on today's OSes (but keep pmem checks on). You can read some more on this at http://www.quora.com/Why-do-some-applications-use-significantly-more-virtual-memory-on-RHEL-6-compared-to-RHEL-5 On Thu, Mar 26, 2015 at 3:37 AM, Abdul I Mohammed oracle.bl...@gmail.com javascript:_e(%7B%7D,'cvml','oracle.bl...@gmail.com'); wrote: Thanks Mith...any idea about Yarn.nodemanager.Vmem-pmem-ratio parameter... If data nodes does not require swap then what about the above parameter? What is that used for in yarn? -- Harsh J
Re: AWS Private and Public Ip
you will get the the private ip to work until and unless you are in your VPC connected to a VPN or a direct connect. For what you are doing, i would use the public IP that should work just fine. On Fri, Mar 13, 2015 at 3:00 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, I am using Elastic Ip Address and assigned Elastic Ip Address to the instances. After doing lot of trial and error , I could atleast install Cloudera Manager on AWS. But a very starnge thing I have notice, I am not sure if I am doing something wrong . When I have installed CM on AWS on an instance, it gave me message at the end to open the http://privateipaddress:7180 to open the cloudera manager GUI. However when I tried to open http://privateaddress:7180 it didn't open but when I tried opening http://Publicip:7180 How should I use only one type of ip address either public or private? Thanks Krish
Re: Not able to ping AWS host
inside your VPC -- subnet -- does the route table have a internet gateway attached(that should have a gateway of 0.0.0.0/0 as well)... On Mon, Mar 9, 2015 at 10:23 PM, Krish Donald gotomyp...@gmail.com wrote: Yes security group has all open ports to 0.0.0.0 and yes cluster is under VPC On Mon, Mar 9, 2015 at 5:15 PM, max scalf oracle.bl...@gmail.com wrote: when you say the security group has all open ports, is that open to public (0.0.0.0) or to your specific IP(if so is ur ip correct)? also are the instance inside of a VPC ?? On Mon, Mar 9, 2015 at 5:05 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, I am trying to setup Hadoop cluster on AWS . After creating an instance, I got the public ip and dns. But I tried to ping it from my windows machine I am not able to ping it. I am not able to logon to machine using putty . It is saying Network timed out. Security group in the AWS cluster has open all TCP, UDP, ICMP and SSH also. Please let me know if anybody ahs any idea. Thanks Krish
Re: Not able to ping AWS host
That is very interesting, any network ACL blocking your inbound connections? If you would like you can setup a webex/go-meeting conf. and we can troubleshoot this together, so we can take this offline as this is specific to AWS and nothing to do with Hadoop. I can be reached at oracle.bl...@gmail.com, i am available to do this later on today after 4PM CST. On Tue, Mar 10, 2015 at 12:00 PM, Krish Donald gotomyp...@gmail.com wrote: It is as below: Route Table: rtb-f377cbxx | myroute https://us-west-2.console.aws.amazon.com/vpc/home?region=us-west-2#routetables:filter=rtb-f377cb96 Destination Target 172.31.0.0/16 local 0.0.0.0/0 igw-6d16cxxx On Tue, Mar 10, 2015 at 6:47 AM, max scalf oracle.bl...@gmail.com wrote: inside your VPC -- subnet -- does the route table have a internet gateway attached(that should have a gateway of 0.0.0.0/0 as well)... On Mon, Mar 9, 2015 at 10:23 PM, Krish Donald gotomyp...@gmail.com wrote: Yes security group has all open ports to 0.0.0.0 and yes cluster is under VPC On Mon, Mar 9, 2015 at 5:15 PM, max scalf oracle.bl...@gmail.com wrote: when you say the security group has all open ports, is that open to public (0.0.0.0) or to your specific IP(if so is ur ip correct)? also are the instance inside of a VPC ?? On Mon, Mar 9, 2015 at 5:05 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, I am trying to setup Hadoop cluster on AWS . After creating an instance, I got the public ip and dns. But I tried to ping it from my windows machine I am not able to ping it. I am not able to logon to machine using putty . It is saying Network timed out. Security group in the AWS cluster has open all TCP, UDP, ICMP and SSH also. Please let me know if anybody ahs any idea. Thanks Krish
Re: What skills to Learn to become Hadoop Admin
Hi Jay, Is there a blog or anything that talks about setting up this big pet store application? as i looked at the GIT readme file and was a little bit lost. Maybe thats becuase i am new to Hadoop. On Sat, Mar 7, 2015 at 10:34 AM, jay vyas jayunit100.apa...@gmail.com wrote: Setting up vendor distros is a great first step. 1) Running TeraSort and benchmarking is a good step. You can also run larger, full stack hadoop applications like bigpetstore, which we curate here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/. 2) Write some mapreduce or spark jobs which write data to a persistent transactional store, such as SOLR or HBase. This is a hugely important part of real world hadoop administration, where you will encounter problems like running out of memory, possibly CPU overclocking on some nodes, and so on. 3) Now, did you want to go deeper into the build/setup/deployment of hadoop ? Its worth it to try building/deploying/debugging hadoop ecosytem components from scratch, by setting up Apache BigTop, which packages RPM/DEB artifacts and provides puppet recipes for distributions. Its the original roots of both the cloudera and hortonworks distributions, so you will learn something about both by playing with it. We have some exersizes you can use to guide you and get started https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes . Feel free to join the mailing list for questions. On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote: Krish, I dont mean to hijack your mail here but i wanted to find out how/what you did for the below portion, as i am trying to go down your path as well, i was able to get 4-5 node cluster using ambari and cdh and now wanted to take it to next level. What have you done for below? I have done a web log integration using flume and twitter sentiment analysis. On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote: Hi, I would like to enter into Big Data world as Hadoop Admin and I have setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop. I have installed the services like hive, oozie, zookeeper etc. I have done a web log integration using flume and twitter sentiment analysis. I wanted to understand what are the other skills I should learn ? Thanks Krish -- jay vyas
Re: sorting in hive -- general
Thank you very much for the explanation Alexander. On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov apivova...@gmail.com wrote: 1. sort by - key are distributed according to MR partitioner (controlled by distributed by in hive) Lets assume hash partitioned uses the same column as sort by and uses x mod 16 formula to get reducer id reduced 0 will have keys 0 16 32 reducer 1 will have keys 1 17 33 if you merge reducer 0 and reducer 1 output you will have 0 16 32 1 17 33 2. order by will use 1 reducer and hive will send all keys to reducer 0 So order by in hive works different from terasort. In case of terasort you can merge output files and get one file with globally sorted data. On Sun, Mar 8, 2015 at 7:55 AM, max scalf oracle.bl...@gmail.com wrote: Thank you Alexander. So is it fair to assume when sort by is used and multiple files are produced per reducer at the end of it all of then are put togeather/merged to get the results back? And can sort by be used without distributed by and expect same result as order by ? On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com wrote: sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote: Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: sorting in hive -- general
Thank you Alexander. So is it fair to assume when sort by is used and multiple files are produced per reducer at the end of it all of then are put togeather/merged to get the results back? And can sort by be used without distributed by and expect same result as order by ? On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com wrote: sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote: Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: What skills to Learn to become Hadoop Admin
Krish, I dont mean to hijack your mail here but i wanted to find out how/what you did for the below portion, as i am trying to go down your path as well, i was able to get 4-5 node cluster using ambari and cdh and now wanted to take it to next level. What have you done for below? I have done a web log integration using flume and twitter sentiment analysis. On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote: Hi, I would like to enter into Big Data world as Hadoop Admin and I have setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop. I have installed the services like hive, oozie, zookeeper etc. I have done a web log integration using flume and twitter sentiment analysis. I wanted to understand what are the other skills I should learn ? Thanks Krish
sorting in hive -- general
Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
@jonathan, I totaly agree that this is reinventing the wheel, but think about the folks who wants to do this setup from scratch to better under hadoop or maybe those folks who are going to do admin realted work...and hence the need to setting is up from scratch... @alexandar, Yes you are right, if one time effort of setting up freedns but for me it was easy enough coz i gave static hostname thru the user data script and also static ip address for each host...once that was done, the way i pushed out /etc/hosts was below...on lets say the master node i edited /etc/hosts file and put all my other nodes info on there...next setup SSH(as we have to do this anyways) for hadoop install, once SSH is setup just create a new file calle hosts.txt and put all your hostname in there and run a for loop like below for host in `cat hosts.txt`; do scp /etc/hosts root@$host:/etc/hosts done when i frist was getting started on HDP i used the below link which helped me, it pushes out /etc/hosts file and also does other stuff...check it out http://sacharya.com/deploying-multinode-hadoop-20-cluster-using-apache-ambari/ On Fri, Mar 6, 2015 at 12:43 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: The only limitation I know is that of how many nodes you can have and how many instances of that particular size the host is on can support. you can load hive in EMR and then any other features of the cluster are managed at the master node level as you have SSH access there. What are the advantage of 2.6 over 2.4 for example. I just feel you guys are reinventing the wheel when amazon already caters for hadoop granted it might not be 2.6. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 07:31, Alexander Pivovarov wrote: I think EMR has its own limitation e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my hive patch. How EMR can help me? it supports hadoop up to 2.4.0 (not even 2.4.1) http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Hi guys I know you guys want to keep costs down, but why go through all the effort to setup ec2 instances when you deploy EMR it takes the time to provision and setup the ec2 instances for you. All configuration then for the entire cluster is done on the master node of the particular cluster or setting up of additional software that is all done through the EMR console. We were doing some geospatial calculations and we loaded a 3rd party jar file called esri into the EMR cluster. I then had to pass a small bootstrap action (script) to have it distribute esri to the entire cluster. Why are you guys reinventing the wheel? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 03:35, Alexander Pivovarov wrote: I found the following solution to this problem I registered 2 subdomains (public and local) for each computer on https://freedns.afraid.org/subdomain/ e.g. myhadoop-nn.crabdance.com myhadoop-nn-local.crabdance.com then I added cron job which sends http requests to update public and local ip on freedns server hint: public ip is detected automatically ip address for local name can be set using request parameter address=10.x.x.x (don't forget to escape ) as a result my nn computer has 2 DNS names with currently assigned ip addresses , e.g. myhadoop-nn.crabdance.com 54.203.181.177 myhadoop-nn-local.crabdance.com 10.220.149.103 in hadoop configuration I can use local machine names to access my cluster outside of AWS I can use public names Just curious if AWS provides easier way to name EC2 computers? On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
Here is a easy way to go about assigning static name to your ec2 instance. When you get the launch an EC2-instance from aws console when you get to the point of selecting VPC, ip address screen there is a screen that says USER DATA...put the below in with appropriate host name(change CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get you static name. #!/bin/bash HOSTNAME_TAG=CHANGE_HOST_NAME_HERE cat /etc/sysconfig/network EOF NETWORKING=yes NETWORKING_IPV6=no HOSTNAME=${HOSTNAME_TAG} EOF IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4) echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG} /etc/hosts echo ${HOSTNAME_TAG} /proc/sys/kernel/hostname service network restart Also note i was able to do this on couple of spot instance for cheap price, only thing is once you shut it down or someone outbids you, you loose that instance but its easy/cheap to play around with and i have used couple of m3.medium for my NN/SNN and couple of them for data nodes... On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
unfortunately without DNS you have to rely on /etc/hosts, so put in entry for all your nodes(nn,snn,dn1,dn2 etc..) on all nodes(/etc/hosts file) and i have that tested for hortonworks(using ambari) and cloudera manager and i am certainly sure it will work for MapR On Thu, Mar 5, 2015 at 8:47 PM, Alexander Pivovarov apivova...@gmail.com wrote: what about DNS? if you have 2 computers (nn and dn) how nn knows dn ip? The script puts only this computer ip to /etc/hosts On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote: Here is a easy way to go about assigning static name to your ec2 instance. When you get the launch an EC2-instance from aws console when you get to the point of selecting VPC, ip address screen there is a screen that says USER DATA...put the below in with appropriate host name(change CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get you static name. #!/bin/bash HOSTNAME_TAG=CHANGE_HOST_NAME_HERE cat /etc/sysconfig/network EOF NETWORKING=yes NETWORKING_IPV6=no HOSTNAME=${HOSTNAME_TAG} EOF IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4) echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG} /etc/hosts echo ${HOSTNAME_TAG} /proc/sys/kernel/hostname service network restart Also note i was able to do this on couple of spot instance for cheap price, only thing is once you shut it down or someone outbids you, you loose that instance but its easy/cheap to play around with and i have used couple of m3.medium for my NN/SNN and couple of them for data nodes... On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish