Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)

2016-08-20 Thread max scalf
Just out of curiosity, have you enabled S3 endpoint for this ?  Hopefully u
are running this cluster inside a VPC, if so an endpoint would help as the
S3 traffic will not go out to the Internet...

Any new policies put in place for your S3 bucket as others have mentioned
something about throttling ?

On Wed, Aug 17, 2016, 3:22 PM Sebastian Nagel 
wrote:

> Hi Dheeren, hi Chris,
>
>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so,
> are they co-located in the same AWS region as the S3 bucket?
>
> Running a cluster of 100 m1.xlarge EC2 instances with Ubuntu 14.04
> (ami-41a20f2a).
> The cluster is running in a single availability zone (us-east-1d), the S3
> bucket
> is in the same region (us-east-1).
>
> % lsb_release -d
> Description:Ubuntu 14.04.3 LTS
>
> % uname -a
> Linux ip-10-91-235-121 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29
> 11:21:34 UTC 2015 x86_64 x86_64
> x86_64 GNU/Linux
>
> > Did you change java idk version as well,  as part of the upgrade?
>
> Java is taken as provided by Ubuntu:
>
> % java -version
> java version "1.7.0_111"
> OpenJDK Runtime Environment (IcedTea 2.6.7) (7u111-2.6.7-0ubuntu0.14.04.3)
> OpenJDK 64-Bit Server VM (build 24.111-b01, mixed mode)
>
> Cloudera CDH is installed from
>
> http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
>
> After the jobs are done the cluster is shut down and bootstrapped (bash +
> cloudinit) anew on demand.
> A new launch of the cluster may, of course, include updates of
>  - the underlying Amazon machine image
>  - Ubuntu packages
>  - Cloudera packages
>
> And the real reason for the problem may come from any of these changes.
> The update to Cloudera CDH 5.8.0 was the most obvious since the problems
> appeared
> (seen first 2016-08-01).
>
> >> If the cluster is not running in EC2 (e.g. on-premises physical
> hardware), then are there any
> notable differences on nodes that experienced this problem (e.g. smaller
> capacity on the outbound NIC)?
>
> Probably not, although I cannot exclude this. I've the last days run into
> problems which could be
> related: few tasks are slow, even seem to hang, e.g., reducers during
> copy. But that's also looks
> more like a Hadoop (configuration) problem. Network throughput between
> nodes measured with iperf is
> not super-performant but generally ok (5-20 MBit/s).
>
>  >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or
> throttled or somehow compromised, then I could see how longer timeouts and
> more retries might
> increase overall job time.  With the shorter settings, it might cause
> individual task attempts to
> fail sooner.  Then, if the next attempt gets scheduled to a different node
> with better bandwidth to
> S3, it would start making progress faster in the second attempt.  Then,
> the effect on overall job
> execution might be faster.
>
> That's also my assumption. While connecting to S3 a server is selected
> which is fast now.
> While copying 1 GB which takes a couple of minutes just because of general
> network throughput,
> the server may become more loaded. When reconnecting a better server is
> chosen.
>
> Btw., tasks are not failing when choosing a moderate timeout - 30 sec. is
> ok, with lower
> values (a few seconds) the file uploads frequently fail.
>
> I've seen this behavior with a simple distcp from S3: with the default
> values, it took 1 day to copy
> 300 GB from S3 to HDFS. After choosing a shorter timeout the job finished
> within 5 hours.
>
> Thanks,
> Sebastian
>
> On 08/16/2016 09:11 PM, Dheeren Bebortha wrote:
> > Did you change java idk version as well,  as part of the upgrade?
> > Dheeren
> >
> >> On Aug 16, 2016, at 11:59 AM, Chris Nauroth 
> wrote:
> >>
> >> Hello Sebastian,
> >>
> >> This is an interesting finding.  Thank you for reporting it.
> >>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so, are they co-located in the same AWS region as
> the S3 bucket?  If the cluster is not running in EC2 (e.g. on-premises
> physical hardware), then are there any notable differences on nodes that
> experienced this problem (e.g. smaller capacity on the outbound NIC)?
> >>
> >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or throttled or somehow compromised, then I could
> see how longer timeouts and more retries might increase overall job time.
> With the shorter settings, it might cause individual task attempts to fail
> sooner.  Then, if the next attempt gets scheduled to a different node with
> better bandwidth to S3, it would start making progress faster in the second
> attempt.  Then, the effect on overall job execution might be faster.
> >>
> >> --Chris Nauroth
> >>
> >> On 8/7/16, 12:12 PM, "Sebastian Nagel" 
> wrote:
> >>
> >>Hi,
> >>
> >>  

Re: HDFS backup to S3

2016-06-15 Thread max scalf
Hi Anu,

Thank for the information, the link you provided does not work.

@Hari,

Let me do some quick research on what you guys can provide and get back to
you.

On Wed, Jun 15, 2016, 10:59 AM Anu Engineer <aengin...@hortonworks.com>
wrote:

> Hi Max,
>
>
>
> Unfortunately, we don’t have a better solution at the moment. I am
> wondering if the right approach might be to use user-defined metadata (
> http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and
> put that information along with the object that we are backing up.
>
>
>
> However, that would be a code change in DistCp, and not as easy as a
> script. But that would address the scalability issue that you are worried
> about.
>
>
>
> Thanks
>
> Anu
>
>
>
>
>
>
>
> *From: *max scalf <oracle.bl...@gmail.com>
> *Date: *Wednesday, June 15, 2016 at 7:15 AM
> *To: *HDP mailing list <user@hadoop.apache.org>
> *Subject: *HDFS backup to S3
>
>
>
> Hello Hadoop community,
>
>
>
> we are running hadoop in AWS(not EMR) but hortonworks distro on EC2
> instance.  Everything is all setup and working as expected.  Our design
> calls for running HDFS/data nodes on local/ephemeral storage and we have 3X
> replication enabled by default, all of the metastore (hive, oozie, ranger,
> ambari etc etc ..) are external to the cluster using RDS/mysql.
>
>
>
> The question that I have is with regards to backups.  We want to run a
> night job that copies data from HDFS into S3.  Knowing that we our cluster
> lives in AWS, the obvious choice is to run our backup to S3.  We do not
> want a warm backup(backup this cluster to another cluster), our RTO/RPO is
> 5 days for this cluster.  So we can run distcp (something like below link)
> to backup our hdfs to S3 and we have tested this and works just fine, but
> how do we go about storage the ownership/permission on these files.
>
>
>
> http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script
>
>
>
> As S3 is a blob storage and does not store any ownership/permission, how
> do we go about backing that up?  One of the ideas I had was to run hdfs dfs
> -lsr (and recursively get all files and folders permissions/ownership) and
> dump that into a file and send that file over to S3 as well, but I am
> guessing it will work now but as the cluster grows it might not scale...
>
>
>
> So I wanted to find out how are people managed backing up
> ownership/permission of HDFS file/folder when sending back up to a blob
> storage like S3.
>
>
>
>
>


HDFS backup to S3

2016-06-15 Thread max scalf
Hello Hadoop community,

we are running hadoop in AWS(not EMR) but hortonworks distro on EC2
instance.  Everything is all setup and working as expected.  Our design
calls for running HDFS/data nodes on local/ephemeral storage and we have 3X
replication enabled by default, all of the metastore (hive, oozie, ranger,
ambari etc etc ..) are external to the cluster using RDS/mysql.

The question that I have is with regards to backups.  We want to run a
night job that copies data from HDFS into S3.  Knowing that we our cluster
lives in AWS, the obvious choice is to run our backup to S3.  We do not
want a warm backup(backup this cluster to another cluster), our RTO/RPO is
5 days for this cluster.  So we can run distcp (something like below link)
to backup our hdfs to S3 and we have tested this and works just fine, but
how do we go about storage the ownership/permission on these files.

http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script

As S3 is a blob storage and does not store any ownership/permission, how do
we go about backing that up?  One of the ideas I had was to run hdfs dfs
-lsr (and recursively get all files and folders permissions/ownership) and
dump that into a file and send that file over to S3 as well, but I am
guessing it will work now but as the cluster grows it might not scale...

So I wanted to find out how are people managed backing up
ownership/permission of HDFS file/folder when sending back up to a blob
storage like S3.


Re: HDFS how to specify the exact datanode to put data on?

2015-07-20 Thread max scalf
May I ask why you need to do that?  Y not let Hadoop handle that for u?

On Sunday, July 19, 2015, Shiyao Ma i...@introo.me wrote:

 Hi,


 I'd like to put my data selectively on some datanodes.

 Currently I can do that by shutting down un-needed datanodes. But this is
 a little laborsome.

 Is it possible to directly specify the datanodes I'd like to put the data
 on when doing operations like: hdfs dfs -put /


 TIA.

 --

 吾輩は猫である。ホームーページはhttps://introo.me http://introo.me。



Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp

2015-06-19 Thread max scalf
Not to hijack this post but how would you deal with data that is maintained
by hive(Orc format file, hive created tables etc..)...Would we copy the
hivemetastore(MySQL) and move that over to new cluster?

On Friday, June 19, 2015, Joep Rottinghuis jrottingh...@gmail.com wrote:

 You can't set up a proxy ?
 You probably want to avoid writing to local file system because aside from
 that being slow, it limits the size of your file to the free space on your
 local disc.

 If you do need to go commando and go through a single client machine that
 can see both clusters you probably want to pipe a get to a put.

 Any kind of serious data volume pulled through a straw is going to be
 rather slow though.

 Cheers,

 Joep

 Sent from my iPhone

 On Jun 19, 2015, at 12:09 AM, Nitin Pawar nitinpawar...@gmail.com
 javascript:_e(%7B%7D,'cvml','nitinpawar...@gmail.com'); wrote:

 yes

 On Fri, Jun 19, 2015 at 11:36 AM, Divya Gehlot divya.htco...@gmail.com
 javascript:_e(%7B%7D,'cvml','divya.htco...@gmail.com'); wrote:

 In thats It will be like three step process .
 1. first cluster (secure zone) HDFS  - copytoLocal - user local file
 system
 2. user local space - copy data - second cluster user local file system
 3. second cluster user local file system - copyfromlocal - second
 clusterHDFS

 Am I on the right track ?



 On 19 June 2015 at 12:38, Nitin Pawar nitinpawar...@gmail.com
 javascript:_e(%7B%7D,'cvml','nitinpawar...@gmail.com'); wrote:

 What's the size of the data?
 If you can not do distcp between clusters then other way is doing hdfs
 get on the data and then hdfs put on another cluster
 On 19-Jun-2015 9:56 am, Divya Gehlot divya.htco...@gmail.com
 javascript:_e(%7B%7D,'cvml','divya.htco...@gmail.com'); wrote:

 Hi,
 I need to copy data from first hadoop cluster to second hadoop cluster.
 I cant access second hadoop cluster from first hadoop cluster due to
 some security issue.
 Can any point me how can I do apart from distcp command.
 For instance
 Cluster 1 secured zone - copy hdfs data  to - cluster 2 in non
 secured zone



 Thanks,
 Divya






 --
 Nitin Pawar




Re: Swap requirements

2015-03-25 Thread max scalf
Thank you harsh.  Can you please explain what you mean when u said Just
simple virtual memory used by the process ?  Doesn't virtual memory means
swap?

On Wednesday, March 25, 2015, Harsh J ha...@cloudera.com wrote:

 The suggestion (regarding swappiness) is not for disabling swap as much as
 it is to 'not using swap (until really necessary)'. When you run a constant
 memory-consuming service such as HBase you'd ideally want the RAM to serve
 up as much as it can, which setting that swappiness value helps do (the OS
 otherwise begins swapping way before its available physical RAM is nearing
 full state).

 The vmem-pmem ratio is something entirely else. The vmem of a process does
 not mean swap space usage, just simple virtual memory used by the . I'd
 recommend disabling YARN's vmem checks on today's OSes (but keep pmem
 checks on). You can read some more on this at
 http://www.quora.com/Why-do-some-applications-use-significantly-more-virtual-memory-on-RHEL-6-compared-to-RHEL-5

 On Thu, Mar 26, 2015 at 3:37 AM, Abdul I Mohammed oracle.bl...@gmail.com
 javascript:_e(%7B%7D,'cvml','oracle.bl...@gmail.com'); wrote:

 Thanks Mith...any idea about Yarn.nodemanager.Vmem-pmem-ratio parameter...

 If data nodes does not require swap then what about the above parameter?
 What is that used for in yarn?




 --
 Harsh J



Re: AWS Private and Public Ip

2015-03-13 Thread max scalf
you will get the the private ip to work until and unless you are in your
VPC connected to a VPN or a direct connect.  For what you are doing, i
would use the public IP that should work just fine.

On Fri, Mar 13, 2015 at 3:00 PM, Krish Donald gotomyp...@gmail.com wrote:

 Hi,

 I am using Elastic Ip Address and assigned Elastic Ip Address to the
 instances.

 After doing lot of trial and error , I could atleast install Cloudera
 Manager on AWS.

 But a very starnge thing I have notice, I am not sure if I am doing
 something wrong .
 When I have installed CM on AWS on an instance, it gave me message at the
 end to open the http://privateipaddress:7180 to open the cloudera manager
 GUI.

 However when I tried to open http://privateaddress:7180 it didn't open
 but when I tried opening http://Publicip:7180

 How should I use only one type of  ip address either public or private?

 Thanks
 Krish



Re: Not able to ping AWS host

2015-03-10 Thread max scalf
inside your VPC -- subnet -- does the route table have a internet gateway
attached(that should have a gateway of 0.0.0.0/0 as well)...

On Mon, Mar 9, 2015 at 10:23 PM, Krish Donald gotomyp...@gmail.com wrote:

 Yes security group has all open ports to 0.0.0.0 and yes cluster is under
 VPC

 On Mon, Mar 9, 2015 at 5:15 PM, max scalf oracle.bl...@gmail.com wrote:

 when you say the security group has all open ports, is that open to
 public (0.0.0.0) or to your specific IP(if so is ur ip correct)?

 also are the instance inside of a VPC ??

 On Mon, Mar 9, 2015 at 5:05 PM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 I am trying to setup Hadoop cluster on AWS .
 After creating an instance, I got the public ip and dns.
 But I tried to ping it from my windows machine I am not able to ping it.

 I am not able to logon to machine using putty .
 It is saying Network timed out.

 Security group in the AWS cluster has open all TCP, UDP, ICMP and SSH
 also.

 Please let me know if anybody ahs any idea.

 Thanks
 Krish






Re: Not able to ping AWS host

2015-03-10 Thread max scalf
That is very interesting, any network ACL blocking your inbound
connections?  If you would like you can setup a webex/go-meeting conf. and
we can troubleshoot this together, so we can take this offline as this is
specific to AWS and nothing to do with Hadoop.

I can be reached at oracle.bl...@gmail.com, i am available to do this later
on today after 4PM CST.

On Tue, Mar 10, 2015 at 12:00 PM, Krish Donald gotomyp...@gmail.com wrote:

 It is as below:
 Route Table:
 rtb-f377cbxx | myroute
 https://us-west-2.console.aws.amazon.com/vpc/home?region=us-west-2#routetables:filter=rtb-f377cb96
 Destination
 Target
 172.31.0.0/16
 local
 0.0.0.0/0
 igw-6d16cxxx

 On Tue, Mar 10, 2015 at 6:47 AM, max scalf oracle.bl...@gmail.com wrote:

 inside your VPC -- subnet -- does the route table have a internet
 gateway attached(that should have a gateway of 0.0.0.0/0 as well)...

 On Mon, Mar 9, 2015 at 10:23 PM, Krish Donald gotomyp...@gmail.com
 wrote:

 Yes security group has all open ports to 0.0.0.0 and yes cluster is
 under VPC

 On Mon, Mar 9, 2015 at 5:15 PM, max scalf oracle.bl...@gmail.com
 wrote:

 when you say the security group has all open ports, is that open to
 public (0.0.0.0) or to your specific IP(if so is ur ip correct)?

 also are the instance inside of a VPC ??

 On Mon, Mar 9, 2015 at 5:05 PM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 I am trying to setup Hadoop cluster on AWS .
 After creating an instance, I got the public ip and dns.
 But I tried to ping it from my windows machine I am not able to ping
 it.

 I am not able to logon to machine using putty .
 It is saying Network timed out.

 Security group in the AWS cluster has open all TCP, UDP, ICMP and SSH
 also.

 Please let me know if anybody ahs any idea.

 Thanks
 Krish








Re: What skills to Learn to become Hadoop Admin

2015-03-09 Thread max scalf
Hi Jay,

Is there a blog or anything that talks about setting up this big pet store
application?  as i looked at the GIT readme file and was a little bit
lost.  Maybe thats becuase i am new to Hadoop.

On Sat, Mar 7, 2015 at 10:34 AM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Setting up vendor distros is a great first step.

 1) Running TeraSort and benchmarking is a good step.  You can also run
 larger, full stack hadoop applications like bigpetstore, which we curate
 here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/.

 2) Write some mapreduce or spark jobs which write data to a persistent
 transactional store, such as SOLR or HBase.  This is a hugely important
 part of real world hadoop administration, where you will encounter problems
 like running out of memory, possibly CPU overclocking on some nodes, and so
 on.

 3) Now, did you want to go deeper into the build/setup/deployment of
 hadoop ?  Its worth it  to try building/deploying/debugging hadoop ecosytem
 components from scratch, by setting up Apache BigTop, which packages
 RPM/DEB artifacts and provides puppet recipes for distributions.  Its the
 original roots of both the cloudera and hortonworks distributions, so you
 will learn something about both by playing with it.

 We have some exersizes you can use to guide you and get started
 https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes
 .  Feel free to join the mailing list for questions.




 On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote:

 Krish,

 I dont mean to hijack your mail here but i wanted to find out how/what
 you did for the below portion, as i am trying to go down your path as well,
 i was able to get 4-5 node cluster using ambari and cdh and now wanted to
 take it to next level.  What have you done for below?

 I have done a web log integration using flume and twitter sentiment
 analysis.

 On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 I would like to enter into Big Data world as Hadoop Admin and I have
 setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop.
 I have installed the services like hive, oozie, zookeeper etc.

 I have done a web log integration using flume and twitter sentiment
 analysis.

 I wanted to understand what are the other skills I should learn ?

 Thanks
 Krish





 --
 jay vyas



Re: sorting in hive -- general

2015-03-08 Thread max scalf
Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 1. sort by -
 key are distributed according to MR partitioner  (controlled by
 distributed by in hive)

 Lets assume hash partitioned uses the same column as sort by and uses x
 mod 16 formula to get reducer id

 reduced 0 will have keys
 0
 16
 32

 reducer 1 will have keys
 1
 17
 33


 if you merge reducer 0 and reducer 1 output you will have
 0
 16
 32
 1
 17
 33


 2. order by will use 1 reducer and hive will send all keys to reducer 0

 So order by in hive works different from terasort. In case of terasort
 you can merge output files and get one file with globally sorted data.




 On Sun, Mar 8, 2015 at 7:55 AM, max scalf oracle.bl...@gmail.com wrote:

 Thank you Alexander.  So is it fair to assume when sort by is used and
 multiple files are produced per reducer at the end of it all of then are
 put togeather/merged to get the results back?

 And can sort by be used without distributed by and expect same result as
 order by ?

 On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 sort by query produces multiple independent files.

 order by - just one file

 usually sort by is used with distributed by.
 In older hive versions (0.7) they might be used to implement local sort
 within partition
 similar to RANK() OVER (PARTITION BY A ORDER BY B)


 On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com
 wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY
 clause. ORDER BY performs a parallel total sort of the input (like that
 described in “Total Sort” on page 261). When a globally sorted result is
 not required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted
 result?, if the sort by operation produces a sorted file per reducer does
 that mean at the end of the sort all the reducer are put back together to
 give the correct results ?









Re: sorting in hive -- general

2015-03-08 Thread max scalf
Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 sort by query produces multiple independent files.

 order by - just one file

 usually sort by is used with distributed by.
 In older hive versions (0.7) they might be used to implement local sort
 within partition
 similar to RANK() OVER (PARTITION BY A ORDER BY B)


 On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY
 clause. ORDER BY performs a parallel total sort of the input (like that
 described in “Total Sort” on page 261). When a globally sorted result is
 not required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted result?,
 if the sort by operation produces a sorted file per reducer does that mean
 at the end of the sort all the reducer are put back together to give the
 correct results ?







Re: What skills to Learn to become Hadoop Admin

2015-03-07 Thread max scalf
Krish,

I dont mean to hijack your mail here but i wanted to find out how/what you
did for the below portion, as i am trying to go down your path as well, i
was able to get 4-5 node cluster using ambari and cdh and now wanted to
take it to next level.  What have you done for below?

I have done a web log integration using flume and twitter sentiment
analysis.

On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote:

 Hi,

 I would like to enter into Big Data world as Hadoop Admin and I have setup
 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop.
 I have installed the services like hive, oozie, zookeeper etc.

 I have done a web log integration using flume and twitter sentiment
 analysis.

 I wanted to understand what are the other skills I should learn ?

 Thanks
 Krish



sorting in hive -- general

2015-03-07 Thread max scalf
Hello all,

I am a new to hadoop and hive in general and i am reading hadoop the
definitive guide by Tom White and on page 504 for the hive chapter, Tom
says below with regards to soritng

*Sorting and Aggregating*
*Sorting data in Hive can be achieved by using a standard ORDER BY clause.
ORDER BY performs a parallel total sort of the input (like that described
in “Total Sort” on page 261). When a globally sorted result is not
required—and in many cases it isn’t—you can use Hive’s nonstandard
extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


My Questions is, what exactly does he mean by globally sorted result?, if
the sort by operation produces a sorted file per reducer does that mean at
the end of the sort all the reducer are put back together to give the
correct results ?


Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-06 Thread max scalf
@jonathan,

I totaly agree that this is reinventing the wheel, but think about the
folks who wants to do this setup from scratch to better under hadoop or
maybe those folks who are going to do admin realted work...and hence the
need to setting is up from scratch...

@alexandar,

Yes you are right, if one time effort of setting up freedns but for me it
was easy enough coz i gave static hostname thru the user data script and
also static ip address for each host...once that was done, the way i pushed
out /etc/hosts was below...on lets say the master node i edited /etc/hosts
file and put all my other nodes info on there...next setup SSH(as we have
to do this anyways) for hadoop install, once SSH is setup just create a new
file calle hosts.txt and put all your hostname in there and run a for loop
like below

for host in `cat hosts.txt`; do
scp /etc/hosts root@$host:/etc/hosts
done

when i frist was getting started on HDP i used the below link which helped
me, it pushes out /etc/hosts file and also does other stuff...check it out

http://sacharya.com/deploying-multinode-hadoop-20-cluster-using-apache-ambari/




On Fri, Mar 6, 2015 at 12:43 AM, Jonathan Aquilina jaquil...@eagleeyet.net
wrote:

  The only limitation I know is that of how many nodes you can have and
 how many instances of that particular size the host is on can support. you
 can load hive in EMR and then any other features of the cluster are managed
 at the master node level as you have SSH access there.

 What are the advantage of 2.6 over 2.4 for example.

 I just feel you guys are reinventing the wheel when amazon already caters
 for hadoop granted it might not be 2.6.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 07:31, Alexander Pivovarov wrote:

I think EMR has its own limitation

 e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my
 hive patch.
  How EMR can help me?  it supports hadoop up to 2.4.0  (not even 2.4.1)

 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html






 On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net
  wrote:

  Hi guys I know you guys want to keep costs down, but why go through all
 the effort to setup ec2 instances when you deploy EMR it takes the time to
 provision and setup the ec2 instances for you. All configuration then for
 the entire cluster is done on the master node of the particular cluster or
 setting up of additional software that is all done through the EMR console.
 We were doing some geospatial calculations and we loaded a 3rd party jar
 file called esri into the EMR cluster. I then had to pass a small bootstrap
 action (script) to have it distribute esri to the entire cluster.

 Why are you guys reinventing the wheel?



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 03:35, Alexander Pivovarov wrote:

I found the following solution to this problem

 I registered 2 subdomains  (public and local) for each computer on
 https://freedns.afraid.org/subdomain/
 e.g.
 myhadoop-nn.crabdance.com
 myhadoop-nn-local.crabdance.com

 then I added cron job which sends http requests to update public and
 local ip on freedns server
 hint: public ip is detected automatically
 ip address for local name can be set using request parameter 
 address=10.x.x.x
 (don't forget to escape )

 as a result my nn computer has 2 DNS names with currently assigned ip
 addresses , e.g.
 myhadoop-nn.crabdance.com  54.203.181.177
 myhadoop-nn-local.crabdance.com   10.220.149.103

 in hadoop configuration I can use local machine names
 to access my cluster outside of AWS I can use public names

 Just curious if AWS provides easier way to name EC2 computers?

 On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used
 in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error.
 HUE is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to 

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread max scalf
Here is a easy way to go about assigning static name to your ec2 instance.
When you get the launch an EC2-instance from aws console when you get to
the point of selecting VPC, ip address screen there is a screen that says
USER DATA...put the below in with appropriate host name(change
CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get
you static name.

#!/bin/bash

HOSTNAME_TAG=CHANGE_HOST_NAME_HERE
cat  /etc/sysconfig/network  EOF
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=${HOSTNAME_TAG}
EOF

IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG}  /etc/hosts

echo ${HOSTNAME_TAG}  /proc/sys/kernel/hostname
service network restart


Also note i was able to do this on couple of spot instance for cheap price,
only thing is once you shut it down or someone outbids you, you loose that
instance but its easy/cheap to play around with and i have used couple
of m3.medium for my NN/SNN and couple of them for data nodes...

On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net
wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used
 in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error. HUE
 is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then
 would like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish




Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread max scalf
unfortunately without DNS you have to rely on /etc/hosts, so put in entry
for all your nodes(nn,snn,dn1,dn2 etc..) on all nodes(/etc/hosts file) and
i have that tested for hortonworks(using ambari) and cloudera manager and i
am certainly sure it will work for MapR

On Thu, Mar 5, 2015 at 8:47 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 what about DNS?
 if you have 2 computers (nn and dn) how nn knows dn ip?

 The script puts only this computer ip to /etc/hosts

 On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote:

 Here is a easy way to go about assigning static name to your ec2
 instance.  When you get the launch an EC2-instance from aws console when
 you get to the point of selecting VPC, ip address screen there is a screen
 that says USER DATA...put the below in with appropriate host name(change
 CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get
 you static name.

 #!/bin/bash

 HOSTNAME_TAG=CHANGE_HOST_NAME_HERE
 cat  /etc/sysconfig/network  EOF
 NETWORKING=yes
 NETWORKING_IPV6=no
 HOSTNAME=${HOSTNAME_TAG}
 EOF

 IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
 echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG}  /etc/hosts

 echo ${HOSTNAME_TAG}  /proc/sys/kernel/hostname
 service network restart


 Also note i was able to do this on couple of spot instance for cheap
 price, only thing is once you shut it down or someone outbids you, you
 loose that instance but its easy/cheap to play around with and i have
 used couple of m3.medium for my NN/SNN and couple of them for data nodes...

 On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used
 in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error.
 HUE is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then
 would like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when 
 everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish