RE: hadoop-hdfs-native-client Help

2021-09-10 Thread Jonathan Aquilina
Hi Paula,

I am not sure how to answer your questions but is there a reason why you are 
using an EC2 instance instead of amazonz EMR (elastic Map reduce) Hadoop 
cluster. As far as I know you can set that up to work with HDFS setup as well 
as S3 buckets if you don’t need a long term cluster to stay online.

Regards,
Jonathan

From: Paula Logan 
Sent: 10 September 2021 16:13
To: user@hadoop.apache.org
Subject: hadoop-hdfs-native-client Help

Hello,

I am new to building Hadoop locally, and am having some issues.  Please let me 
know if this information should be sent to a different distro.


(1) Can Hadoop 3.3.1 be compiled and run with OpenJDK 11 or is OpenJDK 1.8 
needed for compile while 1.8 or 11 can be used to run hadoop?


(2) I am compiling and testing Hadoop 3.3.1 on RHEL 8.4 on the command line not 
via any IDE inside an AWS instance.  I have encountered an issue
 with Native Test Case #35 (all other 39 Native Test Cases succeed).

First here is my maven command:

mvn -e -X test -Pnative,parallel-tests,shelltest,yarn-ui -Dtest=allNative 
-Dparallel-tests=true -Drequire.bzip2=true -Drequire.fuse=true 
-Drequire.isal=true -Disal.prefix=/usr/local -Disal.lib=/usr/local/lib64 
-Dbundle.isal=true -Drequire.openssl=true -Dopenssl.prefix=/usr 
-Dopenssl.include=/usr/include -Dopenssl.lib=/usr/lib64 -Dbundle.openssl=true 
-Dbundle.openssl.in.bin=true -Drequire.pmdk=true -Dpmdk.lib=/usr/lib64 
-Dbundle.pmdk=true -Drequire.snappy=true -Dsnappy.prefix=/usr 
-Dsnappy.include=/usr/include -Dsnappy.lib=/usr/lib64 -Dbundle.snappy=true 
-Drequire.valgrind=true -Dhbase.profile=2.0 -Drequire.zstd=true 
-Dzstd.prefix=/usr -Dzstd.include=/usr/include -Dzstd.lib=/usr/lib64 
-Dbundle.zstd=true -Dbundle.zstd.in.bin=true -Drequire.test.libhadoop=true

This is what I get for Test Case #35:

 [exec] 35/40 Test #35: test_libhdfs_threaded_hdfspp_test_shim_static 
..***Failed   31.58 sec
 [exec] testRecursiveJvmMutex error:
 [exec] ClassNotFoundException: 
RuntimeExceptionjava.lang.NoClassDefFoundError: RuntimeException
 [exec] Caused by: java.lang.ClassNotFoundException: RuntimeException
 [exec] at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 [exec] at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 [exec] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 [exec] at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
 [exec] 2021-09-02 22:31:09,706 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:(529)) - starting cluster: numNameNodes=1, 
numDataNodes=1
 [exec] 2021-09-02 22:31:10,134 INFO  namenode.NameNode 
(NameNode.java:format(1249)) - Formatting using clusterid: testClusterID
 [exec] 2021-09-02 22:31:10,156 INFO  namenode.FSEditLog 
(FSEditLog.java:newInstance(229)) - Edit logging is async:true
 [exec] 2021-09-02 22:31:10,182 INFO  namenode.FSNamesystem 
(FSNamesystem.java:(814)) - KeyProvider: null
 [exec] 2021-09-02 22:31:10,184 INFO  namenode.FSNamesystem 
(FSNamesystemLock.java:(141)) - fsLock is fair: true
 [exec] 2021-09-02 22:31:10,185 INFO  namenode.FSNamesystem 
(FSNamesystemLock.java:(159)) - Detailed lock hold time metrics enabled: 
false
 [exec] 2021-09-02 22:31:10,185 INFO  namenode.FSNamesystem 
(FSNamesystem.java:(847)) - fsOwner= ec2-user 
(auth:SIMPLE)
 [exec] 2021-09-02 22:31:10,185 INFO  namenode.FSNamesystem 
(FSNamesystem.java:(848)) - supergroup
 ...
   [exec] 2021-09-02 22:31:13,204 INFO  ipc.Server 
(Server.java:logException(3020)) - IPC Server handler 7 on default port 44945, 
call Call#6 Retry#-1 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
127.0.0.1:37362: java.io.FileNotFoundException: File does not exist: 
/tlhData0001/file1
 ...
 [exec] 98% tests passed, 1 tests failed out of 40
 [exec]
 [exec] Total Test time (real) = 270.30 sec
 [exec]
 [exec] The following tests FAILED:
 [exec]  35 - test_libhdfs_threaded_hdfspp_test_shim_static (Failed)
 [exec] Errors while running CTest
[INFO] 
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main 3.3.1 ... SUCCESS [  0.707 s]
[INFO] Apache Hadoop Build Tools .. SUCCESS [  2.743 s]
[INFO] Apache Hadoop Project POM .. SUCCESS [  0.692 s]
[INFO] Apache Hadoop Annotations .. SUCCESS [  1.955 s]
[INFO] Apache Hadoop Project Dist POM . SUCCESS [  0.106 s]
[INFO] Apache Hadoop Assemblies ... SUCCESS [  0.101 s]
[INFO] Apache Hadoop Maven Plugins  SUCCESS [  3.194 s]
[INFO] Apache Hadoop MiniKDC .. SUCCESS [  0.806 s]
[INFO] Apache Hadoop Auth . SUCCESS [  4.192 s]
[INFO] Apache Hadoop Auth Examples  SUCCESS [  0.452 s]

RE: Applications always showing in pending state even after cluster restart

2020-06-13 Thread Jonathan Aquilina
What you are saying is a bit of an easy fix.

On the azure network security group lock down those public ip addresses to be 
accessible from your ip address or those ip addresses that are meant to have 
access to it.

Regards,
Jonathan Aquilina
EagleEyeT

Phone: +356 2033 0099
Moblie + 356 7995 7942
Email: sa...@eagleeyet.net<mailto:sa...@eagleeyet.net>
Website: https://eagleeyet.net

From: Gaurav Chhabra 
Sent: 13 June 2020 11:45
To: Hariharan 
Cc: common-u...@hadoop.apache.org 
Subject: Re: Applications always showing in pending state even after cluster 
restart

Wow! What a guess, Hari! :) I wasn't sure those pending tasks could have been 
related to an attack. This happened with me from 1st to 5th June'20. I didn't 
check my Azure usage during that time though I was keeping tab almost every day 
in May. On 8th June (Mon), when i checked the charges, the Azure 'data transfer 
out' charges were showing $88, $90 & $110 for bigdataserver-{5,6,7} 
respectively. I was shocked as my last month charge was around $53. I opened a 
ticket with Azure and then we again started the cluster (with Azure networking 
guy along with me) and within 3-4 minutes, data transfer out again was around 
10-12 GB in total (from 3 instances). We could only figure out that the hits 
were going to some blob storage in Azure. He said it most likely seems to be a 
virus or some attack.

I have now removed public IPs from all instances except two instances (one 
where Cloudera Manager is hosted and another where Resource Manager is 
running). Even those two exposed ones are allowed incoming requests 
specifically from my laptop's IP. Things are fine now.

One thing that i don't get is how's the attacker 'personally' benefitting from 
this except for obviously raising my monthly bill?


Regards



On Sat, 13 Jun 2020 at 11:00, Hariharan 
mailto:hariharan...@gmail.com>> wrote:
This is most likely an attempt to attack your system. If you are running your 
cluster in the cloud, you should run it in a private network so it is not 
exposed to the Internet. Alternatively you can secure your installation as 
described here - 
https://blog.cloudera.com/how-to-secure-internet-exposed-apache-hadoop/

Thanks,
Hari

On Fri, 12 Jun 2020, 12:20 Gaurav Chhabra, 
mailto:varuag.chha...@gmail.com>> wrote:
Hi All,


I have started learning Hadoop and its related components. I am following a 
tutorial on Hadoop Administration on Udemy. As part of the learning process, i 
ran the following command:

$ hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jarrandomtextwriter 
-Ddfs.replication=1 /user/bigdata/randomtextwriter

Above command created 30 files each of size 1 GB. Then i ran the below reduce 
command:

$ yarn jar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
wordcount \
-Dmapreduce.input.fileinputformat.split.minsize=268435456\
-Dmapreduce.job.reduces=8 \
/user/bigdata/randomtext \
/user/bigdata/wordcount

After executing the above command, I just thought of killing the application 
after some time so i ran 'yarn application -list' first which listed a lot many 
applications out of which one was wordcount. I killed that particular 
application using 'yarn application -kill application-id'. However, when i 
checked the scheduler, i could see that several applications were still showing 
in Pending state so i ran the following command:

$ for x in $(yarn application -list -appStates ACCEPTED | awk 'NR > 2 { print 
$1 }'); do yarn application -kill $x; done
It was killing the applications as I could see the 'Apps Completed' count was 
going up but as soon as all the apps got killed, I saw those applications again 
getting created. Even if I stop the whole cluster and start again, the 
scheduler shows that there are submitted applications in Pending state.


Here's the content of fair-scheduler.xml:







drf



drf














This is just a test cluster.  I just want to kill the applications/clear the 
application queue. Any help will really be appreciated as I am struggling with 
it for the last few days.


Regards


-
To unsubscribe, e-mail: 
user-unsubscr...@hadoop.apache.org<mailto:user-unsubscr...@hadoop.apache.org>
For additional commands, e-mail: 
user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>


Re: Warning from user@hadoop.apache.org

2016-09-03 Thread Jonathan Aquilina
I think they are testing some new mechanisms for list moderation.

On 2016-09-04 03:41, Ted Yu wrote:

> There is nothing to worry about on your side.  
> 
> I received such email too.  
> 
> On Sep 3, 2016, at 5:57 PM, Jonathan Aquilina <jaquil...@eagleeyet.net> wrote:
> 
>> Can someone tell me if the below is something to worry about as I hardly 
>> post to the list and I know when I have posted to the list my emails have 
>> not bounced.
>> 
>>  Original Message  
>> 
>> SUBJECT:
>> Warning from user@hadoop.apache.org
>> 
>> DATE:
>> 2016-09-04 02:36
>> 
>> FROM:
>> user-h...@hadoop.apache.org
>> 
>> TO:
>> jaquil...@eagleeyet.net
>> 
>> Hi! This is the ezmlm program. I'm managing the
>> user@hadoop.apache.org mailing list.
>> 
>> Messages to you from the user mailing list seem to
>> have been bouncing. I've attached a copy of the first bounce
>> message I received.
>> 
>> If this message bounces too, I will send you a probe. If the probe bounces,
>> I will remove your address from the user mailing list,
>> without further notice.
>> 
>> I've kept a list of which messages from the user mailing list have 
>> bounced from your address.
>> 
>> Copies of these messages may be in the archive.
>> To retrieve a set of messages 123-145 (a maximum of 100 per request),
>> send a short message to:
>> <user-get.123_...@hadoop.apache.org>
>> 
>> To receive a subject and author list for the last 100 or so messages,
>> send a short message to:
>> <user-in...@hadoop.apache.org>
>> 
>> Here are the message numbers:
>> 
>> 23009
>> 
>> --- Enclosed is a copy of the bounce message I received.
>> 
>> Return-Path: <>
>> Received: (qmail 55605 invoked for bounce); 24 Aug 2016 13:51:02 -
>> Date: 24 Aug 2016 13:51:02 -
>> From: mailer-dae...@apache.org
>> To: user-return-230...@hadoop.apache.org
>> Subject: failure notice

Fwd: Warning from user@hadoop.apache.org

2016-09-03 Thread Jonathan Aquilina
Can someone tell me if the below is something to worry about as I hardly
post to the list and I know when I have posted to the list my emails
have not bounced.

 Original Message  

SUBJECT:
Warning from user@hadoop.apache.org

DATE:
2016-09-04 02:36

FROM:
user-h...@hadoop.apache.org

TO:
jaquil...@eagleeyet.net

Hi! This is the ezmlm program. I'm managing the
user@hadoop.apache.org mailing list.

Messages to you from the user mailing list seem to
have been bouncing. I've attached a copy of the first bounce
message I received.

If this message bounces too, I will send you a probe. If the probe
bounces,
I will remove your address from the user mailing list,
without further notice.

I've kept a list of which messages from the user mailing list have 
bounced from your address.

Copies of these messages may be in the archive.
To retrieve a set of messages 123-145 (a maximum of 100 per request),
send a short message to:
   

To receive a subject and author list for the last 100 or so messages,
send a short message to:
   

Here are the message numbers:

   23009

--- Enclosed is a copy of the bounce message I received.

Return-Path: <>
Received: (qmail 55605 invoked for bounce); 24 Aug 2016 13:51:02 -
Date: 24 Aug 2016 13:51:02 -
From: mailer-dae...@apache.org
To: user-return-230...@hadoop.apache.org
Subject: failure notice

Re: EC2 Hadoop Cluster VS Amazon EMR

2016-03-11 Thread Jonathan Aquilina
When I was testing EMR i had only spent around 17USD for testing and
with a decent sized EMR cluster.

On 2016-03-11 12:31, José Luis Larroque wrote:

> Hi Jonathan!  
> I was trying to know which of those options use a time ago. For now i'm using 
> Amazon EMR, because it's more easy, you have some stuff configurated already. 
> 
> But, a few benefits could be that, with EC2, you can use the Free tier, and 
> save some money while you are testing your stuff. And probably should be more 
> cheaper the use of EC2 against EMR, but i'm not 100% sure of this. 
> 
> Bye! 
> Jose 
> 
> 2016-03-07 6:17 GMT-03:00 Jonathan Aquilina <jaquil...@eagleeyet.net>:
> 
>> Good Morning, 
>> 
>> Just some food for thought as of late im noticing people using EC2 to setup 
>> ones own Hadoop cluster. What is the advantage of using ec2 over amazon's 
>> EMR hadoop cluster? 
>> 
>> Regard, 
>> 
>> Jonathan
 

Re: fs.s3a.endpoint not working

2016-01-14 Thread Jonathan Aquilina
Im not totally following this thread from the beginning but I might be
able to help as I have some experience with Amazon EMR (elastic map
reduce) when working with custom jar files and s3 

Are you using EMR or something internal and offloading strage to s3?

---
Regards,
Jonathan Aquilina
Founder 

On 2016-01-13 23:21, Phillips, Caleb wrote:

> Hi Billy (and others),
> 
> One of the threads suggested using the core-site.xml. Did you try putting 
> your configuration in there?
> 
> Yes, I did try that. I've also tried setting it dynamically in e.g., spark. I 
> can verify that it is getting the configuration correctly:
> 
> hadoop org.apache.hadoop.conf.Configuration
> 
> Still it never connects to our internal S3-compatable store and always 
> connects to AWS.
> 
> One thing I've noticed is that the AWS stuff is handled by an underlying 
> library (I think jets3t in < 2.6 versions, forget what in 2.6+) and when I 
> was trying to mess with stuff and spelunking through the hadoop code, I kept 
> running into blocks with that library.
> 
> I started digging into the code. I found that the custom endpoint was 
> introduced with this patch:
> 
> https://issues.apache.org/jira/browse/HADOOP-11261
> 
> It seems it was integrated in 2.7.0, so just to be sure I downloaded 2.7.1, 
> but the problem persists.
> 
> That code calls this function in the AWS Java SDK:
> 
> http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html#setEndpoint(java.lang.String)
> 
> However, no matter what configuration I use, it still seems to connect to 
> Amazon AWS. Is it possible that the AWS Java SDK cannot work with 
> S3-compatable (non-AWS) stores? If so, it would seem there is no way 
> currently to connect hadoop to an S3-compatable non-AWS store.
> 
> If anyone else has any insight, particularly success using hadoop with a 
> non-AWS, S3-compatable store, please chime in!
> 
> William Watson
> Software Engineer
> (904) 705-7056 PCS
> 
> On Mon, Jan 11, 2016 at 10:39 AM, Phillips, Caleb 
> <caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>> wrote:
> Hi All,
> 
> Just wanted to send this out again since there was no response
> (admittedly, originally sent in the midst of the US holiday season) and it
> seems to be an issue that continues to come up (see e.g., the email from
> Han Ju on Jan 5).
> 
> If anyone has successfully connected Hadoop to a non-AWS S3-compatable
> object store, it'd be very helpful to hear how you made it work. The
> fs.s3a.endpoint configuration directive appears non-functional at our site
> (with Hadoop 2.6.3).
> 
> --
> Caleb Phillips, Ph.D.
> Data Scientist | Computational Science Center
> 
> National Renewable Energy Laboratory (NREL)
> 15013 Denver West Parkway | Golden, CO 80401
> 303-275-4297 | 
> caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>
> 
> On 12/22/15, 1:39 PM, "Phillips, Caleb" 
> <caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>> wrote:
> 
>> Hi All,
>> 
>> New to this list. Looking for a bit of help:
>> 
>> I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object
>> store.
>> 
>> This issue was discussed, but left unresolved, in this thread:
>> 
>> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+0W_
>> au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com<mailto:au5es_flugzmgwkkga3jya1asi3u%2bisjcuymfntvnk...@mail.gmail.com>%3E
>> 
>> And here, on Cloudera's forums (the second post is mine):
>> 
>> https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoi
>> nt-ignored-in-hdfs-site-xml/m-p/33694#M1180
>> 
>> I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using
>> Hadoop, I'm able to connect to S3 on AWS, and e.g., list/put/get files.
>> 
>> However, when I point the fs.s3a.endpoint configuration directive at my
>> non-AWS S3-Compatable object storage, it appears to still point at (and
>> authenticate against) AWS.
>> 
>> I've checked and double-checked my credentials and configuration using
>> both Python's boto library and the s3cmd tool, both of which connect to
>> this non-AWS data store just fine.
>> 
>> Any help would be much appreciated. Thanks!
>> 
>> --
>> Caleb Phillips, Ph.D.
>> Data Scientist | Computational Science Center
>> 
>> National Renewable Energy Laboratory (NREL)
>> 15013 Denver West Parkway | Golden, CO 80401
>> 303-275-4297 | caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>
>> 
>> -
>> To unsubscribe, e-mail: 
>> user-unsubscr...@hadoop.apache.org<mailto:user-unsubscr...@hadoop.apache.org>
>> For additional commands, e-mail: 
>> user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
 

Re: Use of hadoop in AWS - Build it from scratch on a EC2 instance / MapR hadoop distribution / Amazon hadoop distribution

2015-10-19 Thread Jonathan Aquilina
Hey Jose

Have you looked at Amazon emr ( elastic map reduce) where I work we have used 
it and when you provision the emr instance you can use custom jars like the one 
you mentioned. 

In terms of storage you can use either hdfs, if you are going to keep a 
persistent cluster. If not you can store your data in an Amazon s3 bucket. 

Documentation for emr is really good. At the time when we did this and this was 
at the beginning of this year and they supported Hadoop 2.6. 

In my honest opinion you are giving yourself a lot of extra work for nothing to 
get us in Hadoop. Try out emr with temporary cluster and go from there. I 
managed to tool up and learn how to work with emr in a week.

Sent from my iPhone

> On 19 Oct 2015, at 02:10, José Luis Larroque  wrote:
> 
> Thanks for your answer Anders.
> 
> -The amount of data that i'm going to manipulate it's like the wikipedia (i 
> will use a dump)
> - I already have the basics of hadoop (i hope), i have a local multinode 
> cluster setup and i already executed some algorithms.
> - Because the amount of data its important, i believe that i should use 
> several nodes.
> 
> Maybe another option to considerate should be that i'm running Giraph on top 
> of the selected hadoop distribution/EC2.
> 
> Bye!
> Jose
> 
> 2015-10-18 18:53 GMT-03:00 Anders Nielsen :
>> Dear Jose, 
>> 
>> It will help people answer your question if you specify your goals :
>> 
>> -If you do it to learn how to USE a running Hadoop then go for one of the 
>> prebuilt distributions (Amazon or MapR)
>> -If you do it to learn more about the setting up and administrating Hadoop 
>> then you are better off setting everything up from scratch on EC2.
>> -Do you need to run on many nodes or just a 1 node to test some Mapreduce 
>> scripts on a small data set?
>> 
>> Regards, 
>> 
>> Anders
>> 
>> 
>> 
>> 
>>> On Sun, Oct 18, 2015 at 10:03 PM, José Luis Larroque 
>>>  wrote:
>>> Hi all !
>>> 
>>> I started to use hadoop with aws, and a big question appears in front of me!
>>> 
>>> I'm using a MapR distribution, for hadoop 2.4.0 in AWS. I already tried 
>>> some trivial examples, and before moving forward i have one question.
>>> 
>>> What is the better option for using Hadoop on AWS?
>>> - Build it from scratch on a EC2 instance 
>>> - Use MapR distribution of Hadoop
>>> - Use Amazon distribution of Hadoop
>>> 
>>> Sorry if my question is too broad.
>>> 
>>> Bye!
>>> Jose
> 


Re: IMPORTANT: "HOW TO" UNSUBSCRIBE

2015-09-29 Thread Jonathan Aquilina
 

Because there are none of the key hadoop contributors that have a
signature that contains the unsubscribe email. Which technically could
classify these emails as spam. If one notes all mailing lists be it in a
key members signature or otherwise they always explain how you can
unsubscribe. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-09-29 10:54, Daniel Jankovic wrote: 

> well ... why, when they can always send UNSUBSCRIBE to the whole group :) 
> 
> On Tue, Sep 22, 2015 at 5:31 PM, Namikaze Minato <lloydsen...@gmail.com> 
> wrote:
> 
>> Step 1:
>> Send an e-mail to user-unsubscr...@hadoop.apache.org
>> 
>> Done.
 

Re: Comparing CheckSum of Local and HDFS File

2015-08-16 Thread Jonathan Aquilina
 

Correct me if I am wrong but the command you ran on the cluster seems to
be doing a CRC check as well. I am still a novice to hadoop but that is
the most obvious thing i see in the output below. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-08-07 12:34, Shashi Vishwakarma wrote: 

 Hi 
 
 I have a small confusion regarding checksum verification.Lets say , i have a 
 file abc.txt and I transferred this file to hdfs. How do I ensure about data 
 integrity? 
 
 I followed below steps to check that file is correctly transferred. 
 
 ON LOCAL FILE SYSTEM: 
 
 md5sum abc.txt 
 
 276fb620d097728ba1983928935d6121 TestFile 
 
 ON HADOOP CLUSTER : 
 
 hadoop fs -checksum /abc.txt 
 
 /abc.txt MD5-of-0MD5-of-512CRC32C 
 0200911156a9cf0d906c56db7c8141320df0 
 
 Both output looks different to me. Let me know if I am doing anything wrong. 
 
 How do I verify if my file is transferred properly into HDFS? 
 
 Thanks 
 Shashi
 

Re: ssh localhost returns connection closed by ::1 under Cygwin installation on windows 8

2015-07-27 Thread Jonathan Aquilina
 

Hi, 

Is there a reason why you are using ipv6? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-07-23 23:35, Yepeng Sun wrote: 

 Hi, 
 
 I tried to install Hadoop on windows 8 to form multi-node clusters. So first 
 I have to install Cygwin in order to make SSH working. I installed Cygwin 
 with Sshd service successfully. And also I generated passwordless keys. But 
 when I run ssh localhost, it returns connection closed by ::1. Does 
 anyone has the same experience? How did you solve it. 
 
 Thanks! 
 
 Yepeng
 

Re: Unable to Find S3N Filesystem Hadoop 2.6

2015-04-20 Thread Jonathan Aquilina
 

One thing I think which i most likely missed completely is are you using
an amazon EMR cluster or something in house? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-04-20 16:21, Billy Watson wrote: 

 I appreciate the response. These JAR files aren't 3rd party. They're included 
 with the Hadoop distribution, but in Hadoop 2.6 they stopped being loaded by 
 default and now they have to be loaded manually, if needed. 
 
 Essentially the problem boils down to: 
 
 - need to access s3n URLs 
 - cannot access without including the tools directory 
 - after including tools directory in HADOOP_CLASSPATH, failures start 
 happening later in job 
 - need to find right env variable (or shell script or w/e) to include jets3t 
  other JARs needed to access s3n URLs (I think) 
 
 William Watson
 Software Engineer 
 (904) 705-7056 PCS 
 
 On Mon, Apr 20, 2015 at 9:58 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 you mention an environmental variable. the step before you specify the steps 
 to run to get to the result. you can specify a bash script that will allow 
 you to put any 3rd party jar files, for us we used esri, on the cluster and 
 propagate them to all nodes in the cluster as well. You can ping me off list 
 if you need further help. Thing is I havent used pig but my boss and coworker 
 wrote the mappers and reducers. to get these jars to the entire cluster was a 
 super small and simple bash script. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-04-20 15:17, Billy Watson wrote: 
 
 Hi,
 
 I am able to run a `hadoop fs -ls s3n://my-s3-bucket` from the command line 
 without issue. I have set some options in hadoop-env.sh to make sure all the 
 S3 stuff for hadoop 2.6 is set up correctly. (This was very confusing, BTW 
 and not enough searchable documentation on changes to the s3 stuff in hadoop 
 2.6 IMHO).
 
 Anyways, when I run a pig job which accesses s3, it gets to 16%, does not 
 fail in pig, but rather fails in mapreduce with Error: java.io.IOException: 
 No FileSystem for scheme: s3n. 
 
 I have added [hadoop-install-loc]/lib and 
 [hadoop-install-loc]/share/hadoop/tools/lib/ to the HADOOP_CLASSPATH env 
 variable in hadoop-env.sh.erb. When I do not do this, the pig job will fail 
 at 0% (before it ever gets to mapreduce) with a very similar No fileystem 
 for scheme s3n error.
 
 I feel like at this point I just have to add the share/hadoop/tools/lib 
 directory (and maybe lib) to the right environment variable, but I can't 
 figure out which environment variable that should be.
 
 I appreciate any help, thanks!!
 
 Stack trace:
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at 
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at 
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at 
 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at 
 org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498)
  at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467)
  at 
 org.apache.pig.piggybank.storage.CSVExcelStorage.setLocation(CSVExcelStorage.java:609)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:129)
  at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:103)
 at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:512)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 
 -- Billy Watson
 
 -- 
 
 William Watson
 Software Engineer 
 (904) 705-7056 [1] PCS
 

Links:
--
[1] tel:%28904%29%20705-7056


Re: Unable to Find S3N Filesystem Hadoop 2.6

2015-04-20 Thread Jonathan Aquilina
Sadly I'll have to pull back I have only run a Hadoop map reduce cluster with 
Amazon met

Sent from my iPhone

 On 20 Apr 2015, at 16:53, Billy Watson williamrwat...@gmail.com wrote:
 
 This is an install on a CentOS 6 virtual machine used in our test 
 environment. We use HDP in staging and production and we discovered these 
 issues while trying to build a new cluster using HDP 2.2 which upgrades from 
 Hadoop 2.4 to Hadoop 2.6. 
 
 William Watson
 Software Engineer
 (904) 705-7056 PCS
 
 On Mon, Apr 20, 2015 at 10:26 AM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:
 One thing I think which i most likely missed completely is are you using an 
 amazon EMR cluster or something in house?
 
  
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-04-20 16:21, Billy Watson wrote:
 
 I appreciate the response. These JAR files aren't 3rd party. They're 
 included with the Hadoop distribution, but in Hadoop 2.6 they stopped being 
 loaded by default and now they have to be loaded manually, if needed. 
  
 Essentially the problem boils down to:
  
 - need to access s3n URLs
 - cannot access without including the tools directory
 - after including tools directory in HADOOP_CLASSPATH, failures start 
 happening later in job
 - need to find right env variable (or shell script or w/e) to include 
 jets3t  other JARs needed to access s3n URLs (I think)
  
  
 
 William Watson
 Software Engineer
 (904) 705-7056 PCS
 
 On Mon, Apr 20, 2015 at 9:58 AM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:
 you mention an environmental variable. the step before you specify the 
 steps to run to get to the result. you can specify a bash script that will 
 allow you to put any 3rd party jar files, for us we used esri, on the 
 cluster and propagate them to all nodes in the cluster as well. You can 
 ping me off list if you need further help. Thing is I havent used pig but 
 my boss and coworker wrote the mappers and reducers. to get these jars to 
 the entire cluster was a super small and simple bash script.
 
  
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-04-20 15:17, Billy Watson wrote:
 
 Hi,
 
 I am able to run a `hadoop fs -ls s3n://my-s3-bucket` from the command 
 line without issue. I have set some options in hadoop-env.sh to make sure 
 all the S3 stuff for hadoop 2.6 is set up correctly. (This was very 
 confusing, BTW and not enough searchable documentation on changes to the 
 s3 stuff in hadoop 2.6 IMHO).
 
 Anyways, when I run a pig job which accesses s3, it gets to 16%, does not 
 fail in pig, but rather fails in mapreduce with Error: 
 java.io.IOException: No FileSystem for scheme: s3n. 
 
 I have added [hadoop-install-loc]/lib and 
 [hadoop-install-loc]/share/hadoop/tools/lib/ to the HADOOP_CLASSPATH env 
 variable in hadoop-env.sh.erb. When I do not do this, the pig job will 
 fail at 0% (before it ever gets to mapreduce) with a very similar No 
 fileystem for scheme s3n error.
 
 I feel like at this point I just have to add the share/hadoop/tools/lib 
 directory (and maybe lib) to the right environment variable, but I can't 
 figure out which environment variable that should be.
 
 I appreciate any help, thanks!!
 
 
 Stack trace:
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) 
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) 
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at 
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at 
 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at 
 org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498)
  at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467)
  at 
 org.apache.pig.piggybank.storage.CSVExcelStorage.setLocation(CSVExcelStorage.java:609)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:129)
  at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:103)
  at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:512)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755) at 
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
 org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:415) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 
 
 — Billy Watson
 
 --
 
 William Watson
 Software Engineer
 (904) 705-7056 PCS
 


Re: Unsubscribe

2015-04-14 Thread Jonathan Aquilina
 

All emails coming through this list i feel should have a signature with
an unsubscribe link. Who do I need to contact to be able to get that
done? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-04-14 18:12, Preya Shah wrote: 

 unsubscribe
 

Re: Can we run mapreduce job from eclipse IDE on fully distributed mode hadoop cluster?

2015-04-11 Thread Jonathan Aquilina
 

I could be wrong here, but the way I understand things I do not think
that is even possible to run the JAR file from your PC. There are two
things that you need to realize. 

1) How is the JAR file going to connect to the cluster 

2) How is the JAR file going to be distributed to the cluster. 

Again I could be wrong here in my response, so anyone else on the list
feel free to correct me. I am still a novice to Hadoop and have only
worked with it on amazon EMR. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-04-11 08:23, Answer Agrawal wrote: 

 A mapreduce job can be run as jar file from terminal or directly from eclipse 
 IDE. When a job run as jar file from terminal it uses multiple jvm and all 
 resources of cluster. Does the same thing happen when we run from IDE. I have 
 run a job on both and it takes less time on IDE than jar file on terminal. 
 
 Thanks
 

getting amazon emr to access a single file when reducing

2015-03-10 Thread Jonathan Aquilina
 

Hi guys, 

I need to run a job where we have data when being reduced it needs to
access another file for data that is needed, in my case way points, the
way points do not need to be processed. 

On amazon emr its proving very tricky how would one need to do this in
the simplest way possible? 

-- 
Regards,
Jonathan Aquilina
Founder Eagle Eye T
 

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-07 Thread Jonathan Aquilina
 

When i was testing I was using default setup 1 master node 2 core and no
task nodes. i would spiin up the cluster then terminate it. The term for
that is a transient cluster. 

When the big data was needing to be crunched i changed the setup a bit.
An Important note there is a limitation of 20 Nodes be it core or task
with EMR a request can be submitted to lift that limitation. 

When actually live i had 1 master node 3 task nodes (which have HDFS
storage) and 10 task nodes. All instances used were of size m3.large.
Ran another batch of data for 2013 through EMR with this setup in 31 min
just to run the data that isnt including cluster spawn up time. 

One thing to note you do not need to use HDFS storage as that can and
will drive up the cost quickly and there there is a chance of data
corruption or even data loss if a core node crashes. I have been using
amazon S3 and pulling the data from there. The biggest advantage is that
you can spawn up multiple clusters and share the same data to be
processed that way. Using HDFS has its perks too but costs can
drastically increase as well. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-07 09:54, tesm...@gmail.com wrote: 

 Dear Jonathan,
 
 Would you please describe the process of running EMR based Hadoop for $15.00, 
 I tried and my cost were rocketing like $60 for one hour.
 
 Regards
 
 On 05/03/2015 23:57, Jonathan Aquilina wrote: 
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Re: AWS Setting for setting up Hadoop cluster

2015-03-05 Thread Jonathan Aquilina
 

I have experience with my full time job using EMR damn thing is quick
and cheap. The interesting part is wrapping your head around the
concepts. If you need things quickly and fast EMR is the way to go. It
spawns up a number of ec2 instances 

by default you have 1 master and 2 core nodes. The three of them are
m3.large nodes which run you 7 cents per hour. to run one years with of
data which is about 1.1 billion records from the database it took 50 min
from cluster spawn up to completion and shutting down of the cluster. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-05 23:41, Dieter De Witte wrote: 

 You can install Hadoop on Amazon EC2 instances and use the free tier for new 
 members but you can also use Amazon EMR which is not free but is up and 
 running in a couple of seconds... 
 
 2015-03-05 23:28 GMT+01:00 Krish Donald gotomyp...@gmail.com:
 
 Hi, 
 
 I am tired of setting Hadoop cluster using my laptop which has 8GB RAM. 
 I tried 2gb for namenode and 1-1 gb for 3 datanoded so total 5gb I was using 
 . 
 And I was using very basic Hadoop services only. 
 But it is so slow that I am not able to do anything on that. 
 
 Hence I would like to try the AWS service now. 
 
 Can anybody please help me, which configuration I should use it without 
 paying at all? 
 What are the tips you have for AWS ? 
 
 Thanks 
 Krish
 

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Jonathan Aquilina
 

krish EMR wont cost you much with all the testing and data we ran
through the test systems as well as the large amont of data when
everythign was read we paid about 15.00 USD. I honestly do not think
that the specs there would be enough as java can be pretty ram hungry. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-06 00:41, Krish Donald wrote: 

 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Re: AWS Setting for setting up Hadoop cluster

2015-03-05 Thread Jonathan Aquilina
 

Advantage of EMR is that you dont have to stay screwing around with
installing hadoop it does all that for you so you are ready to go 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-05 23:51, Krish Donald wrote: 

 Because I am new to AWS, I would like to explore the free service first and 
 then later I can use EMR. 
 Which one is fast in EC2 and free too? 
 
 Thanks 
 
 On Thu, Mar 5, 2015 at 2:47 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 I have experience with my full time job using EMR damn thing is quick and 
 cheap. The interesting part is wrapping your head around the concepts. If you 
 need things quickly and fast EMR is the way to go. It spawns up a number of 
 ec2 instances 
 
 by default you have 1 master and 2 core nodes. The three of them are m3.large 
 nodes which run you 7 cents per hour. to run one years with of data which is 
 about 1.1 billion records from the database it took 50 min from cluster spawn 
 up to completion and shutting down of the cluster. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-05 23:41, Dieter De Witte wrote: 
 You can install Hadoop on Amazon EC2 instances and use the free tier for new 
 members but you can also use Amazon EMR which is not free but is up and 
 running in a couple of seconds... 
 
 2015-03-05 23:28 GMT+01:00 Krish Donald gotomyp...@gmail.com:
 
 Hi, 
 
 I am tired of setting Hadoop cluster using my laptop which has 8GB RAM. 
 I tried 2gb for namenode and 1-1 gb for 3 datanoded so total 5gb I was using 
 . 
 And I was using very basic Hadoop services only. 
 But it is so slow that I am not able to do anything on that. 
 
 Hence I would like to try the AWS service now. 
 
 Can anybody please help me, which configuration I should use it without 
 paying at all? 
 What are the tips you have for AWS ? 
 
 Thanks 
 Krish
 

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Jonathan Aquilina
 

The only limitation I know is that of how many nodes you can have and
how many instances of that particular size the host is on can support.
you can load hive in EMR and then any other features of the cluster are
managed at the master node level as you have SSH access there. 

What are the advantage of 2.6 over 2.4 for example. 

I just feel you guys are reinventing the wheel when amazon already
caters for hadoop granted it might not be 2.6. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-06 07:31, Alexander Pivovarov wrote: 

 I think EMR has its own limitation
 
 e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my hive 
 patch. How EMR can help me? it supports hadoop up to 2.4.0 (not even 2.4.1)
 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html
  [1]
 
 On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 Hi guys I know you guys want to keep costs down, but why go through all the 
 effort to setup ec2 instances when you deploy EMR it takes the time to 
 provision and setup the ec2 instances for you. All configuration then for the 
 entire cluster is done on the master node of the particular cluster or 
 setting up of additional software that is all done through the EMR console. 
 We were doing some geospatial calculations and we loaded a 3rd party jar file 
 called esri into the EMR cluster. I then had to pass a small bootstrap action 
 (script) to have it distribute esri to the entire cluster. 
 
 Why are you guys reinventing the wheel? 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 03:35, Alexander Pivovarov wrote: 
 
 I found the following solution to this problem
 
 I registered 2 subdomains (public and local) for each computer on 
 https://freedns.afraid.org/subdomain/ [2] 
 e.g. 
 myhadoop-nn.crabdance.com [3]
 myhadoop-nn-local.crabdance.com [4] 
 then I added cron job which sends http requests to update public and local ip 
 on freedns server hint: public ip is detected automatically ip address for 
 local name can be set using request parameter address=10.x.x.x (don't forget 
 to escape )
 
 as a result my nn computer has 2 DNS names with currently assigned ip 
 addresses , e.g.
 myhadoop-nn.crabdance.com [3] 54.203.181.177
 myhadoop-nn-local.crabdance.com [4] 10.220.149.103
 
 in hadoop configuration I can use local machine names to access my cluster 
 outside of AWS I can use public names
 
 Just curious if AWS provides easier way to name EC2 computers?
 
 On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 I dont know how you would do that to be honest. With EMR you have 
 destinctions master core and task nodes. If you need to change configuration 
 you just ssh into the EMR master node. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 02:11, Alexander Pivovarov wrote: 
 
 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used in 
 hadoop cluster. 
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote:
 
 When I started with EMR it was alot of testing and trial and error. HUE is 
 already supported as something that can be installed from the AWS console. 
 What I need to know is if you need this cluster on all the time or this is 
 goign ot be what amazon call a transient cluster. Meaning you fire it up run 
 the job and tear it back down. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 01:10, Krish Donald wrote: 
 
 Thanks Jonathan, 
 
 I will try to explore EMR option also. 
 Can you please let me know the configuration which you have used it? 
 Can you please recommend for me also? 
 I would like to setup Hadoop cluster using cloudera manager and then would 
 like to do below things: 
 
 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh 
 upgrade of CM
 Hue User Administration 
 Spark 
 Solr 
 
 Thanks 
 Krish 
 
 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Links:
--
[1]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Jonathan Aquilina
 

I dont know how you would do that to be honest. With EMR you have
destinctions master core and task nodes. If you need to change
configuration you just ssh into the EMR master node. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-06 02:11, Alexander Pivovarov wrote: 

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used in 
 hadoop cluster. 
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote:
 
 When I started with EMR it was alot of testing and trial and error. HUE is 
 already supported as something that can be installed from the AWS console. 
 What I need to know is if you need this cluster on all the time or this is 
 goign ot be what amazon call a transient cluster. Meaning you fire it up run 
 the job and tear it back down. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 01:10, Krish Donald wrote: 
 
 Thanks Jonathan, 
 
 I will try to explore EMR option also. 
 Can you please let me know the configuration which you have used it? 
 Can you please recommend for me also? 
 I would like to setup Hadoop cluster using cloudera manager and then would 
 like to do below things: 
 
 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh 
 upgrade of CM
 Hue User Administration 
 Spark 
 Solr 
 
 Thanks 
 Krish 
 
 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Jonathan Aquilina
 

Hi guys I know you guys want to keep costs down, but why go through all
the effort to setup ec2 instances when you deploy EMR it takes the time
to provision and setup the ec2 instances for you. All configuration then
for the entire cluster is done on the master node of the particular
cluster or setting up of additional software that is all done through
the EMR console. We were doing some geospatial calculations and we
loaded a 3rd party jar file called esri into the EMR cluster. I then had
to pass a small bootstrap action (script) to have it distribute esri to
the entire cluster. 

Why are you guys reinventing the wheel? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-06 03:35, Alexander Pivovarov wrote: 

 I found the following solution to this problem
 
 I registered 2 subdomains (public and local) for each computer on 
 https://freedns.afraid.org/subdomain/ [1] 
 e.g. 
 myhadoop-nn.crabdance.com [2]
 myhadoop-nn-local.crabdance.com [3] 
 then I added cron job which sends http requests to update public and local ip 
 on freedns server hint: public ip is detected automatically ip address for 
 local name can be set using request parameter address=10.x.x.x (don't forget 
 to escape )
 
 as a result my nn computer has 2 DNS names with currently assigned ip 
 addresses , e.g.
 myhadoop-nn.crabdance.com [2] 54.203.181.177
 myhadoop-nn-local.crabdance.com [3] 10.220.149.103
 
 in hadoop configuration I can use local machine names to access my cluster 
 outside of AWS I can use public names
 
 Just curious if AWS provides easier way to name EC2 computers?
 
 On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 I dont know how you would do that to be honest. With EMR you have 
 destinctions master core and task nodes. If you need to change configuration 
 you just ssh into the EMR master node. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 02:11, Alexander Pivovarov wrote: 
 
 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used in 
 hadoop cluster. 
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote:
 
 When I started with EMR it was alot of testing and trial and error. HUE is 
 already supported as something that can be installed from the AWS console. 
 What I need to know is if you need this cluster on all the time or this is 
 goign ot be what amazon call a transient cluster. Meaning you fire it up run 
 the job and tear it back down. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 01:10, Krish Donald wrote: 
 
 Thanks Jonathan, 
 
 I will try to explore EMR option also. 
 Can you please let me know the configuration which you have used it? 
 Can you please recommend for me also? 
 I would like to setup Hadoop cluster using cloudera manager and then would 
 like to do below things: 
 
 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh 
 upgrade of CM
 Hue User Administration 
 Spark 
 Solr 
 
 Thanks 
 Krish 
 
 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Links:
--
[1] https://freedns.afraid.org/subdomain/
[2] http://myhadoop-nn.crabdance.com
[3] http://myhadoop-nn-local.crabdance.com


Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Jonathan Aquilina
 

When I started with EMR it was alot of testing and trial and error. HUE
is already supported as something that can be installed from the AWS
console. What I need to know is if you need this cluster on all the time
or this is goign ot be what amazon call a transient cluster. Meaning you
fire it up run the job and tear it back down. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-06 01:10, Krish Donald wrote: 

 Thanks Jonathan, 
 
 I will try to explore EMR option also. 
 Can you please let me know the configuration which you have used it? 
 Can you please recommend for me also? 
 I would like to setup Hadoop cluster using cloudera manager and then would 
 like to do below things: 
 
 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh 
 upgrade of CM
 Hue User Administration 
 Spark 
 Solr 
 
 Thanks 
 Krish 
 
 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

changing log verbosity

2015-02-24 Thread Jonathan Aquilina
 

How does one go about changing the log verbosity in hadoop? What
configuration file should I be looking at? 

-- 
Regards,
Jonathan Aquilina
Founder Eagle Eye T
 

mssql bulk copy dat files

2015-02-23 Thread Jonathan Aquilina
 

Can hadoop process dat files that are generated by MS SQL bulk copy? 

-- 
Regards,
Jonathan Aquilina
Founder Eagle Eye T
 

Re: mssql bulk copy dat files

2015-02-23 Thread Jonathan Aquilina
 

We are using sqlcmd at the moment I was just curious though as Bulk copy
can copy to files rows really quickly into and out of a db. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 12:07, Alexander Alten-Lorenz wrote: 

 Not per default. But you can use sqoop to offload the DBs into 
 something-delimited text. 
 http://mapredit.blogspot.de/2011/10/sqoop-and-microsoft-sql-server.html [1] 
 http://www.microsoft.com/en-us/download/details.aspx?id=27584 [2] 
 
 On 23 Feb 2015, at 12:01, Jonathan Aquilina jaquil...@eagleeyet.net wrote: 
 
 Can hadoop process dat files that are generated by MS SQL bulk copy? 
 
 -- 
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 

Links:
--
[1]
http://mapredit.blogspot.de/2011/10/sqoop-and-microsoft-sql-server.html
[2] http://www.microsoft.com/en-us/download/details.aspx?id=27584


Re: recombining split files after data is processed

2015-02-22 Thread Jonathan Aquilina
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

 Hi, 
 
 You can use an single reducer 
 (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller 
 datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
 
 BR, 
 Alex 
 
 On 23 Feb 2015, at 08:00, Jonathan Aquilina jaquil...@eagleeyet.net wrote: 
 
 Hey all, 
 
 I understand that the purpose of splitting files is to distribute the data 
 to multiple core and task nodes in a cluster. My question is that after the 
 output is complete is there a way one can combine all the parts into a 
 single file? 
 
 -- 
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 

Links:
--
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces


Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

Where I am working we are working on transient cluster (temporary) using
Amazon EMR. When I was reading up on how things work they suggested for
monitoring to use ganglia to monitor memory usage and network usage etc.
That way depending on how things are setup be it using an amazon s3
bucket for example and pulling data directly into the cluster the
network link will always be saturated to ensure a constant flow of data.


What I am suggesting is potentially looking at ganglia. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-22 07:42, Fang Zhou wrote: 

 Hi Jonathan, 
 
 Thank you. 
 
 The number of files impact on the memory usage in Namenode. 
 
 I just want to get the real memory usage situation in Namenode. 
 
 The memory used in heap always changes so that I have no idea about which 
 value is the right one. 
 
 Thanks, 
 Tim 
 
 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote: 
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size? 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-21 21:54, Fang Zhou wrote: 
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

Hi Tim, 

Not sure if this might be of any use in terms of improving overall
cluster performance for you, but I hope that it might shed some ideas
for you and others. 

https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-22 07:57, Tim Chou wrote: 

 Hi Jonathan, 
 
 Very useful information. I will look at the ganglia. 
 
 However, I do not have the administrative privilege for the cluster. I don't 
 know if I can install Ganglia in the cluster. 
 
 Thank you for your information. 
 
 Best, 
 Tim 
 
 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net:
 
 Where I am working we are working on transient cluster (temporary) using 
 Amazon EMR. When I was reading up on how things work they suggested for 
 monitoring to use ganglia to monitor memory usage and network usage etc. That 
 way depending on how things are setup be it using an amazon s3 bucket for 
 example and pulling data directly into the cluster the network link will 
 always be saturated to ensure a constant flow of data. 
 
 What I am suggesting is potentially looking at ganglia. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, 
 
 Thank you. 
 
 The number of files impact on the memory usage in Namenode. 
 
 I just want to get the real memory usage situation in Namenode. 
 
 The memory used in heap always changes so that I have no idea about which 
 value is the right one. 
 
 Thanks, 
 Tim 
 
 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote: 
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size? 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-21 21:54, Fang Zhou wrote: 
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

I am rather new to hadoop, but wouldnt the difference be potentially in
how the files are split in terms of size? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-21 21:54, Fang Zhou wrote: 

 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

writing mappers and reducers question

2015-02-19 Thread Jonathan Aquilina
 

Hey guys Is it safe to guess that one would need a single node setup to
be able to write mappers and reducers for hadoop? 

-- 
Regards,
Jonathan Aquilina
Founder Eagle Eye T