RE: hadoop-hdfs-native-client Help
Hi Paula, I am not sure how to answer your questions but is there a reason why you are using an EC2 instance instead of amazonz EMR (elastic Map reduce) Hadoop cluster. As far as I know you can set that up to work with HDFS setup as well as S3 buckets if you don’t need a long term cluster to stay online. Regards, Jonathan From: Paula Logan Sent: 10 September 2021 16:13 To: user@hadoop.apache.org Subject: hadoop-hdfs-native-client Help Hello, I am new to building Hadoop locally, and am having some issues. Please let me know if this information should be sent to a different distro. (1) Can Hadoop 3.3.1 be compiled and run with OpenJDK 11 or is OpenJDK 1.8 needed for compile while 1.8 or 11 can be used to run hadoop? (2) I am compiling and testing Hadoop 3.3.1 on RHEL 8.4 on the command line not via any IDE inside an AWS instance. I have encountered an issue with Native Test Case #35 (all other 39 Native Test Cases succeed). First here is my maven command: mvn -e -X test -Pnative,parallel-tests,shelltest,yarn-ui -Dtest=allNative -Dparallel-tests=true -Drequire.bzip2=true -Drequire.fuse=true -Drequire.isal=true -Disal.prefix=/usr/local -Disal.lib=/usr/local/lib64 -Dbundle.isal=true -Drequire.openssl=true -Dopenssl.prefix=/usr -Dopenssl.include=/usr/include -Dopenssl.lib=/usr/lib64 -Dbundle.openssl=true -Dbundle.openssl.in.bin=true -Drequire.pmdk=true -Dpmdk.lib=/usr/lib64 -Dbundle.pmdk=true -Drequire.snappy=true -Dsnappy.prefix=/usr -Dsnappy.include=/usr/include -Dsnappy.lib=/usr/lib64 -Dbundle.snappy=true -Drequire.valgrind=true -Dhbase.profile=2.0 -Drequire.zstd=true -Dzstd.prefix=/usr -Dzstd.include=/usr/include -Dzstd.lib=/usr/lib64 -Dbundle.zstd=true -Dbundle.zstd.in.bin=true -Drequire.test.libhadoop=true This is what I get for Test Case #35: [exec] 35/40 Test #35: test_libhdfs_threaded_hdfspp_test_shim_static ..***Failed 31.58 sec [exec] testRecursiveJvmMutex error: [exec] ClassNotFoundException: RuntimeExceptionjava.lang.NoClassDefFoundError: RuntimeException [exec] Caused by: java.lang.ClassNotFoundException: RuntimeException [exec] at java.net.URLClassLoader.findClass(URLClassLoader.java:382) [exec] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) [exec] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) [exec] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) [exec] 2021-09-02 22:31:09,706 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:(529)) - starting cluster: numNameNodes=1, numDataNodes=1 [exec] 2021-09-02 22:31:10,134 INFO namenode.NameNode (NameNode.java:format(1249)) - Formatting using clusterid: testClusterID [exec] 2021-09-02 22:31:10,156 INFO namenode.FSEditLog (FSEditLog.java:newInstance(229)) - Edit logging is async:true [exec] 2021-09-02 22:31:10,182 INFO namenode.FSNamesystem (FSNamesystem.java:(814)) - KeyProvider: null [exec] 2021-09-02 22:31:10,184 INFO namenode.FSNamesystem (FSNamesystemLock.java:(141)) - fsLock is fair: true [exec] 2021-09-02 22:31:10,185 INFO namenode.FSNamesystem (FSNamesystemLock.java:(159)) - Detailed lock hold time metrics enabled: false [exec] 2021-09-02 22:31:10,185 INFO namenode.FSNamesystem (FSNamesystem.java:(847)) - fsOwner= ec2-user (auth:SIMPLE) [exec] 2021-09-02 22:31:10,185 INFO namenode.FSNamesystem (FSNamesystem.java:(848)) - supergroup ... [exec] 2021-09-02 22:31:13,204 INFO ipc.Server (Server.java:logException(3020)) - IPC Server handler 7 on default port 44945, call Call#6 Retry#-1 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 127.0.0.1:37362: java.io.FileNotFoundException: File does not exist: /tlhData0001/file1 ... [exec] 98% tests passed, 1 tests failed out of 40 [exec] [exec] Total Test time (real) = 270.30 sec [exec] [exec] The following tests FAILED: [exec] 35 - test_libhdfs_threaded_hdfspp_test_shim_static (Failed) [exec] Errors while running CTest [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main 3.3.1 ... SUCCESS [ 0.707 s] [INFO] Apache Hadoop Build Tools .. SUCCESS [ 2.743 s] [INFO] Apache Hadoop Project POM .. SUCCESS [ 0.692 s] [INFO] Apache Hadoop Annotations .. SUCCESS [ 1.955 s] [INFO] Apache Hadoop Project Dist POM . SUCCESS [ 0.106 s] [INFO] Apache Hadoop Assemblies ... SUCCESS [ 0.101 s] [INFO] Apache Hadoop Maven Plugins SUCCESS [ 3.194 s] [INFO] Apache Hadoop MiniKDC .. SUCCESS [ 0.806 s] [INFO] Apache Hadoop Auth . SUCCESS [ 4.192 s] [INFO] Apache Hadoop Auth Examples SUCCESS [ 0.452 s]
RE: Applications always showing in pending state even after cluster restart
What you are saying is a bit of an easy fix. On the azure network security group lock down those public ip addresses to be accessible from your ip address or those ip addresses that are meant to have access to it. Regards, Jonathan Aquilina EagleEyeT Phone: +356 2033 0099 Moblie + 356 7995 7942 Email: sa...@eagleeyet.net<mailto:sa...@eagleeyet.net> Website: https://eagleeyet.net From: Gaurav Chhabra Sent: 13 June 2020 11:45 To: Hariharan Cc: common-u...@hadoop.apache.org Subject: Re: Applications always showing in pending state even after cluster restart Wow! What a guess, Hari! :) I wasn't sure those pending tasks could have been related to an attack. This happened with me from 1st to 5th June'20. I didn't check my Azure usage during that time though I was keeping tab almost every day in May. On 8th June (Mon), when i checked the charges, the Azure 'data transfer out' charges were showing $88, $90 & $110 for bigdataserver-{5,6,7} respectively. I was shocked as my last month charge was around $53. I opened a ticket with Azure and then we again started the cluster (with Azure networking guy along with me) and within 3-4 minutes, data transfer out again was around 10-12 GB in total (from 3 instances). We could only figure out that the hits were going to some blob storage in Azure. He said it most likely seems to be a virus or some attack. I have now removed public IPs from all instances except two instances (one where Cloudera Manager is hosted and another where Resource Manager is running). Even those two exposed ones are allowed incoming requests specifically from my laptop's IP. Things are fine now. One thing that i don't get is how's the attacker 'personally' benefitting from this except for obviously raising my monthly bill? Regards On Sat, 13 Jun 2020 at 11:00, Hariharan mailto:hariharan...@gmail.com>> wrote: This is most likely an attempt to attack your system. If you are running your cluster in the cloud, you should run it in a private network so it is not exposed to the Internet. Alternatively you can secure your installation as described here - https://blog.cloudera.com/how-to-secure-internet-exposed-apache-hadoop/ Thanks, Hari On Fri, 12 Jun 2020, 12:20 Gaurav Chhabra, mailto:varuag.chha...@gmail.com>> wrote: Hi All, I have started learning Hadoop and its related components. I am following a tutorial on Hadoop Administration on Udemy. As part of the learning process, i ran the following command: $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jarrandomtextwriter -Ddfs.replication=1 /user/bigdata/randomtextwriter Above command created 30 files each of size 1 GB. Then i ran the below reduce command: $ yarn jar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ wordcount \ -Dmapreduce.input.fileinputformat.split.minsize=268435456\ -Dmapreduce.job.reduces=8 \ /user/bigdata/randomtext \ /user/bigdata/wordcount After executing the above command, I just thought of killing the application after some time so i ran 'yarn application -list' first which listed a lot many applications out of which one was wordcount. I killed that particular application using 'yarn application -kill application-id'. However, when i checked the scheduler, i could see that several applications were still showing in Pending state so i ran the following command: $ for x in $(yarn application -list -appStates ACCEPTED | awk 'NR > 2 { print $1 }'); do yarn application -kill $x; done It was killing the applications as I could see the 'Apps Completed' count was going up but as soon as all the apps got killed, I saw those applications again getting created. Even if I stop the whole cluster and start again, the scheduler shows that there are submitted applications in Pending state. Here's the content of fair-scheduler.xml: drf drf This is just a test cluster. I just want to kill the applications/clear the application queue. Any help will really be appreciated as I am struggling with it for the last few days. Regards - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org<mailto:user-unsubscr...@hadoop.apache.org> For additional commands, e-mail: user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>
Re: Warning from user@hadoop.apache.org
I think they are testing some new mechanisms for list moderation. On 2016-09-04 03:41, Ted Yu wrote: > There is nothing to worry about on your side. > > I received such email too. > > On Sep 3, 2016, at 5:57 PM, Jonathan Aquilina <jaquil...@eagleeyet.net> wrote: > >> Can someone tell me if the below is something to worry about as I hardly >> post to the list and I know when I have posted to the list my emails have >> not bounced. >> >> Original Message >> >> SUBJECT: >> Warning from user@hadoop.apache.org >> >> DATE: >> 2016-09-04 02:36 >> >> FROM: >> user-h...@hadoop.apache.org >> >> TO: >> jaquil...@eagleeyet.net >> >> Hi! This is the ezmlm program. I'm managing the >> user@hadoop.apache.org mailing list. >> >> Messages to you from the user mailing list seem to >> have been bouncing. I've attached a copy of the first bounce >> message I received. >> >> If this message bounces too, I will send you a probe. If the probe bounces, >> I will remove your address from the user mailing list, >> without further notice. >> >> I've kept a list of which messages from the user mailing list have >> bounced from your address. >> >> Copies of these messages may be in the archive. >> To retrieve a set of messages 123-145 (a maximum of 100 per request), >> send a short message to: >> <user-get.123_...@hadoop.apache.org> >> >> To receive a subject and author list for the last 100 or so messages, >> send a short message to: >> <user-in...@hadoop.apache.org> >> >> Here are the message numbers: >> >> 23009 >> >> --- Enclosed is a copy of the bounce message I received. >> >> Return-Path: <> >> Received: (qmail 55605 invoked for bounce); 24 Aug 2016 13:51:02 - >> Date: 24 Aug 2016 13:51:02 - >> From: mailer-dae...@apache.org >> To: user-return-230...@hadoop.apache.org >> Subject: failure notice
Fwd: Warning from user@hadoop.apache.org
Can someone tell me if the below is something to worry about as I hardly post to the list and I know when I have posted to the list my emails have not bounced. Original Message SUBJECT: Warning from user@hadoop.apache.org DATE: 2016-09-04 02:36 FROM: user-h...@hadoop.apache.org TO: jaquil...@eagleeyet.net Hi! This is the ezmlm program. I'm managing the user@hadoop.apache.org mailing list. Messages to you from the user mailing list seem to have been bouncing. I've attached a copy of the first bounce message I received. If this message bounces too, I will send you a probe. If the probe bounces, I will remove your address from the user mailing list, without further notice. I've kept a list of which messages from the user mailing list have bounced from your address. Copies of these messages may be in the archive. To retrieve a set of messages 123-145 (a maximum of 100 per request), send a short message to:To receive a subject and author list for the last 100 or so messages, send a short message to: Here are the message numbers: 23009 --- Enclosed is a copy of the bounce message I received. Return-Path: <> Received: (qmail 55605 invoked for bounce); 24 Aug 2016 13:51:02 - Date: 24 Aug 2016 13:51:02 - From: mailer-dae...@apache.org To: user-return-230...@hadoop.apache.org Subject: failure notice
Re: EC2 Hadoop Cluster VS Amazon EMR
When I was testing EMR i had only spent around 17USD for testing and with a decent sized EMR cluster. On 2016-03-11 12:31, José Luis Larroque wrote: > Hi Jonathan! > I was trying to know which of those options use a time ago. For now i'm using > Amazon EMR, because it's more easy, you have some stuff configurated already. > > But, a few benefits could be that, with EC2, you can use the Free tier, and > save some money while you are testing your stuff. And probably should be more > cheaper the use of EC2 against EMR, but i'm not 100% sure of this. > > Bye! > Jose > > 2016-03-07 6:17 GMT-03:00 Jonathan Aquilina <jaquil...@eagleeyet.net>: > >> Good Morning, >> >> Just some food for thought as of late im noticing people using EC2 to setup >> ones own Hadoop cluster. What is the advantage of using ec2 over amazon's >> EMR hadoop cluster? >> >> Regard, >> >> Jonathan
Re: fs.s3a.endpoint not working
Im not totally following this thread from the beginning but I might be able to help as I have some experience with Amazon EMR (elastic map reduce) when working with custom jar files and s3 Are you using EMR or something internal and offloading strage to s3? --- Regards, Jonathan Aquilina Founder On 2016-01-13 23:21, Phillips, Caleb wrote: > Hi Billy (and others), > > One of the threads suggested using the core-site.xml. Did you try putting > your configuration in there? > > Yes, I did try that. I've also tried setting it dynamically in e.g., spark. I > can verify that it is getting the configuration correctly: > > hadoop org.apache.hadoop.conf.Configuration > > Still it never connects to our internal S3-compatable store and always > connects to AWS. > > One thing I've noticed is that the AWS stuff is handled by an underlying > library (I think jets3t in < 2.6 versions, forget what in 2.6+) and when I > was trying to mess with stuff and spelunking through the hadoop code, I kept > running into blocks with that library. > > I started digging into the code. I found that the custom endpoint was > introduced with this patch: > > https://issues.apache.org/jira/browse/HADOOP-11261 > > It seems it was integrated in 2.7.0, so just to be sure I downloaded 2.7.1, > but the problem persists. > > That code calls this function in the AWS Java SDK: > > http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html#setEndpoint(java.lang.String) > > However, no matter what configuration I use, it still seems to connect to > Amazon AWS. Is it possible that the AWS Java SDK cannot work with > S3-compatable (non-AWS) stores? If so, it would seem there is no way > currently to connect hadoop to an S3-compatable non-AWS store. > > If anyone else has any insight, particularly success using hadoop with a > non-AWS, S3-compatable store, please chime in! > > William Watson > Software Engineer > (904) 705-7056 PCS > > On Mon, Jan 11, 2016 at 10:39 AM, Phillips, Caleb > <caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>> wrote: > Hi All, > > Just wanted to send this out again since there was no response > (admittedly, originally sent in the midst of the US holiday season) and it > seems to be an issue that continues to come up (see e.g., the email from > Han Ju on Jan 5). > > If anyone has successfully connected Hadoop to a non-AWS S3-compatable > object store, it'd be very helpful to hear how you made it work. The > fs.s3a.endpoint configuration directive appears non-functional at our site > (with Hadoop 2.6.3). > > -- > Caleb Phillips, Ph.D. > Data Scientist | Computational Science Center > > National Renewable Energy Laboratory (NREL) > 15013 Denver West Parkway | Golden, CO 80401 > 303-275-4297 | > caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov> > > On 12/22/15, 1:39 PM, "Phillips, Caleb" > <caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov>> wrote: > >> Hi All, >> >> New to this list. Looking for a bit of help: >> >> I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object >> store. >> >> This issue was discussed, but left unresolved, in this thread: >> >> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+0W_ >> au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com<mailto:au5es_flugzmgwkkga3jya1asi3u%2bisjcuymfntvnk...@mail.gmail.com>%3E >> >> And here, on Cloudera's forums (the second post is mine): >> >> https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoi >> nt-ignored-in-hdfs-site-xml/m-p/33694#M1180 >> >> I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using >> Hadoop, I'm able to connect to S3 on AWS, and e.g., list/put/get files. >> >> However, when I point the fs.s3a.endpoint configuration directive at my >> non-AWS S3-Compatable object storage, it appears to still point at (and >> authenticate against) AWS. >> >> I've checked and double-checked my credentials and configuration using >> both Python's boto library and the s3cmd tool, both of which connect to >> this non-AWS data store just fine. >> >> Any help would be much appreciated. Thanks! >> >> -- >> Caleb Phillips, Ph.D. >> Data Scientist | Computational Science Center >> >> National Renewable Energy Laboratory (NREL) >> 15013 Denver West Parkway | Golden, CO 80401 >> 303-275-4297 | caleb.phill...@nrel.gov<mailto:caleb.phill...@nrel.gov> >> >> - >> To unsubscribe, e-mail: >> user-unsubscr...@hadoop.apache.org<mailto:user-unsubscr...@hadoop.apache.org> >> For additional commands, e-mail: >> user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org> > > - > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > For additional commands, e-mail: user-h...@hadoop.apache.org
Re: Use of hadoop in AWS - Build it from scratch on a EC2 instance / MapR hadoop distribution / Amazon hadoop distribution
Hey Jose Have you looked at Amazon emr ( elastic map reduce) where I work we have used it and when you provision the emr instance you can use custom jars like the one you mentioned. In terms of storage you can use either hdfs, if you are going to keep a persistent cluster. If not you can store your data in an Amazon s3 bucket. Documentation for emr is really good. At the time when we did this and this was at the beginning of this year and they supported Hadoop 2.6. In my honest opinion you are giving yourself a lot of extra work for nothing to get us in Hadoop. Try out emr with temporary cluster and go from there. I managed to tool up and learn how to work with emr in a week. Sent from my iPhone > On 19 Oct 2015, at 02:10, José Luis Larroquewrote: > > Thanks for your answer Anders. > > -The amount of data that i'm going to manipulate it's like the wikipedia (i > will use a dump) > - I already have the basics of hadoop (i hope), i have a local multinode > cluster setup and i already executed some algorithms. > - Because the amount of data its important, i believe that i should use > several nodes. > > Maybe another option to considerate should be that i'm running Giraph on top > of the selected hadoop distribution/EC2. > > Bye! > Jose > > 2015-10-18 18:53 GMT-03:00 Anders Nielsen : >> Dear Jose, >> >> It will help people answer your question if you specify your goals : >> >> -If you do it to learn how to USE a running Hadoop then go for one of the >> prebuilt distributions (Amazon or MapR) >> -If you do it to learn more about the setting up and administrating Hadoop >> then you are better off setting everything up from scratch on EC2. >> -Do you need to run on many nodes or just a 1 node to test some Mapreduce >> scripts on a small data set? >> >> Regards, >> >> Anders >> >> >> >> >>> On Sun, Oct 18, 2015 at 10:03 PM, José Luis Larroque >>> wrote: >>> Hi all ! >>> >>> I started to use hadoop with aws, and a big question appears in front of me! >>> >>> I'm using a MapR distribution, for hadoop 2.4.0 in AWS. I already tried >>> some trivial examples, and before moving forward i have one question. >>> >>> What is the better option for using Hadoop on AWS? >>> - Build it from scratch on a EC2 instance >>> - Use MapR distribution of Hadoop >>> - Use Amazon distribution of Hadoop >>> >>> Sorry if my question is too broad. >>> >>> Bye! >>> Jose >
Re: IMPORTANT: "HOW TO" UNSUBSCRIBE
Because there are none of the key hadoop contributors that have a signature that contains the unsubscribe email. Which technically could classify these emails as spam. If one notes all mailing lists be it in a key members signature or otherwise they always explain how you can unsubscribe. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-09-29 10:54, Daniel Jankovic wrote: > well ... why, when they can always send UNSUBSCRIBE to the whole group :) > > On Tue, Sep 22, 2015 at 5:31 PM, Namikaze Minato <lloydsen...@gmail.com> > wrote: > >> Step 1: >> Send an e-mail to user-unsubscr...@hadoop.apache.org >> >> Done.
Re: Comparing CheckSum of Local and HDFS File
Correct me if I am wrong but the command you ran on the cluster seems to be doing a CRC check as well. I am still a novice to hadoop but that is the most obvious thing i see in the output below. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-08-07 12:34, Shashi Vishwakarma wrote: Hi I have a small confusion regarding checksum verification.Lets say , i have a file abc.txt and I transferred this file to hdfs. How do I ensure about data integrity? I followed below steps to check that file is correctly transferred. ON LOCAL FILE SYSTEM: md5sum abc.txt 276fb620d097728ba1983928935d6121 TestFile ON HADOOP CLUSTER : hadoop fs -checksum /abc.txt /abc.txt MD5-of-0MD5-of-512CRC32C 0200911156a9cf0d906c56db7c8141320df0 Both output looks different to me. Let me know if I am doing anything wrong. How do I verify if my file is transferred properly into HDFS? Thanks Shashi
Re: ssh localhost returns connection closed by ::1 under Cygwin installation on windows 8
Hi, Is there a reason why you are using ipv6? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-07-23 23:35, Yepeng Sun wrote: Hi, I tried to install Hadoop on windows 8 to form multi-node clusters. So first I have to install Cygwin in order to make SSH working. I installed Cygwin with Sshd service successfully. And also I generated passwordless keys. But when I run ssh localhost, it returns connection closed by ::1. Does anyone has the same experience? How did you solve it. Thanks! Yepeng
Re: Unable to Find S3N Filesystem Hadoop 2.6
One thing I think which i most likely missed completely is are you using an amazon EMR cluster or something in house? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-20 16:21, Billy Watson wrote: I appreciate the response. These JAR files aren't 3rd party. They're included with the Hadoop distribution, but in Hadoop 2.6 they stopped being loaded by default and now they have to be loaded manually, if needed. Essentially the problem boils down to: - need to access s3n URLs - cannot access without including the tools directory - after including tools directory in HADOOP_CLASSPATH, failures start happening later in job - need to find right env variable (or shell script or w/e) to include jets3t other JARs needed to access s3n URLs (I think) William Watson Software Engineer (904) 705-7056 PCS On Mon, Apr 20, 2015 at 9:58 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: you mention an environmental variable. the step before you specify the steps to run to get to the result. you can specify a bash script that will allow you to put any 3rd party jar files, for us we used esri, on the cluster and propagate them to all nodes in the cluster as well. You can ping me off list if you need further help. Thing is I havent used pig but my boss and coworker wrote the mappers and reducers. to get these jars to the entire cluster was a super small and simple bash script. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-20 15:17, Billy Watson wrote: Hi, I am able to run a `hadoop fs -ls s3n://my-s3-bucket` from the command line without issue. I have set some options in hadoop-env.sh to make sure all the S3 stuff for hadoop 2.6 is set up correctly. (This was very confusing, BTW and not enough searchable documentation on changes to the s3 stuff in hadoop 2.6 IMHO). Anyways, when I run a pig job which accesses s3, it gets to 16%, does not fail in pig, but rather fails in mapreduce with Error: java.io.IOException: No FileSystem for scheme: s3n. I have added [hadoop-install-loc]/lib and [hadoop-install-loc]/share/hadoop/tools/lib/ to the HADOOP_CLASSPATH env variable in hadoop-env.sh.erb. When I do not do this, the pig job will fail at 0% (before it ever gets to mapreduce) with a very similar No fileystem for scheme s3n error. I feel like at this point I just have to add the share/hadoop/tools/lib directory (and maybe lib) to the right environment variable, but I can't figure out which environment variable that should be. I appreciate any help, thanks!! Stack trace: org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467) at org.apache.pig.piggybank.storage.CSVExcelStorage.setLocation(CSVExcelStorage.java:609) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:129) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:512) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) -- Billy Watson -- William Watson Software Engineer (904) 705-7056 [1] PCS Links: -- [1] tel:%28904%29%20705-7056
Re: Unable to Find S3N Filesystem Hadoop 2.6
Sadly I'll have to pull back I have only run a Hadoop map reduce cluster with Amazon met Sent from my iPhone On 20 Apr 2015, at 16:53, Billy Watson williamrwat...@gmail.com wrote: This is an install on a CentOS 6 virtual machine used in our test environment. We use HDP in staging and production and we discovered these issues while trying to build a new cluster using HDP 2.2 which upgrades from Hadoop 2.4 to Hadoop 2.6. William Watson Software Engineer (904) 705-7056 PCS On Mon, Apr 20, 2015 at 10:26 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: One thing I think which i most likely missed completely is are you using an amazon EMR cluster or something in house? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-20 16:21, Billy Watson wrote: I appreciate the response. These JAR files aren't 3rd party. They're included with the Hadoop distribution, but in Hadoop 2.6 they stopped being loaded by default and now they have to be loaded manually, if needed. Essentially the problem boils down to: - need to access s3n URLs - cannot access without including the tools directory - after including tools directory in HADOOP_CLASSPATH, failures start happening later in job - need to find right env variable (or shell script or w/e) to include jets3t other JARs needed to access s3n URLs (I think) William Watson Software Engineer (904) 705-7056 PCS On Mon, Apr 20, 2015 at 9:58 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: you mention an environmental variable. the step before you specify the steps to run to get to the result. you can specify a bash script that will allow you to put any 3rd party jar files, for us we used esri, on the cluster and propagate them to all nodes in the cluster as well. You can ping me off list if you need further help. Thing is I havent used pig but my boss and coworker wrote the mappers and reducers. to get these jars to the entire cluster was a super small and simple bash script. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-20 15:17, Billy Watson wrote: Hi, I am able to run a `hadoop fs -ls s3n://my-s3-bucket` from the command line without issue. I have set some options in hadoop-env.sh to make sure all the S3 stuff for hadoop 2.6 is set up correctly. (This was very confusing, BTW and not enough searchable documentation on changes to the s3 stuff in hadoop 2.6 IMHO). Anyways, when I run a pig job which accesses s3, it gets to 16%, does not fail in pig, but rather fails in mapreduce with Error: java.io.IOException: No FileSystem for scheme: s3n. I have added [hadoop-install-loc]/lib and [hadoop-install-loc]/share/hadoop/tools/lib/ to the HADOOP_CLASSPATH env variable in hadoop-env.sh.erb. When I do not do this, the pig job will fail at 0% (before it ever gets to mapreduce) with a very similar No fileystem for scheme s3n error. I feel like at this point I just have to add the share/hadoop/tools/lib directory (and maybe lib) to the right environment variable, but I can't figure out which environment variable that should be. I appreciate any help, thanks!! Stack trace: org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467) at org.apache.pig.piggybank.storage.CSVExcelStorage.setLocation(CSVExcelStorage.java:609) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:129) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:103) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:512) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) — Billy Watson -- William Watson Software Engineer (904) 705-7056 PCS
Re: Unsubscribe
All emails coming through this list i feel should have a signature with an unsubscribe link. Who do I need to contact to be able to get that done? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-14 18:12, Preya Shah wrote: unsubscribe
Re: Can we run mapreduce job from eclipse IDE on fully distributed mode hadoop cluster?
I could be wrong here, but the way I understand things I do not think that is even possible to run the JAR file from your PC. There are two things that you need to realize. 1) How is the JAR file going to connect to the cluster 2) How is the JAR file going to be distributed to the cluster. Again I could be wrong here in my response, so anyone else on the list feel free to correct me. I am still a novice to Hadoop and have only worked with it on amazon EMR. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-04-11 08:23, Answer Agrawal wrote: A mapreduce job can be run as jar file from terminal or directly from eclipse IDE. When a job run as jar file from terminal it uses multiple jvm and all resources of cluster. Does the same thing happen when we run from IDE. I have run a job on both and it takes less time on IDE than jar file on terminal. Thanks
getting amazon emr to access a single file when reducing
Hi guys, I need to run a job where we have data when being reduced it needs to access another file for data that is needed, in my case way points, the way points do not need to be processed. On amazon emr its proving very tricky how would one need to do this in the simplest way possible? -- Regards, Jonathan Aquilina Founder Eagle Eye T
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
When i was testing I was using default setup 1 master node 2 core and no task nodes. i would spiin up the cluster then terminate it. The term for that is a transient cluster. When the big data was needing to be crunched i changed the setup a bit. An Important note there is a limitation of 20 Nodes be it core or task with EMR a request can be submitted to lift that limitation. When actually live i had 1 master node 3 task nodes (which have HDFS storage) and 10 task nodes. All instances used were of size m3.large. Ran another batch of data for 2013 through EMR with this setup in 31 min just to run the data that isnt including cluster spawn up time. One thing to note you do not need to use HDFS storage as that can and will drive up the cost quickly and there there is a chance of data corruption or even data loss if a core node crashes. I have been using amazon S3 and pulling the data from there. The biggest advantage is that you can spawn up multiple clusters and share the same data to be processed that way. Using HDFS has its perks too but costs can drastically increase as well. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-07 09:54, tesm...@gmail.com wrote: Dear Jonathan, Would you please describe the process of running EMR based Hadoop for $15.00, I tried and my cost were rocketing like $60 for one hour. Regards On 05/03/2015 23:57, Jonathan Aquilina wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: AWS Setting for setting up Hadoop cluster
I have experience with my full time job using EMR damn thing is quick and cheap. The interesting part is wrapping your head around the concepts. If you need things quickly and fast EMR is the way to go. It spawns up a number of ec2 instances by default you have 1 master and 2 core nodes. The three of them are m3.large nodes which run you 7 cents per hour. to run one years with of data which is about 1.1 billion records from the database it took 50 min from cluster spawn up to completion and shutting down of the cluster. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-05 23:41, Dieter De Witte wrote: You can install Hadoop on Amazon EC2 instances and use the free tier for new members but you can also use Amazon EMR which is not free but is up and running in a couple of seconds... 2015-03-05 23:28 GMT+01:00 Krish Donald gotomyp...@gmail.com: Hi, I am tired of setting Hadoop cluster using my laptop which has 8GB RAM. I tried 2gb for namenode and 1-1 gb for 3 datanoded so total 5gb I was using . And I was using very basic Hadoop services only. But it is so slow that I am not able to do anything on that. Hence I would like to try the AWS service now. Can anybody please help me, which configuration I should use it without paying at all? What are the tips you have for AWS ? Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: AWS Setting for setting up Hadoop cluster
Advantage of EMR is that you dont have to stay screwing around with installing hadoop it does all that for you so you are ready to go --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-05 23:51, Krish Donald wrote: Because I am new to AWS, I would like to explore the free service first and then later I can use EMR. Which one is fast in EC2 and free too? Thanks On Thu, Mar 5, 2015 at 2:47 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I have experience with my full time job using EMR damn thing is quick and cheap. The interesting part is wrapping your head around the concepts. If you need things quickly and fast EMR is the way to go. It spawns up a number of ec2 instances by default you have 1 master and 2 core nodes. The three of them are m3.large nodes which run you 7 cents per hour. to run one years with of data which is about 1.1 billion records from the database it took 50 min from cluster spawn up to completion and shutting down of the cluster. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-05 23:41, Dieter De Witte wrote: You can install Hadoop on Amazon EC2 instances and use the free tier for new members but you can also use Amazon EMR which is not free but is up and running in a couple of seconds... 2015-03-05 23:28 GMT+01:00 Krish Donald gotomyp...@gmail.com: Hi, I am tired of setting Hadoop cluster using my laptop which has 8GB RAM. I tried 2gb for namenode and 1-1 gb for 3 datanoded so total 5gb I was using . And I was using very basic Hadoop services only. But it is so slow that I am not able to do anything on that. Hence I would like to try the AWS service now. Can anybody please help me, which configuration I should use it without paying at all? What are the tips you have for AWS ? Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
The only limitation I know is that of how many nodes you can have and how many instances of that particular size the host is on can support. you can load hive in EMR and then any other features of the cluster are managed at the master node level as you have SSH access there. What are the advantage of 2.6 over 2.4 for example. I just feel you guys are reinventing the wheel when amazon already caters for hadoop granted it might not be 2.6. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 07:31, Alexander Pivovarov wrote: I think EMR has its own limitation e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my hive patch. How EMR can help me? it supports hadoop up to 2.4.0 (not even 2.4.1) http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html [1] On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Hi guys I know you guys want to keep costs down, but why go through all the effort to setup ec2 instances when you deploy EMR it takes the time to provision and setup the ec2 instances for you. All configuration then for the entire cluster is done on the master node of the particular cluster or setting up of additional software that is all done through the EMR console. We were doing some geospatial calculations and we loaded a 3rd party jar file called esri into the EMR cluster. I then had to pass a small bootstrap action (script) to have it distribute esri to the entire cluster. Why are you guys reinventing the wheel? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 03:35, Alexander Pivovarov wrote: I found the following solution to this problem I registered 2 subdomains (public and local) for each computer on https://freedns.afraid.org/subdomain/ [2] e.g. myhadoop-nn.crabdance.com [3] myhadoop-nn-local.crabdance.com [4] then I added cron job which sends http requests to update public and local ip on freedns server hint: public ip is detected automatically ip address for local name can be set using request parameter address=10.x.x.x (don't forget to escape ) as a result my nn computer has 2 DNS names with currently assigned ip addresses , e.g. myhadoop-nn.crabdance.com [3] 54.203.181.177 myhadoop-nn-local.crabdance.com [4] 10.220.149.103 in hadoop configuration I can use local machine names to access my cluster outside of AWS I can use public names Just curious if AWS provides easier way to name EC2 computers? On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish Links: -- [1] http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
Hi guys I know you guys want to keep costs down, but why go through all the effort to setup ec2 instances when you deploy EMR it takes the time to provision and setup the ec2 instances for you. All configuration then for the entire cluster is done on the master node of the particular cluster or setting up of additional software that is all done through the EMR console. We were doing some geospatial calculations and we loaded a 3rd party jar file called esri into the EMR cluster. I then had to pass a small bootstrap action (script) to have it distribute esri to the entire cluster. Why are you guys reinventing the wheel? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 03:35, Alexander Pivovarov wrote: I found the following solution to this problem I registered 2 subdomains (public and local) for each computer on https://freedns.afraid.org/subdomain/ [1] e.g. myhadoop-nn.crabdance.com [2] myhadoop-nn-local.crabdance.com [3] then I added cron job which sends http requests to update public and local ip on freedns server hint: public ip is detected automatically ip address for local name can be set using request parameter address=10.x.x.x (don't forget to escape ) as a result my nn computer has 2 DNS names with currently assigned ip addresses , e.g. myhadoop-nn.crabdance.com [2] 54.203.181.177 myhadoop-nn-local.crabdance.com [3] 10.220.149.103 in hadoop configuration I can use local machine names to access my cluster outside of AWS I can use public names Just curious if AWS provides easier way to name EC2 computers? On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish Links: -- [1] https://freedns.afraid.org/subdomain/ [2] http://myhadoop-nn.crabdance.com [3] http://myhadoop-nn-local.crabdance.com
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
changing log verbosity
How does one go about changing the log verbosity in hadoop? What configuration file should I be looking at? -- Regards, Jonathan Aquilina Founder Eagle Eye T
mssql bulk copy dat files
Can hadoop process dat files that are generated by MS SQL bulk copy? -- Regards, Jonathan Aquilina Founder Eagle Eye T
Re: mssql bulk copy dat files
We are using sqlcmd at the moment I was just curious though as Bulk copy can copy to files rows really quickly into and out of a db. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-23 12:07, Alexander Alten-Lorenz wrote: Not per default. But you can use sqoop to offload the DBs into something-delimited text. http://mapredit.blogspot.de/2011/10/sqoop-and-microsoft-sql-server.html [1] http://www.microsoft.com/en-us/download/details.aspx?id=27584 [2] On 23 Feb 2015, at 12:01, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Can hadoop process dat files that are generated by MS SQL bulk copy? -- Regards, Jonathan Aquilina Founder Eagle Eye T Links: -- [1] http://mapredit.blogspot.de/2011/10/sqoop-and-microsoft-sql-server.html [2] http://www.microsoft.com/en-us/download/details.aspx?id=27584
Re: recombining split files after data is processed
Thanks Alex. where would that command be placed in a mapper or reducer or run as a command. Here at work we are looking to use Amazon EMR to do our number crunching and we have access to the master node, but not really the rest of the cluster. Can this be added as a step to be run after initial processing? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: Hi, You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name BR, Alex On 23 Feb 2015, at 08:00, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Hey all, I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file? -- Regards, Jonathan Aquilina Founder Eagle Eye T Links: -- [1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Re: How can I get the memory usage in Namenode and Datanode?
Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
Hi Tim, Not sure if this might be of any use in terms of improving overall cluster performance for you, but I hope that it might shed some ideas for you and others. https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:57, Tim Chou wrote: Hi Jonathan, Very useful information. I will look at the ganglia. However, I do not have the administrative privilege for the cluster. I don't know if I can install Ganglia in the cluster. Thank you for your information. Best, Tim 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net: Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
writing mappers and reducers question
Hey guys Is it safe to guess that one would need a single node setup to be able to write mappers and reducers for hadoop? -- Regards, Jonathan Aquilina Founder Eagle Eye T