UserGroupInformation.getLoginUser: failure to login.
Hi hadoop. I recently ran a spark job that uses the hadoop.security libraries for login (spark context does this)... It threw an exception: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:700) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2181) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2181) at scala.Option.getOrElse(Option.scala:120) And the root exception was: Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name This is running in a docker container. Is there anything in particular I need to do to run such containers (i.e. do we have a privileged requirement for UserGrouInfo or anything like that?)... -- jay vyas
Re: Use of hadoop in AWS - Build it from scratch on a EC2 instance / MapR hadoop distribution / Amazon hadoop distribution
Also, ASF BigTop packages hadoop for you. You can always grab our releases http://www.apache.org/dist/bigtop/bigtop-1.0.0/repos/ We package pig, spark, hive, hbase, Its not had to set up a bigtop build server, as we have dockerized the packaging of both RPM and Deb packages, and you can experiment locally with this stuff using the vagrant recipes. On Mon, Oct 19, 2015 at 6:26 AM, Jonathan Aquilina <jaquil...@eagleeyet.net> wrote: > Hey Jose > > Have you looked at Amazon emr ( elastic map reduce) where I work we have > used it and when you provision the emr instance you can use custom jars > like the one you mentioned. > > In terms of storage you can use either hdfs, if you are going to keep a > persistent cluster. If not you can store your data in an Amazon s3 bucket. > > Documentation for emr is really good. At the time when we did this and > this was at the beginning of this year and they supported Hadoop 2.6. > > In my honest opinion you are giving yourself a lot of extra work for > nothing to get us in Hadoop. Try out emr with temporary cluster and go from > there. I managed to tool up and learn how to work with emr in a week. > > Sent from my iPhone > > On 19 Oct 2015, at 02:10, José Luis Larroque <larroques...@gmail.com> > wrote: > > Thanks for your answer Anders. > > -The amount of data that i'm going to manipulate it's like the wikipedia > (i will use a dump) > - I already have the basics of hadoop (i hope), i have a local multinode > cluster setup and i already executed some algorithms. > - Because the amount of data its important, i believe that i should use > several nodes. > > Maybe another option to considerate should be that i'm running Giraph on > top of the selected hadoop distribution/EC2. > > Bye! > Jose > > 2015-10-18 18:53 GMT-03:00 Anders Nielsen <anders.shinde.niel...@gmail.com > >: > >> Dear Jose, >> >> It will help people answer your question if you specify your goals : >> >> -If you do it to learn how to USE a running Hadoop then go for one of the >> prebuilt distributions (Amazon or MapR) >> -If you do it to learn more about the setting up and administrating >> Hadoop then you are better off setting everything up from scratch on EC2. >> -Do you need to run on many nodes or just a 1 node to test some Mapreduce >> scripts on a small data set? >> >> Regards, >> >> Anders >> >> >> >> >> On Sun, Oct 18, 2015 at 10:03 PM, José Luis Larroque < >> larroques...@gmail.com> wrote: >> >>> Hi all ! >>> >>> I started to use hadoop with aws, and a big question appears in front of >>> me! >>> >>> I'm using a MapR distribution, for hadoop 2.4.0 in AWS. I already tried >>> some trivial examples, and before moving forward i have one question. >>> >>> What is the better option for using Hadoop on AWS? >>> - Build it from scratch on a EC2 instance >>> - Use MapR distribution of Hadoop >>> - Use Amazon distribution of Hadoop >>> >>> Sorry if my question is too broad. >>> >>> Bye! >>> Jose >>> >>> >>> >>> >>> >> > -- jay vyas
Re: spark
For a start compare sparks word count with mapreduce word count. Then compare sparksql with hive. If you get that far for the final exersize, Find out for yourself by running bigpetstore-mapreduce and bigpetstore-spark side by side :). They are two similar applications which generate data sets and process them for etl and product recommendations which we are curating in Apache bigtop. On Aug 17, 2015, at 6:33 PM, Publius t...@yahoo.com wrote: Hello what is the difference between Hadoop and Spark? How is Spark better?
Re: How to test DFS?
you could just list the file contents in your hadoop data/ directories, of the individual nodes, ... somewhere in there the file blocks will be floating around. On Tue, May 26, 2015 at 4:59 PM, Caesar Samsi caesarsa...@mac.com wrote: Hello, How would I go about and confirm that a file has been distributed successfully to all datanodes? I would like to demonstrate this capability in a short briefing for my colleagues. Can I access the file from the datanode itself (todate I can only access the files from the master node, not the slaves)? Thank you, Caesar. -- jay vyas
Re: What skills to Learn to become Hadoop Admin
Setting up vendor distros is a great first step. 1) Running TeraSort and benchmarking is a good step. You can also run larger, full stack hadoop applications like bigpetstore, which we curate here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/. 2) Write some mapreduce or spark jobs which write data to a persistent transactional store, such as SOLR or HBase. This is a hugely important part of real world hadoop administration, where you will encounter problems like running out of memory, possibly CPU overclocking on some nodes, and so on. 3) Now, did you want to go deeper into the build/setup/deployment of hadoop ? Its worth it to try building/deploying/debugging hadoop ecosytem components from scratch, by setting up Apache BigTop, which packages RPM/DEB artifacts and provides puppet recipes for distributions. Its the original roots of both the cloudera and hortonworks distributions, so you will learn something about both by playing with it. We have some exersizes you can use to guide you and get started https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes . Feel free to join the mailing list for questions. On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote: Krish, I dont mean to hijack your mail here but i wanted to find out how/what you did for the below portion, as i am trying to go down your path as well, i was able to get 4-5 node cluster using ambari and cdh and now wanted to take it to next level. What have you done for below? I have done a web log integration using flume and twitter sentiment analysis. On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote: Hi, I would like to enter into Big Data world as Hadoop Admin and I have setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop. I have installed the services like hive, oozie, zookeeper etc. I have done a web log integration using flume and twitter sentiment analysis. I wanted to understand what are the other skills I should learn ? Thanks Krish -- jay vyas
Re: Interview Questions asked for Hadoop Admin
Hi krish. Im going to interpret this as What is a real world hadoop project workload i can run to study for my upcoming job interview :) ... You could look here https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/bigpetstore-mapreduce If you understand that application, you will do just fine :) . We use custom input formats to generate arbitrarily large data sets, pig processing, and mahout's recommender all in the bigpetstore-mapreduce implementation. Also , its all unit tested (the jobs themselves), so you can run and inspect changes locally, and get a feel for maintaining a real world hadoop app. Running it and modifying the data generation and other phases will be a great form of preparation for you, and you can run it all by spinning up VMs in apache bigtop. On Thu, Feb 12, 2015 at 1:03 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, Does anybody has interview questions which was asked during their interview on Hadoop admin role? I found few on internet but if somebody who has attended the interview can give us an idea , that will be great. Thanks Krish -- jay vyas
Re: Home for Apache Big Data Solutions?
Bigtop.. Yup! Mr Asanjar : why don't you post an email about what your doing on the Apache bigtop list, we'd love to hear from you. There could possibly be some overlap and our goal is to plumb the hadoop ecosystem as well On Feb 9, 2015, at 4:41 PM, Artem Ervits artemerv...@gmail.com wrote: I believe Apache Bigtop is what you're looking for. Artem Ervits On Feb 9, 2015 8:15 AM, Jean-Baptiste Onofré j...@nanthrax.net wrote: Hi Amir, thanks for the update. Please, let me know if you need some help on the proposal and to qualify your ideas. Regards JB On 02/09/2015 02:05 PM, MrAsanjar . wrote: Hi Chris, thanks for the information, will get on it ... Hi JB Glad that you are familiar with Juju, however my personal goal is not to promote any tool but to take the next step, which is to build a community for apache big data solutions. do you already have a kind of proposal/description of your projects ? working on it :) I got the idea while flying back from South Africa on Saturday. During my trip I noticed most of the communities spending their precious resources on solution plumbing, without much of emphasis on solution best practices due to the lack of expertise. By the time Big Data solution framework becomes operational, funding has diminished enough to limit solution activity (i.e data analytic payload development). I am sure we could find similar scenarios with other institutions and SMB (small and medium-size businesses) anywhere. In the nutshell my goals are as follow: 1) Make Big Data solutions available to everyone 2) Encapsulate the best practices 3) All Orchestration tools are welcomed - Some solutions could have hybrid tooling model 4) Enforce automated testing and quality control. 5) Share analytic payloads (i.e mapreduce apps, storm topology, Pig scripts,...) Is it like distribution, or tooling ? Good question, I envision to have a distribution model as it has dependency on Apache hadoop projects distributions. What's the current license ? Charms/Bundles are moving to Apache 2.0 license, target data 2/27. Regards Amir Sanjar Big Data Solution Lead Canonical On Sun, Feb 8, 2015 at 10:46 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov wrote: Dear Amir, Thank you for your interest in contributing these projects to the ASF! Sincerely appreciate it. My suggestion would be to look into the Apache Incubator, which is the home for incoming projects at the ASF. The TL;DR answer is: 1. You’ll need to create a proposal for each project that you would like to bring in using: http://incubator.apache.org/guides/proposal.html 2. You should put your proposal up on a public wiki for each project: http://wiki.apache.org/incubator/ create a new page e.g., YourProjectProposal, which would in turn become http://wiki.apache.org/incubator/YouProjectProposal You will need to request permissions to add the page on the wiki 3. Recruit at least 3 IPMC/ASF members to mentor your project: http://people.apache.org/committers-by-project.html#incubator-pmc http://people.apache.org/committers-by-project.html#member 4. Submit your proposal for consideration at the Incubator 5. Enjoy! Cheers and good luck. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: MrAsanjar . afsan...@gmail.com mailto:afsan...@gmail.com Reply-To: user@hadoop.apache.org mailto:user@hadoop.apache.org user@hadoop.apache.org mailto:user@hadoop.apache.org Date: Sunday, February 8, 2015 at 8:36 AM To: user@hadoop.apache.org mailto:user@hadoop.apache.org user@hadoop.apache.org mailto:user@hadoop.apache.org, dev-i...@bigtop.apache.org mailto:dev-i...@bigtop.apache.org dev-i...@bigtop.apache.org mailto:dev-i...@bigtop.apache.org Subject: Home for Apache Big Data Solutions? Hi all, My name is Amir Sanjar, Big Data Solution Development Lead at Canonical. My team has been developing various Big Data solutions build on top of Apache Hadoop projects (i.e. Hadoop, Hive, Pig,..) . We would like to contribute these pure open source solutions
Re: Any working VM of Apache Hadoop ?
Also BigTop has a very flexible vagrant infrastructure: https://github.com/apache/bigtop/tree/master/bigtop-deploy/vm/vagrant-puppet On Jan 18, 2015, at 3:37 PM, Andre Kelpe ake...@concurrentinc.com wrote: Try our vagrant setup: https://github.com/Cascading/vagrant-cascading-hadoop-cluster - André On Sat, Jan 17, 2015 at 10:07 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, I am looking for working VM of Apache Hadoop. Not looking for cloudera or Horton works VMs. If anybody has it and if they can share that would be great . Thanks Krish -- André Kelpe an...@concurrentinc.com http://concurrentinc.com
Re: HDFS-based database for Big and Small data?
1) Phoenix can be used on top of hbase for richer querying semantics. That combo might be good for complex workloads. 2) SolrCloud also might fit the bill here ? Solr can be backed by any HAdoop compatible FS including HDFS, and it's resiliant by that mechanism, and offers sophisticated indexing and searching options. Although the querying is limited... On Jan 3, 2015, at 9:39 AM, Wilm Schumacher wilm.schumac...@gmail.com wrote: Am 03.01.2015 um 08:44 schrieb Alec Taylor: Want to replace MongoDB with an HDFS-based database in my architecture. Note that this is a new system, not a rewrite of an old one. Are there any open-source fast read/write database built on HDFS yeah. As Ted wrote: hbase. with a model similar to a document-store, well, then PERHAPS hbase isn't the right choice. What exactly do you need from the definition of a doc-store? If you e.g. rely highly on ad hoc queries or secondary indexes then perhaps hbase could lead to some additional work for you. that can hold my regular business logic and enables an object model in Python? (E.g.: via Data Mapper or Active Record patterns) in addition to Teds link, you could also use thrift, if this is enough control for you. Depends on your requirement. Best wishes, Wilm
Re: New to this group.
Many demos out there are for the business community... For a demonstration of hadoop at a finer grained level, how it's deployed, packaged, installed and used, for a developer who wants to learn hadoop the hard way, I'd suggest : 1 - Getting Apache bigtop stood up on VMs, and 2 - running the BigPetStore application , which is meant to demonstrate end to end building testing and deployment of a hadoop batch analytics system with mapreduce, pig, and mahout. This will also expose you to puppet, gradle, vagrant, all in a big data app which solves Real world problems like jar dependencies and multiple ecosystem components. Since BPS generates its own data, you don't waste time worrying about external data sets, Twitter credentials, etc, and can test both on your laptop and on a 100 node cluster (similar to teragen but for the whole ecosystem). Since it features integration tests and tested on Bigtops hadoop distribution, (which is 100% pure Apache based), it's imo the purest learning source, not blurred with company specific downloads or branding. Disclaimer : Of course I'm biased as I work on it... :) but we've been working hard to make bigtop easily consumable as a gateway drug to bigdata processing, and if you have solid linux and Java background, im sure others would agree it's great place to get immersed in the hadoop ecosystem. On Jan 2, 2015, at 1:05 PM, Krish Donald gotomyp...@gmail.com wrote: I would like to work on some kind of case studies like I have seen couple on Horton works like twitter sentiment analysis, web log analysis etc. But if somebody can give idea about other case studies which can be worked upon and can be put in resume later . As I don't have real time project experience. On Fri, Jan 2, 2015 at 10:33 AM, Ted Yu yuzhih...@gmail.com wrote: You can search for Open JIRAs which are related to admin. Here is an example query: https://issues.apache.org/jira/browse/HADOOP-9642?jql=project%20%3D%20HADOOP%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22admin%22 FYI On Fri, Jan 2, 2015 at 10:24 AM, Krish Donald gotomyp...@gmail.com wrote: I have fair understanding of hadoop eco system... I have setup multinode cluster using VMs in my personal laptop for Hadoop 2.0 . But beyond that i would like to work on some project to get a good hold on the subject. I basically would like to go to into Hadoop Administartion side as my backgroud is RDBMS databases Admnistrator . On Fri, Jan 2, 2015 at 10:11 AM, Wilm Schumacher wilm.schumac...@gmail.com wrote: Hi, the standard books may be a good start: I liked the following definitive guide: http://www.amazon.de/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 hadoop in action: http://www.manning.com/lam2/ hadoop in practive: http://www.manning.com/holmes2/ A list is here: http://wiki.apache.org/hadoop/Books Hope this helps. Best wishes, Wilm Am 02.01.2015 um 19:02 schrieb Krish Donald: Hi, I am new to this group and hadoop. Please help me to learn hadoop and suggest some self study project . Thanks Krish Donald
Re: hadoop / hive / pig setup directions
Hi bhupendra, The Apache BigTop project was born to solve the general problem of dealing with and verifying the functionality of various components in the hadoop ecosystem. Also, it creates rpm , apt repos for installing hadoop and puppet recipes for initializing the file system and installing components in a clear and dependency aware manner. And we have smoke tests to validate that hive,pig, and so on all are working. You should definitely consider checking it out of your building a hadoop environment or big data stack. The best way to get started is with the vagrant recipes , which spin up a cluster from scratch for you. Once that works, you can take the puppet code and run it on bare metal, One advantage of this approach is that you are using bits that the community tests for you, and will avoid reinventing the wheel of writing a bunch of shell scripts for things like synchronizing config files, yum installing components across a cluster, smoke tests. On Dec 16, 2014, at 9:05 AM, GUPTA bhupendra1...@gmail.com wrote: Hello all, Firstly am a neophyte in the world of Hadoop.. So far, I have got the hadoop single node cluster running on Ubuntu. The end state of this was datanode and namenode servers where running.. But from here, am not sure how do I proceed, in the sense, how do I get the other pieces of the hadoop ecosystem installed and working. Like Hive, Pig , Hbase and may be Ambari as well, set up and running. Would appreciate if I can get access to materials which says these are MUST HAVEs for any hadoop project Just trying to get all the pieces together... Regards Bhupendra
Re: Hadoop Learning Environment
Hi tim. Id suggest using apache bigtop for this. BigTop integrates the hadoop ecosystem into a single upstream distribution, packages everything, curates smoke tests, vagrant, docker recipes for deployment. Also, we curate a blueprint hadoop application (bigpetstore) which you build yourself, easily, and can run to generate, process, and visualize the bigdata ecosystem. You can also easily deploy bigtop onto ec2 if you want to pay for it . On Tue, Nov 4, 2014 at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- jay vyas
Re: Hadoop Learning Environment
Hi daemon: Actually, for most folks who would want to actually use a hadoop cluster, i would think setting up bigtop is super easy ! If you have issues with it ping me and I can help you get started. Also, we have docker containers - so you dont even *need* a VM to run a 4 or 5 node hadoop cluster. install vagrant install VirtualBox git clone https://github.com/apache/bigtop cd bigtop/bigtop-deploy/vm/vagrant-puppet vagrant up Then vagrant destroy when your done. This to me is easier than manually downloading an appliance, picking memory starting the virtualbox gui, loading the appliance , etc... and also its easy to turn the simple single node bigtop VM into a multinode one, by just modifying the vagrantile. On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com wrote: What you want as a sandbox depends on what you are trying to learn. If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all of the suggestions (perhaps excluding BigTop due to its setup complexities) are great. Laptop? perhaps but laptop's are really kind of infuriatingly slow (because of the hardware - you pay a price for a 30-45watt average heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7 with lots of memory. What do you think of the thought that you will pretty quickly graduate to wanting a small'ish desktop for your sandbox? A simple, single node, Hadoop instance will let you learn many things. The next level of complexity comes when you are attempting to deal with data whose processing needs to be split up, so you can learn about how to split data in Mapping, reduce the splits via reduce jobs, etc. For that, you could get a windows desktop box or e.g. RedHat/CentOS and use virtualization. Something like a 4 core i5 with 32gb of memory, running 3 or for some things 4, vm's. You could load e.g. hortonworks into each of the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives off of eBay and you can have a lot of learning. *...“The race is not to the swift,nor the battle to the strong,but to those who can see it coming and jump aside.” - Hunter ThompsonDaemeon* On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote: you can try the pivotal vm as well. http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com wrote: Tim, download Sandbox from http://hortonworks/com You will have everything needed in a small VM instance which will run on your home desktop. *Thank you!* *Sincerely,* *Leonid Fedotov* Systems Architect - Professional Services lfedo...@hortonworks.com office: +1 855 846 7866 ext 292 mobile: +1 650 430 1673 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com wrote: Hey all, I want to setup an environment where I can teach myself hadoop. Usually the way I'll handle this is to grab a machine off the Amazon free tier and setup whatever software I want. However I realize that Hadoop is a memory intensive, big data solution. So what I'm wondering is, would a t2.micro instance be sufficient for setting up a cluster of hadoop nodes with the intention of learning it? To keep things running longer in the free tier I would either setup however many nodes as I want and keep them stopped when I'm not actively using them. Or just setup a few nodes with a few different accounts (with a different gmail address for each one.. easy enough to do). Failing that, what are some other free/cheap solutions for setting up a hadoop learning environment? Thanks, Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- jay vyas
Re: TestDFSIO with FS other than defaultFS
Hi jeff. Wrong fs means that your configuration doesn't know how to bind ofs to the OrangeFS file system class. You can debug the configuration using fs.dumpConfiguration(), and you will likely see references to hdfs in there. By the way, have you tried our bigtop hcfs tests yet? We now support over 100 Hadoop file system compatibility tests... You can see a good sample of what parameters should be set for a hcfs implementation here: https://github.com/gluster/glusterfs-hadoop/blob/master/conf/core-site.xml On Oct 2, 2014, at 12:42 PM, Jeffrey Denton den...@clemson.edu wrote: Hello all, I'm trying to run TestDFSIO using a different file system other than the configured defaultFS and it doesn't work for me: $ hadoop org.apache.hadoop.fs.TestDFSIO -Dtest.build.data=ofs://test/user/$USER/TestDFSIO -write -nrFiles 1 -fileSize 10240 14/10/02 11:24:19 INFO fs.TestDFSIO: TestDFSIO.1.7 14/10/02 11:24:19 INFO fs.TestDFSIO: nrFiles = 1 14/10/02 11:24:19 INFO fs.TestDFSIO: nrBytes (MB) = 10240.0 14/10/02 11:24:19 INFO fs.TestDFSIO: bufferSize = 100 14/10/02 11:24:19 INFO fs.TestDFSIO: baseDir = ofs://test/user/denton/TestDFSIO 14/10/02 11:24:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/02 11:24:20 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 14/10/02 11:24:20 INFO fs.TestDFSIO: creating control file: 10737418240 bytes, 1 files java.lang.IllegalArgumentException: Wrong FS: ofs://test/user/denton/TestDFSIO/io_control, expected: hdfs://dsci at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:595) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:591) at org.apache.hadoop.fs.TestDFSIO.createControlFile(TestDFSIO.java:290) at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:751) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650) At Clemson University, we're running HDP-2.1 (Hadoop 2.4.0.2.1) on 16 data nodes and 3 separate master nodes for the resource manager and two namenodes; however, for this test, the data nodes are really being used to run the map tasks with job output being written to 16 separate OrangeFS servers. Ideally, we would like the 16 HDFS data nodes and two namenodes to be the defaultFS, but would also like the capability to run jobs using other OrangeFS installations. The above error does not occur when OrangeFS is configured to be the defaultFS. Also, we have no problems running teragen/terasort/teravalidate when OrangeFS IS NOT the defaultFS. So, is it possible to run TestDFSIO using a FS other than the defaultFS? If you're interested in the OrangeFS classes, they can be found here: I have not yet run any of the FS tests released with 2.5.1 but hope to soon. Regards, Jeff Denton OrangeFS Developer Clemson University den...@clemson.edu
Re:
See https://wiki.apache.org/hadoop/HCFS/ YES Yarn is written to the FileSystem interface. It works on S3FileSystem and GlusterFileSystem and any other HCFS. We have run , and continue to run, the many tests in apache bigtop's test suite against our hadoop clusters running on alternative file system implementations, and it works. When you say HDFS does not support fs.AbstractFileSystem.s3.impl That is true. If your file system is configured using HDFS, then s3 urls will not be used, ever. When you create a FileSystem object in hadoop, it reads the uri (i.e. glusterfs:///) and then finds the file system binding in your core-site.xml (i.e. fs.AbstractFileSystem.glusterfs.impl). So the URI must have a corresponding entry in the core-site.xml. As a reference implementation, you can see https://github.com/gluster/glusterfs-hadoop/blob/master/conf/core-site.xml On Fri, Sep 26, 2014 at 10:10 AM, Naganarasimha G R (Naga) garlanaganarasi...@huawei.com wrote: Hi All, I have following doubts on pluggable FileSystem and YARN 1. If all the implementations should extend FileSystem then why there is a parallel class AbstractFileSystem. which ViewFS extends ? 2. Is YARN supposed to run on any of the pluggable org.apache.hadoop.fs.FileSystem like s3 ? if its suppose to run then when submitting a job in the client side YARNRunner is calling FileContext.getFileContext(this.conf); which is further calling FileContext.getAbstractFileSystem() which throws exception for S3. So i am not able to run YARN job with ViewFS with S3 as mount. And based on the code even if i configure only S3 then also its going to fail. 3. HDFS does not support fs.AbstractFileSystem.s3.impl with some default class similar to org.apache.hadoop.fs.s3.S3FileSystem ? Regards, Naga Huawei Technologies Co., Ltd. Phone: Fax: Mobile: +91 9980040283 Email: naganarasimh...@huawei.com Huawei Technologies Co., Ltd. http://www.huawei.com -- jay vyas
Re: To Generate Test Data in HDFS (PDGF)
While on the subject, You can also use the bigpetstore application to do this, in apache bigtop. This data is suited well for hbase ( semi structured, transactional, and features some global patterns which can make for meaningful queries and so on). Clone apache/bigtop cd bigtop-bigpetstore gradle clean package # build the jar Then follow the instructions in the README to generate as many records as you want in a distributed context. Each record is around 80 bytes, so about 10^10 records should be on the scale you are looking for. On Sep 22, 2014, at 5:14 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I need to generate large amount of test data (4TB) into Hadoop, has anyone used PDGF to do so? Could you share your cook book about PDGF in Hadoop (or HBase)? Many Thanks Arthur
Re: how to setup Kerberozed Hadoop ?
Once you read the the docs and get a base understanding.. here is my recipe you can try for a maintainable , easy to manage setup. - Puppet-IPA (puppet recipe for FreeIPA for setting up kerberos realms and users) - then layer in apache bigtop's puppet hadoop modules (for installation and setup of the hadoop cluster) - then do the glue necessary to kerberize existing, running hadoop services (free ipa will set up the kerberos realm for you, add users, and so on - all you have to do is add the kerberos security info into the core-site.xml) On Mon, Sep 15, 2014 at 3:52 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Hi Have you already looked at the existing documentation? For apache http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SecureMode.html -For cloudera http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.6.0/CDH4-Security-Guide/cdh4sg_topic_3.html Some random blogs: http://blog.godatadriven.com/kerberos-cloudera-setup.html Regards, Shahab On Mon, Sep 15, 2014 at 3:47 PM, Xiaohua Chen xiaohua.c...@gmail.com wrote: Hi experts: I am new to Hadoop. We want to setup a Kerberozed hadoop for testing. Can you share any guide lines or instructions on how to setup a Kerberozed hadoop env ? Thanks. Sophia -- jay vyas
Re: Tez and MapReduce
Yes as an example of running a mapreduce job followed by a tez you can see our last post on this https://blogs.apache.org/bigtop/entry/testing_apache_tez_with_apache . You can see in the bigtop/tez testing blogpost that you can confirm that Tez is being used easily on the web ui. From TezClent.java: /** * TezClient is used to submit Tez DAGs for execution. DAG's are executed via a * Tez App Master. TezClient can run the App Master in session or non-session * mode. br * In non-session mode, each DAG is executed in a different App Master that * exits after the DAG execution completes. br * In session mode, the TezClient creates a single instance of the App Master * and all DAG's are submitted to the same App Master.br * Session mode may give better performance when a series of DAGs need to * executed because it enables resource re-use across those DAGs. Non-session * mode should be used when the user wants to submit a single DAG or wants to * disconnect from the cluster after submitting a set of unrelated DAGs. br * If API recommendations are followed, then the choice of running in session or * non-session mode is transparent to writing the application. By changing the * session mode configuration, the same application can be running in session or * non-session mode. */ On Mon, Sep 1, 2014 at 12:43 PM, Alexander Pivovarov apivova...@gmail.com wrote: e.g. in hive to switch engines set hive.execution.engine=mr; or set hive.execution.engine=tez; tez is faster especially on complex queries. On Aug 31, 2014 10:33 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Can Tez and MapReduce live together and get along in the same cluster? B. -- jay vyas
Re: hadoop/yarn and task parallelization on non-hdfs filesystems
Your FileSystem implementation should provide specific tuning parameters for IO. For example, in the GlusterFileSystem, we have a buffer parameter that is typically embedded into the core-site.xml. https://github.com/gluster/glusterfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/glusterfs/GlusterVolume.java Similarly, in HDFS, there are tuning parameters that would go in hdfs-site.xml IIRC from your stackoverflow question, the Hadoop Compatible FileSystem you are using is backed by a company of some sort, so you should contact the engineers working on the implementation about how to tune the underlying FS. Regarding mapreduce and yarn - task optimization at that level is independent of the underlying file system. There are some parameters that you can specify with your job, like setting the min number of tasks, which can increase/decrease the number of total tasks. From some experience tuning web crawlers with this stuff, I can say that a high number will increase parallelism but might decrease availability of your cluster (and locality of individual jobs). A high # of tasks generally works good when doing something CPU or network intensive. On Fri, Aug 15, 2014 at 11:22 AM, java8964 java8...@hotmail.com wrote: I believe that Calvin mentioned before that this parallel file system mounted into local file system. In this case, will Hadoop just use java.io.File as local File system to treat them as local file and not split the file? Just want to know the logic in hadoop handling the local file. One suggestion I can think is to split the files manually outside of hadoop. For example, generate lots of small files as 128M or 256M size. In this case, each mapper will process one small file, so you can get good utilization of your cluster, assume you have a lot of small files. Yong From: ha...@cloudera.com Date: Fri, 15 Aug 2014 16:45:02 +0530 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems To: user@hadoop.apache.org Does your non-HDFS filesystem implement a getBlockLocations API, that MR relies on to know how to split files? The API is at http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus , long, long), and MR calls it at https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392 If not, perhaps you can enforce a manual chunking by asking MR to use custom min/max split sizes values via config properties: https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66 On Fri, Aug 15, 2014 at 10:16 AM, Calvin iphcal...@gmail.com wrote: I've looked a bit into this problem some more, and from what another person has written, HDFS is tuned to scale appropriately [1] given the number of input splits, etc. In the case of utilizing the local filesystem (which is really a network share on a parallel filesystem), the settings might be set conservatively in order not to thrash the local disks or present a bottleneck in processing. Since this isn't a big concern, I'd rather tune the settings to efficiently utilize the local filesystem. Are there any pointers to where in the source code I could look in order to tweak such parameters? Thanks, Calvin [1] https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems On Tue, Aug 12, 2014 at 12:29 PM, Calvin iphcal...@gmail.com wrote: Hi all, I've instantiated a Hadoop 2.4.1 cluster and I've found that running MapReduce applications will parallelize differently depending on what kind of filesystem the input data is on. Using HDFS, a MapReduce job will spawn enough containers to maximize use of all available memory. For example, a 3-node cluster with 172GB of memory with each map task allocating 2GB, about 86 application containers will be created. On a filesystem that isn't HDFS (like NFS or in my use case, a parallel filesystem), a MapReduce job will only allocate a subset of available tasks (e.g., with the same 3-node cluster, about 25-40 containers are created). Since I'm using a parallel filesystem, I'm not as concerned with the bottlenecks one would find if one were to use NFS. Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml) configuration that will allow me to effectively maximize resource utilization? Thanks, Calvin -- Harsh J -- jay vyas
Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?
also, consider apache bigtop. That is the apache upstream Hadoop initiative, and it comes with smoke tests+ Puppet recipes for setting up your own Hadoop distro from scratch. IMHO ... If learning or building your own tooling around Hadoop , bigtop is ideal. If interested in purchasing support , than the vendor distros are a good gateway. On Aug 12, 2014, at 5:31 PM, Aaron Eng a...@maprtech.com wrote: On that note, 2 is also misleading/incomplete. You might want to explain which specific features you are referencing so the original poster can figure out if those features are relevant. The inverse of 2 is also true, things like consistent snapshots and full random read/write over NFS are in MapR and not in HDFS. On Tue, Aug 12, 2014 at 2:10 PM, Kai Voigt k...@123.org wrote: 3. seems a biased and incomplete statement. Cloudera’s distribution CDH is fully open source. The proprietary „stuff you refer to is most likely Cloudera Manager, an additional tool to make deployment, configuration and monitoring easy. Nobody is required to use it to run a Hadoop cluster. Kai (a Cloudera Employee) Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com: Hortonworks. Here is my reasoning: 1. Hortonwork is 100% open source. 2. MapR has stuff on their roadmap that Hortonworks has already accomplished and has moved on to other things. 3. Cloudera has proprietary stuff in their stack. No. 4. Hortonworks makes training super accessible and there is a community around it. 5. Who the heck is BigInsights? (Which should tell you something.) Adaryl Bob Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: mani kandan Sent: Tuesday, August 12, 2014 3:12 PM To: user@hadoop.apache.org Subject: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode? Which distribution are you people using? Cloudera vs Hortonworks vs Biginsights? Kai VoigtAm Germaniahafen 1 k...@123.org 24143 Kiel +49 160 96683050 Germany @KaiVoigt
Re: Bench-marking Hadoop Performance
There are alot of tests out there and it can be tough to determine what is a standard. - TeraGen/TearSort and testdfsio are starting points. - Various other non apache projects (such as ycsb or hibench) will have good benchmarks for certain type sof cases. -If looking for a more comprehensive long term strategy, I'd suggest the you ask on the bigtop mailing list, where we are building a broader community around uniform smoke testing and benchmarking of hadoop, hadoop compatible file systems, and YARN applications. On Tue, Jul 22, 2014 at 11:23 AM, Charley Newtonne cnewto...@gmail.com wrote: This is a new cluster I'm putting up and I need to get an idea on what to expect from a performance standpoint. Older docs point to gridmix and TestDFSIO . However, most of this doc is obsolete and no longer applies on 2.4. Where can I find benchmarking docs for 2.4? What are my options? Also, I have searched safari books online including rough cuts, but not seeing books for the 2.4 release. If you know of a book for this release, please share. Thank you. -- jay vyas
Re: Hadoop 2.4 test jar files.
FYI, the FS tests have just been overhauled and im not sure if those jars have the latest FS tests (hadoop-9361). For those tests its easy to add them by building hadoop and just adding the hadoop-common and hadoop-common-test jars as maven dependencies locally. On Tue, Jul 22, 2014 at 2:00 PM, Charley Newtonne cnewto...@gmail.com wrote: ..You can expand the one(s) you're interested in and run tests contained in them... How is that done? How do I know what these classes do and what arguments they take? On Tue, Jul 22, 2014 at 1:42 PM, Ted Yu yuzhih...@gmail.com wrote: These jar files contain source code for the respective hadoop modules. You can expand the one(s) you're interested in and run tests contained in them. Cheers On Tue, Jul 22, 2014 at 9:47 AM, Charley Newtonne cnewto...@gmail.com wrote: I have spent hours trying to find out how to run these jar files. The older version are documented on the web and some of the books. These, however, are not. How do I know ... - The purpose of each one of these jar files. - The class to call and what it does. - The arguments to pass. /a01/hadoop/2.4.0/share/hadoop/hdfs/hadoop-hdfs-2.4.0-tests.jar /a01/hadoop/2.4.0/share/hadoop/hdfs/sources/hadoop-hdfs-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-sls-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-datajoin-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-archives-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-gridmix-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-extras-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-streaming-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-distcp-2.4.0-test-sources.jar /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-rumen-2.4.0-test-sources.jar -- jay vyas
Re: clarification on HBASE functionality
Hbase is not harcoded to hdfs: it works on any file system that implements the file system interface, we've run it on glusterfs for example. I assume some have also run it on s3 and other alternative file systems . ** However ** For best performance, direct block io hooks on hdfs can boost high throughout applications on hdfs. Ultimately, the hbase root directory only needs a fully qualified FileSystem uri which maps to a FileSystem class. On Jul 14, 2014, at 5:59 PM, Ted Yu yuzhih...@gmail.com wrote: Right. hbase is different from Cassandra in this regard. On Mon, Jul 14, 2014 at 2:57 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Now this is different from Cassandra which does NOT use HDFS correct? (Sorry. Don’t know why that needed two emails.) B. From: Ted Yu Sent: Monday, July 14, 2014 4:53 PM To: mailto:user@hadoop.apache.org Subject: Re: clarification on HBASE functionality Yes. See http://hbase.apache.org/book.html#arch.hdfs On Mon, Jul 14, 2014 at 2:52 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: HBASE uses HDFS to store it's data correct? B.
Re: Hadoop virtual machine
I really like the Cascading recipes above, thanks for sharing that ! Also we have *apache bigtop vagrant recipes* which we curate for this kind of thing, which are really useful. You can spin up a 1 or multi node cluster, just by running the startup.sh script. Which are probably the most configurable and flexible. These are super easy to use, and allow you maximal control over your environment. 1) git clone https://github.com/apache/bigtop 2) cd bigtop-deploy/vm/vagrant/vagrant-puppet 3) Follow the directions in the README to create your hadoop cluster You can look into the provision script to see how you can customize exactly which components come (hbase,mahout,pig,) installed in your distribution. Feel free to drop a line on the bigtop mailing list if you need any help with getting them up and running. On Sun, Jul 6, 2014 at 12:47 PM, Andre Kelpe ake...@concurrentinc.com wrote: We have a multi-vm or single-vm setup with apache hadoop, if you want to give that a spin: https://github.com/Cascading/vagrant-cascading-hadoop-cluster - André On Sun, Jul 6, 2014 at 9:05 AM, MrAsanjar . afsan...@gmail.com wrote: For my hadoop development and testing I use LXC (linux container) instead of VM, mainly due to its light weight resource consumption. As mater of fact as I am typing, my ubuntu system is automatically building a 6 nodes hadoop cluster on my 16G labtop. If you have an Ubuntu system you could install a fully configurable Hadoop 2.2.0 single node or multi-node cluster in less then 10 minutes. Here what you need to do: 1) Install and learn Ubuntu Juju (shouldn't take an hour)- instructions : https://juju.ubuntu.com/docs/getting-started.html 2) there are two types hadoop charms: a) Single node for hadoop development : https://jujucharms.com/?text=hadoop2-devel b) multi-node for testing testing : https://jujucharms.com/?text=hadoop Let me know if you need more help On Sun, Jul 6, 2014 at 7:59 AM, Marco Shaw marco.s...@gmail.com wrote: Note that the CDH link is for Cloudera which only provides Hadoop for Linux. HDP has pre-built VMs for both Linux and Windows hosts. You can also search for HDInsight emulator which runs on Windows and is based on HDP. Marco On Jul 6, 2014, at 12:38 AM, Gavin Yue yue.yuany...@gmail.com wrote: http://hortonworks.com/products/hortonworks-sandbox/ or CDH5 http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html On Sat, Jul 5, 2014 at 11:27 PM, Manar Elkady m.elk...@fci-cu.edu.eg wrote: Hi, I am a newcomer in using Hadoop, and I read many online tutorial to set up Hadoop on Window by using virtual machines, but all of them link to old versions of Hadoop virtual machines. Could any one help me to find a Hadoop virtual machine, which include a newer version of hadoop? Or should I do it myself from scratch? Also, any well explained Hadoop installing tutorial and any other helpful material are appreciated. Manar, -- -- André Kelpe an...@concurrentinc.com http://concurrentinc.com -- jay vyas
Re: Hadoop with SAN
You can either use san to back your datanodes, or implement a custom FileSystem over your san storage. Either would have different drawbacks depending on your requirements.
Re: No job can run in YARN (Hadoop-2.2)
Sounds oddSo (1) you got a filenotfound exception and (2) you fixed it by commenting out memory specific config parameters? Not sure how that would work... Any other details or am I missing something else? On May 11, 2014, at 4:16 AM, Tao Xiao xiaotao.cs@gmail.com wrote: I'm sure this problem is caused by the incorrect configuration. I commented out all the configurations regarding memory, then jobs can run successfully. 2014-05-11 0:01 GMT+08:00 Tao Xiao xiaotao.cs@gmail.com: I installed Hadoop-2.2 in a cluster of 4 nodes, following Hadoop YARN Installation: The definitive guide. The configurations are as follows: ~/.bashrc core-site.xml hdfs-site.xml mapred-site.xml slavesyarn-site.xml I started NameNode, DataNodes, ResourceManager and NodeManagers successfully, but no job can run successfully. For example, I run the following job: [root@Single-Hadoop ~]#yarn jar /var/soft/apache/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 4 The output is as follows: 14/05/10 23:56:25 INFO mapreduce.Job: Task Id : attempt_1399733823963_0004_m_00_0, Status : FAILED Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) 14/05/10 23:56:25 INFO mapreduce.Job: Task Id : attempt_1399733823963_0004_m_01_0, Status : FAILED Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) ... ... 14/05/10 23:56:36 INFO mapreduce.Job: map 100% reduce 100% 14/05/10 23:56:37 INFO mapreduce.Job: Job job_1399733823963_0004 failed with state FAILED due to: Task failed task_1399733823963_0004_m_00 Job failed as tasks failed. failedMaps:1 failedReduces:0 14/05/10 23:56:37 INFO mapreduce.Job: Counters: 10 Job Counters Failed map tasks=7 Killed map tasks=1 Launched map tasks=8 Other local map tasks=6 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=21602 Total time spent by all reduces in occupied slots (ms)=0 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Job Finished in 24.515 seconds java.io.FileNotFoundException: File does not exist: hdfs://Single-Hadoop.zd.com/user/root/QuasiMonteCarlo_1399737371038_1022927375/out/reduce-out at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1110) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1749) at
Yarn hangs @Scheduled
Hi folks : My yarn jobs seem to be hanging in the SHEDULED state. I've restarted my nodemanager a few times , but no luck. What are the possible reasons that YARN job submission hangs ? I know one is resource availability, but this is a fresh cluster on a VM with only one job, one NM, and one RM. 14/04/24 16:20:32 INFO ipc.Server: Auth successful for yarn@IDH1.LOCAL(auth:SIMPLE) 14/04/24 16:20:32 INFO authorize.ServiceAuthorizationManager: Authorization successful for yarn@IDH1.LOCAL (auth:KERBEROS) for protocol=interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB 14/04/24 16:20:32 INFO resourcemanager.ClientRMService: Allocated new applicationId: 4 14/04/24 16:20:33 INFO resourcemanager.ClientRMService: Application with id 4 submitted by user yarn 14/04/24 16:20:33 INFO resourcemanager.RMAuditLogger: USER=yarn IP=192.168.122.100 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: Storing application with id application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from NEW to NEW_SAVING 14/04/24 16:20:33 INFO recovery.RMStateStore: Storing info for app: application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from NEW_SAVING to SUBMITTED 14/04/24 16:20:33 INFO fair.FairScheduler: Accepted application application_1398370674313_0004 from user: yarn, in queue: default, currently num of applications: 4 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from SUBMITTED to ACCEPTED 14/04/24 16:20:33 INFO resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1398370674313_0004_01 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: appattempt_1398370674313_0004_01 State change from NEW to SUBMITTED 14/04/24 16:20:33 INFO fair.FairScheduler: Added Application Attempt appattempt_1398370674313_0004_01 to scheduler from user: yarn 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: appattempt_1398370674313_0004_01 State change from SUBMITTED to SCHEDULED -- Jay Vyas http://jayunit100.blogspot.com
Re: Yarn hangs @Scheduled
I fixed the issue by setting yarn.scheduler.minimum-allocation-mb=1024 I'm thinking this happens a lot in VMs where you run w low memory. If memory too low, I think other failures will occur at runtime when you start daemons or tasks...If too high, then the tasks will hang... On Apr 24, 2014, at 5:25 PM, Vinod Kumar Vavilapalli vino...@apache.org wrote: How much memory do you see as available on the RM web page? And what are the memory requirements for this app? And this is a MR job? +Vinod Hortonworks Inc. http://hortonworks.com/ On Thu, Apr 24, 2014 at 1:23 PM, Jay Vyas jayunit...@gmail.com wrote: Hi folks : My yarn jobs seem to be hanging in the SHEDULED state. I've restarted my nodemanager a few times , but no luck. What are the possible reasons that YARN job submission hangs ? I know one is resource availability, but this is a fresh cluster on a VM with only one job, one NM, and one RM. 14/04/24 16:20:32 INFO ipc.Server: Auth successful for yarn@IDH1.LOCAL (auth:SIMPLE) 14/04/24 16:20:32 INFO authorize.ServiceAuthorizationManager: Authorization successful for yarn@IDH1.LOCAL (auth:KERBEROS) for protocol=interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB 14/04/24 16:20:32 INFO resourcemanager.ClientRMService: Allocated new applicationId: 4 14/04/24 16:20:33 INFO resourcemanager.ClientRMService: Application with id 4 submitted by user yarn 14/04/24 16:20:33 INFO resourcemanager.RMAuditLogger: USER=yarn IP=192.168.122.100 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: Storing application with id application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from NEW to NEW_SAVING 14/04/24 16:20:33 INFO recovery.RMStateStore: Storing info for app: application_1398370674313_0004 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from NEW_SAVING to SUBMITTED 14/04/24 16:20:33 INFO fair.FairScheduler: Accepted application application_1398370674313_0004 from user: yarn, in queue: default, currently num of applications: 4 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State change from SUBMITTED to ACCEPTED 14/04/24 16:20:33 INFO resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1398370674313_0004_01 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: appattempt_1398370674313_0004_01 State change from NEW to SUBMITTED 14/04/24 16:20:33 INFO fair.FairScheduler: Added Application Attempt appattempt_1398370674313_0004_01 to scheduler from user: yarn 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: appattempt_1398370674313_0004_01 State change from SUBMITTED to SCHEDULED -- Jay Vyas http://jayunit100.blogspot.com CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Strange error in Hadoop 2.2.0: FileNotFoundException: file:/tmp/hadoop-hadoop/mapred/
Is this happening in the job client? or the mappers? On Tue, Apr 22, 2014 at 11:21 AM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Hello, I am running Hadoop 2.2.0 in a single-node cluster mode. My application dies with the following strange error: Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: file:/tmp/hadoop-hadoop/mapred/local/1398179594286/part-0 (No such file or directory) This looks like the kind of file that should have been created on the fly (and then deleted). Does anyone know what this error is really a symptom of? Perhaps some permissions issues? Thank you, Natalia -- Jay Vyas http://jayunit100.blogspot.com
Re: Shuffle Error after enabling Kerberos authentication
(bump) this is a good question. im new to kerberos as well, and have been wondering how to prevent scenarios such as this from happening. my thought is that since Kerberos iirc requires a ticket for each pair of client + services working together ... maybe there is a chance that, if *any* two nodes in a cluster havent been initialized with the right tickets to talk together, then a possible error can happen during shuffle-sort b/c so much distributed copying is going on ??? In any case, id love to know any good smoke tests for a large size kerberized hadoop cluster that dont require running a mapreduce job. On Sat, Apr 19, 2014 at 11:11 PM, Mike m...@unitedrmr.com wrote: Unsubscribe On Apr 19, 2014, at 5:32 AM, Terance Dias terance.d...@gmail.com wrote: Hi, I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic multi-node cluster and run map reduce jobs. But when I enable Kerberos authentication, the reduce task fails with following error. Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) I did a search and found that people have generally seen this error when their network configuration is not correct and so the data nodes are not able to communicate with each other to shuffle the data. I don't think that is the problem in my case because everything works fine if Kerberos authentication is disabled. Any idea what what the problem could be? Thanks, Terance. -- Jay Vyas http://jayunit100.blogspot.com
Re: MapReduce for complex key/value pairs?
- Adding parsing logic in mappers/reducers is the simplest, least elegant way to do it, or just writing json strings is one simple way to do it. - You get more advanced by writing custom writables which parse the data are the first way to do it. - The truly portable and right way is to do it is to define a schema and use Avro to parse it. Unlike manually adding parsing to app logic, or adding json deser to your mapper/reducers, proper Avro serialization has the benefit of increasing performance and app portability while also code more maintainable (it interoperates with pure java domain objects) On Tue, Apr 8, 2014 at 2:30 PM, Harsh J ha...@cloudera.com wrote: Yes, you can write custom writable classes that detail and serialise your required data structure. If you have Hadoop: The Definitive Guide, checkout its section Serialization under chapter Hadoop I/O. On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Dear All, I was wondering if the following is possible using MapReduce. I would like to create a job that loops over a bunch of documents, tokenizes them into ngrams, and stores the ngrams and not only the counts of ngrams but also _which_ document(s) had this particular ngram. In other words, the key would be the ngram but the value would be an integer (the count) _and_ an array of document id's. Is this something that can be done? Any pointers would be appreciated. I am using Java, btw. Thank you, Natalia Connolly -- Harsh J -- Jay Vyas http://jayunit100.blogspot.com
Re: Hadoop Serialization mechanisms
But I believe w.r.t. will we see performance gains when using avro/thrift/... over writables -- it depends on the writable implementation.For example, If I have a writable serialization which can use a bit map to store an enum, but then read that enum as a string: It will look the same to user, but my writable implementation would be superior.We can obviously say that if you use avro/thrift/pbuffers in an efficient way, then yes, you will see a performance gain then say storing everything as Text Writable objects. But clever optimizations can be done even within the Writable framework as well. On Sun, Mar 30, 2014 at 4:08 PM, Harsh J ha...@cloudera.com wrote: Does Hadoop provides a pluggible feature for Serialization for both the above cases? - You can override the RPC serialisation module and engine with a custom class if you wish to, but it would not be trivial task. - You can easily use custom data serialisation modules for I/O. Is Writable the default Serialization mechanism for both the above cases? While MR's built-in examples in Apache Hadoop continue to use Writables, the RPCs have moved to using Protocol buffers for themselves in 2.x onwards. Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop 2.x? Yes partially, see above. Will there be a significant performance gain if the default Serialization i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce programming? Yes, you should see a gain in using a more efficient data serialisation format for data files. On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas jayunit...@gmail.com wrote: Those are all great questions, and mostly difficultto answer.I havent played with serialization APIs in some time, but let me try to give some guidance. WRT to your bulleted questions above: 1) Serialization is file system independant: The use of any hadoop compatible file system should support any kind of serialization. 2) See (1). The default serialization is Writables: But you can easily add your own by modifiying the io.serializations configuration parameter. 3) I doubt anything significant effecting the way serialization works: The main thrust of 1-2 was in the way services are deployed, not changing the internals of how data is serialized. After all, the serialization APIs need to remain stability even as the arch. of hadoop changes. 4) It depends on the implementation. If you have a custom writable that is really good at compressing your data, that will be better than using a thrift auto generated API for serialization that is uncustomized out of the box. Example: Say you are writing strings and you know the string is max 3 characters. A smart Writable serializer with custom implementations optimized for your data will beat a thrift serialization approach. But I think in general, the advantage of thrift/avro is that its easier to get really good compression natively out-of-the-box, due to the fact that many different data types are strongly supported by the way they apply the schemas (for example , a thrift struct can contain a boolean, two strings , and an int These types will all be optmiized for you by thrift Where as in Writables, you cannot as easily create sophisticated types with optimization of nested properties. On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello All, AFAIK Hadoop serialization comes into picture in the 2 areas: putting data on the wire i.e., for interprocess communication between nodes using RPC putting data on disk i.e. using the Map Reduce for persistent storage say HDFS. I have a couple of questions regarding the Serialization mechanisms used in Hadoop: Does Hadoop provides a pluggible feature for Serialization for both the above cases? Is Writable the default Serialization mechanism for both the above cases? Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop 2.x? Will there be a significant performance gain if the default Serialization i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map Reduce programming? Thanks, -RR -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J -- Jay Vyas http://jayunit100.blogspot.com
Jobs fail immediately in local mode ?
Im running a job in local mode, and have found that it returns immediately, switching job state to FAILURE. From the /tmp/hadoop-jay directory, I see that clearly an attempt was made to run the job , and that some files seem to have been created But I don't see any clues. ├── [102] local │ └── [102] localRunner │ └── [170] jay │ ├── [ 68] job_local1531736937_0001 │ ├── [ 68] job_local218993552_0002 │ └── [136] jobcache │ ├── [102] job_local1531736937_0001 │ │ └── [102] attempt_local1531736937_0001_m_00_0 │ │ └── [136] output │ │ ├── [ 14] file.out │ │ └── [ 32] file.out.index │ └── [102] job_local218993552_0002 │ └── [102] attempt_local218993552_0002_m_00_0 │ └── [136] output │ ├── [ 14] file.out │ └── [ 32] file.out.index └── [136] staging ├── [102] jay1531736937 └── [102] jay218993552 Any thoughts on how i can further diagnose whats happening and why my job fails without a stacktrace? Because I dont have hadoop installed on the system (i.e. im just running a java app that fires up a hadoop client locally), I cant see anything in /var/log . -- Jay Vyas http://jayunit100.blogspot.com
Re: Doubt
Certainly it is , and quite common especially if you have some high performance machines : they can run as mapreduce slaves and also double as mongo hosts. The problem would of course be that when running mapreduce jobs you might have very slow network bandwidth at times, and if your front end needs fast response times all the time from mongo instances you could be in trouble. On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar praveen...@gmail.comwrote: Why not ? Its just a matter of installing 2 different packages. Depends on what do you want to use it for, you need to take care of few things, but as far as installation is concerned, it should be easily doable. Regards Prav On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote: Hi all, is it possible to install Mongodb on the same VM which consists hadoop? -- amiable harsha -- Jay Vyas http://jayunit100.blogspot.com
Re: What if file format is dependent upon first few lines?
-- method 1 -- You could, i think, just extend fileinputformat, with isSplittable = false. Then each file wont be brokeen up into separate blocks, and processed as a whole per mapper. This is probably the easiest thing to do but if you have huge files, it wont perform very well. -- method 2 -- You can use Harsh's suggestion (thanks for that idea, i didnt know it). 1) In the setup method of a mapper, you can get the file path : using ((FileSplit) context.getInputSplit()).getPath(); 2) Then , in the mappers setup method, you should be able open a file input stream and call seek(0) to read the file header, as Harsh sais. 3) When you process the header, you can store the results in the Setup method as a local variable, and the mapper can read from that variable and proceed. On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO raofeng...@gmail.com wrote: thanks, Harsh. could you specify more detail, or give some links or an example where I can start? 2014-02-27 22:17 GMT+08:00 Harsh J ha...@cloudera.com: A mapper's record reader implementation need not be restricted to strictly only the input split boundary. It is a loose relationship - you can always seek(0), read the lines you need to prepare, then seek(offset) and continue reading. Apache Avro (http://avro.apache.org) has a similar format - header contains the schema a reader needs to work. On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO raofeng...@gmail.com wrote: Below is a fake sample of Microsoft IIS log: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 390 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 390 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390 ... The first four lines describe the file format, which is a must to parse each log line. It means log file could NOT be simply splitted, otherwise the second split would lost the file format information. How could each mapper get the first few lines in the file? -- Harsh J -- Jay Vyas http://jayunit100.blogspot.com
YARN job exits fast without failure, but does nothing
Hi yarn: Ive traced a oozie problem to a yarn task log , which originates from an oozie submitted job: http://paste.fedoraproject.org/78099/92698193/raw/ Although the above yarn task ends in SUCCESS, it seems to do essentially nothing. Has anyone ever seen a log like that before? Any insight into why i might have an empty task like this would be appreciated. I wont go into details about oozie here since its the yarn mailing list, but the link to my original problem is here: http://qnalist.com/questions/4726691/oozie-reports-unkown-hadoop-job-failure-but-no-error-indication-in-yarn -- Jay Vyas http://jayunit100.blogspot.com
Re: How to ascertain why LinuxContainer dies?
Not sure where the containers dump standard out /error to? I figured it would be propagated in the node manager logs if anywhere, right? Sent from my iPhone On Feb 14, 2014, at 4:46 AM, Harsh J ha...@cloudera.com wrote: Hi, Does your container command generate any stderr/stdout outputs that you can check under the container's work directory after it fails? On Fri, Feb 14, 2014 at 9:46 AM, Jay Vyas jayunit...@gmail.com wrote: I have a linux container that dies. The nodemanager logs only say: WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch : org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:202) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) where can i find the root cause of the non-zero exit code ? -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J
Re: How to ascertain why LinuxContainer dies?
Okay harsh : Your hint was enought to get me back on trakc! I found the linux container logs and they are Wonderful :)... I guess at the end of each container run, logs get propogated into the Distributed file system's /var/log directories. In any case, once i dug in there, I found the cryptic failure was because my done_intermediate permissions were bad. anyways, thanks for the hint Harsh ! After monitoring the local /var/log/hadoop-yarn/container/ directory, i was able to see that the stdout/stderr files were being deleted , and then after some googling i found a post about how YARN aggregates logs into the DFS. Anyways, problem solved. For those curious: If debugging Yarn-linux-containers that are dying (as shown in [local] /var/log/hadoop-yarn/ nodemanager logs), you can dig more after the task dies by going into hadoop fs -cat /var/log/hadoop-yarn/apps/oozie_user/logs/application_1392385522708_0008/* On Fri, Feb 14, 2014 at 9:17 AM, German Florez-Larrahondo german...@samsung.com wrote: I believe that errors on containers are not propagated to the standard Java logs. You have to look into the std* and syslog files of the container: Here is an example : *.../userlogs/application_1391549207212_0006/container_1391549207212_0006_01_27* [htf@gfldesktop container_1391549207212_0006_01_27]$ ls -lart total 60 -rw-rw-r-- 1 htf htf 0 Feb 4 17:27 stdout -rw-rw-r-- 1 htf htf 0 Feb 4 17:27 stderr drwx--x--- 28 htf htf 4096 Feb 4 17:27 .. drwx--x--- 2 htf htf 4096 Feb 4 17:27 . -rw-rw-r-- 1 htf htf 50471 Feb 4 17:31 syslog Regards ./g -Original Message- From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Friday, February 14, 2014 7:02 AM To: user@hadoop.apache.org Cc: user@hadoop.apache.org Subject: Re: How to ascertain why LinuxContainer dies? Not sure where the containers dump standard out /error to? I figured it would be propagated in the node manager logs if anywhere, right? Sent from my iPhone On Feb 14, 2014, at 4:46 AM, Harsh J ha...@cloudera.com wrote: Hi, Does your container command generate any stderr/stdout outputs that you can check under the container's work directory after it fails? On Fri, Feb 14, 2014 at 9:46 AM, Jay Vyas jayunit...@gmail.com wrote: I have a linux container that dies. The nodemanager logs only say: WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch : org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:202) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: 322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.laun chContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C ontainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C ontainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec utor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:908) at java.lang.Thread.run(Thread.java:662) where can i find the root cause of the non-zero exit code ? -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J -- Jay Vyas http://jayunit100.blogspot.com
How to ascertain why LinuxContainer dies?
I have a linux container that dies. The nodemanager logs only say: WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch : org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:202) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) where can i find the root cause of the non-zero exit code ? -- Jay Vyas http://jayunit100.blogspot.com
Re: Test hadoop code on the cloud
As a slightly more advanced option for OpenStack people: Consider trying savanna (Hadoop provisioned on top of open stack) as well. On Wed, Feb 12, 2014 at 10:23 AM, Silvina Caíno Lores silvi.ca...@gmail.com wrote: You can check Amazon Elastic MapReduce, which comes preconfigured on EC2 but you need to pay a little por it, or make your custom instalation on EC2 (beware that EC2 instances come with nothing but really basic shell tools on it, so it may take a while to get it running). Amazon's free tier allows you to instantiate several tiny machines; when you spend your free quota they start charging you so be careful. Good luck :D On 12 February 2014 13:27, Andrea Barbato and.barb...@gmail.com wrote: Thanks for the answer, but if i want to test my code on a full distributed installation? (for more accurate performance) 2014-02-12 13:01 GMT+01:00 Zhao Xiaoguang cool...@gmail.com: I think you can test it in Amazon EC2 with pseudo distribute, it support 1 tiny instance for 1 year free. Send From My Macbook On Feb 12, 2014, at 6:29 PM, Andrea Barbato and.barb...@gmail.com wrote: Hi! I need to test my hadoop code on a cluster, what is the simplest way to do this on the cloud? Is there any way to do it for free? Thank in advance -- Jay Vyas http://jayunit100.blogspot.com
YARN FSDownload: How did Mr1 do it ?
Im noticing that resource localization is much more complex in YARN than MR1, in particular, the timestamps need to be identical, or else, an exception is thrown. i never saw that in MR1. How did MR1 JobTrackers handle resource localization differently than MR2 App Masters? -- Jay Vyas http://jayunit100.blogspot.com
Re: performance of hadoop fs -put
No , im using a glob pattern, its all done in one put statement On Tue, Jan 28, 2014 at 9:22 PM, Harsh J ha...@cloudera.com wrote: Are you calling one command per file? That's bound to be slow as it invokes a new JVM each time. On Jan 29, 2014 7:15 AM, Jay Vyas jayunit...@gmail.com wrote: Im finding that hadoop fs -put on a cluster is quite slow for me when i have large amounts of small files... much slower than native file ops. Note that Im using the RawLocalFileSystem as the underlying backing filesystem that is being written to in this case, so HDFS isnt the issue. I see that the Put class creates a linkedlist of # number of elements in the path. 1) Is there a more performant way to run fs -put 2) Has anyone else noted that fs -put has extra overhead? Im going to trace some more but , just wanted to bounce this off the mailing list... maybe others also have run into this issue. ** Is hadoop fs -put inherently slower than a unix cpaction, regardless of filesystem -- and if so , why? ** -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: DistributedCache deprecated
gotcha this makes sense On Wed, Jan 29, 2014 at 4:44 PM, praveenesh kumar praveen...@gmail.comwrote: @Jay - Plus if you see DistributedCache class, these methods have been added inside the Job class, I am guessing they have kept the functionality same, just merged DistributedCache class into Job class itself. giving more methods for developers with less classes to worry about, thus simplifying the API. I hope that makes sense. Regards Prav On Wed, Jan 29, 2014 at 9:41 PM, praveenesh kumar praveen...@gmail.comwrote: @Jay - I don't know how Job class is replacing the DistributedCache class , but I remember trying distributed cache functions like void *addArchiveToClassPath http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path%29* (Pathhttp://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/Path.html archive) Add an archive path to the current set of classpath entries. void *addCacheArchive http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheArchive%28java.net.URI%29* (URIhttp://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true uri) Add a archives to be localized void *addCacheFile http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29* (URIhttp://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true uri) Add a file to be localized and it works fine. The same way you were using DC before.. Well I am not sure what would be the best answer, but if you are trying to use DC , I was able to do it with Job class itself. Regards Prav On Wed, Jan 29, 2014 at 9:27 PM, Jay Vyas jayunit...@gmail.com wrote: Thanks for asking this : Im not sure and didnt realize this until you mentioned it! 1) Prav: You are implying that we would use the Job Class... but how could it replace the DC? 2) The point of the DC is to replicate a file so that its present and local on ALL nodes. I didnt know it was deprecated, but there must be some replacement for it - or maybe it just got renamed and moved? SO ... what is the future of the DistributedCache for mapreduce jobs? On Wed, Jan 29, 2014 at 4:22 PM, praveenesh kumar praveen...@gmail.comwrote: I think you can use the Job class. http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html Regards Prav On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael michael.giord...@vistronix.com wrote: I noticed that in Hadoop 2.2.0 org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated. (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class) Is there a class that provides equivalent functionality? My application relies heavily on DistributedCache. Thanks, Mike G. This communication, along with its attachments, is considered confidential and proprietary to Vistronix. It is intended only for the use of the person(s) named above. Note that unauthorized disclosure or distribution of information not generally known to the public is strictly prohibited. If you are not the intended recipient, please notify the sender immediately. -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: Passing data from Client to AM
while your at it, what about adding values to the Configuration() object, does that still work as a hack for information passing? On Wed, Jan 29, 2014 at 5:25 PM, Arun C Murthy a...@hortonworks.com wrote: Command line arguments env variables are the most direct options. A more onerous option is to write some data to a file in HDFS, use LocalResource to ship it to the container on each node and get application code to read that file locally. (In MRv1 parlance that is Distributed Cache). hth, Arun On Jan 29, 2014, at 12:59 PM, Brian C. Huffman bhuff...@etinternational.com wrote: I'm looking at Distributed Shell as an example for writing a YARN application. My question is why are the script path and associated metadata saved as environment variables? Are there any other ways besides environment variables or command line arguments for passing data from the Client to the ApplicationMaster? Thanks, Brian -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Jay Vyas http://jayunit100.blogspot.com
Re: Hadoop-2.2.0 and Pig-0.12.0 - error IBM_JAVA
Thanks for sharing this as we had the same problem, and We are playing with similar errors. Starting to think that there is something overly difficult about pig/hadoop 2.x deployment, related to which version of pig you use . chelsoo has helped us resolve our issue by pointing us to https://issues.apache.org/jira/browse/PIG-3729 hope somebody can illuminate whats actually going on with pig , hadoop 2x, and hadoop 1x : and why the standard pig jars dont work on 2x? On Tue, Jan 28, 2014 at 2:23 PM, Serge Blazhievsky hadoop...@gmail.comwrote: Which hadoop distribution are you using? On Tue, Jan 28, 2014 at 10:04 AM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi Guys, I'm running hadoop 2.2.0 version with pig-0.12.0, when I'm trying to run any job getting the error as below, *java.lang.NoSuchFieldError: IBM_JAVA* Is this because of Java version or compatibility issue with hadoop and pig. I'm using Java version - *1.6.0_31* Please help me out. -- Regards, Viswa.J -- Jay Vyas http://jayunit100.blogspot.com
performance of hadoop fs -put
Im finding that hadoop fs -put on a cluster is quite slow for me when i have large amounts of small files... much slower than native file ops. Note that Im using the RawLocalFileSystem as the underlying backing filesystem that is being written to in this case, so HDFS isnt the issue. I see that the Put class creates a linkedlist of # number of elements in the path. 1) Is there a more performant way to run fs -put 2) Has anyone else noted that fs -put has extra overhead? Im going to trace some more but , just wanted to bounce this off the mailing list... maybe others also have run into this issue. ** Is hadoop fs -put inherently slower than a unix cpaction, regardless of filesystem -- and if so , why? ** -- Jay Vyas http://jayunit100.blogspot.com
Strange rpc exception in Yarn
Hi folks: At the **end** of a successful job, im getting some strange stack traces this when using pig, however, it doesnt seem to be pig specific from the stacktrace. Rather, it appears that the job client is attempting to do something funny. Anyone ever see this sort of exception in Yarn ? It seems as though its related to an IPC call, but the IPC call is throwing an exception in the hasNext(..) implementation in the AbstractfileSystem. ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:roofmonkey (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.fs.AbstractFileSystem$1.hasNext(AbstractFileSystem.java:861) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:656) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:668) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:722) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$300(HistoryFileManager.java:77) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:275) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:708) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(HistoryFileManager.java:847) at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:107) at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:207) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:200) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:196) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:196) at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:228) at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
Re: Shutdown hook for FileSystems
what is happening when you remove the shutdown hook ?is that supposed to trigger an exception -
Re: What is the difference between Hdfs and DistributedFileSystem?
yes, yes , and YES ! The use of alternative file systems to HDFS exciting part of the hadoop ecosystem, allowing us to plug mapreduce applications into different storage backends. Lost of folks in the hadoop community are working hard to democratize storage on Hadoop. Take a moment to read a scan some of these article to get an idea of how modular the hadoop stack really is, and how broad the ecosystem is in terms of storage backends. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/s3/S3FileSystem.html http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/RawLocalFileSystem.html http://answers.oreilly.com/topic/456-get-to-know-hadoop-filesystems/ https://wiki.apache.org/hadoop/HCFS http://www.gluster.org/2013/10/automated-hadoop-deployment-on-glusterfs-with-apache-ambari/ http://www.redhat.com/about/news/archive/2013/10/red-hat-contributes-apache-hadoop-plug-in-to-the-gluster-community On Mon, Jan 13, 2014 at 12:10 PM, Michael sjp120...@gmail.com wrote: HDFS is an implementation of the Distributed File System. There can be other implementations of a generic distributed file system ( for eg google file system GFS ) On 13 January 2014 17:01, 梁李印 liyin.lian...@aliyun-inc.com wrote: What is the difference between Hdfs.java and DistributedFileSystem.java in Hadoop2? Best Regards, Liyin Liang Tel: 78233 Email: liyin.lian...@alibaba-inc.com -- Jay Vyas http://jayunit100.blogspot.com
Re: Ways to manage user accounts on hadoop cluster when using kerberos security
I recently found a pretty simple and easy way to set ldap up for my machines on rhel and wrote it up using jumpbox and authconfig. If you are in the cloud and only need a quick easy ldap idh and nssswitch setup, this is I think the easiest / cheapest way to do it. I know rhel and fedora come with authconfig not sure about the other Linux distros: http://jayunit100.blogspot.com/2013/12/an-easy-way-to-centralize.html?m=1 On Jan 8, 2014, at 1:22 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: On Jan 7, 2014, at 2:55 PM, Manoj Samel manoj.sa...@gmail.com wrote: I am assuming that if the users are in a LDAP, can using the PAM for LDAP solve the issue. That's how I've seen this issue addressed. +Vinod CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Debug Hadoop Junit Test in Eclipse
Excellent question. its not trivial to debug a distributed app in eclipse, but it is totally doable using javaagents . We've written it up here: http://jayunit100.blogspot.com/2013/07/deep-dive-into-hadoop-with-bigtop-and.html FYI cc Brad childs (https://github.com/childsb) at red hat has helped me with the tutorial, he might have some extra advice also (cc'd on this email), I've written up one way to do this using the bigtop VMs here. On Mon, Dec 16, 2013 at 8:07 AM, Karim Awara karim.aw...@kaust.edu.sawrote: Hi, I want to trace how a file upload (-put) happens in hadoop. So Im junit testing TestDFSShell.java. When I try to debug the test, It fails due to test timed out exception. I believe this is because I am trying to stop one thread while the rest are working. I have changed the breakpoint property to suspend VM, but still same problem. How can I trace calls made by datanode/namenode when running TestDFSShell.java Junit test through eclipse? I am using hadoop 2.2.0 -- Best Regards, Karim Ahmed Awara -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- Jay Vyas http://jayunit100.blogspot.com
Re: Debug Hadoop Junit Test in Eclipse
In that case i guess you will have to statically trace the code your self. On Mon, Dec 16, 2013 at 10:32 AM, Karim Awara karim.aw...@kaust.edu.sawrote: Useful post, however, I am not trying to debug mapreduce programs with its associated VMs. I want to modify HDFS source code on how it uploads files. So I am only looking to trace fs commands through the DFS shell. I believe this should be require less work in debugging than actually going to mapred VMs! -- Best Regards, Karim Ahmed Awara On Mon, Dec 16, 2013 at 5:57 PM, Jay Vyas jayunit...@gmail.com wrote: Excellent question. its not trivial to debug a distributed app in eclipse, but it is totally doable using javaagents . We've written it up here: http://jayunit100.blogspot.com/2013/07/deep-dive-into-hadoop-with-bigtop-and.html FYI cc Brad childs (https://github.com/childsb) at red hat has helped me with the tutorial, he might have some extra advice also (cc'd on this email), I've written up one way to do this using the bigtop VMs here. On Mon, Dec 16, 2013 at 8:07 AM, Karim Awara karim.aw...@kaust.edu.sawrote: Hi, I want to trace how a file upload (-put) happens in hadoop. So Im junit testing TestDFSShell.java. When I try to debug the test, It fails due to test timed out exception. I believe this is because I am trying to stop one thread while the rest are working. I have changed the breakpoint property to suspend VM, but still same problem. How can I trace calls made by datanode/namenode when running TestDFSShell.java Junit test through eclipse? I am using hadoop 2.2.0 -- Best Regards, Karim Ahmed Awara -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- Jay Vyas http://jayunit100.blogspot.com -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- Jay Vyas http://jayunit100.blogspot.com
Pluggable distribute cache impl
are there any ways to plug in an alternate distributed cache implantation (I.e when nodes of a cluster already have an nfs mount or other local data service...)?
Re: multiusers in hadoop through LDAP
So, not knowing much about LDAP, but being very interested in the multiuser problem on multiuser filesystems, i was excited to see this question Im researching the same thing at the moment, and it seems obviated by the fact that : - the FileSystem API itslef provides implementations for getting group and user names / permissions And furthermore - the linux task controllers launch jobs as the user submitting the job, whereas the regular task controllers launch tasksunder the YARN daemon name, iirc. So where does LDAP begin and TaskController / FileSystem notions of ownership end ? I guess I'm also asking what are the entites which are ownable in hadoop app , and how we can leverage the GroupMappingServiceProviders to deploy more flexible hadoop environments. Any thoughts on this would be appreciated. On Tue, Dec 10, 2013 at 6:38 PM, Adam Kawa kawa.a...@gmail.com wrote: Please have a look at hadoop.security.group.mapping.ldap.* settings as Hardik Pandya suggests. = In advance, just to share our story related to LDAP + hadoop.security.group.mapping.ldap.*, if you run into the same limitation as we did: In many cases hadoop.security.group.mapping.ldap.* should solve your problem. Unfortunately, they did now work for us. The problematic setting relates to an additional filter to use when searching for LDAP groups. We wanted to use posixGroups filter, but it is currently not supported by Hadoop. Finally, we found a workaround using name service switch configuration where we specified that the LDAP should the primary source of information about groups of our users. This means that we solved this problem on the operating system level, not on Hadoop level. You can read more about this issue here: http://hakunamapdata.com/a-user-having-surprising-troubles-running-more-resource-intensive-hive-queries/ and here http://www.slideshare.net/AdamKawa/hadoop-adventures-at-spotify-strata-conference-hadoop-world-2013 (slides 18-26). 2013/12/10 Hardik Pandya smarty.ju...@gmail.com have you looked at hadoop.security.group.mapping.ldap.* in hadoop-common/core-default.xmlhttp://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/core-default.xml additional resourcehttp://hakunamapdata.com/a-user-having-surprising-troubles-running-more-resource-intensive-hive-queries/may help On Tue, Dec 10, 2013 at 3:06 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi In my cluster ,I want to have multiusers for different purpose.The usual method is to add a user through the OS on Hadoop NameNode . I notice the hadoop also support to LDAP, could I add user through LDAP instead through OS? So that if a user is authenticated by the LDAP ,who will also access the HDFS directory? Regards -- Jay Vyas http://jayunit100.blogspot.com
Write a file to local disks on all nodes of a YARN cluster.
I want to put a file on all nodes of my cluster, that is locally readable (not in HDFS). Assuming that i cant gaurantee a FUSE mount or NFS or anything of the SORT on my cluster, is there a poor man's way to do this in yarn? something like for each node n in cluster: n.copToLocal(a,/tmp/a); So that afterwards, all nodes in the cluster have a file a in /tmp. -- Jay Vyas http://jayunit100.blogspot.com
FSMainOperations FSContract tests?
Mainly @steveloughran Is it safe to say that *old* fs semantics are in FSContract test, and *new* fs semantics in FSMainOps tests ? I ask this because it seems that you had tests in your swift filesystem tests which used the FSContract libs, as well as the FSMainOps.. Not sure why you need both? There is pretty high redundancy it seems
Re: how to prevent JAVA HEAP OOM happen in shuffle process in a MR job?
version is rewally important here.. - If 1.x, then Where (NN , JT , TT ?) - if 2.x, then where? (AM, NM, ... ?) -- probably less likely here, since the resources are ephemeral. I know that some older 1x versions had an issue with the jobtracker having an ever-expanding hashmap or something like that, so that if you ran 100s of jobs, you could get OOM erros on the JobTracker.
Re: Hadoop Test libraries: Where did they go ?
Yup , we figured it out eventually. The artifacts now use the test-jar directive which creates a jar file that you can reference in mvn using the type tag in your dependencies. However, fyi, I haven't been able to successfully google for the quintessential classes in the hadoop test libs like the fs BaseContractTest by name, so they are now harder to find then before So i think it's unfortunate that they are not a top level maven artifact. It's misleading, as It's now very easy to assume from looking at hadoop in mvn central that hadoop-test is just an old library that nobody updates anymore. Just a thought but Maybe hadoop-test could be rejuvenated to point to the hadoop-commons some how? On Nov 25, 2013, at 4:52 AM, Steve Loughran ste...@hortonworks.com wrote: I see a hadoop-common-2.2.0-tests.jar in org.apache.hadoop/hadoop-?common; SHA1 a9994d261d00295040a402cd2f611a2bac23972a, which resolves in a search engine to http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/2.2.0/ It looks like it is now part of the hadoop-common artifacts, you just say you want the test bits http://maven.apache.org/guides/mini/guide-attached-tests.html On 21 November 2013 23:28, Jay Vyas jayunit...@gmail.com wrote: It appears to me that http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-test Is no longer updated Where does hadoop now package the test libraries? Looking in the .//hadoop-common-project/hadoop-common/pom.xml file in the hadoop 2X branches, im not sure wether or not src/test is packaged into a jar anymore... but i fear it is not. -- Jay Vyas http://jayunit100.blogspot.com -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Hadoop Test libraries: Where did they go ?
It appears to me that http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-test Is no longer updated Where does hadoop now package the test libraries? Looking in the .//hadoop-common-project/hadoop-common/pom.xml file in the hadoop 2X branches, im not sure wether or not src/test is packaged into a jar anymore... but i fear it is not. -- Jay Vyas http://jayunit100.blogspot.com
Re: how to stream the video from hdfs
I believe there is a FUSE mount for hdfs which will allow you to open files normally in your streaming app rather than requiring using the jav API. Also consider that For Media and highly available binary data for a front end I would guess that hdfs might be overkill because of the blocking/nn requirement..If hdfs is not required but you still want a hadoop compatible dfs, you could also try gluster which may be a little better suited for read only , unblocked data for streaming from a front end. On Nov 13, 2013, at 12:50 AM, mallik arjun mallik.cl...@gmail.com wrote: Dear Jens, i want to put the videos into hdfs, and then i want to stream those video's php front end. On Tue, Nov 12, 2013 at 11:50 PM, Jens Scheidtmann jens.scheidtm...@gmail.com wrote: Dear Mallik, Please tell us what you are trying to accomplish, maybe then somebody is able to help you... Jens Am Montag, 11. November 2013 schrieb mallik arjun : hi all, how to stream the video from hdfs.
YARN And NTP
Hi folks. Is there a way to make YARN more forgiving with last modification times? The following exception in org.apache.hadoop.yarn.util.FSDownload: changed on src filesystem (expected + resource.getTimestamp() + , was + sStat.getModificationTime()); I realize that time should be the same, but depending on underlying filesystem the semantics of this last modified time might vary. Any thoughts on this? -- Jay Vyas http://jayunit100.blogspot.com
Re: Uploading a file to HDFS
I've diagramed the hadoop HDFS write path here: http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash ravi...@ymail.com wrote: Karim! Look at DFSOutputStream.java:DataStreamer HTH Ravi -- *From:* Karim Awara karim.aw...@kaust.edu.sa *To:* user user@hadoop.apache.org *Sent:* Thursday, September 26, 2013 7:51 AM *Subject:* Re: Uploading a file to HDFS Thanks for the reply. when the client caches 64KB of data on its own side, do you know which set of major java classes/files responsible for such action? -- Best Regards, Karim Ahmed Awara On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Case 2: While selecting target DN in case of write operations, NN will always prefers first DN as same DN from where client sending the data, in some cases NN ignore that DN when there is some disk space issues or some other health symptoms found,rest of things will same. Thanks Jitendra On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma shekhar2...@gmail.comwrote: Its not the namenode that does the reading or breaking of the file.. When you run the command hadoop fs -put input output. Here hadoop is a script file which is default client for hadoop..and when the client contacts the namenode for writing, then NN creates a block id and ask 3 dN to host the block ( replication factor to 3) and this information is sent to client. client caches 64KB of data on its own side and then pushes the data to the DN and then this data gets pushed through pipeline..and this process gets repeated till 64MB data is written and if the client wants to to write more then he will again contact NN via heart beat signal and this process continuess... Check how does writing happens in HDFS? Regards, Som Shekhar Sharma +91-8197243810 On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara karim.aw...@kaust.edu.sa wrote: Hi, I have a couple of questions about the process of uploading a large file ( 10GB) to HDFS. To make sure my understanding is correct, assuming I have a cluster of N machines. What happens in the following: Case 1: assuming i want to uppload a file (input.txt) of size K GBs that resides on the local disk of machine 1 (which happens to be the namenode only). if I am running the command -put input.txt {some hdfs dir} from the namenode (assuming it does not play the datanode role), then will the namenode read the first 64MB in a temporary pipe and then transfers it to one of the cluster datanodes once finished? Or the namenode does not do any reading of the file, but rather asks a certain datanode to read the 64MB window from the file remotely? Case 2: assume machine 1 is the namenode, but i run the -put command from machine 3 (which is a datanode). who will start reading the file? -- Best Regards, Karim Ahmed Awara This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- Jay Vyas http://jayunit100.blogspot.com
Re: Retrieve and compute input splits
Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to. I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations. See the FileSystem abstract class for details on that. On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian mohaj...@gmail.comwrote: For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call? On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal sonalgoy...@gmail.comwrote: The input splits are not copied, only the information on the location of the splits is copied to the jobtracker so that it can assign tasktrackers which are local to the split. Check the Job Initialization section at http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/ To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai saigr...@yahoo.in wrote: Hi I have attached the anatomy of MR from definitive guide. In step 6 it says JT/Scheduler retrieve input splits computed by the client from hdfs. In the above line it refers to as the client computes input splits. 1. Why does the JT/Scheduler retrieve the input splits and what does it do. If it is retrieving the input split does this mean it goes to the block and reads each record and gets the record back to JT. If so this is a lot of data movement for large files. which is not data locality. so i m getting confused. 2. How does the client know how to calculate the input splits. Any help please. Thanks Sai -- Jay Vyas http://jayunit100.blogspot.com
Re: Extending DFSInputStream class
This is actually somewhat common in some of the hadoop core classes : Private constructors and inner classes. I think in the long term jiras should be opened for these to make them public and pluggable with public parameterized constructors wherever possible, so that modularizations can be provided. On Thu, Sep 26, 2013 at 10:46 AM, Rob Blah tmp5...@gmail.com wrote: Hi I would like to wrap DFSInputStream by extension. However it seems that the DFSInputStream constructor is package private. Is there anyway to achieve my goal? Also just out of curiosity why you have made this class inaccessible for developers, or am I missing something? regards tmp -- Jay Vyas http://jayunit100.blogspot.com
Re: Extending DFSInputStream class
The way we have gotten around this in the past is extending and then copying the private code and creating a brand new implementation. On Thu, Sep 26, 2013 at 10:50 AM, Jay Vyas jayunit...@gmail.com wrote: This is actually somewhat common in some of the hadoop core classes : Private constructors and inner classes. I think in the long term jiras should be opened for these to make them public and pluggable with public parameterized constructors wherever possible, so that modularizations can be provided. On Thu, Sep 26, 2013 at 10:46 AM, Rob Blah tmp5...@gmail.com wrote: Hi I would like to wrap DFSInputStream by extension. However it seems that the DFSInputStream constructor is package private. Is there anyway to achieve my goal? Also just out of curiosity why you have made this class inaccessible for developers, or am I missing something? regards tmp -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: Concatenate multiple sequence files into 1 big sequence file
iirc sequence files can be concatenated as is and read as one large file but maybe im forgetting something.
RawLocalFileSystem, getPos and NullPointerException
What is the correct behaviour for getPos in a record reader, and how should it behave when the underlying stream is null? It appears this can happen in the rawlocalfilesystem. Not sure if its implemented more safely in DistributedfileSYstem just yet. I've found that the getPos in the RawLocalFileSystem's input stream can throw a null pointer exception if its underlying stream is closed. I discovered this when playing with a custom record reader. to patch it, I simply check if a call to stream.available() throws an exception, and if so, I return 0 in the getPos() function. The existing getPos() implementation is found here: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20/src/examples/org/apache/hadoop/examples/MultiFileWordCount.java What should be the correct behaviour of getPos() in the RecordReader? http://stackoverflow.com/questions/18708832/hadoop-rawlocalfilesystem-and-getpos -- Jay Vyas http://jayunit100.blogspot.com
Re: hadoop cares about /etc/hosts ?
Jitendra: When you say check your masters file content what are you referring to? On Mon, Sep 9, 2013 at 8:31 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Also can you please check your masters file content in hadoop conf directory? Regards JItendra On Mon, Sep 9, 2013 at 5:11 PM, Olivier Renault orena...@hortonworks.comwrote: Could you confirm that you put the hash in front of 192.168.6.10 localhost It should look like # 192.168.6.10localhost Thanks Olivier On 9 Sep 2013 12:31, Cipher Chen cipher.chen2...@gmail.com wrote: Hi everyone, I have solved a configuration problem due to myself in hadoop cluster mode. I have configuration as below: property namefs.default.name/name valuehdfs://master:54310/value /property a nd the hosts file: /etc/hosts: 127.0.0.1 localhost 192.168.6.10localhost ### 192.168.6.10tulip master 192.168.6.5 violet slave a nd when i was trying to start-dfs.sh, namenode failed to start. namenode log hinted that: 13/09/09 17:09:02 INFO namenode.NameNode: Namenode up at: localhost/ 192.168.6.10:54310 ... 13/09/09 17:09:10 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:11 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:12 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:13 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:14 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:15 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:16 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:17 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:18 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithF 13/09/09 17:09:19 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54310. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithF ... Now I know deleting the line 192.168.6.10localhost ### would fix this. But I still don't know why hadoop would resolve master to localhost/127.0.0.1. Seems http://blog.devving.com/why-does-hbase-care-about-etchosts/explains this, I'm not quite sure. Is there any other explanation to this? Thanks. -- Cipher Chen CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Jay Vyas http://jayunit100.blogspot.com
MultiFileLineRecodrReader vs CombineFileRecordReader
I've found that there are two different implementations of seemingly the same class: MultiFileLineRecordReader (implemented as an inner class in some versions of MultiFileWordCount) and CombineFileRecordReader In order to implement RecordReaders for the MultiFileWordCount class. Is there any major difference between these classes, and why the redundancy ? I'm thinking maybe it was retro added at some point, based on some git detective work which I tried... But I figured it might just be easier to ask here :) -- Jay Vyas http://jayunit100.blogspot.com
examples of HADOOP REST API
Hi, it appears that there are some completed jiras for the Hadoop REST services for monitoring via http calls. Are there any examples of these in use? I dont see any docs on the URLs that the hadoop REST API publishes cluster information over. Im assuming also that there might be some overlap between this and the ambari REST services, but not sure where to start digging. I want to run some rest calls at the end of some jobs to query how many tasks failed, etc... Hopefully, I could get this in JSON rather than scraping HTML. Thanks! -- Jay Vyas http://jayunit100.blogspot.com
Re: e-Science app on Hadoop
there are literally hundreds. Here is a great review article for how mapreduce is used in the bioinformatics algorithms space: http://www.biomedcentral.com/1471-2105/11/S12/S1 On Fri, Aug 16, 2013 at 3:38 PM, Felipe Gutierrez felipe.o.gutier...@gmail.com wrote: Hello, Does anybody know an e-Science application to run on Hadoop? Thanks. Felipe -- *-- -- Felipe Oliveira Gutierrez -- felipe.o.gutier...@gmail.com -- https://sites.google.com/site/lipe82/Home/diaadia* -- Jay Vyas http://jayunit100.blogspot.com
Mapred.system.dir: should JT start without it?
Is there a startup for contract mapreduce making its own mapred.system.dir ? Also, it seems that the jobtracker can startup even if this directory was not created / doesn't exist - I'm thinking that if that's the case, JT should fail up front.
Re: Why LineRecordWriter.write(..) is synchronized
Then is this a bug? Synchronization in absence of any race condition is normally considered bad. In any case id like to know why this writer is synchronized whereas the other one are not.. That is, I think, then point at issue: either other writers should be synchronized or else this one shouldn't be - consistency across the write implementations is probably desirable so that changes to output formats or record writers don't lead to bugs in multithreaded environments . On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote: While we don't fork by default, we do provide a MultithreadedMapper implementation that would require such synchronization. But if you are asking is it necessary, then perhaps the answer is no. On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote: its not hadoop forked threads, we may create a line record writer, then call this writer concurrently. On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote: Hi, Thanks for your reply. May I know where does hadoop fork multiple threads to use a single RecordWriter. regards, sathwik On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote: because we may use multi-threads to write a single file. On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote: Hi, LineRecordWriter.write(..) is synchronized. I did not find any other RecordWriter implementations define the write as synchronized. Any specific reason for this. regards, sathwik
Re: solr -Reg
True that it deserves some posting on solr, but i think It's still partially relevant... The SolrInputFormat and SolrOutputFormat handle this for you and will be used in your map reduce jobs . They will output one core. per reducer, where each reducer corresponds to a core.. This is necessary since all indices are stored locally per core. Remember that even though you might be able to create shards from several terabytes easily in hadoop, hosting them will require some very high performance Hardware. On Jul 28, 2013, at 1:11 PM, Harsh J ha...@cloudera.com wrote: The best place to ask questions pertaining to Solr, would be on its own lists. Head over to http://lucene.apache.org/solr/discussion.html On Sun, Jul 28, 2013 at 11:37 AM, Venkatarami Netla venkatarami.ne...@cloudwick.com wrote: Hi, I am beginner for solr , so please explain step by step how to use solr with hdfs and map reduce.. Thanks Regards -- N Venkata Rami Reddy -- Harsh J
Re: Staging directory ENOTDIR error.
This was a very odd error - it turns out that i had created a file, called tmp in my fs root directory, which meant that when the jobs were trying to write to the tmp directory, they ran into the not-a-dir exception. In any case, I think the error reporting in NativeIO class should be revised. On Thu, Jul 11, 2013 at 10:24 PM, Devaraj k devara...@huawei.com wrote: Hi Jay, ** ** Here client is trying to create a staging directory in local file system, which actually should create in HDFS. ** ** Could you check whether do you have configured “fs.defaultFS” configuration in client with the HDFS. ** ** Thanks Devaraj k ** ** *From:* Jay Vyas [mailto:jayunit...@gmail.com] *Sent:* 12 July 2013 04:12 *To:* common-u...@hadoop.apache.org *Subject:* Staging directory ENOTDIR error. ** ** Hi , I'm getting an ungoogleable exception, never seen this before. This is on a hadoop 1.1. cluster... It appears that its permissions related... Any thoughts as to how this could crop up? I assume its a bug in my filesystem, but not sure. 13/07/11 18:39:43 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:ENOTDIR: Not a directory ENOTDIR: Not a directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: CompositeInputFormat
Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :) On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.comwrote: Hi, ** ** I want to perform a JOIN on two sets of data with Hadoop. I read that the class CompositeInputFormat can be used to perform joins on data, but I can’t find any examples of how to do it. Could someone help me out? It would be much appreciated. J ** ** Thanks in advance, ** ** Andrew -- Jay Vyas http://jayunit100.blogspot.com
Staging directory ENOTDIR error.
Hi , I'm getting an ungoogleable exception, never seen this before. This is on a hadoop 1.1. cluster... It appears that its permissions related... Any thoughts as to how this could crop up? I assume its a bug in my filesystem, but not sure. 13/07/11 18:39:43 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:ENOTDIR: Not a directory ENOTDIR: Not a directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) -- Jay Vyas http://jayunit100.blogspot.com
Data node EPERM not permitted.
Hi : I've mounted my own ext4 disk ont /mnt/sdb and chmodded it to 777. However, when starting the data node: /etc/init.d/hadoop-hdfs-datanode start I get the following error in my logs (bottom of this message) *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** What is the EPERM error caused by, and how can I reproduce it? I'm assuming that, since the directory permissions are recursively set to 777 there shouldnt be a way that this error could occur, unless somewhere intermittently the directory permissions are being changed by hdfs to the wrong thing. *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 2013-07-06 15:54:13,968 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid dfs.datanode.data.dir /mnt/sdb/hadoop-hdfs/cache/hdfs/dfs/data : EPERM: Operation not permitted at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:605) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:439) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:138) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:154) at org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:1659) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1638) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1575) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1598) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1751) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1772) -- Jay Vyas http://jayunit100.blogspot.com
starting Hadoop, the new way
Hi : Is there a hadoop 2.0 tutorial for 1.0 people ? Im used to running start-all.sh , but it appears that the new MR2 version of hadoop is much more sophisticated. In any case, Im wondering what the standard way to start the new generation of hadoop/mr2 hadoop/mapreduce and hadoop/hdfs is and if I need to set any particular env variables when doing so. -- Jay Vyas http://jayunit100.blogspot.com
Re: HDFS interfaces
Looking in the source, it appears that In HDFS, the Namenode supports getting this info directly via the client, and ultimately communicates block locations to the DFSClient , which is used by the DistributedFileSystem. /** * @see ClientProtocol#getBlockLocations(String, long, long) */ static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length) throws IOException { try { return namenode.getBlockLocations(src, start, length); } catch(RemoteException re) { throw re.unwrapRemoteException(AccessControlException.class, FileNotFoundException.class, UnresolvedPathException.class); } } On Tue, Jun 4, 2013 at 2:00 AM, Mahmood Naderan nt_mahm...@yahoo.comwrote: There are many instances of getFileBlockLocations in hadoop/fs. Can you explain which one is the main? It must be combined with a method of logically splitting the input data along block boundaries, and of launching tasks on worker nodes that are close to the data splits Is this a user level task of system level task? Regards, Mahmood* * -- *From:* John Lilley john.lil...@redpoint.net *To:* user@hadoop.apache.org user@hadoop.apache.org; Mahmood Naderan nt_mahm...@yahoo.com *Sent:* Tuesday, June 4, 2013 3:28 AM *Subject:* RE: HDFS interfaces Mahmood, It is the in the FileSystem interface. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path, long, long)http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations%28org.apache.hadoop.fs.Path,%20long,%20long%29 This by itself is not sufficient for application programmers to make good use of data locality. It must be combined with a method of logically splitting the input data along block boundaries, and of launching tasks on worker nodes that are close to the data splits. MapReduce does both of these things internally along with the file-format input classes. For an application to do so directly, see the new YARN-based interfaces ApplicationMaster and ResourceManager. These are however very new and there is little documentation or examples. john *From:* Mahmood Naderan [mailto:nt_mahm...@yahoo.com] *Sent:* Monday, June 03, 2013 12:09 PM *To:* user@hadoop.apache.org *Subject:* HDFS interfaces Hello, It is stated in the HDFS architecture guide ( https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html) that *HDFS provides interfaces for applications to move themselves closer to where the data is located. * What are these interfaces and where they are in the source code? Is there any manual for the interfaces? Regards, Mahmood -- Jay Vyas http://jayunit100.blogspot.com
Re: Install hadoop on multiple VMs in 1 laptop like a cluster
Just FYI if you are on linux, KVM and kickstart are really good for this as well and we have some kickstart Fedora16 hadoop setup scripts I can share to spin up a cluster of several VMs on the fly with static IPs (that usually to me is the tricky part with hadoop VM cluster setup - setting up the VMs with static ip addresses, getting the nodes to talk / ssh to each other, and consistently defining the slaves file). But if you are stuck with VMWare, then i beleive VMWare also has a vagrant plugin now, which will be much easier for you to maintain. Manually cloning machines doesnt scale well when you want to rebuild your cluster. On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Sai Sai, You can take a look at that also: http://goo.gl/iXzae I just did that yesterday for some other folks I'm working with. Maybe not the best way, but working like a charm. JM 2013/5/31 shashwat shriparv dwivedishash...@gmail.com: Try this http://www.youtube.com/watch?v=gIRubPl20oo there will be three videos 1-3 watch and you can do what you need to do Thanks Regards ∞ Shashwat Shriparv On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Hi, You can create a clone machine through an existing virtual machine in VMware and then run it as a separate virtual machine. http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html After installing you have to make sure that all the virtual machines are setup with correct network set up so that they can ping each other (you should use Host only network settings in network configuration). I hope this will help you. Regards Jitendra On Fri, May 31, 2013 at 5:23 PM, Sai Sai saigr...@yahoo.in wrote: Just wondering if anyone has any documentation or references to any articles how to simulate a multi node cluster setup in 1 laptop with hadoop running on multiple ubuntu VMs. any help is appreciated. Thanks Sai -- Jay Vyas http://jayunit100.blogspot.com
Re: What else can be built on top of YARN.
What is the separation of concerns between YARN and Zookeeper? That is, where does YARN leave off and where does Zookeeper begin? Or is there some overlap On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, It is at least because of the reasons that Vinod listed that makes my life easy for porting my application on to YARN instead of making it work in the Map Reduce framework. The main purpose of me using YARN is to exploit the resource management capabilities of YARN. Thanks, Kishore On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul -- Jay Vyas http://jayunit100.blogspot.com
Re: understanding souce code structure
Hi! a few weeks ago I had the same question... Tried a first iteration at documenting this by going through the classes starting with key/value pairs in the blog post below. http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html Note it's not perfect yet but I think it should provide some insight into things. The lynch pin of it all is the DFSOutputStream and the DataStreamer classes. Anyways... Feel free to borrow the contents and roll your own , or comment on it leave some feedback,or let me know if anything is missing. Definetly would be awesome to have a rock solid view of the full write path. On May 27, 2013, at 2:10 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote: Hello I am trying to understand the source of of hadoop especially the HDFS. I want to know where should I look exactly in the source code about how HDFS distributes the data. Also how the map reduce engine tries to read the data. Any hint regarding the location of those in the source code is appreciated. Regards, Mahmood
Re: Configuring SSH - is it required? for a psedo distriburted mode?
Actually, I should amend my statement -- SSH is required, but passwordless ssh (i guess) you can live without if you are willing to enter your password for each process that gets started. But Why wouldn't you want to implement passwordless ssh in a pseudo distributed cluster ? Its very easy to implement on a single node: cat ~/.ssh/id_rsa.pub /root/.ssh/authorized_keys On Thu, May 16, 2013 at 11:31 AM, Jay Vyas jayunit...@gmail.com wrote: Yes it is required -- in psuedodistributed node the jobtracker is not necessarily aware that the task trackers / data nodes are on the same machine, and will thus attempt to ssh into them when starting the respective deamons etc (i.e. start-all.sh) On Thu, May 16, 2013 at 11:21 AM, kishore alajangi alajangikish...@gmail.com wrote: When you start the hadoop procecess, each process will ask the password to start, to overcome this we will configure SSH if you use single node or multiple nodes for each process, if you can enter the password for each process Its not a mandatory even if you use multiple systems. Thanks, Kishore. On Thu, May 16, 2013 at 8:24 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi, I have a dedicated user on Linux server for hadoop. I am installing it in psedo distributed mode on this box. I want to test my programs on this machine. But i see that in installation steps - they were mentioned that SSH needs to be configured. If it is single node, I dont require it ...right? Please advise. I was looking at this site http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ It menionted like this - Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section. Thanks, Raj -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
partition as block?
Hi guys: Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either: 1) A custom FileInputFormat 2) A custom partitioner 3) -DnumReducers Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to use partitions as a poor mans block. Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API. -- Jay Vyas http://jayunit100.blogspot.com
Re: partition as block?
Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Jay, What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either: 1) A custom FileInputFormat 2) A custom partitioner 3) -DnumReducers Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to use partitions as a poor mans block. Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API. -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: partition as block?
Yes it is a problem at the first stage. What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized. On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq donta...@gmail.com wrote: Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm?? Am I taking it in the correct way. Please correct me if I am getting it wrong. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:34 AM, Jay Vyas jayunit...@gmail.com wrote: Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.comwrote: Hello Jay, What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either: 1) A custom FileInputFormat 2) A custom partitioner 3) -DnumReducers Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to use partitions as a poor mans block. Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API. -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: partition as block?
What do you mean increasing the size? Im talking more about increasing the number of partitions... Which actually decreases individual file size. On Apr 30, 2013, at 4:09 PM, Mohammad Tariq donta...@gmail.com wrote: Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with disk based shuffle which is gonna be slower than RAM based shuffle, thus delaying the entire reduce phase. Furthermore N/W might get overwhelmed. I think keeping it considerably high will definitely give you some boost. But it'll require a high level tinkering. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 1:29 AM, Jay Vyas jayunit...@gmail.com wrote: Yes it is a problem at the first stage. What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized. On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq donta...@gmail.com wrote: Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as you continue to read the file. Wouldn't it be a bottleneck from 'disk' point of view??Are you not going away from the distributed paradigm?? Am I taking it in the correct way. Please correct me if I am getting it wrong. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:34 AM, Jay Vyas jayunit...@gmail.com wrote: Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. Wondering, if there is a way to hack an increased numbers of partitions as a mechanism to simulate blocks - or wether this is just a bad idea altogether :) On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Jay, What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger block size will put extra overhead when it comes to disk I/O. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either: 1) A custom FileInputFormat 2) A custom partitioner 3) -DnumReducers Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffic... but maybe (1) or (2) will be a precise way to use partitions as a poor mans block. Just a thought - not sure if anyone has tried (1) or (2) before in order to simulate blocks and increase locality by utilizing the partition API. -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com -- Jay Vyas http://jayunit100.blogspot.com
Re: Maven dependency
this should be enough to get started (you can pick the 1.* version if you want the newer APIs and stuff, but for the elephant book, the older apis will work fine as well) . dependencies dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version0.20.2/version /dependency /dependencies On Wed, Apr 24, 2013 at 3:13 PM, Kevin Burton rkevinbur...@charter.netwrote: I am reading “Hadoop in Action” and the author on page 51 puts forth this code: ** ** public class WordCount2 { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(WordCount2.class); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(LongWritable.class); conf.setMapperClass(TokenCountMapper.class); conf.setCombinerClass(LongSumReducer.class); conf.setReducerClass(LongSumReducer.class);r client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } } ** ** Which is an example for a simple MapReduce job. But being a beginner I am not sure how to set up a project for this code. If I am using Maven what are the Maven dependencies that I need? There are several map reduce dependencies and I am not sure which to pick. Are there other dependencies need (such as JobConf)? What are the imports needed? During the construction of the configuration what heuristics are used to find the configuration for the Hadoop cluster? ** ** Thank you. -- Jay Vyas http://jayunit100.blogspot.com
Re: Append MR output file to an exitsted HDFS file
I might be misunderstanding, but if you want each Reducer to append its outputs to outputs to corresponding files that already exist in HDFS? Remember that the reducers usually are outputting globs so you will have several parts to your output - so the append has to be done in a way where new reducer paritions corresponds to a old paritions. If so - maybe you could play with your own OutputFormat by taking the source from one that serves as a starting point, and replacing the create... stream part with a call to write() with a call to append(). The reason this is tricky is that each OutputFormat is going to have to find the corresponding file to append. On Sun, Apr 21, 2013 at 10:54 PM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi All Can I append a MR output file to an existed file on HDFS. I‘m using CDH4.1.2 vs MRv2 Regards -- Jay Vyas http://jayunit100.blogspot.com
Re: Writing intermediate key,value pairs to file and read it again
How many intermediate keys? If small enough, you can keep them in memory. If large, you can just wait for the job to finish and siphon them into your job as input with the MultipleInputs API. On Apr 20, 2013, at 10:43 AM, Vikas Jadhav vikascjadha...@gmail.com wrote: Hello, Can anyone help me in following issue Writing intermediate key,value pairs to file and read it again let us say i have to write each intermediate pair received @reducer to a file and again read that as key value pair again and use it for processing I found IFile.java file which has reader and writer but i am not able understand how to use it for example. I dont understand Counter value as last parameter spilledRecordsCounter Thanks. -- Regards, Vikas
JobSubmissionFiles: past , present, and future?
Hi guys: I'm curious about the changes and future of the JobSubmissionFiles class. Grepping around on the web I'm finding some code snippets that suggest that hadoop security is not handled the same way on the staging directory as before: http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E But I'm having trouble definitively pinning this to versions. Why the difference in the if/else logic here and what is the future Of permissions on .staging?
Re: JobSubmissionFiles: past , present, and future?
To update on this, it was just pointed out to me by matt farrallee that the auto fix of permissions is for a failsafe in case of a race condition, and not meant to mend bad permissions in all cases: https://github.com/apache/hadoop-common/commit/f25dc04795a0e9836e3f237c802bfc1fe8a243ad Something to keep in mind - if you see the fixing staging permissions error message alot Then there might be a more systemic problem in your fs... At least, that was the case for us. On Apr 12, 2013, at 6:11 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: I'm curious about the changes and future of the JobSubmissionFiles class. Grepping around on the web I'm finding some code snippets that suggest that hadoop security is not handled the same way on the staging directory as before: http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E But I'm having trouble definitively pinning this to versions. Why the difference in the if/else logic here and what is the future Of permissions on .staging?
Re: No bulid.xml when to build FUSE
hadoop-hdfs builds with maven, not ant. You might also need to install the serialization libraries. See http://wiki.apache.org/hadoop/HowToContribute . As an aside, you could try to use gluster as a FUSE mount if you simply want a HA FUSE mountable filesystem which is mapreduce compatible. https://github.com/gluster/hadoop-glusterfs . -- Forwarded message -- From: YouPeng Yang yypvsxf19870...@gmail.com Date: Wed, Apr 10, 2013 at 10:06 AM Subject: No bulid.xml when to build FUSE To: user@hadoop.apache.org Dear All I want to integrate the FUSE with the Hadoop. So i checkout the code using the command: *[root@Hadoop ~]# svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/ hadoop-trunk* However I did not find any ant build.xmls to build the fuse-dfs in the *hadoop-trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib.* * * Did I checkout the wrong codes, Or is there any other ways to bulid fuse-dfs? * * Please guide me . * * * * *Thanks * regards -- Jay Vyas http://jayunit100.blogspot.com
Re: Copy Vs DistCP
DistCP is a full blown mapreduce job (mapper only, where the mappers do a fully parallel copy to the detsination). CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file. I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? On Wed, Apr 10, 2013 at 6:20 PM, KayVajj vajjalak...@gmail.com wrote: I have few questions regarding the usage of DistCP for copying files in the same cluster. 1) Which one is better within a same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other? 2) when we run a cp command like below from a client node of the cluster (not a data node), How does the cp command work i) like an MR job ii) copy files locally and then it copy it back at the new location. Example of the copy command hdfs dfs -cp /some_location/file /new_location/ Thanks, your responses are appreciated. -- Kay -- Jay Vyas http://jayunit100.blogspot.com
Re: Distributed cache: how big is too big?
Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache? After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs. Cant you Just make the cache size big and store the file there? What advantage is hdfs distribution of the file over all nodes ? On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson bjorn...@gmail.com wrote: Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought. On Apr 8, 2013 9:59 PM, John Meza j_meza...@hotmail.com wrote: I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing. To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed. I know this isn't new and commonly done with a Distributed Cache. Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this. I know that-Default local.cache.size=10Gb -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb?? -Distributed Cache is normally not used if larger than =? Another Option: Put the data directories on each DN and provide location to TaskTracker? thanks John
The Job.xml file
Hi guys: I cant find much info about the life cycle for the job.xml file in hadoop. My thoughts are : 1) It is created by the job client 2) It is only read by the JobTracker 3) Task trackers (indirectly) are configured by information in job.xml because the JobTracker decomposes its contents into individual tasks So, my (related) questions are: Is there a way to start a job directly from a job.xml file? What components depend on and read the job.xml file? Where is the job.xml defined/documented (if anywhere)? -- Jay Vyas http://jayunit100.blogspot.com
MVN repository for hadoop trunk
Hi guys: Is there a mvn repo for hadoop's 3.0.0 trunk build? Clearly the hadoop pom.xml allows us to build hadoop from scratch and installs it as 3.0.0-SNAPSHOT -- but its not clear wether there is a published version of this snapshot jar somewhere. -- Jay Vyas http://jayunit100.blogspot.com
Re: MVN repository for hadoop trunk
This is awesome thanks ! On Sat, Apr 6, 2013 at 5:14 PM, Harsh J ha...@cloudera.com wrote: Thanks Giri, was not aware of this one! On Sun, Apr 7, 2013 at 2:38 AM, Giridharan Kesavan gkesa...@hortonworks.com wrote: All the hadoop snapshot artifacts are available through the snapshots url: https://repository.apache.org/content/groups/snapshots -Giri On Sat, Apr 6, 2013 at 2:00 PM, Harsh J ha...@cloudera.com wrote: I don't think we publish nightly or rolling jars anywhere on maven central from trunk builds. On Sun, Apr 7, 2013 at 2:17 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: Is there a mvn repo for hadoop's 3.0.0 trunk build? Clearly the hadoop pom.xml allows us to build hadoop from scratch and installs it as 3.0.0-SNAPSHOT -- but its not clear wether there is a published version of this snapshot jar somewhere. -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J -- Harsh J -- Jay Vyas http://jayunit100.blogspot.com
cannot find /usr/lib/hadoop/mapred/
Hi guys: I'm getting an odd error involving a file called toBeDeleted. I've never seen this - somehow its blocking my task trackers from starting. 2013-03-06 16:19:24,657 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Cannot find root /usr/lib/hadoop/mapred/ for execution of task deletion of toBeDeleted/2013-03-06_ 02-25-40.379_4 on /usr/lib/hadoop/mapred/ with original name /usr/lib/hadoop/mapred/toBeDeleted/2013-03-06_02-25-40.379_4 at org.apache.hadoop.util.AsyncDiskService.execute(AsyncDiskService.java:95) at org.apache.hadoop.util.MRAsyncDiskService.execute(MRAsyncDiskService.java:115) at org.apache.hadoop.util.MRAsyncDiskService.init(MRAsyncDiskService.java:105) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:742) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1522) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3821) -- Jay Vyas http://jayunit100.blogspot.com