Re: Problem running Hadoop 0.23.0
Hi Nitin, It looks like you may be using the wrong port number - try 8088 for the resource manager UI. Cheers, Tom On Mon, Nov 28, 2011 at 4:02 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi, I was trying to setup Hadoop 0.23.0 with help of http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/SingleCluster.html. After starting resourcemanager and nodemanager, I get following error when i try to hit Hadoop UI *���)org**.apache.hadoop.**ipc.RPC$Version** Mismatch���Ser**ver IPC version 5 cannot communicate with client version 47 *. There is no significant error in Hadoop logs (it shows everything started successfully). Do you have any idea about this error? Thanks, -- Nitin Khandelwal
Re: cannot use distcp in some s3 buckets
On Thu, Oct 13, 2011 at 2:06 PM, Raimon Bosch raimon.bo...@gmail.com wrote: By the way, The url I'm trying has a '_' in the bucket name. Could be this the problem? Yes, underscores are not permitted in hostnames. Cheers, Tom 2011/10/13 Raimon Bosch raimon.bo...@gmail.com Hi, I've been having some problems with one of our s3 buckets. I have asked on amazon support with no luck yet https://forums.aws.amazon.com/thread.jspa?threadID=78001. I'm getting this exception only with our oldest s3 bucket with this command: hadoop distcp s3://MY_BUCKET_NAME/logfile-20110815.gz /tmp/logfile-20110815.gz java.lang.IllegalArgumentException: Invalid hostname in URI s3://MY_BUCKET_NAME/logfile-20110815.gz /tmp/logfile-20110815.gz at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41) at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82) As you can see, hadoop is rejecting my url before starting to do the authorization steps. Someone has been in a similar issue? I have already tested the same operation in newer s3 buckets and the command is working correctly. Thanks in advance, Raimon Bosch.
Re: updated example
JobConf and the old API are no longer deprecated in the forthcoming 0.20.205 release, so you can continue to use it without issue. The equivalent in the new API is setInputFormatClass() on org.apache.hadoop.mapreduce.Job. Cheers, Tom On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson kthom...@binghamton.edu wrote: I see that the JobConf class used in the WordCount tutorial is deprecated for the Configuration class. I am wanting to change the file input format (to the StreamInputFormat for XML as in Hadoop: The Definitive Guide pp. 212-213) but I don't see a setInputFormat method in the Configuration class as there was in the JobConf class. Is there an updated example using the non-deprecated classes and methods? I have searched but not found one. Regards, Keith
Re: Distributed cluster filesystem on EC2
You might consider Apache Whirr (http://whirr.apache.org/) for bringing up Hadoop clusters on EC2. Cheers, Tom On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote: Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before. It would make for a good feature to have tiered file systems all mapped into the same namespace, but that would be a lot of work and complexity. The quick solution would be to know what data you want to process before hand and then run distcp to copy it from S3 into HDFS before launching the other map/reduce jobs. I don't think there is anything automatic out there. --Bobby Evans On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2, and one thing that I'm trying to explore is whether we can use alternative scheduling systems like SGE with shared FS for non data intensive tasks, since they are easier to work with for lay users. One problem for now is how to create shared cluster filesystem similar to HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks and permissions), that will use amazon EC2 local nonpersistent storage. Idea is to keep original data on S3, then as needed fire up a bunch of nodes, start shared filesystem, and quickly copy data from S3 to that FS, run the analysis with SGE, save results and shut down that filesystem. I tried things like S3FS and similar native S3 implementation but speed is too bad. Currently I just have a FS on my master node that is shared via NFS to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start more than 10 nodes. Thank you. I'd appreciate any suggestions and links to relevant resources!. Dmitry
Re: 0.21.0 - Java Class Error
Hi Witold, Is this on Windows? The scripts were re-structured after Hadoop 0.20, and looking at them now I notice that the cygwin path translation for the classpath seems to be missing. You could try adding the following line to the if $cygwin clause in bin/hadoop-config.sh: CLASSPATH=`cygpath -p -w $CLASSPATH` It's worth filing a bug for this too. Cheers, Tom On Thu, Apr 7, 2011 at 1:24 PM, Witold Januszewski wit...@skni.org wrote: To Whom It May Concern, When trying to run Hadoop 0.21 with JDK 1.6_23 I get an error: java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName. The full error log is in the attached .png Can you help me? I'd be grateful. Yours faithfully, Witold Januszewski
Re: hadoop installation problem(single-node)
The instructions at http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html should be what you need. Cheers, Tom On Wed, Mar 2, 2011 at 12:59 AM, Manish Yadav manish.ya...@orkash.com wrote: Dear Sir/Madam I'm very new to hadoop. I'm trying to install hadoop on my computer. I followed a weblink and try to install it. I want to install hadoop on my single node cluster. i 'm using Ubuntu 10.04 64-bit as my operating system . I have installed java in /usr/java/jdk1.6.0_24. the step i take to install hadoop are following 1: Make a group hadoop and a user hadoop with home directory in hadoop directory i have a directory called projects and download hadoop binary there than extract them there; i configured the ssh also. than i made changes to some file which are following. i'm attaching them with this male please check them . 1: hadoop_env_sh 2:core-site.xml 3mapreduce-site.xml 4 hdfs-site. xml 5 hadoop's usre .bashrc 6 hadoop'user .profile After making changes to these fie ,I just enter the hadoop account and enter the few command following thing happen : hadoop@ws40-man-lin:~$ echo $HADOOP_HOME /home/hadoop/project/hadoop-0.20.0 hadoop@ws40-man-lin:~$ hadoop namenode -format hadoop: command not found hadoop@ws40-man-lin:~$ namenode -format namenode: command not found hadoop@ws40-man-lin:~$ now I'm completely stuck i don't know what to do? please help me as there is no more help around the net. i' m attaching the files also which i changed can u tell me the exact configuration which i should use to install hadoop.
Re: Missing files in the trunk ??
These files are generated files. If you run ant avro-generate eclipse then Eclipse should file these files. Cheers, Tom On Mon, Feb 28, 2011 at 2:43 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi all, I checked out the map-reduce trunk a few days back and following files are missing.. import org.apache.hadoop.mapreduce.jobhistory.Events; import org.apache.hadoop.mapreduce.jobhistory.JhCounter; import org.apache.hadoop.mapreduce.jobhistory.JhCounterGroup; import org.apache.hadoop.mapreduce.jobhistory.JhCounters; ant jar works well but eclipse finds these files missing in the corresponding packages .. I browsed the trunk online but couldn't trace these files.. Any help is highly appreciated :) -- Regards, Bharath .V w:http://research.iiit.ac.in/~bharath.v
Re: 0.21 found interface but class was expected
Hi Steve, Sorry to hear about the problems you had. The issue you hit was a result of MAPREDUCE-954, and there was some discussion on that JIRA about compatibility. I believe the thinking was that the context classes are framework classes, so users don't extend/implement them in the normal course of use, and it's also understood that users would recompile their apps (i.e. source compatibility). However, tools like MRUnit which extend/implement these classes do need to be updated when a change like this happens. We tried hard to make 0.21 as backwards compatible with 0.20 as possible, a big part of which was going through all the APIs and annotating their audience and stability (see http://developer.yahoo.com/blogs/hadoop/posts/2010/05/towards_enterpriseclass_compat/ for background). The new MapReduce API (in org.apache.hadoop.mapreduce), which is what we are talking about here, is not yet declared stable (unlike the old API) and these classes are marked with @InterfaceStability.Evolving to show that they can change even between minor releases. I think we could improve visibility to users by publishing a list of incompatible changes in the API for each release - so I've opened HADOOP-7035 for this. Cheers, Tom On Sun, Nov 14, 2010 at 7:41 AM, Konstantin Boudnik c...@apache.org wrote: Oh, thank you Todd! For a second there I thought that Hadoop developers have promised a full binary compatibility (in true Solaris sense of the word). Now I understand that such thing never been promised. Even though Hadoop haven't come over 'major' version change yet. Seriously. Steve, you are talking about leaving and breathing system here. To best of my understanding first stable Hadoop version was suppose to be 1.0 - a major version according to your own terms. Which apparently hasn't came around yet. Now, what exactly you are frustrated about? Cos On Sat, Nov 13, 2010 at 06:50PM, Todd Lipcon wrote: We do have policies against breaking APIs between consecutive major versions except for very rare exceptions (eg UnixUserGroupInformation went away when security was added). We do *not* have any current policies that existing code can work against different major versions without a recompile in between. Switching an implementation class to an interface is a case where a simple recompile of the dependent app should be sufficient to avoid issues. For whatever reason, the JVM bytecode for invoking an interface method (invokeinterface) is different than invoking a virtual method in a class (invokevirtual). -Todd On Sat, Nov 13, 2010 at 5:28 PM, Lance Norskog goks...@gmail.com wrote: It is considered good manners :) Seriously, if you want to attract a community you have an obligation to tell them when you're going to jerk the rug out from under their feet. On Sat, Nov 13, 2010 at 3:27 PM, Konstantin Boudnik c...@apache.org wrote: It doesn't answer my question. I guess I will have to look for the answer somewhere else On Sat, Nov 13, 2010 at 03:22PM, Steve Lewis wrote: Java libraries are VERY reluctant to change major classes in a way that breaks backward compatability - NOTE that while the 0.18 packages are deprecated, they are separate from the 0.20 packages allowing 0.18 code to run on 0.20 systems - this is true of virtually all Java libraries On Sat, Nov 13, 2010 at 3:08 PM, Konstantin Boudnik c...@apache.org wrote: As much as I love ranting I can't help but wonder if there were any promises to make 0.21+ be backward compatible with 0.20 ? Just curious? On Sat, Nov 13, 2010 at 02:50PM, Steve Lewis wrote: I have a long rant at http://lordjoesoftware.blogspot.com/ on this but the moral is that there seems to have been a deliberate decision that 0,20 code will may not be comparable with - I have NEVER seen a major library so directly abandon backward compatability On Fri, Nov 12, 2010 at 8:04 AM, Sebastian Schoenherr sebastian.schoenh...@student.uibk.ac.at wrote: Hi Steve, we had a similar problem. We've compiled our code with version 0.21 but included the wrong jars into the classpath. (version 0.20.2; NInputFormat.java). It seems that Hadoop changed this class to an interface, maybe you've a simliar problem. Hope this helps. Sebastian Zitat von Steve Lewis lordjoe2...@gmail.com: Cassandra sees this error with 0.21 of hadoop Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected I see something similar Error: Found interface org.apache.hadoop.mapreduce.TaskInputOutputContext, but class was expected I find this especially puzzling since org.apache.hadoop.mapreduce.TaskInputOutputContext IS a class not
Re: How to stop a mapper within a map-reduce job when you detect bad input
On Thu, Oct 21, 2010 at 8:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! The equivalent functionality is in org.apache.hadoop.mapreduce.Mapper#run. Cheers Tom ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell
Hi Mike, What do you get if you type ./hadoop classpath? Does it contain the Hadoop common JAR? To avoid the deprecation warning you should use hadoop fs, not hadoop dfs. Tom On Wed, Sep 15, 2010 at 12:53 PM, Mike Franon kongfra...@gmail.com wrote: Hi, I just setup 3 node hadoop cluster using the latest version from website , 0.21.0 I am able to start all the daemons, when I run jps I see datanode, namenode, secondary, tasktracker, but I was running a test and trying to run the following command: ./hadoop dfs -ls, and I get the following error: DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) Could not find the main class: org.apache.hadoop.fs.FsShell. Program will exit. If i try this command instead: ./hadoop hdfs -ls Exception in thread main java.lang.NoClassDefFoundError: hdfs Caused by: java.lang.ClassNotFoundException: hdfs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) Could not find the main class: hdfs. Program will exit. Does anyone know what the command really is I should be using? Thanks
Re: Hadoop 0.21.0 release Maven repo
Hi Sonal, The 0.21.0 jars are not available in Maven yet, since the process for publishing them post split has changed. See HDFS-1292 and MAPREDUCE-1929. Cheers, Tom On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, Can someone please point me to the Maven repo for 0.21 release? Thanks. Thanks and Regards, Sonal www.meghsoft.com http://in.linkedin.com/in/sonalgoyal
Re: Ivy
The 0.21.0 jars are not in the Apache Maven repos yet, since the process for publishing them post split has changed. HDFS-1292 and MAPREDUCE-1929 are the tickets to fix this. Cheers, Tom On Sat, Aug 28, 2010 at 9:10 PM, Mark static.void@gmail.com wrote: On 8/27/10 9:25 AM, Owen O'Malley wrote: On Aug 27, 2010, at 8:04 AM, Mark wrote: Is there a public ivy repo that has the latest hadoop? Thanks The hadoop jars and poms should be pushed into the central Maven repositories, which Ivy uses. -- Owen I am looking for the latest version 0.21.0 so our team can build Map/Reduce classes against it
[ANNOUNCE] Apache Hadoop 0.21.0 released
Hi everyone, I am pleased to announce that Apache Hadoop 0.21.0 is available for download from http://hadoop.apache.org/common/releases.html. Over 1300 issues have been addressed since 0.20.2; you can find details at http://hadoop.apache.org/common/docs/r0.21.0/releasenotes.html http://hadoop.apache.org/hdfs/docs/r0.21.0/releasenotes.html http://hadoop.apache.org/mapreduce/docs/r0.21.0/releasenotes.html Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2. Thanks to all who contributed to this release! Tom
Re: Implementing S3FileSystem#append
Hi Oleg, I don't know of any plans to implement this. However, since this is a block-based storage system which uses S3, I wonder whether an implementation could use some of the logic in HDFS for block storage and append in general. Cheers, Tom On Thu, Aug 12, 2010 at 8:34 AM, Aleshko, Oleg o.ales...@itransition.com wrote: Hi! Is there any plans on implementing append function for S3 file system? I'm currently considering using it for implementation of resume upload functionality. The other option would be to use EBS, but it looks like an overkill. Thanks, Oleg.
Re: Hadoop 0.21 :: job.getCounters() returns null?
Hi Felix, Aaron Kimball hit the same problem - it's being discussed at https://issues.apache.org/jira/browse/MAPREDUCE-1920. Thanks for reporting this. Cheers, Tom On Tue, Jul 6, 2010 at 11:26 AM, Felix Halim felix.ha...@gmail.com wrote: I tried hadoop 0.21 release candidate. job.waitForCompletion(true); Counters ctrs = job.getCounters(); // here ctrs is null In the previous hadoop version 0.20.2 it worked fine for all times. Is this a bug in 0.21 ? Or i'm missing some settings? Thanks, Felix Halim
Re: Next Release of Hadoop version number and Kerberos
Hi Ananth, The next release of Hadoop will be 0.21.0, but it won't have Kerberos authentication in it (since it's not all in trunk yet). The 0.22.0 release later this year will have a working version of security in it. Cheers, Tom On Wed, Jul 7, 2010 at 8:09 AM, Ananth Sarathy ananth.t.sara...@gmail.com wrote: is the next release of Hadoop going to .21 or .22? I was just wondering, cause I am hearing conflicting things about the next release having Kerberos security but looking through some past emails, hearing that it was coming in .22. Ananth T Sarathy
Re: Cloudera EC2 scripts
Hi Mark, You can find the latest version of the scripts at http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.tar.gz. Documentation is at http://archive.cloudera.com/docs/ec2.html. The source code is currently in src/contrib/cloud in Hadoop Common, but is in the process of moving to a new Incubator project called Whirr (see http://incubator.apache.org/projects/whirr.html). Cheers, Tom On Thu, May 27, 2010 at 10:11 PM, Mark Kerzner markkerz...@gmail.com wrote: That would be fine, but where is the link to get them On Fri, May 28, 2010 at 12:10 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: I didn't have any problems using the scripts that are in CDH3 (beta, March 2010) to bring up and tear down Hadoop cluster instances with EC2. I think there were some differences between the documentation and the actual scripts but it's been a few weeks and I don't have access to my notes right now to see what they were. --Andrew On May 27, 2010, at 9:31 PM, Mark Kerzner wrote: Hi, I was using the beta version of Cloudera scripts from a while back, and I think there is a stable version, but I can't find it. It tells me to go download a Hadoop distribution, and there I can't find cloudera scripts. I do see something there, hadoop-0.18.3/src/contrib/ec2/bin, but it does not look right. Is it me? Thank you, Mark
Re: problem w/ data load
Hi Susanne, Hadoop uses the file extension to detect that a file is compressed. I believe Hive does too. Did you store the compressed file in HDFS with a .gz extension? Cheers, Tom BTW It's best to send Hive questions like these to the hive-user@ list. On Sun, May 2, 2010 at 11:22 AM, Susanne Lehmann susanne.lehm...@metamarketsgroup.com wrote: Hi, I want to load data from HDFS to Hive, the data is in compressed files. The data is stored in flat files, the delimiter is ^A (ctrl-A). As long as I use de-compressed files everything is working fine. Since ctrl-A is the default delimiter I even don't need a specification for it. I do the following: hadoop dfs -put /test/file new hive DROP TABLE test_new; OK Time taken: 0.057 seconds hive CREATE TABLE test_new( bla int, bla string, etc bla string); OK Time taken: 0.035 seconds hive LOAD DATA INPATH /test/file INTO TABLE test_new; Loading data to table test_new OK Time taken: 0.063 seconds But if I do the same with the same file compressed it's not working anymore. I tried tons of different table definitions with the delimiter specified, but it doesn't go. The load itself works, but the data is always NULL, so there is a delimiter problem I conclude. Any help is greatly appreciated!
Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20
Hi Yuanyuan, I think you've found a bug - could you file a JIRA issue for this please? Thanks, Tom On Wed, Apr 28, 2010 at 11:04 PM, Yuanyuan Tian yt...@us.ibm.com wrote: I have a problem in getting the input file name in the mapper when uisng MultipleInputs. I need to use MultipleInputs to support different formats for my inputs to the my MapReduce job. And inside each mapper, I also need to know the exact input file that the mapper is processing. However, conf.get(map.input.file) returns null. Can anybody help me solve this problem? Thanks in advance. public class Test extends Configured implements Tool{ static class InnerMapper extends MapReduceBase implements MapperWritable, Writable, NullWritable, Text { public void configure(JobConf conf) { String inputName=conf.get(map.input.file)); ... } } public int run(String[] arg0) throws Exception { JonConf job; job = new JobConf(Test.class); ... MultipleInputs.addInputPath(conf, new Path(A), TextInputFormat.class); MultipleInputs.addInputPath(conf, new Path(B), SequenceFileFormat.class); ... } } Yuanyuan
Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20
Hi Yuanyuan, Thanks for filing an issue. To work around the issue could you use a regular FileInputFormat in a set of map-only jobs (which can read the input file names) so you can create a common input for a final MR job? This is admittedly less efficient since it needs more jobs. Cheers, Tom On Thu, Apr 29, 2010 at 10:37 AM, Yuanyuan Tian yt...@us.ibm.com wrote: Hi Tom, I have file a JIRA ticket (MAPREDUCE-1743) for this issue. At the mean time, can you suggest an alternative approach to achieve what I want (supporting different input formats and get the input file name in each mapper)? Yuanyuan Tom White ---04/29/2010 09:42:44 AM---Hi Yuanyuan, I think you've found a bug - could you file a JIRA issue for this please? From: Tom White t...@cloudera.com To: common-user@hadoop.apache.org Date: 04/29/2010 09:42 AM Subject: Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20 Hi Yuanyuan, I think you've found a bug - could you file a JIRA issue for this please? Thanks, Tom On Wed, Apr 28, 2010 at 11:04 PM, Yuanyuan Tian yt...@us.ibm.com wrote: I have a problem in getting the input file name in the mapper when uisng MultipleInputs. I need to use MultipleInputs to support different formats for my inputs to the my MapReduce job. And inside each mapper, I also need to know the exact input file that the mapper is processing. However, conf.get(map.input.file) returns null. Can anybody help me solve this problem? Thanks in advance. public class Test extends Configured implements Tool{ static class InnerMapper extends MapReduceBase implements MapperWritable, Writable, NullWritable, Text { public void configure(JobConf conf) { String inputName=conf.get(map.input.file)); ... } } public int run(String[] arg0) throws Exception { JonConf job; job = new JobConf(Test.class); ... MultipleInputs.addInputPath(conf, new Path(A), TextInputFormat.class); MultipleInputs.addInputPath(conf, new Path(B), SequenceFileFormat.class); ... } } Yuanyuan
Re: File permissions on S3FileSystem
Hi Danny, S3FileSystem has no concept of permissions, which is why this check fails. The change that introduced the permissions check was introduced in https://issues.apache.org/jira/browse/MAPREDUCE-181. Could you file a bug for this please? Cheers, Tom On Thu, Apr 22, 2010 at 4:16 AM, Danny Leshem dles...@gmail.com wrote: Hello, I'm running a Hadoop cluster using 3 small Amazon EC2 machines and the S3FileSystem. Till lately I've been using 0.20.2 and everything was ok. Now I'm using the latest trunc 0.22.0-SNAPSHOT and getting the following thrown: Exception in thread main java.io.IOException: The ownership/permissions on the staging directory s3://my-s3-bucket/mnt/hadoop.tmp.dir/mapred/staging/root/.staging is not as expected. It is owned by and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx-- at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:107) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:312) at org.apache.hadoop.mapreduce.Job.submit(Job.java:961) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:977) at com.mycompany.MyJob.runJob(MyJob.java:153) at com.mycompany.MyJob.run(MyJob.java:177) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.mycompany.MyOtherJob.runJob(MyOtherJob.java:62) at com.mycompany.MyOtherJob.run(MyOtherJob.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.mycompany.MyOtherJob.main(MyOtherJob.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) (The it is owned by ... and permissions is not a mistake, seems like the empty string is printed there) My configuration is as follows: core-site: fs.default.name=s3://my-s3-bucket fs.s3.awsAccessKeyId=[key id omitted] fs.s3.awsSecretAccessKey=[secret key omitted] hadoop.tmp.dir=/mnt/hadoop.tmp.dir hdfs-site: empty mapred-site: mapred.job.tracker=[domU-XX-XX-XX-XX-XX-XX.compute-1.internal:9001] mapred.map.tasks=6 mapred.reduce.tasks=6 Any help would be appreciated... Best, Danny
Re: JobConf.setJobEndNotificationURI
I think you can set the URI on the configuration object with the key JobContext.END_NOTIFICATION_URL. Cheers, Tom On Tue, Feb 23, 2010 at 12:02 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I am looking for counterpart to JobConf.setJobEndNotificationURI() in org.apache.hadoop.mapreduce Please advise. Thanks
Re: Cloudera AMIs
Hi Sonal, You should use the one with the later date. The Cloudera AMIs don't actually have Hadoop installed on them, just Java and some other base packages. Hadoop is installed at start up time; you can find more information at http://archive.cloudera.com/docs/ec2.html. Cheers, Tom P.S. For Cloudera-specific questions please consider using the Cloudera forum at http://getsatisfaction.com/cloudera On Sun, Mar 14, 2010 at 7:03 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I want to know which Cloudera AMI supports which Hadoop version. For example, ami-2932d440:cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090602-i386.manifest.xml ami-ed59bf84: cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-i386.manifest.xml Whats the difference between the two? Which Hadoop version do they support? I need to use the 0.20+ release. Thanks and Regards, Sonal
Re: Is it possible to share a key across maps?
Please submit a patch for the documentation change - perhaps at https://issues.apache.org/jira/browse/HADOOP-5973. Cheers, Tom On Wed, Jan 13, 2010 at 12:09 AM, Amogh Vasekar am...@yahoo-inc.com wrote: +1 for the documentation change in mapred-tutorial. Can we do that and publish using a normal apache account? Thanks, Amogh On 1/13/10 2:29 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Amogh, You bet it helps! Thanks! Sometimes it's very difficult to map between the old and the new APIs. I was digging for that answer for awhile. Thanks. --- On Tue, 1/12/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org common-user@hadoop.apache.org, raymondj...@yahoo.com raymondj...@yahoo.com, core-u...@hadoop.apache.org core-u...@hadoop.apache.org Date: Tuesday, January 12, 2010, 3:32 PM Re: Is it possible to share a key across maps? (Sorry for the spam if any, mails are bouncing back for me) Hi, In setup() use this, FileSplit split = (FileSplit)context.getInputSplit(); split.getPath() will return you the Path. Hope this helps. Amogh On 1/13/10 1:25 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Hi Gang, I was able to use this on an older version that uses the JobClient class to run the job but not on the newer api with the Job class. The Job class appears to use a setup() method instead of a configure() method but the map.input.file attribute does not appear to be available via the conf class the setup() method. Have you tried to do what you described using the newer api? Thank you. --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote:
Re: Implementing VectorWritable
Have a look at org.apache.hadoop.io.ArrayWritable. You may be able to use this class in your application, or at least use it as a basis for writing VectorWritable. Cheers, Tom On Tue, Dec 29, 2009 at 1:37 AM, bharath v bharathvissapragada1...@gmail.com wrote: Can you please tell me , what is the functionality of those 2 methods. (How should i implement the same in this VectorWritable) .. Thanks On Tue, Dec 29, 2009 at 11:25 AM, Jeff Zhang zjf...@gmail.com wrote: The readFields and write method is empty ? When data is transfered from map phase to reduce phase, data is serialized and deserialized , so the write and readFields will be called. You should not leave them empty. Jeff Zhang On Tue, Dec 29, 2009 at 1:29 PM, bharath v bharathvissapragada1...@gmail.com wrote: Hi , I've implemented a simple VectorWritable class as follows package com; import org.apache.hadoop.*; import org.apache.hadoop.io.*; import java.io.*; import java.util.Vector; public class VectorWritable implements WritableComparable { private VectorString value = new Vector(); public VectorWritable() {} public VectorWritable(VectorString value) { set(value); } public void set(VectorString val) { this.value = val; } public VectorString get() { return this.value; } public void readFields(DataInput in) throws IOException { //value = in.readInt(); } public void write(DataOutput out) throws IOException { // out.writeInt(value); } public boolean equals(Object o) { if (!(o instanceof VectorWritable)) return false; VectorWritable other = (VectorWritable)o; return this.value.equals(other.value); } public int hashCode() { return value.hashCode(); } public int compareTo(Object o) { Vector thisValue = this.value; Vector thatValue = ((VectorWritable)o).value; return (thisValue.size()thatValue.size() ? -1 : (thisValue.size()==thatValue.size() ? 0 : 1)); } public String toString() { return value.toString(); } public static class Comparator extends WritableComparator { public Comparator() { super(VectorWritable.class); } public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int thisValue = readInt(b1, s1); int thatValue = readInt(b2, s2); return (thisValuethatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } } static { // register this comparator WritableComparator.define(VectorWritable.class, new Comparator()); } } The map phase is outputting correct Text,VectorWritable pairs .. but in reduce phase when I iterate over the values Iterable.. Iam getting the size of the vector to be 0; I think there is a minor mistake in my VectorWritable Implementation .. Can anyone point it.. Thanks
Re: Configuration for Hadoop running on Amazon S3
If you are using S3 as your file store then you don't need to run HDFS (and indeed HDFS will not start up if you try). Cheers, Tom 2009/12/17 Rekha Joshi rekha...@yahoo-inc.com: Not sure what the whole error is, but you can always alternatively try this - property namefs.default.name/name values3://BUCKET/value /property property namefs.s3.awsAccessKeyId/name valueID/value /property property namefs.s3.awsSecretAccessKey/name valueSECRET/value /property And I am not sure what is the base hadoop version on S3, but possibly if S3 wiki is correct try updating conf/hadoop-site.xml Cheers, /R On 12/18/09 10:23 AM, 松柳 lamfeeli...@gmail.com wrote: Hi all, I tried to run my hadoop program on S3 by following this wiki page: http://wiki.apache.org/hadoop/AmazonS3 I configured the core-site.xml by adding property namefs.default.name/name values3://ID:sec...@bucket/value /property and I specified the accesskey and secretkey by using the URI format:s3://ID:sec...@bucket however, it fails and datanodes reports: NumberFormatException ... Is this the right way to config hadoop running on s3? if so, whats the problem? Regards Song
Re: Master and slaves on hadoop/ec2
Correct. The master runs the namenode and jobtracker, but not a datanode or tasktracker. Tom On Tue, Nov 24, 2009 at 4:57 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, do I understand it correctly that, when I launch a Hadoop cluster on EC2, the master will not be doing any work, and it is just for organizing work, while the slaves will be actual workers? Thank you, Mark
Re: How do I reference S3 from an EC2 Hadoop cluster?
On Tue, Nov 24, 2009 at 9:27 PM, Mark Kerzner markkerz...@gmail.com wrote: Yes, Tom, I saw all these problems. I think that I should stop trying to imitate EMR - that's where the storing data on S3 appeared, and transfer data directly to the Hadoop cluster. Then I will be using all as intended. Is there a way to scp directly to the HDFS, or do I need to scp to local storage on some machine, and then - to HDFS? distcp is the appropriate tool for this. There is some guidance on http://wiki.apache.org/hadoop/AmazonS3. Also, is there a way to make the master a bigger instance than that of the slaves? No, this is not supported, but I can see it would be useful, particularly for larger clusters. Please consider opening a JIRA for it. Cheers, Tom Thank you, Mark On Tue, Nov 24, 2009 at 11:20 PM, Tom White t...@cloudera.com wrote: Mark, If the data was transferred to S3 outside of Hadoop then you should use the s3n filesystem scheme (see the explanation on http://wiki.apache.org/hadoop/AmazonS3 for the differences between the Hadoop S3 filesystems). Also, some people have had problems embedding the secret key in the URI, so you can set it in the configuration as follows: property namefs.s3n.awsAccessKeyId/name valueID/value /property property namefs.s3n.awsSecretAccessKey/name valueSECRET/value /property Then use a URI of the form s3n://BUCKET/path/to/logs Cheers, Tom On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I need to copy data from S3 to HDFS. This instruction bin/hadoop distcp s3://ID:SECRET@BUCKET/path/to/logs logs does not seem to work. Thank you.
Re: How do I reference S3 from an EC2 Hadoop cluster?
Mark, If the data was transferred to S3 outside of Hadoop then you should use the s3n filesystem scheme (see the explanation on http://wiki.apache.org/hadoop/AmazonS3 for the differences between the Hadoop S3 filesystems). Also, some people have had problems embedding the secret key in the URI, so you can set it in the configuration as follows: property namefs.s3n.awsAccessKeyId/name valueID/value /property property namefs.s3n.awsSecretAccessKey/name valueSECRET/value /property Then use a URI of the form s3n://BUCKET/path/to/logs Cheers, Tom On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I need to copy data from S3 to HDFS. This instruction bin/hadoop distcp s3://ID:SECRET@BUCKET/path/to/logs logs does not seem to work. Thank you.
Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?
Hi Mark, HADOOP-6108 will add Cloudera's EC2 scripts to the Apache distribution, with the difference that they will run Apache Hadoop. The same scripts will also support Cloudera's Distribution for Hadoop, simply by using a different boot script on the instances. So I would suggest you use these scripts since they are more flexible than the existing bash-based ones in Apache (e.g. they also support EBS), and are likely to have more features added, and support more cloud providers over time. Hope this helps. Tom On Sun, Nov 15, 2009 at 7:31 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, guys, sorry for kind of making you do my work, but I have a conundrum. I have been developing on Ubuntu, and preferred to run the same Ubuntu Linux on EC2, and indeed, that is what Amazon Elastic MR was giving me. But now I am running my own cluster on EC2, and Apache Hadoop images are all on Fedora. I have already figured out the scripts and it all works - except that I have not tested on Fedora, and I do use Linux packages. Alternatively, I could run on Cloudera's Hadoop, and they have Ubuntu. But, I would probably to switch to their distribution in my code, and learn their startup scripts. Which way is better? Thank you, Mark
Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?
On Sun, Nov 15, 2009 at 8:39 PM, Mark Kerzner markkerz...@gmail.com wrote: Tom, do I understand correctly that with these scripts I can use the Apache Hadoop configuration as I am used to, and run and EC2 image that contains Cloudera Hadoop distribution? Yes, you can run Apache Hadoop with your existing configuration. PS. I could not download them from here, http://issues.apache.org/jira/secure/attachment/12422889/HADOOP-6108.patch, was getting, too many open files error. I think this may be a transient problem (if it recurs you can report it to in...@apache.org). Thank you, Mark On Sun, Nov 15, 2009 at 10:29 PM, Tom White t...@cloudera.com wrote: Hi Mark, HADOOP-6108 will add Cloudera's EC2 scripts to the Apache distribution, with the difference that they will run Apache Hadoop. The same scripts will also support Cloudera's Distribution for Hadoop, simply by using a different boot script on the instances. So I would suggest you use these scripts since they are more flexible than the existing bash-based ones in Apache (e.g. they also support EBS), and are likely to have more features added, and support more cloud providers over time. Hope this helps. Tom On Sun, Nov 15, 2009 at 7:31 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, guys, sorry for kind of making you do my work, but I have a conundrum. I have been developing on Ubuntu, and preferred to run the same Ubuntu Linux on EC2, and indeed, that is what Amazon Elastic MR was giving me. But now I am running my own cluster on EC2, and Apache Hadoop images are all on Fedora. I have already figured out the scripts and it all works - except that I have not tested on Fedora, and I do use Linux packages. Alternatively, I could run on Cloudera's Hadoop, and they have Ubuntu. But, I would probably to switch to their distribution in my code, and learn their startup scripts. Which way is better? Thank you, Mark
Re: Confused by new API MultipleOutputFormats using Hadoop 0.20.1
Multiple outputs has been ported to the new API in 0.21. See https://issues.apache.org/jira/browse/MAPREDUCE-370. Cheers, Tom On Sat, Nov 7, 2009 at 6:45 AM, Xiance SI(司宪策) adam...@gmail.com wrote: I just fall back to old mapred.* APIs, seems MultipleOutputs only works for the old API. wishes, Xiance On Mon, Nov 2, 2009 at 9:12 AM, Paul Smith psm...@aconex.com wrote: Totally stuck here, I can't seem to find a way to resolve this, but I can't use the new API _and_ use the MultipleOutputFormats class. I found this thread which is related, but doesn't seem to help me (or I missed something completely, certainly possible): http://markmail.org/message/u4wz5nbcn5rawydq#query:hadoop%20MultipleTextOutputFormat%20OutputFormat%20Job%20JobConf+page:1+mid:5wy63oqa2vs6bj7b+state:results My controller Job class is simple, but I get a compile error trying to add the new MultipleOutputs: public class ControllerMetricGrinder { public static class MetricNameMultipleTextOutputFormat extends MultipleTextOutputFormatString, ControllerMetric { �...@override protected String generateFileNameForKeyValue(String key, ControllerMetric value, String name) { return key; } } public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(ControllerMetricGrinder.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(ControllerMetric.class); job.setMapperClass(ControllerMetricMapper.class); job.setCombinerClass(ControllerMetricReducer.class); job.setReducerClass(ControllerMetricReducer.class); // COMPILE ERROR HERE MultipleOutputs.addMultiNamedOutput(job, metrics, MetricNameMultipleTextOutputFormat.class, Text.class, ControllerMetric.class); job.setNumReduceTasks(5); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } (mappers and reducers are using the new API, and are in separate classes). MultipleOutputs doesn't take a Job, it only takes a JobConf. Any ideas? I'd prefer to use the new API (because I've written it that way), but I'm guessing now I'll have to go and rework everything to the OLD API to get this to work. I'm trying to create a File-per-metric name (there's only 5). thoughts? Paul
Re: Multiple Input Paths
MultipleInputs is available from Hadoop 0.19 onwards (in org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input for the new API in later versions). Tom On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant mark.vige...@riskmetrics.com wrote: Amogh, That sounds so awesome! Yeah I wish I had that class now. Do you have any tips on how to create such a delegating class? The best I can come up with is to just submit both files to the mapper using multiple input paths and then having anif statement at the beginning of the map that checks which file it's dealing with but I'm skeptical that I can even make that work... Is there a way you know of that I could submit 2 mapper classes to the job? -Original Message- From: Amogh Vasekar [mailto:am...@yahoo-inc.com] Sent: Wednesday, November 04, 2009 1:50 AM To: common-user@hadoop.apache.org Subject: Re: Multiple Input Paths Hi Mark, A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This would allow you to have a different inputformat, mapper depending on the path you are getting the split from. It uses special Delegating[mapper/input] classes to resolve this. I understand backporting this is more or less out of question, but the ideas there might provide pointers to help you solve your current problem. Just a thought :) Amogh On 11/3/09 8:44 PM, Mark Vigeant mark.vige...@riskmetrics.com wrote: Hey Vipul No I haven't concatenated my files yet, and I was just thinking over how to approach the issue of multiple input paths. I actually did what Amandeep hinted at which was we wrote our own XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send chunks of XML to the mapper as opposed to lines of text or whole files. So I specified the Line Delimiter in the XMLRecordReader (ie startTag) and everything in between the tags startTag and /startTag are sent to the mapper. Inside the map function is where to parse the data and write it to the table. What I have to do now is just figure out how to set the Line Delimiter to be something common in both XML files I'm reading. Currently I have 2 mapper classes and thus 2 submitted jobs which is really inefficient and time consuming. Make sense at all? Sorry if it doesn't, feel free to ask more questions Mark -Original Message- From: Vipul Sharma [mailto:sharmavi...@gmail.com] Sent: Monday, November 02, 2009 7:48 PM To: common-user@hadoop.apache.org Subject: RE: Multiple Input Paths Mark, were you able to concatenate both the xml files together. What did you do to keep the resulting xml well forned? Regards, Vipul Sharma, Cell: 281-217-0761
Re: Terminate Instances Terminating ALL EC2 Instances
Hi Mark, Sorry to hear that all your EC2 instances were terminated. Needless to say, this should certainly not happen. The scripts are a Python rewrite (see HADOOP-6108) of the bash ones so HADOOP-1504 is not applicable, but the behaviour should be the same: the terminate-cluster command lists the instances that it will terminate, and prompts for confirmation that they should be terminated. Is it listing instances that are not in the cluster? I have used this script a lot and it has never terminated any instances that are not in the cluster. What are the names of the security groups that the instances are in (both those in the cluster, and those outside the cluster that are inadvertently terminated)? Thanks, Tom On Mon, Oct 19, 2009 at 4:41 PM, Mark Stetzer stet...@gmail.com wrote: Hey all, While running the (latest as of Friday) Cloudera-created EC2 scripts, I noticed that running the terminate-cluster script kills ALL of your EC2 nodes, not just those associated with the cluster. This has been documented before in HADOOP-1504 (http://issues.apache.org/jira/browse/HADOOP-1504), and a fix was integrated way back on June 21, 2007. My questions are: 1) Is anyone else seeing this? I can reproduce this behavior consistently. AND 2) Is this a regression in the common code, a problem with the Cloudera scripts, or just user error on my part? Just trying to get to the bottom of this so no one else has to see all of their EC2 instances die accidentally :( Thanks! -Mark
Re: Terminate Instances Terminating ALL EC2 Instances
On Mon, Oct 19, 2009 at 5:34 PM, Mark Stetzer stet...@gmail.com wrote: Hi Tom, The terminate-cluster script only lists the instances that are part of the cluster (master and all slaves) as far as I can tell. As an example, I set up a cluster of 1 master and 5 slaves, then started an additional non-Hadoop server via the AWS mgmt. console running a completely different AMI (OpenSolaris 2009.06 just to be very different). terminate-cluster only listed the 6 instances that were part of the cluster if I remember correctly. I have 4 security groups: default, default-master, default-slave, and mark-default. mark-default wasn't even added until after I started the Hadoop cluster; I added it to log in to the OpenSolaris instance. I think there is a bug here. I've filed https://issues.apache.org/jira/browse/HADOOP-6320. As an immediate workaround you can avoid calling the Hadoop cluster default, and make sure that you don't create non-Hadoop EC2 instances in the cluster group. Thanks, Tom Does this help at all? Thanks. -Mark On Mon, Oct 19, 2009 at 11:52 AM, Tom White t...@cloudera.com wrote: Hi Mark, Sorry to hear that all your EC2 instances were terminated. Needless to say, this should certainly not happen. The scripts are a Python rewrite (see HADOOP-6108) of the bash ones so HADOOP-1504 is not applicable, but the behaviour should be the same: the terminate-cluster command lists the instances that it will terminate, and prompts for confirmation that they should be terminated. Is it listing instances that are not in the cluster? I have used this script a lot and it has never terminated any instances that are not in the cluster. What are the names of the security groups that the instances are in (both those in the cluster, and those outside the cluster that are inadvertently terminated)? Thanks, Tom On Mon, Oct 19, 2009 at 4:41 PM, Mark Stetzer stet...@gmail.com wrote: Hey all, While running the (latest as of Friday) Cloudera-created EC2 scripts, I noticed that running the terminate-cluster script kills ALL of your EC2 nodes, not just those associated with the cluster. This has been documented before in HADOOP-1504 (http://issues.apache.org/jira/browse/HADOOP-1504), and a fix was integrated way back on June 21, 2007. My questions are: 1) Is anyone else seeing this? I can reproduce this behavior consistently. AND 2) Is this a regression in the common code, a problem with the Cloudera scripts, or just user error on my part? Just trying to get to the bottom of this so no one else has to see all of their EC2 instances die accidentally :( Thanks! -Mark
Re: JobTracker startup failure when starting hadoop-0.20.0 cluster on Amazon EC2 with contrib/ec2 scripts
Hi Jeyendran, Were there any errors reported in the datanode logs? There could be a problem with datanodes contacting the namenode, caused by firewall configuration problems (EC2 security groups). Cheers, Tom On Fri, Sep 4, 2009 at 12:17 AM, Jeyendran Balakrishnanjbalakrish...@docomolabs-usa.com wrote: I downloaded Hadoop 0.20.0 and used the src/contrib/ec2/bin scripts to launch a Hadoop cluster on Amazon EC2, after building a new Hadoop 0.20.0 AMI. I launched an instance with my new Hadoop 0.20.0 AMI, then logged in and ran the following to launch a new cluster: root(/vol/hadoop-0.20.0) bin/launch-hadoop-cluster hadoop-test 2 After the usual EC2 wait, one master and two slave instances were launched on EC2, as expected. When I ssh'ed into the instances, here is what I found: Slaves: DataNode and NameNode are running Master: Only NameNode is running I could use HDFS commands (using $HADOOP_HOME/bin/hadoop scripts) without any problems, from both master and slaves. However, since JobTracker is not running, I cannot run map-reduce jobs. I checked the logs from /vol/hadoop-0.20.0/logs for the JobTracker, reproduced below: 2009-09-03 18:55:38,486 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and h dfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 2009-09-03 18:55:38,520 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = domU-12-31-39-06-44-E3.compute-1.internal/10.208.75.17 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-09-03 18:55:38,652 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=50002 2009-09-03 18:55:38,703 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2009-09-03 18:55:38,827 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030 2009-09-03 18:55:38,827 INFO org.mortbay.log: jetty-6.1.14 2009-09-03 18:55:48,425 INFO org.mortbay.log: Started selectchannelconnec...@0.0.0.0:50030 2009-09-03 18:55:48,427 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2009-09-03 18:55:48,432 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 50002 2009-09-03 18:55:48,432 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2009-09-03 18:55:48,541 INFO org.apache.hadoop.mapred.JobTracker: Cleaning up the system directory 2009-09-03 18:55:48,628 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/hadoop/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(F SNamesystem.java:1256) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:4 22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo cationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation Handler.java:59) at $Proxy4.addBlock(Unknown Source) at
Re: Can't find TestDFSIO
Hi Cam, Looks like it's in hadoop-hdfs-hdfswithmr-test-0.21.0-dev.jar, which should be built with ant jar-test. Cheers, Tom On Mon, Aug 24, 2009 at 8:22 PM, Cam Macdonellc...@cs.ualberta.ca wrote: Thanks Danny, It currently does not show up hadoop-common-test, hadoop-hdfs-test or hadoop-mapred-test with 0.21-dev. So either it has been a victim of the project split or I didn't specify the right target for Ant. Cam Gross, Danny wrote: Hi Cam, For what it's worth, in 19.1, I see TestDFSIO in the hadoop-0.19.1-test.jar. Best regards, Danny -Original Message- From: Cam Macdonell [mailto:c...@cs.ualberta.ca] Sent: Monday, August 24, 2009 12:00 PM To: common-user@hadoop.apache.org Subject: Can't find TestDFSIO Hi, I'm trying to run the TestDFSIO benchmark that is mentioned in the hadoop o'reilly book. However, I can't find it in any of the jars (common, mapred or hdfs). For example, I presume it would be under hdfs, but the only mentioned test is 'dfsthroughput'. $ ./bin/hadoop jar /home/cam/research/SVN/hadoop/lib/hadoop-hdfs-test-0.21.0-dev.jar An example program must be given as the first argument. Valid program names are: dfsthroughput: measure hdfs throughput Has the name of TestDFSIO changed or am I looking in the wrong place? Any tips or pointers are appreciated, Cam
Re: File Chunk to Map Thread Association
Hi Roman, Have a look at CombineFileInputFormat - it might be related to what you are trying to do. Cheers, Tom On Thu, Aug 20, 2009 at 10:59 AM, roman kolcunroman.w...@gmail.com wrote: On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun roman.w...@gmail.com wrote: Hello Harish, I know that TaskTracker creates separate threads (up to mapred.tasktracker.map.tasks.maximum) which execute the map() function. However, I haven't found the piece of code which associate FileSplit with the given map thread. Is it downloaded locally in the TaskTracker function or in MapTask? Yes this is done by the MapTask. Thanks, I will have a better look into it. I know I can increase the input file size by changing 'mapred.min.split.size' , however, the file is split sequentially and very rarely two consecutive HDFS blocks are stored on a single node. This means that the data locality will not be exploited cause every map() will have to download part of the file from network. Roman Kolcun I see what you mean - you want to modify the hadoop code to allocate multiple (non-sequential) data-local blocks to one MapTask. That's exactly what I want to do. I don't know if you'll achieve much by doing all that work. Basically I would like to emulate larger DFS blocksize. I've performed 2 word count benchmarks on a cluster of 10 machines with 100GB file. With 64MB blocksize it took 2035 seconds, when I've increased it to 256MB it took 1694 seconds - which is 16.76% increase. Hadoop lets you reuse the launched JVMs for multiple MapTasks. That should minimize the overhead of launching MapTasks. Increasing the DFS blocksize for the input files is another means to achieve the same effect. Do you think that this could be eliminated by reusing JVMs? I am doing it as a project for my university degree so I really hope it will lower the processing time significantly. I would like to make it general for different block sizes. Thank you for your help. Roman Kolcun
Re: MapFile performance
On Mon, Aug 3, 2009 at 3:09 AM, Billy Pearsonbilly_pear...@sbcglobal.net wrote: not sure if its still there but there was a parm in the hadoop-site conf file that would allow you to skip x number if index when reading it in to memory. This is io.map.index.skip (default 0), which will skip this number of keys for every key in the index. For example, if set to 2, one third of the keys will end up in memory. From what I understand we scan find the key offset just before the data and seek once and read until we find the key. Billy - Original Message - From: Andy Liu andyliu1227-re5jqeeqqe8avxtiumw...@public.gmane.org Newsgroups: gmane.comp.jakarta.lucene.hadoop.user To: core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org Sent: Tuesday, July 28, 2009 7:53 AM Subject: MapFile performance I have a bunch of Map/Reduce jobs that process documents and writes the results out to a few MapFiles. These MapFiles are subsequently searched in an interactive application. One problem I'm running into is that if the values in the MapFile data file are fairly large, lookup can be slow. This is because the MapFile index only stores every 128th key by default (io.map.index.interval), and after the binary search it may have to scan/skip through up to 127 values (off of disk) before it finds the matching record. I've tried io.map.index.interval = 1, which brings average get() times from 1200ms to 200ms, but at the cost of memory during runtime, which is undesirable. One possible solution is to have the MapFile index store every single key, offset pair. Then MapFile.Reader, upon startup, would read every 128th key in memory. MapFile.Reader.get() would behave the same way except instead of seeking through the values SequenceFile it would seek through the index SequenceFile until it finds the matching record, and then it can seek to the corresponding offset in the values. I'm going off the assumption that it's much faster to scan through the index (small keys) than it is to scan through the values (large values). Or maybe the index can be some kind of disk-based btree or bdb-like implementation? Anybody encounter this problem before? Andy
Re: Status of 0.19.2
I've now updated the news section, and the documentation on the website to reflect the 0.19.2 release. There were several reports of it being more stable than 0.19.1 in the voting thread: http://www.mail-archive.com/common-...@hadoop.apache.org/msg00051.html Cheers, Tom On Tue, Jul 28, 2009 at 12:37 PM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, I've seen that the 0.19.2 version was added recently to the downloads but there's no entry under the news section. Is it stable enough for deployment? Thanks, Tamir
Re: Reading GZIP input files.
That's for the case where you want to do the decompression yourself, explicitly, perhaps when you are reading the data out of HDFS (and not using MapReduce). When using compressed data as input to a MapReduce job, Hadoop will automatically decompress them for you. Tom On Fri, Jul 31, 2009 at 5:34 PM, David Beendaveb...@gmail.com wrote: I'm new, reading Tom White's book, but there is an example using: CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); // infers from file ext. InputStream in = codec.createInputStream(fs.open(inputPath)); On Fri, Jul 31, 2009 at 8:01 AM, prashant ullegaddiprashullega...@gmail.com wrote: Hi guys, I have a set of 1000 gzipped plain text files. How to read them in Hadoop? Is there any built-in class available for it? Btw, I'm using hadoop-0.18.3. Regards, Prashant.
Re: Using JobControl in hadoop
Hi Raakhi, JobControl is designed to be run from a new thread: Thread t = new Thread(jobControl); t.start(); Then you can run a loop to poll for job completion and print out status: String oldStatus = null; while (!jobControl.allFinished()) { String status = getStatusString(jobControl); if (!status.equals(oldStatus)) { System.out.println(status); oldStatus = status; } try { Thread.sleep(1000); } catch (InterruptedException e) { // ignore } } Hope this helps. Tom On Fri, Jul 17, 2009 at 9:10 AM, Rakhi Khatwanirakhi.khatw...@gmail.com wrote: Hi, I was trying out a map-reduce example using JobControl. i create a jobConf conf1 object, add the necessary information then i create a job object Job job1 = new Job(conf1); n thn i delare JobControl object as follows: JobControl jobControl = new JobControl(JobControl1); jobControl.addJob(job1); jobControl.run(); whn i execute it in the console, i get the following output 09/07/17 13:10:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/07/17 13:10:16 INFO mapred.FileInputFormat: Total input paths to process : 4 n there is no other output. but from the UI i can c that the job has been executed. if there any way i can direct the output to the console. or is there any way in which while the job is runing, i can continue processing from main. (i wanna try suspending/stopping jobs etc). Regards, Raakhi
Re: access Configuration object in Partioner??
Hi Jianmin, Partitioner extends JobConfigurable, so you can implement the configure() method to access the JobConf. Hope that helps. Cheers, Tom On Tue, Jul 14, 2009 at 10:27 AM, Jianmin Woojianmin_...@yahoo.com wrote: Hi, I am considering to implement a Partitioner that needs to access the parameters in Configuration of job. However, there is no straightforward way for this task. Are there any suggestions? Thanks, Jianmin
Re: more than one reducer in standalone mode
There's a Jira to fix this here: https://issues.apache.org/jira/browse/MAPREDUCE-434 Tom On Mon, Jul 13, 2009 at 12:34 AM, jason hadoopjason.had...@gmail.com wrote: If the jobtracker is set to local, there is no way to have more than 1 reducer. On Sun, Jul 12, 2009 at 12:21 PM, Rares Vernica rvern...@gmail.com wrote: Hello, Is it possible to have more than one reducer in standalone mode? I am currently using 0.17.2.1 and I do: job.setNumReduceTasks(4); before starting the job and it seems that Hadoop overrides the variable, as it says: 09/07/12 12:07:40 INFO mapred.MapTask: numReduceTasks: 1 Thanks! Rares -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: access Configuration object in Partioner??
Hi Jianmin, Sorry - I (incorrectly) assumed you were using the old API. Partitioners don't yet work with the new API (see https://issues.apache.org/jira/browse/MAPREDUCE-565). However, when they do you can make your Partitioner implement Configurable (by extending Configured, for example), and this will give you access to the job configuration, since the framework will set it for you on the partitioner. Cheers Tom On Tue, Jul 14, 2009 at 12:46 PM, Jianmin Woojianmin_...@yahoo.com wrote: Thanks a lot for your information, Tom. I am using the org.apache.hadoop.mapreduce.Partitioner in 0.20. It seems that the org.apache.hadoop.mapred.Partitioner is deprecated and will be removed in the futture. Do you have some suggestions on this? Thanks, Jianmin From: Tom White t...@cloudera.com To: common-user@hadoop.apache.org Sent: Tuesday, July 14, 2009 6:03:34 PM Subject: Re: access Configuration object in Partioner?? Hi Jianmin, Partitioner extends JobConfigurable, so you can implement the configure() method to access the JobConf. Hope that helps. Cheers, Tom On Tue, Jul 14, 2009 at 10:27 AM, Jianmin Woojianmin_...@yahoo.com wrote: Hi, I am considering to implement a Partitioner that needs to access the parameters in Configuration of job. However, there is no straightforward way for this task. Are there any suggestions? Thanks, Jianmin
Re: Restarting a killed job from where it left
Hi Akhil, Have a look at the mapred.jobtracker.restart.recover property. Cheers, Tom On Sun, Jul 12, 2009 at 12:06 AM, akhil1988akhilan...@gmail.com wrote: HI All, I am looking for ways to restart my hadoop job from where it left when the entire cluster goes down or the job gets stopped due to some reason i.e. I am looking for ways in which I can store at regular intervals the status of my job and then when I restart the job it starts from where it left rather than starting from the beginning again. Can anyone please give me some reference to read about the ways to handle this. Thanks, Akhil -- View this message in context: http://www.nabble.com/Restarting-a-killed-job-from-where-it-left-tp2618p2618.html Sent from the Hadoop core-user mailing list archive at Nabble.com.