Re: 1st Hadoop India User Group meet
Sanjay, Congratulations for holding the first meetup. All the best with it. Its exciting to see work being done in India involving Hadoop. I've been a part of some projects in the Hadoop ecosystem and have done some research work during my graduate studies as well as for a project at Cisco Systems. I'm traveling to Delhi in December and would love to meet and talk about how and what you and other users are doing in this area. Would you be interested? Looking forward to hearing from you. Regards Amandeep On Mon, Nov 9, 2009 at 10:19 PM, Sanjay Sharma sanjay.sha...@impetus.co.inwrote: We are planning to hold first Hadoop India user group meet up on 28th November 2009 in Noida. We would be talking about our experiences with Apache Hadoop/Hbase/Hive/PIG/Nutch/etc. The agenda would be: - Introductions - Sharing experiences on Hadoop and related technologies - Establishing agenda for the next few meetings - Information exchange: tips, tricks, problems and open discussion - Possible speaker TBD (invitations open!!) {we do have something to share on Hadoop for newbie Hadoop Advanced Tuning} My company (Impetus) would be providing the meeting room and we should be able to accommodate around 40-60 friendly people. Coffee, Tea, and some snacks will be provided. Please join the linked-in Hadoop India User Group ( http://www.linkedin.com/groups?home=gid=2258445trk=anet_ug_hm) OR Yahoo group (http://tech.groups.yahoo.com/group/hadoopind/) and confirm your attendance. Regards, Sanjay Sharma Follow our updates on www.twitter.com/impetuscalling. * Impetus Technologies is exhibiting it capabilities in Mobile and Wireless in the GSMA Mobile Asia Congress, Hong Kong from November 16-18, 2009. Visit http://www.impetus.com/mlabs/GSMA_events.html for details. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Where is the eclipse plug-in for hadoop 0.20.1
Hi Stephen, Thank you. It works Jeff Zhang On Mon, Nov 9, 2009 at 10:31 PM, Stephen Watt sw...@us.ibm.com wrote: Hi Jeff That is correct. The plugin for 0.20.1 exists only in the src/contrib as it has some build and runtime issues. It is presently being tracked here - http://issues.apache.org/jira/browse/HADOOP-6360 In the interim, if you go to that JIRA, you can obtain a 0.20.1 plugin.jar that I have attached to the JIRA as a stop gap measure. I'd appreciate it if you could report in the JIRA what works for you and what does not with the attached plugin. Also, if you have any additional features for the plugin that you would like to request, feel free to add them as a comment to the JIRA. Regards Steve Watt From: Jeff Zhang zjf...@gmail.com To: core-u...@hadoop.apache.org Date: 11/09/2009 12:09 AM Subject: Where is the eclipse plug-in for hadoop 0.20.1 Hi all, I could not find the ecilpse plug-in for hadoop 0.20.1. I only find the source code eclipse plugin. But do not know how to build the plug-in. Anyone could give some help? Thank you. Jeff Zhang
[Ask for help]: IOException: Expecting a line not the end of stream, hadoop-0.20.1 in Daemn Small Linux
Dear all, I am new in learning hadoop, encountered a problem while complying Hadoop/Quick Start(http://hadoop.apache.org/common/docs/current/quickstart.html) tutorial. Everything in cygwin is okay, but in Daemn Small Linux(DSL). In Daemn Small Linux, after executing the command: ---bin/hadoop jar hadoop-0.20.1-examples.jar grep input output 'dfs[a-z.]+' output errors as following OUTPUT_01, base on the errors, I tried df -k, output as OUTPUT_02, it wasn't NULL/empty. I tried searching all the mailing list of common-user core-user and Google, got noting solution, so I have to send this email to ask you gus' help modestly. Please kindly reply if anyone have idea. Thanks in advance! =) Best regards. Neo Tan OUTPUT_01:== r...@box:/home/hadoop/hadoop-0.20.1# bin/hadoop jar hadoop-0.20.1-examples.jar grep input output 'dfs[a-z.]+' 09/11/10 17:12:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/11/10 17:12:43 INFO mapred.FileInputFormat: Total input paths to process : 5 09/11/10 17:12:44 INFO mapred.FileInputFormat: Total input paths to process : 5 09/11/10 17:12:44 INFO mapred.JobClient: Running job: job_local_0001 09/11/10 17:12:44 INFO mapred.MapTask: numReduceTasks: 1 09/11/10 17:12:44 INFO mapred.MapTask: io.sort.mb = 100 09/11/10 17:12:45 INFO mapred.MapTask: data buffer = 79691776/99614720 09/11/10 17:12:45 INFO mapred.MapTask: record buffer = 262144/327680 09/11/10 17:12:45 INFO mapred.MapTask: Starting flush of map output 09/11/10 17:12:45 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: Expecting a line not the end of stream at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109) at org.apache.hadoop.util.Shell.runCommand(Shell.java:179) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1431) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) 09/11/10 17:12:45 WARN util.Shell: Error reading the error stream java.io.IOException: Stream closed at java.io.BufferedReader.ensureOpen(BufferedReader.java:97) at java.io.BufferedReader.readLine(BufferedReader.java:292) at java.io.BufferedReader.readLine(BufferedReader.java:362) at org.apache.hadoop.util.Shell$1.run(Shell.java:164) 09/11/10 17:12:45 INFO mapred.JobClient: map 0% reduce 0% 09/11/10 17:12:45 INFO mapred.JobClient: Job complete: job_local_0001 09/11/10 17:12:45 INFO mapred.JobClient: Counters: 0 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.hadoop.examples.Grep.run(Grep.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Grep.main(Grep.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) r...@box:/home/hadoop/hadoop-0.20.1# = OUTPUT_02:= r...@box:/home/hadoop/hadoop-0.20.1# df -k Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda1 8254240 1027828 6807120 13% / =
Lucene + Hadoop
Hi, I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is hrishikesh, it should be added to a index for hrishikesh. There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function. Does this logic sound correct? Is there any other way of implementing this requirement? --Hrishi DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Automate EC2 cluster termination
Hi, I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great but I want to automate it a bit more. I want to be able to: - start cluster - copy data from S3 to the DFS - run the job - copy result data from DFS to S3 - verify it all copied ok - shutdown the cluster. I guess the hardest part is reliably detecting when a job is complete. I've seen solutions that provide a time based shutdown but they are not suitable as our jobs vary in time. Has anyone made a script that does this already? I'm using the Cloudera python scripts to start/terminate my cluster. Thanks, John
Re: Automate EC2 cluster termination
You should be able to detect the status of the job in your java main() method, just do either: job.waitForCompletion(), and, when the job finishes running, use job.isSuccessful(), or if you want to you can write a custom watcher thread to poll job status manually; this will allow you to, for instance, launch several jobs and wait for them to return. You will poll the job tracker using either method, but I think the overhead is pretty minimal. I'm not sure if its necessary to copy data from S3 to DFS, btw (unless you have a performance reason to do so... even then, since you're not really guaranteed very much locality on EC2 you probably won't see a huge difference). You should probably just set the default file system to s3. See http://wiki.apache.org/hadoop/AmazonS3 . On 11/10/09 9:13 AM, John Clarke wrote: Hi, I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great but I want to automate it a bit more. I want to be able to: - start cluster - copy data from S3 to the DFS - run the job - copy result data from DFS to S3 - verify it all copied ok - shutdown the cluster. I guess the hardest part is reliably detecting when a job is complete. I've seen solutions that provide a time based shutdown but they are not suitable as our jobs vary in time. Has anyone made a script that does this already? I'm using the Cloudera python scripts to start/terminate my cluster. Thanks, John
Hadoop NameNode not starting up
I am running Hadoop on single server. The issue I am running into is that start-all.sh script is not starting up NameNode. Only way I can start NameNode is by formatting it and I end up losing data in HDFS. Does anyone have solution to this issue? Kaushal
Next Boston Hadoop Meetup, Tuesday, November 24th
After a packed, energetic first Boston Hadoop Meetup, we're having another. Next one will be in two weeks, on Tuesday, November 24th, 7 pm, at the HubSpot offices: http://www.meetup.com/bostonhadoop/calendar/11834241/ (HubSpot is at 1 Broadway, Cambridge on the fifth floor. There Will Be Food. There Will Be Beer.) As before, we'll aim to have 2 c. 20 minute presentations, with plenty of time for QA after each, and then a few 5-minute lightning talks. Also, the eating and the chatting,. Please feel free to contact me if you've got an idea for a talk of any length, on Hadoop, Hive, Pig, Hbase, etc. -Dan Milstein 617-401-2855 dmilst...@hubspot.com http://dev.hubspot.com/
Re: Cross Join
Thanks to all who commented on this. I think there was some confusion over what I was trying to do: indeed there was no common key between the two tables to join on, which made all the methods I investigated either inappropriate or inefficient. In the end I decided to write my own join class. It can be written in a reducer or a mapper. While the reducer implementation is a bit cleaner, the mapper implementation provides (theoretically) better distributed processing. For those who are interested, the basic algorithm is: x is defined as the cross product of two vectors proc crossproduct: Allow mapreduce to partition the left side of the input on each mapper let left_i = save all the left side key/value pairs that are processed on that node in cleanup (or at the end of the reduce) : let right = open the right side of the join on each node through hdfs for each pair of pairs in left_i x right: if transform (pair) !=null emit transform (pair) else continue endfor end on each end proc The important On 11/5/09 1:15 PM, Ashutosh Chauhan wrote: Hi Edmund, If you can prepare your dataset in a way org.apache.hadoop.mapred.join requires, then it might be an efficient way to do joins in your case. IMHO though requirements placed by it though are pretty restrictive. Also, instead of reinventing the wheel, I would also suggest you to take a look how Pig tries to solve joining large dataset problem. It has infact four different join algorithms implemented and one or more them should satisfy your requirements. It seems to me merge-join of Pig is well suited in your case. Its only requirement is it wants dataset to be sorted on both sides. Datasets need not to be equipartitioned, need not to have same number of partitions etc. You said that sorting the dataset is pain in your case. Pig's orderby is quite sophisticated and performs sorting rather quite efficiently. If indeed doing sort is not an option, then you may want to consider hash join or skewed join of Pig. Joins in Pig are explained at high-level here: http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/ Hope it helps, Ashutosh On Thu, Nov 5, 2009 at 06:19, Jason Vennerjason.had...@gmail.com wrote: Look at the join package in map reduce, it provides this functionality quite cleaning, for ordered datasets that have the same partitioning. org.apache.hadoop.mapred.join in hadoop 19 On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlweyekohl...@gmail.com wrote: Hi, I'm looking for an efficient way to do a cross join. I've gone through a few implementations, and I wanted to seek some advice before attempting another. The join is a large collection to large collection - so there's no trick optimizations like downloading one side of the join on each node (ie. map side join). The output of the join will be sparse, (its basically matching a large collection of regexes to a large collection of strings), but because of the nature of the data there's not really any way to pre-process either side of the join. 1. Naive approach - on a single node, iterate over both collections, resulting in reading the left file 1 times and the right file n times - I know this is bad. 2. Indexed approach - index data item with a row/col - requires replicating, sorting, and shuffling all the records 2 times - also not good. This actually seemed to perform worse than 1, and resulted in running out of disk space on the mappers when output was spilled to disk. I'm now considering what to try next. One idea is to improve on 1 by blocking the reads, so that the right side of the join is read b times, where b is the number of blocks the left side is split into. The other (imho, best) idea is to write a reduce-side join, which would actually be fully parallelized, which basically relies on map/reduce to split the left side into blocks, and then allows each reducer to stream through the right side once. In this version, the right side is still downloaded b times, but the operation is done in parallel. The only issue with this is that I would need to iterate over the reduce iterators multiple times, which is something that M/R doesn't allow (I think). I know I could save the contents of the iterator locally, but this seems like a bad design choice too. Does anybody know if there's a smart way to iterate twice in a reducer? There's probably some methods I haven't really thought of. Does anyone have any suggestions? -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Hadoop NameNode not starting up
Is there error output from start-all.sh? On 11/9/09 11:10 PM, Kaushal Amin wrote: I am running Hadoop on single server. The issue I am running into is that start-all.sh script is not starting up NameNode. Only way I can start NameNode is by formatting it and I end up losing data in HDFS. Does anyone have solution to this issue? Kaushal
Error with replication and namespaceID
On the actual datanodes I see the following exception: I am not sure what the namespaceID is or how to sync them. Thanks for any advice! / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pingo-3.poly.edu/128.238.55.33 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 1016244663; datanode namespaceID = 1687029285 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) --- On Mon, 11/9/09, Boris Shkolnik bo...@yahoo-inc.com wrote: From: Boris Shkolnik bo...@yahoo-inc.com Subject: Re: newbie question - error with replication To: common-user@hadoop.apache.org Date: Monday, November 9, 2009, 5:02 PM Make sure you have at least one datanode running. Look at the data node log file. (logs/*-datanode-*.log) Boris. On 11/9/09 7:15 AM, Raymond Jennings III raymondj...@yahoo.com wrote: I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!
java.io.IOException: Could not obtain block:
Hello everyone, I am getting this error java.io.IOException: Could not obtain block:, when running on my new cluster. When I ran the same job on the single node it worked perfectly, I then added in the second node, and receive this error. I was running the grep sample job. I am running Hadoop 0.19.2, because of a dependency on Nutch (Eventhough this was not a Nutch job). I am not running HBase, the version of Java is OpenJDK 1.6.0. Does anybody have any ideas? Thanks in advance, -John
Hadoop User Group (Bay Area) - next Wednesday (Nov 18th) at Yahoo!
Hi all, We are one week away from the next Bay Area Hadoop User Group - Yahoo! Sunnyvale Campus, next Wednesday (Nov 18th) at 6PM We have an exciting evening planed: *Katta, Solr, Lucene and Hadoop - Searching at scale, Jason Rutherglen and Jason Venner *Walking through the New File system API, Sanjay Radia, Yahoo! *Keep your data in Jute but still use it in python, Paul Tarjan, Yahoo! Please RSVP here: http://www.meetup.com/hadoop/calendar/11724002/ Please note that this is the last HUG for 2009, as we will not have a meeting on December (due to the holidays). We will open 2010 with a HUG on Jan 20th. Looking forward to seeing you next week! Dekel
Re: Re: how to read file in hadoop
It is because the content I read from the file is encoded in UTF8, I use Text.decode to decode it back to plain text string, the problem is gone now. -Gang - 原始邮件 发件人: Gang Luo lgpub...@yahoo.com.cn 收件人: common-user@hadoop.apache.org 发送日期: 2009/11/10 (周二) 12:14:44 上午 主 题: Re: Re: how to read file in hadoop I download it to my local filesystem. The content is correct, I can see it either by command or by texteditor. So, I think the file itself has no problem. --Gang - 原始邮件 发件人: Jeff Zhang zjf...@gmail.com 收件人: common-user@hadoop.apache.org 发送日期: 2009/11/9 (周一) 11:58:22 下午 主 题: Re: Re: how to read file in hadoop Maybe you can download the file to local to see what content is there. Jeff Zhang 2009/11/10 Gang Luo lgpub...@yahoo.com.cn Since no response to this question up to now, I'd like to discribe more details about it. I try to read a file in HDFS and copy it to another file. It works well and I can see the content by 'cat' is what it supposed to be. The only problems is that, when I read it to Bytes[] and print it out to stdout, it is NOT what it should be. Thus, I cannot do anything (e.g. comparison) except write it directely to another file. I guess this problem may due to the setting of file format (text or binary) or coding (e.g.utf-8). Can someone give me some ideas? --Gang - 原始邮件 发件人: Gang Luo lgpub...@yahoo.com.cn 收件人: common-user@hadoop.apache.org 发送日期: 2009/11/9 (周一) 11:47:02 上午 主 题: how to read file in hadoop Hi all I want to use HDFS IO api to read a result file of the previous mapreduce job. But what I read is not the things in that file, say the content I print to stdout is different from what I get from the console by command 'cat'. I guese there maybe some problem about the file format (binary or text). Can anyone give me some hints? Gang Luo ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
stdout logs ?
Hi all In src/contrib/data_join/src/java/org/apache/hadoop/contrib/utils/join/DataJoinJob.java i found a couple of println statements (shown below )which are getting executed when submitted for a job . I am not sure to which stdout they are printing ? I searched in logs/* but dint find it ? Can somebody please tell me where are they logged btw i am running this job on a cluster which has only one node . try { running = jc.submitJob(job); JobID jobId = running.getID(); System.out.println(Job + jobId + is submitted); while (!running.isComplete()) { System.out.println(Job + jobId + is still running.); try { Thread.sleep(6); } catch (InterruptedException e) {
Re: Error with replication and namespaceID
Hi Ray, You'll probably find that even though the name node starts, it doesn't have any data nodes and is completely empty. Whenever hadoop creates a new filesystem, it assigns a large random number to it to prevent you from mixing datanodes from different filesystems on accident. When you reformat the name node its FS has one ID, but your data nodes still have chunks of the old FS with a different ID and so will refuse to connect to the namenode. You need to make sure these are cleaned up before reformatting. You can do it just by deleting the data node directory, although there's probably a more official way to do it. On 11/10/09 11:01 AM, Raymond Jennings III wrote: On the actual datanodes I see the following exception: I am not sure what the namespaceID is or how to sync them. Thanks for any advice! / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pingo-3.poly.edu/128.238.55.33 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 1016244663; datanode namespaceID = 1687029285 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) --- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com wrote: From: Boris Shkolnikbo...@yahoo-inc.com Subject: Re: newbie question - error with replication To: common-user@hadoop.apache.org Date: Monday, November 9, 2009, 5:02 PM Make sure you have at least one datanode running. Look at the data node log file. (logs/*-datanode-*.log) Boris. On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com wrote: I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!
Re: stdout logs ?
Hi Siddu, I asked this question couple of days ago. You should use you browser access the jobtracker. Click a job id -map-pick a map task-click the link at the column task log, you will see the output at stdout and stderr. -Gang --Original Message- In src/contrib/data_join/src/java/org/apache/hadoop/contrib/utils/join/DataJoinJob.java i found a couple of println statements (shown below )which are getting executed when submitted for a job . I am not sure to which stdout they are printing ? I searched in logs/* but dint find it ? Can somebody please tell me where are they logged btw i am running this job on a cluster which has only one node . try { running = jc.submitJob(job); JobID jobId = running.getID(); System.out.println(Job + jobId + is submitted); while (!running.isComplete()) { System.out.println(Job + jobId + is still running.); try { Thread.sleep(6); } catch (InterruptedException e) { ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: Error with replication and namespaceID
Thanks!!! That worked! I guess I can edit the number on the datanodes as well but if there is an even more official way to resolve this I would be interested in hearing about it. --- On Tue, 11/10/09, Edmund Kohlwey ekohl...@gmail.com wrote: From: Edmund Kohlwey ekohl...@gmail.com Subject: Re: Error with replication and namespaceID To: common-user@hadoop.apache.org Date: Tuesday, November 10, 2009, 1:46 PM Hi Ray, You'll probably find that even though the name node starts, it doesn't have any data nodes and is completely empty. Whenever hadoop creates a new filesystem, it assigns a large random number to it to prevent you from mixing datanodes from different filesystems on accident. When you reformat the name node its FS has one ID, but your data nodes still have chunks of the old FS with a different ID and so will refuse to connect to the namenode. You need to make sure these are cleaned up before reformatting. You can do it just by deleting the data node directory, although there's probably a more official way to do it. On 11/10/09 11:01 AM, Raymond Jennings III wrote: On the actual datanodes I see the following exception: I am not sure what the namespaceID is or how to sync them. Thanks for any advice! / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pingo-3.poly.edu/128.238.55.33 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 1016244663; datanode namespaceID = 1687029285 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) --- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com wrote: From: Boris Shkolnikbo...@yahoo-inc.com Subject: Re: newbie question - error with replication To: common-user@hadoop.apache.org Date: Monday, November 9, 2009, 5:02 PM Make sure you have at least one datanode running. Look at the data node log file. (logs/*-datanode-*.log) Boris. On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com wrote: I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!
Should I upgrade from 0.18.3 to the latest 0.20.1?
Hi, I've been working on my project for about a year, and I decided to upgrade from 0.18.3 (which was stable and already old even back then). I have started, but I see that many classes have changed, many are deprecated, and I need to re-write some code. Is it worth it? What are the advantages of doing this? Other areas of concern are: - Will Amazon EMR work with the latest Hadoop? - What about Cloudera distribution or Yahoo distribution? Thank you, Mark
error setting up hdfs?
had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls ls: Cannot access .: No such file or directory. anyone else get this one? i started changing settings on my box to get all of my cores working, but immediately hit this error. since then i started from scratch and have hit this error again. what am i missing?
Re: error setting up hdfs?
You need to specify a path. Try bin/hadoop dfs -ls / Steve Watt From: zenkalia zenka...@gmail.com To: core-u...@hadoop.apache.org Date: 11/10/2009 03:04 PM Subject: error setting up hdfs? had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls ls: Cannot access .: No such file or directory. anyone else get this one? i started changing settings on my box to get all of my cores working, but immediately hit this error. since then i started from scratch and have hit this error again. what am i missing?
Re: Automate EC2 cluster termination
Hi John, Have you considered Amazon Elastic MapReduce? (Disclaimer: I work on Elastic MapReduce) http://aws.amazon.com/elasticmapreduce/ It waits for your job to finish and then automatically shuts down the cluster. With a simple command like: elastic-mapreduce --create --num-instances 10 --jar s3://mybucket/my.jar --args s3://mybucket/input/,s3://mybucket/output/ It will automatically create a cluster, run your jar, and then shut everything down. Elastic MapReduce costs a little bit more than just plain EC2, but if it prevents your cluster from running longer than necessary, you might save money. Andrew On 11/10/09 6:13 AM, John Clarke clarke...@gmail.com wrote: Hi, I use EC2 to run my Hadoop jobs using Cloudera's 0.18.3 AMI. It works great but I want to automate it a bit more. I want to be able to: - start cluster - copy data from S3 to the DFS - run the job - copy result data from DFS to S3 - verify it all copied ok - shutdown the cluster. I guess the hardest part is reliably detecting when a job is complete. I've seen solutions that provide a time based shutdown but they are not suitable as our jobs vary in time. Has anyone made a script that does this already? I'm using the Cloudera python scripts to start/terminate my cluster. Thanks, John
Re: Hadoop NameNode not starting up
You need to go to your logs directory and have a look at what is going on in the namenode log. What version are you using ? I'm going to take a guess at your issue here and say that you used the /tmp as a path for some of your hadoop conf settings and you have rebooted lately. The /tmp dir is wiped out on reboot. Kind regards Steve Watt From: Kaushal Amin kaushala...@gmail.com To: common-user@hadoop.apache.org Date: 11/10/2009 08:47 AM Subject: Hadoop NameNode not starting up I am running Hadoop on single server. The issue I am running into is that start-all.sh script is not starting up NameNode. Only way I can start NameNode is by formatting it and I end up losing data in HDFS. Does anyone have solution to this issue? Kaushal
Re: Hadoop NameNode not starting up
did u format it for the first time another quick way to fugure out is ${HADOOP_HOME}/bin/hadoop start namenode see wht error it gives -Sagar Stephen Watt wrote: You need to go to your logs directory and have a look at what is going on in the namenode log. What version are you using ? I'm going to take a guess at your issue here and say that you used the /tmp as a path for some of your hadoop conf settings and you have rebooted lately. The /tmp dir is wiped out on reboot. Kind regards Steve Watt From: Kaushal Amin kaushala...@gmail.com To: common-user@hadoop.apache.org Date: 11/10/2009 08:47 AM Subject: Hadoop NameNode not starting up I am running Hadoop on single server. The issue I am running into is that start-all.sh script is not starting up NameNode. Only way I can start NameNode is by formatting it and I end up losing data in HDFS. Does anyone have solution to this issue? Kaushal
Re: error setting up hdfs?
ok, things are working.. i must have forgotten what i did when first setting up hadoop... should these responses be considered inconsistent/an error? hmm. hadoop dfs -ls error hadoop dfs -ls / irrelevant stuff about the path you're in hadoop dfs -mkdir lol works fine hadoop dfs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2009-11-10 05:28 /user/hadoop/lol thanks stephen. -mike On Tue, Nov 10, 2009 at 1:19 PM, Stephen Watt sw...@us.ibm.com wrote: You need to specify a path. Try bin/hadoop dfs -ls / Steve Watt From: zenkalia zenka...@gmail.com To: core-u...@hadoop.apache.org Date: 11/10/2009 03:04 PM Subject: error setting up hdfs? had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls ls: Cannot access .: No such file or directory. anyone else get this one? i started changing settings on my box to get all of my cores working, but immediately hit this error. since then i started from scratch and have hit this error again. what am i missing?
Anyone using Hadoop in Austin, Texas ?
Just curious to see if there are any hadoop compatriots around and if there are, maybe we could organize a meetup. Regards Steve Watt
Re: Anyone using Hadoop in Austin, Texas ?
Me in Houston :) Mark On Tue, Nov 10, 2009 at 3:32 PM, Stephen Watt sw...@us.ibm.com wrote: Just curious to see if there are any hadoop compatriots around and if there are, maybe we could organize a meetup. Regards Steve Watt
Re: error setting up hdfs?
You don't need to specify a path. If you don't specify a path argument for ls, then it uses your home directory in HDFS (/user/yourusernamehere). When you first started HDFS, /user/hadoop didn't exist, so 'hadoop fs -ls' -- 'hadoop fs -ls /user/hadoop' -- directory not found. When you mkdir'd 'lol', you were actually effectively doing mkdir -p /user/hadoop/lol, so then it created your home directory underneath of that. - Aaron On Tue, Nov 10, 2009 at 1:30 PM, zenkalia zenka...@gmail.com wrote: ok, things are working.. i must have forgotten what i did when first setting up hadoop... should these responses be considered inconsistent/an error? hmm. hadoop dfs -ls error hadoop dfs -ls / irrelevant stuff about the path you're in hadoop dfs -mkdir lol works fine hadoop dfs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2009-11-10 05:28 /user/hadoop/lol thanks stephen. -mike On Tue, Nov 10, 2009 at 1:19 PM, Stephen Watt sw...@us.ibm.com wrote: You need to specify a path. Try bin/hadoop dfs -ls / Steve Watt From: zenkalia zenka...@gmail.com To: core-u...@hadoop.apache.org Date: 11/10/2009 03:04 PM Subject: error setting up hdfs? had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls ls: Cannot access .: No such file or directory. anyone else get this one? i started changing settings on my box to get all of my cores working, but immediately hit this error. since then i started from scratch and have hit this error again. what am i missing?
Re: Lucene + Hadoop
I think that sounds right. I believe that's what I did when I implemented this type of functionality for http://simpy.com/ I'm not sure why this is a Hadoop thing, though. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Hrishikesh Agashe hrishikesh_aga...@persistent.co.in To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue, November 10, 2009 4:56:33 AM Subject: Lucene + Hadoop Hi, I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is hrishikesh, it should be added to a index for hrishikesh. There has to be a separate index for every author). For this, I am keeping multiple IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse incoming file and if I see author is one for which I already have opened a IndexWriter, I just add this file in that index, else I create a new IndesWriter for new author. As authors might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a certain threshold and starting all over again. There is no reduced function. Does this logic sound correct? Is there any other way of implementing this requirement? --Hrishi DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Hadoop User Group Maryland/DC Area
Hey Abhi, Check out http://www.meetup.com/Hadoop-DC/. Regards, Jeff On Tue, Nov 10, 2009 at 9:26 AM, Abhishek Pratap abhishek@gmail.comwrote: Hi Guys Just wondering if there is any Hadoop group functioning in the Maryland/DC area. I would love to be a part and learn few things along the way. Cheers, -Abhi
Re: Should I upgrade from 0.18.3 to the latest 0.20.1?
The new API in 0.20.x is likely not what you'll see in the final Hadoop 1.0 release, which I've heard some people forecast within the next 18 months or so (we'll see). There will likely be a 0.21.x series, and then the final release. That having been said, its much more similar to what you'll see in the final release. Depending on how complex your jobs are, you may see minor or no changes in the final release, or you may see dramatic ones. I think (someone correct me if I'm wrong) the basic map and reduce abstract classes are just about set in stone, but if you're using other stuff like file formats, custom splits, etc. then you may see a lot of differences. I've also noticed a lot of changes in how the job and task trackers work, even in the current trunk. There's also some interesting work being done by yahoo on pipelining MR jobs, which will not be in any 0.20.x release. The other thing about 0.20.x is that a lot of the old API (like joins, etc.) has not been updated, so your application may be a hodgepodge patchwork of the two APIs. Are there any portions of the new API which are particularly attractive to you? That might help people suggest weather or not you should switch to satisfy that need. If you don't have any needs particular to the 0.20.x API then there's probably little reason to switch. If you do upgrade to 0.20.1, make sure to get the cloudera or yahoo distributions. The current stable (0.20.1) release on the Apache page is very buggy. On 11/10/09 3:30 PM, Mark Kerzner wrote: Hi, I've been working on my project for about a year, and I decided to upgrade from 0.18.3 (which was stable and already old even back then). I have started, but I see that many classes have changed, many are deprecated, and I need to re-write some code. Is it worth it? What are the advantages of doing this? Other areas of concern are: - Will Amazon EMR work with the latest Hadoop? - What about Cloudera distribution or Yahoo distribution? Thank you, Mark
Re: NameNode/DataNode JobTracker/TaskTracker
On Mon, Nov 9, 2009 at 1:04 PM, John Martyniak j...@beforedawnsolutions.com wrote: Thanks Todd. I wasn't sure if that is possible. But you pointed out an important point and that is it is just NN and JT that would run remotely. So in order to do this would I just install the complete hadoop instance on each one. And then would they be configed as masters? Or should NameNode and JobTracker run on the same machine? So there would be one master. Either way. On all clusters but the largest, the NN and JT are not significant users of CPU. On medium size clusters they can start to use up multiple GBs of RAM. If you're using less than 30 nodes you can *probably* get by with one machine for both; I say probably because it depends on not just your total capacity but also the number of files you have. There are some rough sizing estimates if you google the archives for CompressedOops I think - someone did some measurements of the NN's memory requirements. So when I start the cluster would I start it from the NN/JT machine. Could it also be started from any of the other cluster members. It doesn't matter - Hadoop itself doesn't use SSH or anything. The daemons just all have to be started somehow. If you're using the Cloudera distribution with RPM/Deb you can use init scripts. If you prefer shell scripts and ssh you can use the provided start-all scripts, your own scripts, or something like pdssh or cap shell. If you're a masochist you can log into each node individually and start the daemons by hand. I do not recommend this last option :) sorry for all of the seemingly basic questions, but want to get it right the first time:) Sure thing- we're here to help. -Todd On Nov 9, 2009, at 1:11 PM, Todd Lipcon wrote: On Mon, Nov 9, 2009 at 7:20 AM, John Martyniak j...@beforedawnsolutions.com wrote: Can the NameNode/DataNode JobTracker/TaskTracker run on a server that isn't part of the cluster meaning I would like to run it on a machine that wouldn't participate in the processing of data, and wouldn't participate in the HDFS data sharing, and would solely focus on the NameNode/DataNode JobTracker/TaskTracker tasks. Yes, running the NN and the JT on servers that don't also run TT/DN is very common and recommended for clusters of more than maybe 5 nodes. -Todd
Re: java.io.IOException: Could not obtain block:
I've not encountered an error like this, but here's some suggestions: 1. Try to make sure that your two node cluster is setup correctly. Querying the web interface, using any of the included dfs utils (eg. hadoop dfs -ls), or looking in your log directory may yield more useful stack traces or errors. 2. Open up the source and check out the code around the stack trace. This sucks, but hadoop is actually pretty easy to surf through in Eclipse, and most classes are kept within a reasonable number of lines of code and fairly readable. 3. Rip out the parts of Nutch you need and drop them in your project, and forget about 0.19. This isn't ideal, but you have to remember that this whole ecosystem is still forming and sometimes it makes sense to rip stuff out and transplant it into your project rather than depending on 2-3 classes from a project which you otherwise don't use. On 11/10/09 11:32 AM, John Martyniak wrote: Hello everyone, I am getting this error java.io.IOException: Could not obtain block:, when running on my new cluster. When I ran the same job on the single node it worked perfectly, I then added in the second node, and receive this error. I was running the grep sample job. I am running Hadoop 0.19.2, because of a dependency on Nutch (Eventhough this was not a Nutch job). I am not running HBase, the version of Java is OpenJDK 1.6.0. Does anybody have any ideas? Thanks in advance, -John
Re: java.io.IOException: Could not obtain block:
Edmund, Thanks for the advice. It turns out that it was the firewall running on the second cluster node. So I stopped that and all is working correctly. Now that I have the second node working the way that it is supposed to probably, going to bring another couple of nodes online. Wish me luck:) -John On Nov 10, 2009, at 9:30 PM, Edmund Kohlwey wrote: I've not encountered an error like this, but here's some suggestions: 1. Try to make sure that your two node cluster is setup correctly. Querying the web interface, using any of the included dfs utils (eg. hadoop dfs -ls), or looking in your log directory may yield more useful stack traces or errors. 2. Open up the source and check out the code around the stack trace. This sucks, but hadoop is actually pretty easy to surf through in Eclipse, and most classes are kept within a reasonable number of lines of code and fairly readable. 3. Rip out the parts of Nutch you need and drop them in your project, and forget about 0.19. This isn't ideal, but you have to remember that this whole ecosystem is still forming and sometimes it makes sense to rip stuff out and transplant it into your project rather than depending on 2-3 classes from a project which you otherwise don't use. On 11/10/09 11:32 AM, John Martyniak wrote: Hello everyone, I am getting this error java.io.IOException: Could not obtain block:, when running on my new cluster. When I ran the same job on the single node it worked perfectly, I then added in the second node, and receive this error. I was running the grep sample job. I am running Hadoop 0.19.2, because of a dependency on Nutch (Eventhough this was not a Nutch job). I am not running HBase, the version of Java is OpenJDK 1.6.0. Does anybody have any ideas? Thanks in advance, -John