Is hdfs reliable? Very odd error
I copied a 230GB file into my hadoop cluster. After my MR job kept failing I tracked down the error to one line of formatted text. I copied the file back out of hdfs and when I compare it to the original file there are about 20 bytes on one line (out of 230GB) that are different. Is there no CRC or checksum done when copying files into hdfs? (Just to be clear, I copied the original file out of hdfs - not the output of my MR job.)
Preferred Java version
Is 1.6.0_17 or 1.6.0_20 preferred as the JRE for hadoop? Thank you.
Help with Hadoop runtime error
Does anyone know what might be causing this error? I am using version Hadoop 0.20.2 and it happens when I run bin/hadoop dfs -copyFromLocal ... 10/07/09 15:51:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 128.238.55.43:50010 10/07/09 15:51:45 INFO hdfs.DFSClient: Abandoning block blk_2932625575574450984_1002
Re: Help with Hadoop runtime error
Hi Ted, thanks for your replay. That does not seem to make a difference though. I put that property in the xml file, restarted everything, tried to transfer the file again but the same thing occurred. I had my cluster working perfectly for about a year but I recently had some disk failures and scrubbed all of my machines reinstalled linux (same version) and moved from hadoop 0.20.1 to 0.20.2. - Original Message From: Ted Yu yuzhih...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, July 9, 2010 4:26:30 PM Subject: Re: Help with Hadoop runtime error Please see the description about xcievers at: http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements You can confirm that you have a xcievers problem by grepping the datanode logs with the error message pasted in the last bullet point. On Fri, Jul 9, 2010 at 1:10 PM, Raymond Jennings III raymondj...@yahoo.comwrote: Does anyone know what might be causing this error? I am using version Hadoop 0.20.2 and it happens when I run bin/hadoop dfs -copyFromLocal ... 10/07/09 15:51:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 128.238.55.43:50010 10/07/09 15:51:45 INFO hdfs.DFSClient: Abandoning block blk_2932625575574450984_1002
Newbie to HDFS compression
Are there instructions on how to enable (which type?) of compression on hdfs? Does this have to be done during installation or can it be added to a running cluster? Thanks, Ray
Re: Newbie to HDFS compression
Oh, maybe that's what I meant :-) I recall reading something on this mail group that the compression in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities. Looking at the config xml files it's not clear what I need to do. Thanks. - Original Message From: Eric Sammer esam...@cloudera.com To: common-user@hadoop.apache.org Sent: Thu, June 24, 2010 5:09:33 PM Subject: Re: Newbie to HDFS compression There is no file system level compression in HDFS. You can stored compressed files in HDFS, however. On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Are there instructions on how to enable (which type?) of compression on hdfs? Does this have to be done during installation or can it be added to a running cluster? Thanks, Ray -- Eric Sammer twitter: esammer data: www.cloudera.com
Which version of java is the preferred version?
I recall reading sometime ago on this mailing list that certain JRE versions were recommended and others were not. Was it 1.6.0_17 the preferred? Thank you.
Cutom partitioner question
I am trying to create my partitioner but I am getting an exception. Is anything required other than providing the method public int getPartition and extending the Partitioner class? java.lang.RuntimeException: java.lang.NoSuchMethodException: TSPmrV6$TSPPartitioner.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:527) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.NoSuchMethodException: TSPmrV6$TSPPartitioner.init() at java.lang.Class.getConstructor0(Unknown Source) at java.lang.Class.getDeclaredConstructor(Unknown Source) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109) ... 4 more
Re: Cutom partitioner question
Hi Ted, that does not appear to be the problem I am having. I tried adding it as you said but I get the same runtime error. Here is my partitioner: public class MyPartitioner extends PartitionerText, Text { public MyPartitioner() { } public int getPartition(Text key, Text value, int num_partitions) { String key2 = key.toString(); int hash = key2.hashCode(); hash = hash % num_partitions; return(hash); } } and in my main I have: job.setMapOutputValueClass(Text.class); job.setMapOutputKeyClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setPartitionerClass(MyPartitioner.class); Thanks. --- On Thu, 6/3/10, Ted Yu yuzhih...@gmail.com wrote: From: Ted Yu yuzhih...@gmail.com Subject: Re: Cutom partitioner question To: common-user@hadoop.apache.org Date: Thursday, June 3, 2010, 2:10 PM An empty ctor is needed for your Partitioner class. On Thu, Jun 3, 2010 at 10:13 AM, Raymond Jennings III raymondj...@yahoo.com wrote: I am trying to create my partitioner but I am getting an exception. Is anything required other than providing the method public int getPartition and extending the Partitioner class? java.lang.RuntimeException: java.lang.NoSuchMethodException: TSPmrV6$TSPPartitioner.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:527) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.NoSuchMethodException: TSPmrV6$TSPPartitioner.init() at java.lang.Class.getConstructor0(Unknown Source) at java.lang.Class.getDeclaredConstructor(Unknown Source) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109) ... 4 more
Getting zero length files on the reduce output.
I have a cluster of 12 slave nodes. I see that for some jobs the part-r-0 type files, half of them are zero in size after the job completes. Does this mean the hash function that splits the data to each reducer node is not working all that well? On other jobs it's pretty much even across all reducers but on certain jobs only half of the reducers have files bigger than 0. It is reproducible though. Can I change this hash function in anyway? Thanks.
How can I syncronize writing to an hdfs file
I want to write to a common hdfs file from within my map method. Given that each task runs in a separate jvm (on separate machines) making a method syncronized will not work I assume. Are there any file locking or other methods to guarantee mutual exclusion on hdfs? (I want to append to this file and I have the append option turned on.) Thanks.
Decomishining a node
I've got a dead machine on my cluster. I want to safely update HDFS so that nothing references this machine then I want to rebuild it and put it back in service in the cluster. Does anyone have any pointers how to do this (the first part - updating HDFS so that it's no longer referenced.) Thank you.
Re: Hadoop does not follow my setting
Isn't the number of mappers specified only a suggestion ? --- On Thu, 4/22/10, He Chen airb...@gmail.com wrote: From: He Chen airb...@gmail.com Subject: Hadoop does not follow my setting To: common-user@hadoop.apache.org Date: Thursday, April 22, 2010, 12:50 PM Hi everyone I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a 30GB file. I plan to test differenct number of mappers' performance. For example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66 mappers and 110 mappers. However, I set the mapred.map.tasks equals to 22. But when I ran the job, it shows 436 mappers total. I think maybe the wordcount set its parameters inside the its own program. I give -Dmapred.map.tasks=22 to this program. But it is still 436 again in my another try. I found out that 30GB divide by 436 is just 64MB, it is just my block size. Any suggestions will be appreciated. Thank you in advance! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
JobTracker website data - can it be increased?
I am running an application that has many iterations and I find that the JobTracker's website cuts off many of the initial runs. Is there any way to increase the results of previous jobs such that they are still available at the JobTracker's website? Thank you.
why does 'jps' lose track of hadoop processes ?
After running hadoop for some period of time, the command 'jps' fails to report any hadoop process on any node in the cluster. The processes are still running as can be seen with 'ps -ef|grep java' In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the processes to stop.
RE: why does 'jps' lose track of hadoop processes ?
That would explain why the processes cannot be stopped but the mystery of why jps loses track of these active processes still remains. Even when jps does not report any hadoop process I can still submit and run jobs just fine. I will have to check the next time it happens if the the hadoop pid's are the same as what is in the file. If different that would somehow mean the hadoop process was being restarted? --- On Mon, 3/29/10, Bill Habermaas b...@habermaas.us wrote: From: Bill Habermaas b...@habermaas.us Subject: RE: why does 'jps' lose track of hadoop processes ? To: common-user@hadoop.apache.org Date: Monday, March 29, 2010, 11:44 AM Sounds like your pid files are getting cleaned out of whatever directory they are being written (maybe garbage collection on a temp directory?). Look at (taken from hadoop-env.sh): # The directory where pid files are stored. /tmp by default. # export HADOOP_PID_DIR=/var/hadoop/pids The hadoop shell scripts look in the directory that is defined. Bill -Original Message- From: Raymond Jennings III [mailto:raymondj...@yahoo.com] Sent: Monday, March 29, 2010 11:37 AM To: common-user@hadoop.apache.org Subject: why does 'jps' lose track of hadoop processes ? After running hadoop for some period of time, the command 'jps' fails to report any hadoop process on any node in the cluster. The processes are still running as can be seen with 'ps -ef|grep java' In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the processes to stop.
Re: why does 'jps' lose track of hadoop processes ?
Yes, I am. --- On Mon, 3/29/10, Bill Au bill.w...@gmail.com wrote: From: Bill Au bill.w...@gmail.com Subject: Re: why does 'jps' lose track of hadoop processes ? To: common-user@hadoop.apache.org Date: Monday, March 29, 2010, 1:04 PM Are you running jps under the same user id that the hadoop processes are running under? Bill On Mon, Mar 29, 2010 at 11:37 AM, Raymond Jennings III raymondj...@yahoo.com wrote: After running hadoop for some period of time, the command 'jps' fails to report any hadoop process on any node in the cluster. The processes are still running as can be seen with 'ps -ef|grep java' In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the processes to stop.
Question about ChainMapper
I would like to try to use a ChainMapper/ChainReducer but I see that the last parameter is a JobConf which I am not creating as I am using the latest API version. Has anyone tried to do this with the later version API? Can I extract a JobConf object somewhere? Thanks
Is there a size limit on a line for a text file?
for the input to a mapper or as the output of either mapper or reducer?
java.io.IOException: Spill failed
Any pointers on what might be causing this? Thanks! java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1006) at java.io.DataOutputStream.write(Unknown Source) at org.apache.hadoop.io.Text.write(Text.java:282) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:854) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at TSPmrV2$TSPMapper3.MapEmit(TSPmrV2.java:587) at TSPmrV2$TSPMapper3.map(TSPmrV2.java:571) at TSPmrV2$TSPMapper3.map(TSPmrV2.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201003181420_4634/attempt_201003181420_4634_m_00_0/output/spill142.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1183) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)
Is there an easy way to clear old jobs from the jobtracker webpage?
I'd like to be able to clear the contents of the jobs that have completed running on the jobtracker webpage. Is there an easy way to do this without restarting the cluster?
Can I pass a user value to my reducer?
I need to pass a counter value to my reducer from the main program. Can this be done through the context parameter somehow?
I want to group similar keys in the reducer.
Is it possible to override a method in the reducer so that similar keys will be grouped together? For example I want all keys of value KEY1 and KEY2 to merged together. (My reducer has a KEY of type TEXT.) Thanks.
How do I upgrade my hadoop cluster using hadoop?
I thought there was a util to do the upgrade for you that you run from one node and it would do a copy to every other node?
SEQ
Are there any examples that show how to create a SEQ file in HDFS ?
Anyone use MapReduce for TSP approximations?
I am interested in seeing how mapreduce could be used to approximate the traveling salesman problem. Anyone have a pointer? Thanks.
How do I get access to the Reporter within Mapper?
I am using the non-deprecated Mapper. Can I obtain it from the Context somehow? Anyone have an example of this? Thanks.
Is it possible to run multiple mapreduce jobs from within the same application
In other words: I have a situation where I want to feed the output from the first iteration of my mapreduce job to a second iteration and so on. I have a for loop in my main method to setup the job parameters and to run it through all iterations but on about the third run the Hadoop processes lose their association with the 'jps' command and then weird things start happening. I remember reading somewhere about chaining - is that what is needed? I'm not sure what causes jps to not report the hadoop processes even though they are still active as can be seen with the ps command. Thanks. (This is on version 0.20.1)
Question about Join.java example
Is there a typo in the Join.java example that comes with hadoop? It has the line: JobConf jobConf = new JobConf(getConf(), Sort.class); Shouldn't that be Join.class ? Is there an equivalent example that uses the later API instead of the deprecated calls?
Re: Need to re replicate
I would try running the rebalance utility. I would be curious to see what that will do and if that will fix it. --- On Wed, 1/27/10, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: From: Ananth T. Sarathy ananth.t.sara...@gmail.com Subject: Need to re replicate To: common-user@hadoop.apache.org Date: Wednesday, January 27, 2010, 9:28 PM One of our datanodes went bye bye. We added a bunch more data nodes, but when I do a fsck i get a report that a bunch of files are only replicated on 2 server, which makes sense, because we had 3, and lost one. Now that we have 6 more, is there anything i need to do replicate the those files are will the cluster fix itself? Ananth
Re: Passing whole text file to a single map
Not sure if this solves your problem but I had a similar case where there was unique data at the beginning of the file and if that file was split between maps I would lose that for the 2nd and subsequent maps. I was able to pull the file name from the conf and read the first two lines for every map. --- On Sat, 1/23/10, stolikp stol...@o2.pl wrote: From: stolikp stol...@o2.pl Subject: Passing whole text file to a single map To: core-u...@hadoop.apache.org Date: Saturday, January 23, 2010, 9:49 AM I've got some text files in my input directory and I want to pass each single text file (whole file not just a line) to a map (one file per one map). How can I do this ? TextInputFormat splits text into lines and I do not want this to happen. I tried: http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F but it doesn't work for me, compiler doesn't know what NonSplitableTextInputFormat.class is. I'm using hadoop 0.20.1 -- View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Google has obtained the patent over mapreduce
I am not a patent attorney either but for what it's worth - many times a patent is sought solely to protect a company from being sued from another. So even though Hadoop is out there it could be the case that Google has no intent of suing anyone who uses it - they just wanted to protect themselves from someone else claiming it as their own and then suing Google. But yes, the patent system clearly has problems as you stated. --- On Wed, 1/20/10, Edward Capriolo edlinuxg...@gmail.com wrote: From: Edward Capriolo edlinuxg...@gmail.com Subject: Re: Google has obtained the patent over mapreduce To: common-user@hadoop.apache.org Date: Wednesday, January 20, 2010, 12:09 PM Interesting situation. I try to compare mapreduce to the camera. Let argue Google is Kodak, Apache is Polaroid, and MapReduce is a Camera. Imagine Kodak invented the camera privately, never sold it to anyone, but produced some document describing what a camera did. Polaroid followed the document and produced a camera and sold it publicly. Kodak later patents a camera, even though no one outside of Kodak can confirm Kodak ever made a camera before Polaroid. Not saying that is what happened here, but google releasing the GFS pdf was a large factor in causing hadoop to happen. Personally, it seems like they gave away too much information before they had the patent. The patent system faces many problems including this 'back to the future' issue. Where it takes so long to get a patent no one can wait, by the time a patent is issued there are already multiple viable implementations of a patent. I am no patent layer or anything, but I notice the phrase master process all over the claims. Maybe if a piece of software (hadoop) had a distributed process that would be sufficient to say hadoop technology does not infringe on this patent. I think it would be interesting to look deeply at each claim and determine if hadoop could be designed to not infringe on these patents, to deal with what if scenarios. On Wed, Jan 20, 2010 at 11:29 AM, Ravi ravindra.babu.rav...@gmail.com wrote: Hi, I too read about that news. I don't think that it will be any problem. However Google didn't invent the model. Thanks. On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi udaya...@gmail.com wrote: Hi, As an user of hadoop, Is there anything to worry about Google obtaining the patent over mapreduce? Thanks.
Obtaining name of file in map task
I am trying to determine what the name of the file that is being used for the map task. I am trying to use the setup() method to read the input file with: public void setup(Context context) { Configuration conf = context.getConfiguration(); String inputfile = conf.get(map.input.file); .. But inputfile is always null. Anyone have a pointer on how to do this? Thanks.
Re: Is it possible to share a key across maps?
Hi Gang, I was able to use this on an older version that uses the JobClient class to run the job but not on the newer api with the Job class. The Job class appears to use a setup() method instead of a configure() method but the map.input.file attribute does not appear to be available via the conf class the setup() method. Have you tried to do what you described using the newer api? Thank you. --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote: From: Gang Luo lgpub...@yahoo.com.cn Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org Date: Friday, January 8, 2010, 10:03 PM I don't do that in map method, but in configure( JobConf ) method which runs ahead of any map method call in that map task. JobConf.get(map.input.file) can tell you which file this map task is processing. Use this path to read first line of corresponding file. All these are done in configure method, that means, before any map method is called. -Gang - 原始邮件 发件人: Raymond Jennings III raymondj...@yahoo.com 收件人: common-user@hadoop.apache.org 发送日期: 2010/1/8 (周五) 7:54:30 下午 主 题: Re: Is it possible to share a key across maps? Hi, you do this in the map method (open the file and read the first line?) Could you explain a little more how you do it with configure(), thank you. --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote: From: Gang Luo lgpub...@yahoo.com.cn Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org Date: Friday, January 8, 2010, 4:46 PM I will do that like this: at each map task, I get the input file to this mapper in the configure(), and manually read the first line of that file to get the user ID. Then start running the map function. -Gang - 原始邮件 发件人: Raymond Jennings III raymondj...@yahoo.com 收件人: common-user@hadoop.apache.org 发送日期: 2010/1/8 (周五) 4:23:15 下午 主 题: Is it possible to share a key across maps? I have large files where the userid is the first line of each file. I want to use that value as the output of the map phase for each subsequent line of the file. If each map task gets a chunk of this file only one map task will read the key value from the first line. Is there anyway I can force the other map tasks to wait until this key is read and then somehow pass this value to other map tasks? Or is my reasoning incorrect? Thanks. ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: Is it possible to share a key across maps?
It looks like what you are referring to is the deprecated class - which has made for some confusing conversations in the past. It seems like many users still use the older API and most of the examples still use it. I would like to stay with the more recent api which looks the call is actually setup() instead of configure(). Not sure if it's a one to one mapping though. --- On Fri, 1/8/10, Jeff Zhang zjf...@gmail.com wrote: From: Jeff Zhang zjf...@gmail.com Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org Date: Friday, January 8, 2010, 11:15 PM Actually you can treat the mapper task as a template design pattern, here's the persuade code: Mapper.configure(JobConf) for each record in InputSplit: do Mapper.map(key,value,outputkey,outputvalue) Mapper.close() Any sub class of mapper can override the three method: configure(), map(),close() to do customization. 2010/1/8 Gang Luo lgpub...@yahoo.com.cn I don't do that in map method, but in configure( JobConf ) method which runs ahead of any map method call in that map task. JobConf.get(map.input.file) can tell you which file this map task is processing. Use this path to read first line of corresponding file. All these are done in configure method, that means, before any map method is called. -Gang - 原始邮件 发件人: Raymond Jennings III raymondj...@yahoo.com 收件人: common-user@hadoop.apache.org 发送日期: 2010/1/8 (周五) 7:54:30 下午 主 题: Re: Is it possible to share a key across maps? Hi, you do this in the map method (open the file and read the first line?) Could you explain a little more how you do it with configure(), thank you. --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote: From: Gang Luo lgpub...@yahoo.com.cn Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org Date: Friday, January 8, 2010, 4:46 PM I will do that like this: at each map task, I get the input file to this mapper in the configure(), and manually read the first line of that file to get the user ID. Then start running the map function. -Gang - 原始邮件 发件人: Raymond Jennings III raymondj...@yahoo.com 收件人: common-user@hadoop.apache.org 发送日期: 2010/1/8 (周五) 4:23:15 下午 主 题: Is it possible to share a key across maps? I have large files where the userid is the first line of each file. I want to use that value as the output of the map phase for each subsequent line of the file. If each map task gets a chunk of this file only one map task will read the key value from the first line. Is there anyway I can force the other map tasks to wait until this key is read and then somehow pass this value to other map tasks? Or is my reasoning incorrect? Thanks. ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ -- Best Regards Jeff Zhang
Can map reduce methods print to console in eclipse?
I tried writing to stderr but I guess that is not valid. Can someone tell me how I can output some text during either the map or reduce methods?
Is it possible to share a key across maps?
I have large files where the userid is the first line of each file. I want to use that value as the output of the map phase for each subsequent line of the file. If each map task gets a chunk of this file only one map task will read the key value from the first line. Is there anyway I can force the other map tasks to wait until this key is read and then somehow pass this value to other map tasks? Or is my reasoning incorrect? Thanks.
Other sources for hadoop api help
I am trying to develop some hadoop programs and I see that most of the examples included in the distribution are using deprecated classes and methods. Are there any other sources to learn about the api other than the javadocs, which for beginners trying to write hadoop programs, is not the best source. Thanks.
Jobs stop at 0%
I have been recently seeing a problem where jobs stop at map 0% that previously worked fine (with no code changes.) Restarting hadoop on the cluster solves this problem but there is nothing in the log files to indicate what the problem is. Has anyone seen something similar?
Errors seen on the jobtracker node
Does anyone have any idea what might be causing the following three errors that I am seeing. I am not able to determine what job or what was happening at the times listed but I am hoping that if I have a little more information I can track down what is happening: hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-11-11:2009-11-11 11:38:57,720 ERROR org.apache.hadoop.mapred.JobHistory: Failed creating job history log file, disabling history hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-11-11:2009-11-11 11:38:57,782 ERROR org.apache.hadoop.mapred.JobHistory: Failed to store job conf on the local filesystem hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-12-13:2009-12-13 22:30:04,495 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file . by DFSClient_-340809610
Combiner phase question
Does the combiner run once per data node or one per map task? (That it can run multiple times on the same data node after each map task.) Thanks.
Good idea to run NameNode and JobTracker on same machine?
Do people normally combine these two processes onto one machine? Currently I have them on separate machines but I am wondering they use that much CPU processing time and maybe I should combine them and create another DataNode.
Has anyone gotten the Hadoop eclipse plugin to work on Windows?
I have been pulling my hair out on this one. I tried building it within eclipse - no errors, but when I put the jar file in and restart eclipse I can see the Map/Reduce prospective but once I try to do anything it bombs with random cryptic errors. I looked at Stephen's notes on jiva but still no go. I am desperate to get this working so bribes, kick-backs other reciprocity will be gladly considered. ;-) Thanks! Ray
build / install hadoop plugin question
The plugin that is included in the hadoop distribution under src/contrib/eclipse-plugin - how does that get installed as it does not appear to be in a standard plugin format. Do I have to build it first and if so can you tell me how. Thanks. Ray
Re: build / install hadoop plugin question
That's what I would normally do for a plugin but this has a sub-directory of eclipse-plugin (and not plugins) and the files are all java files and not class files. This in the hadoop directory of src/contrib/eclipse-plugin. It looks to me like it has to be built first and then copied into the plugins directory? --- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com wrote: From: Dhaivat Pandit ceo.co...@gmail.com Subject: Re: build / install hadoop plugin question To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Friday, November 20, 2009, 9:05 PM Just paste it in eclipse installation plugins folder and restart eclipse -dp On Nov 20, 2009, at 2:08 PM, Raymond Jennings III raymondj...@yahoo.com wrote: The plugin that is included in the hadoop distribution under src/contrib/eclipse-plugin - how does that get installed as it does not appear to be in a standard plugin format. Do I have to build it first and if so can you tell me how. Thanks. Ray
Re: build / install hadoop plugin question
Could you explain further on how to do this. I have never built a plugin before. Do I do this from within eclipse? Thanks! --- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com wrote: From: Dhaivat Pandit ceo.co...@gmail.com Subject: Re: build / install hadoop plugin question To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Friday, November 20, 2009, 9:53 PM Yes if it's not built you can do ant eclipse. It will geerate the plugin jar and you can paste it in plugin directory. -dp On Nov 20, 2009, at 6:49 PM, Raymond Jennings III raymondj...@yahoo.com wrote: That's what I would normally do for a plugin but this has a sub-directory of eclipse-plugin (and not plugins) and the files are all java files and not class files. This in the hadoop directory of src/contrib/eclipse-plugin. It looks to me like it has to be built first and then copied into the plugins directory? --- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com wrote: From: Dhaivat Pandit ceo.co...@gmail.com Subject: Re: build / install hadoop plugin question To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Friday, November 20, 2009, 9:05 PM Just paste it in eclipse installation plugins folder and restart eclipse -dp On Nov 20, 2009, at 2:08 PM, Raymond Jennings III raymondj...@yahoo.com wrote: The plugin that is included in the hadoop distribution under src/contrib/eclipse-plugin - how does that get installed as it does not appear to be in a standard plugin format. Do I have to build it first and if so can you tell me how. Thanks. Ray
Can I change the block size and then restart?
Can I just change the block size in the config and restart or do I have to reformat? It's okay if what is currently in the file system stays at the old block size if that's possible ?
Re: About Hadoop pseudo distribution
If I understand you correctly you can run jps and see the java jvm's running on each machine - that should tell you if you are running in pseudo mode or not. --- On Thu, 11/12/09, kvorion kveinst...@gmail.com wrote: From: kvorion kveinst...@gmail.com Subject: About Hadoop pseudo distribution To: core-u...@hadoop.apache.org Date: Thursday, November 12, 2009, 12:02 PM Hi All, I have been trying to set up a hadoop cluster on a number of machines, a few of which are multicore machines. I have been wondering whether the hadoop pseudo distribution is something that can help me take advantage of the multiple cores on my machines. All the tutorials say that the pseudo distribution mode lets you start each daemon in a separate java process. I have the following configuration settings for hadoop-site.xml: property namefs.default.name/name valuehdfs://athena:9000/value /property property namemapred.job.tracker/name valueathena:9001/value /property property namedfs.replication/name value2/value /property I am not sure if this is really running in the pseudo-distribution mode. Are there any indicators or outputs that confirm what mode you are running in? -- View this message in context: http://old.nabble.com/About-Hadoop-pseudo-distribution-tp26322382p26322382.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
User permissions on dfs ?
Is there a way that I can setup directories in dfs for individual users and set the permissions such that only that user can read write such that if I do a hadoop dfs -ls I would get /user/user1 /user/user2 etc each directory only being able to read and write to by the respective user? I don't want to format an entire dfs filesystem for each user just let them have one sub-directory off of the main /users dfs directory that only they (and root) can read and write to. Right now if I run a mapreduce app as any user but root I am unable to save the intermediate files in dfs. Thanks!
Re: User permissions on dfs ?
Ah okay, I was looking at the options for hadoop and it only shows fs and not dfs - now that I realize they are one in the same. Thanks! --- On Wed, 11/11/09, Allen Wittenauer awittena...@linkedin.com wrote: From: Allen Wittenauer awittena...@linkedin.com Subject: Re: User permissions on dfs ? To: common-user@hadoop.apache.org Date: Wednesday, November 11, 2009, 1:59 PM On 11/11/09 8:50 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Is there a way that I can setup directories in dfs for individual users and set the permissions such that only that user can read write such that if I do a hadoop dfs -ls I would get /user/user1 /user/user2 etc each directory only being able to read and write to by the respective user? I don't want to format an entire dfs filesystem for each user just let them have one sub-directory off of the main /users dfs directory that only they (and root) can read and write to. Right now if I run a mapreduce app as any user but root I am unable to save the intermediate files in dfs. A) Don't run Hadoop as root. All of your user submitted code will also run as root. This is bad. :) B) You should be able to create user directories: hadoop dfs -mkdir /user/username hadoop dfs -chown username /user/username ... C) If you are attempting to run pig (and some demos), it has a dependency on a world writable /tmp. :( hadoop dfs -mkdir /tmp hadoop dfs -chmod a+w /tmp D) If you are on Solaris, whoami isn't in the default path. This confuses the hell out of Hadoop so you may need to hack all your machines to make Hadoop happy here.
Error with replication and namespaceID
On the actual datanodes I see the following exception: I am not sure what the namespaceID is or how to sync them. Thanks for any advice! / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pingo-3.poly.edu/128.238.55.33 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 1016244663; datanode namespaceID = 1687029285 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) --- On Mon, 11/9/09, Boris Shkolnik bo...@yahoo-inc.com wrote: From: Boris Shkolnik bo...@yahoo-inc.com Subject: Re: newbie question - error with replication To: common-user@hadoop.apache.org Date: Monday, November 9, 2009, 5:02 PM Make sure you have at least one datanode running. Look at the data node log file. (logs/*-datanode-*.log) Boris. On 11/9/09 7:15 AM, Raymond Jennings III raymondj...@yahoo.com wrote: I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!
Re: Error with replication and namespaceID
Thanks!!! That worked! I guess I can edit the number on the datanodes as well but if there is an even more official way to resolve this I would be interested in hearing about it. --- On Tue, 11/10/09, Edmund Kohlwey ekohl...@gmail.com wrote: From: Edmund Kohlwey ekohl...@gmail.com Subject: Re: Error with replication and namespaceID To: common-user@hadoop.apache.org Date: Tuesday, November 10, 2009, 1:46 PM Hi Ray, You'll probably find that even though the name node starts, it doesn't have any data nodes and is completely empty. Whenever hadoop creates a new filesystem, it assigns a large random number to it to prevent you from mixing datanodes from different filesystems on accident. When you reformat the name node its FS has one ID, but your data nodes still have chunks of the old FS with a different ID and so will refuse to connect to the namenode. You need to make sure these are cleaned up before reformatting. You can do it just by deleting the data node directory, although there's probably a more official way to do it. On 11/10/09 11:01 AM, Raymond Jennings III wrote: On the actual datanodes I see the following exception: I am not sure what the namespaceID is or how to sync them. Thanks for any advice! / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pingo-3.poly.edu/128.238.55.33 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 1016244663; datanode namespaceID = 1687029285 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) --- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com wrote: From: Boris Shkolnikbo...@yahoo-inc.com Subject: Re: newbie question - error with replication To: common-user@hadoop.apache.org Date: Monday, November 9, 2009, 5:02 PM Make sure you have at least one datanode running. Look at the data node log file. (logs/*-datanode-*.log) Boris. On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com wrote: I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!
newbie question - error with replication
I am trying to resolve an IOException error. I have a basic setup and shortly after running start-dfs.sh I get a: error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Any pointers how to resolve this? Thanks!