Archives larger thatn 2^31 bytes in DistributedCache
In hadoop-0.17 we tried to use a 2.2GB archive and seemingly ran into http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6599383: java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(ZipFile.java:114) at java.util.zip.ZipFile.(ZipFile.java:131) at org.apache.hadoop.fs.FileUtil.unZip(FileUtil.java:421) at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache. java:338) at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache. java:161) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:137) Any work around known for hadoop-0.17 (except multiple smaller archives)? Is it correct to assume that in hadoop-0.18 this is no longer an issue when using tar.gz? Thanks, Christian
Please help, don't know how to solve--java.io.IOException: WritableName can't load class
Hello, guys, I am very new to hadoop. I was trying to read nutch data files using a script i found on http://wiki.apache.org/nutch/Getting_Started . And after 2 days of trying, I still cannot get it to work. now the error i got is "java.lang.RuntimeException: java.io.IOException: WritableName can't load class". Below is my script: /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package test; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.MapFile; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; /** * * @author mudong */ public class Main { /** * @param args the command line arguments */ public static void main(String[] args) { // TODO code application logic here try{ Configuration conf = new Configuration(); conf.addResource(new Path("/home/mudong/programming/java/hadoop-0.17.2.1/conf/hadoop-default.xml")); //conf.addResource(new Path("/home/mudong/programming/java/hadoop-0.18.1/conf/hadoop-default.xml")); FileSystem fs= FileSystem.get(conf); String seqFile = new String("/home/mudong/programming/java/nutch-0.9/crawl/segments/20081021075837/content/part-0"); MapFile.Reader reader; reader = new MapFile.Reader (fs, seqFile, conf); Class keyC = reader.getKeyClass(); Class valueC = reader.getValueClass(); while (true) { WritableComparable key = null; Writable value = null; try { key = (WritableComparable)keyC.newInstance(); value = (Writable)valueC.newInstance(); } catch (Exception ex) { ex.printStackTrace(); System.exit(-1); } try { if (!reader.next(key, value)) { break; } System.out.println(key); System.out.println(value); } catch (Exception e) { e.printStackTrace(); System.out.println("Exception occured. " + e); break; } } }catch(Exception e){ e.printStackTrace(); System.out.println("Exception occured. " + e); } } } And when I running the script above, I got error messages like below. java.lang.RuntimeException: java.io.IOException: WritableName can't load class at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1612) Exception occured. java.lang.RuntimeException: java.io.IOException: WritableName can't load class at org.apache.hadoop.io.MapFile$Reader.getValueClass(MapFile.java:248) at test.Main.main(Main.java:36) Caused by: java.io.IOException: WritableName can't load class at org.apache.hadoop.io.WritableName.getClass(WritableName.java:74) at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1610) ... 2 more Caused by: java.lang.ClassNotFoundException: org.apache.nutch.protocol.Content at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:581) at org.apache.hadoop.io.WritableName.getClass(WritableName.java:72) I've tried a lot of things, but it's just not working. I use hadoop-0.17.2.1. Thanks a lot, guys, Rongdong
RE: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1
Wow, if the issue is fixed with version 0.20, then could we please have a patch for version 0.18? Thanks, Deepika -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 12:19 PM To: core-user@hadoop.apache.org Subject: Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1 So, Philippe reports that the problem goes away with 0.20-dev (trunk?): http://mahout.markmail.org/message/swmzreg6fnzf6icv We aren't totally clear on the structure of SVN for Hadoop, but it seems like it is not fixed by this patch. On Oct 29, 2008, at 10:28 AM, Grant Ingersoll wrote: > We'll try it out... > > On Oct 28, 2008, at 3:00 PM, Arun C Murthy wrote: > >> >> On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote: >> >>> Hi, >>> >>> Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with >>> some of our clustering code and Hadoop 0.18.1. The thread in >>> context is at: http://mahout.markmail.org/message/vcyvlz2met7fnthr >>> >>> The problem seems to occur when going from 0.17.2 to 0.18.1. In >>> the user logs, we are seeing the following exception: >>> 2008-10-27 21:18:37,014 INFO org.apache.hadoop.mapred.Merger: Down >>> to the last merge-pass, with 2 segments left of total size: 5011 >>> bytes >>> 2008-10-27 21:18:37,033 WARN org.apache.hadoop.mapred.ReduceTask: >>> attempt_200810272112_0011_r_00_0 Merge of the inmemory files >>> threw an exception: java.io.IOException: Intermedate merge failed >>> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>> $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147) >>> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>> $InMemFSMergeThread.run(ReduceTask.java:2078) >>> Caused by: java.lang.NumberFormatException: For input string: "[" >> >> If you are sure that this isn't caused by your application-logic, >> you could try running with http://issues.apache.org/jira/browse/HADOOP-4277 >> . >> >> That bug caused many a ship to sail in large circles, hopelessly. >> >> Arun >> >>> >>> at >>> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: >>> 1224) >>> at java.lang.Double.parseDouble(Double.java:510) >>> at >>> org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java: >>> 60) >>> at >>> org >>> .apache >>> .mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256) >>> at >>> org >>> .apache >>> .mahout >>> .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:38) >>> at >>> org >>> .apache >>> .mahout >>> .clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:31) >>> at org.apache.hadoop.mapred.ReduceTask >>> $ReduceCopier.combineAndSpill(ReduceTask.java:2174) >>> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access >>> $3100(ReduceTask.java:341) >>> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>> $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134) >>> >>> And in the main output log (from running bin/hadoop jar mahout/ >>> examples/build/apache-mahout-examples-0.1-dev.job >>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) we see: >>> 08/10/27 21:18:41 INFO mapred.JobClient: Task Id : >>> attempt_200810272112_0011_r_00_0, Status : FAILED >>> java.io.IOException: attempt_200810272112_0011_r_00_0The >>> reduce copier failed >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255) >>> at org.apache.hadoop.mapred.TaskTracker >>> $Child.main(TaskTracker.java:2207) >>> >>> If I run this exact same job on 0.17.2 it all runs fine. I >>> suppose either a bug was introduced in 0.18.1 or a bug was fixed >>> that we were relying on. Looking at the release notes between the >>> fixes, nothing in particular struck me as related. If it helps, I >>> can provide the instructions for how to run the example in >>> question (they need to be written up anyway!) >>> >>> >>> I see some related things at http://hadoop.markmail.org/search/?q=Merge+of+the+inmemory+files+threw+a n+exception >>> , but those are older, it seems, so not sure what to make of them. >>> >>> Thanks, >>> Grant >> > > -- > Grant Ingersoll > Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. > http://www.lucenebootcamp.com > > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > >
Re: Status FUSE-Support of HDFS
It has come a long way since 0.18 and facebook keeps our (0.17) dfs mounted via fuse and uses that for some operations. There have recently been some problems with fuse-dfs when used in a multithreaded environment, but those have been fixed in 0.18.2 and 0.19. (do not use 0.18 or 0.18.1) The current (known) issues are: 1. Wrong semantics when copying over an existing file - namely it does a delete and then re-creates the file, so ownership/permissions may end up wrong. There is a patch for this. 2. When directories have 10s of thousands of files, performance can be very poor. 3. Posix truncate is supported only for truncating it to 0 size since hdfs doesn't support truncate. 4. Appends are not supported - this is a libhdfs problem and there is a patch for it. It is still a pre-1.0 product for sure, but it has been pretty stable for us. -- pete On 10/31/08 9:08 AM, "Robert Krüger" <[EMAIL PROTECTED]> wrote: Hi, could anyone tell me what the current Status of FUSE support for HDFS is? Is this something that can be expected to be usable in a few weeks/months in a production environment? We have been really happy/successful with HDFS in our production system. However, some software we use in our application simply requires an OS-Level file system which currently requires us to do a lot of copying between HDFS and a regular file system for processes which require that software and FUSE support would really eliminate that one disadvantage we have with HDFS. We wouldn't even require the performance of that to be outstanding because just by eliminatimng the copy step, we would greatly increase the thruput of those processes. Thanks for sharing any thoughts on this. Regards, Robert
Re: Mapper settings...
On Oct 31, 2008, at 3:15 PM, Bhupesh Bansal wrote: Why do we need these setters in JobConf ?? jobConf.setMapOutputKeyClass(String.class); jobConf.setMapOutputValueClass(LongWritable.class); Just historical. The Mapper and Reducer interfaces didn't use to be generic. (Hadoop used to run on Java 1.4 too...) It would be nice to remove the need to call them. There is an old bug open to check for consistency HADOOP-1683. It would be even better to make the setting of both the map and reduce output types optional if they are specified by the template parameters. -- Owen
Re: LeaseExpiredException and too many xceiver
Config on most Y! clusters sets dfs.datanode.max.xcievers to a large value .. something like 1k to 2k. You could try that. Raghu. Nathan Marz wrote: Looks like the exception on the datanode got truncated a little bit. Here's the full exception: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.115-50010-1225485937590, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) On Oct 31, 2008, at 2:49 PM, Nathan Marz wrote: Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with "Could not find block locations..." errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/dustintmp/shredded_dataunits/_temporary/_attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
Mapper settings...
Hey guys, Just curious, Why do we need these setters in JobConf ?? jobConf.setMapOutputKeyClass(String.class); jobConf.setMapOutputValueClass(LongWritable.class); We should be able to extract these from OutputController of Mapper class ?? IMHO, they have to be consistent with OutputCollector class. so why have extra point of failures ? Best Bhupesh
Re: To Compute or Not to Compute on Prod
Currently, I'm just researching so I'm just playing with the idea of streaming log data into the HDFS. I'm confused about: "...all you need is a Hadoop install. Your production node doesn't need to be a datanode." If my production node is *not* a dataNode then how can I do "hadoop dfs put?" I was under the impression that when I install HDFS on a cluster each node in the cluster is a dataNode. Shahab On Fri, Oct 31, 2008 at 1:46 PM, Norbert Burger <[EMAIL PROTECTED]>wrote: > What are you using to "stream logs into the HDFS"? > > If the command-line tools (ie., "hadoop dfs put") work for you, then all > you > need is a Hadoop install. Your production node doesn't need to be a > datanode. > > On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust <[EMAIL PROTECTED] > >wrote: > > > I want to stream data from logs into the HDFS in production but I do NOT > > want my production machine to be apart of the computation cluster. The > > reason I want to do it in this way is to take advantage of HDFS without > > putting computation load on my production machine. Is this possible*?* > > Furthermore, is this unnecessary because the computation would not put a > > significant load on my production box (obviously depends on the > map/reduce > > implementation but I'm asking in general)*?* > > > > I should note that our prod machine hosts our core web application and > > database (saving up for another box :-). > > > > Thanks, > > Shahab > > >
Re: LeaseExpiredException and too many xceiver
Looks like the exception on the datanode got truncated a little bit. Here's the full exception: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.115-50010-1225485937590, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) On Oct 31, 2008, at 2:49 PM, Nathan Marz wrote: Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with "Could not find block locations..." errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/ shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ dustintmp/shredded_dataunits/_temporary/ _attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org .apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
LeaseExpiredException and too many xceiver
Hello, We are seeing some really bad errors on our hadoop cluster. After reformatting the whole cluster, the first job we run immediately fails with "Could not find block locations..." errrors. In the namenode logs, we see a ton of errors like: 2008-10-31 14:20:44,799 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 7276, call addBlock(/tmp/dustintmp/shredded_dataunits/_t$ org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/ dustintmp/shredded_dataunits/_temporary/ _attempt_200810311418_0002_m_23_0$ at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1166) at org .apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) In the datanode logs, we see a ton of errors like: 2008-10-31 14:20:09,978 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-2129547091-10.100.11.1$ of concurrent xcievers 256 at org.apache.hadoop.dfs.DataNode $DataXceiver.run(DataNode.java:1030) at java.lang.Thread.run(Thread.java:619) Anyone have any ideas on what may be wrong? Thanks, Nathan Marz Rapleaf
RE: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1
Hi Devraj, It was pretty consistent with my comparator class in my old email(the one that uses UTF8). While trying to resolve the issue, I changed UTF8 to Text. That made it disappear for a while but then it came back again. My new Comparator class(with Text) is - public class IncrementalURLIndexKey implements WritableComparable { private Text url; private long userid; public IncrementalURLIndexKey() { } public IncrementalURLIndexKey(Text url, long userid) { this.url = url; this.userid = userid; } public Text getUrl() { return url; } public long getUserid() { return userid; } public void write(DataOutput out) throws IOException { url.write(out); out.writeLong(userid); } public void readFields(DataInput in) throws IOException { url = new Text(); url.readFields(in); userid = in.readLong(); } public int compareTo(Object o) { IncrementalURLIndexKey other = (IncrementalURLIndexKey) o; int result = url.compareTo(other.getUrl()); if (result == 0) result = CUID.compare(userid, other.userid); return result; } /** * A Comparator optimized for IncrementalURLIndexKey. */ public static class GroupingComparator extends WritableComparator { public GroupingComparator() { super(IncrementalURLIndexKey.class, true); } public int compare(WritableComparable a, WritableComparable b) { IncrementalURLIndexKey key1 = (IncrementalURLIndexKey) a; IncrementalURLIndexKey key2 = (IncrementalURLIndexKey) b; return key1.getUrl().compareTo(key2.getUrl()); } } static { WritableComparator.define(IncrementalURLIndexKey.class, new GroupingComparator()); } } Thanks, Deepika -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 28, 2008 9:01 PM To: core-user@hadoop.apache.org Subject: Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1 Quick question (I haven't looked at your comparator code yet) - is this reproducible/consistent? On 10/28/08 11:52 PM, "Deepika Khera" <[EMAIL PROTECTED]> wrote: > I am getting a similar exception too with Hadoop 0.18.1(See stacktrace > below), though its an EOFException. Does anyone have any idea about what > it means and how it can be fixed? > > 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_200810241922_0844_r_06_0 Merge of the inmemory files threw > an exception: java.io.IOException: Intermedate merge failed > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn > MemMerge(ReduceTask.java:2147) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run( > ReduceTask.java:2078) > Caused by: java.lang.RuntimeException: java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java: > 103) > at > org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269) > at > org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:122) > at > org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:49) > at > org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:321) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:72) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doIn > MemMerge(ReduceTask.java:2123) > ... 1 more > Caused by: java.io.EOFException > at > java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:103) > at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213) > at > com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK > ey.java:40) > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java: > 97) > ... 7 more > > 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_200810241922_0844_r_06_0 Merging of the local FS files threw > an exception: java.io.IOException: java.lang.RuntimeException: > java.io.EOFException > at > org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java: > 103) > at > org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:269) > at > org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:135) > at > org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:102) > at > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.ja > va:226) > at > org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242) > at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Reduc > eTask.java:2035) > Caused by: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) > at com.collarity.io.IOUtil.readUTF8(IOUtil.java:213) > at > com.collarity.url.IncrementalURLIndexKey.readFields(IncrementalURLIndexK > ey.java:40) > at > org.apache.hadoop.io.Writable
Re: To Compute or Not to Compute on Prod
Hi, We have deployed a new monitoring system Chukwa ( http://wiki.apache.org/hadoop/Chukwa) that is doing exactly that. Also this system provide an easy way to post-process you log file and extract useful information using M/R. /Jerome. On 10/31/08 1:46 PM, "Norbert Burger" <[EMAIL PROTECTED]> wrote: > What are you using to "stream logs into the HDFS"? > > If the command-line tools (ie., "hadoop dfs put") work for you, then all you > need is a Hadoop install. Your production node doesn't need to be a > datanode. > > On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust <[EMAIL PROTECTED]>wrote: > >> I want to stream data from logs into the HDFS in production but I do NOT >> want my production machine to be apart of the computation cluster. The >> reason I want to do it in this way is to take advantage of HDFS without >> putting computation load on my production machine. Is this possible*?* >> Furthermore, is this unnecessary because the computation would not put a >> significant load on my production box (obviously depends on the map/reduce >> implementation but I'm asking in general)*?* >> >> I should note that our prod machine hosts our core web application and >> database (saving up for another box :-). >> >> Thanks, >> Shahab >>
Re: To Compute or Not to Compute on Prod
What are you using to "stream logs into the HDFS"? If the command-line tools (ie., "hadoop dfs put") work for you, then all you need is a Hadoop install. Your production node doesn't need to be a datanode. On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust <[EMAIL PROTECTED]>wrote: > I want to stream data from logs into the HDFS in production but I do NOT > want my production machine to be apart of the computation cluster. The > reason I want to do it in this way is to take advantage of HDFS without > putting computation load on my production machine. Is this possible*?* > Furthermore, is this unnecessary because the computation would not put a > significant load on my production box (obviously depends on the map/reduce > implementation but I'm asking in general)*?* > > I should note that our prod machine hosts our core web application and > database (saving up for another box :-). > > Thanks, > Shahab >
Redirecting Libhdfs output
Hey all, libhdfs prints out useful information to stderr in the function errnoFromException; unfortunately, in the C application framework I use, the stderr is directed to /dev/null, making debugging miserably hard. Does anyone have any suggestions to make the errnoFromException function write out the error to a different stream? Brian
Re: To Compute or Not to Compute on Prod
Definitely speaking java Do you think I'm being paranoid about the possible load? Shahab On Fri, Oct 31, 2008 at 11:52 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > Shahab, > > This can be done. > If you client speaks java you can connect to hadoop and write as a stream. > > If you client does not have java. The thrift api will generate stubs > in a variety of languages > > Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs > > Shameless plug -- If you just want to stream data I created a simple > socket server- > http://www.jointhegrid.com/jtgweb/lhadoopserver/index.jsp > > So you do not have to be part of the cluster to write to it. >
Re: To Compute or Not to Compute on Prod
Shahab, This can be done. If you client speaks java you can connect to hadoop and write as a stream. If you client does not have java. The thrift api will generate stubs in a variety of languages Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs Shameless plug -- If you just want to stream data I created a simple socket server- http://www.jointhegrid.com/jtgweb/lhadoopserver/index.jsp So you do not have to be part of the cluster to write to it.
To Compute or Not to Compute on Prod
I want to stream data from logs into the HDFS in production but I do NOT want my production machine to be apart of the computation cluster. The reason I want to do it in this way is to take advantage of HDFS without putting computation load on my production machine. Is this possible*?* Furthermore, is this unnecessary because the computation would not put a significant load on my production box (obviously depends on the map/reduce implementation but I'm asking in general)*?* I should note that our prod machine hosts our core web application and database (saving up for another box :-). Thanks, Shahab
Re: SecondaryNameNode on separate machine
True, dfs.http.address is the NN Web UI address. This where the NN http server runs. Besides the Web UI there also a servlet running on that server which is used to transfer image and edits from NN to the secondary using http get. So SNN uses both addresses fs.default.name and dfs.http.address. When SNN finishes the checkpoint the primary needs to transfer the resulting image back. This is done via the http server running on SNN. Answering Tomislav's question: The difference between fs.default.name and dfs.http.address is that fs.default.name is the name-node's PRC address, where clients and data-nodes connect to, while dfs.http.address is the NN's http server address where our browsers connect to, but it is also used for transferring image and edits files. --Konstantin Otis Gospodnetic wrote: Konstantin & Co, please correct me if I'm wrong, but looking at hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN *Web UI*. In other words, this is where we people go look at the NN. The secondary NN must then be using only the Primary NN URL specified in fs.default.name. This URL looks like hdfs://name-node-hostname-here/. Something in Hadoop then knows the exact port for the Primary NN based on the URI schema (e.g. "hdfs://") in this URL. Is this correct? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tomislav Poljak <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, October 30, 2008 1:52:18 PM Subject: Re: SecondaryNameNode on separate machine Hi, can you, please, explain the difference between fs.default.name and dfs.http.address (like how and when is SecondaryNameNode using fs.default.name and how/when dfs.http.address). I have set them both to same (namenode's) hostname:port. Is this correct (or dfs.http.address needs some other port)? Thanks, Tomislav On Wed, 2008-10-29 at 16:10 -0700, Konstantin Shvachko wrote: SecondaryNameNode uses http protocol to transfer the image and the edits from the primary name-node and vise versa. So the secondary does not access local files on the primary directly. The primary NN should know the secondary's http address. And the secondary NN need to know both fs.default.name and dfs.http.address of the primary. In general we usually create one configuration file hadoop-site.xml and copy it to all other machines. So you don't need to set up different values for all servers. Regards, --Konstantin Tomislav Poljak wrote: Hi, I'm not clear on how does SecondaryNameNode communicates with NameNode (if deployed on separate machine). Does SecondaryNameNode uses direct connection (over some port and protocol) or is it enough for SecondaryNameNode to have access to data which NameNode writes locally on disk? Tomislav On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote: I think a lot of the confusion comes from this thread : http://www.nabble.com/NameNode-failover-procedure-td11711842.html Particularly because the wiki was updated with wrong information, not maliciously I'm sure. This information is now gone for good. Otis, your solution is pretty much like the one given by Dhruba Borthakur and augmented by Konstantin Shvachko later in the thread but I never did it myself. One thing should be clear though, the NN is and will remain a SPOF (just like HBase's Master) as long as a distributed manager service (like Zookeeper) is not plugged into Hadoop to help with failover. J-D On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: Hi, So what is the "recipe" for avoiding NN SPOF using only what comes with Hadoop? From what I can tell, I think one has to do the following two things: 1) configure primary NN to save namespace and xa logs to multiple dirs, one of which is actually on a remotely mounted disk, so that the data actually lives on a separate disk on a separate box. This saves namespace and xa logs on multiple boxes in case of primary NN hardware failure. 2) configure secondary NN to periodically merge fsimage+edits and create the fsimage checkpoint. This really is a second NN process running on another box. It sounds like this secondary NN has to somehow have access to fsimage & edits files from the primary NN server. http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes not describe the best practise around that - the recommended way to give secondary NN access to primary NN's fsimage and edits files. Should one mount a disk from the primary NN box to the secondary NN box to get access to those files? Or is there a simpler way? In any case, this checkpoint is just a merge of fsimage+edits files and again is there in case the box with the primary NN dies. That's what's described on http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore or less. Is this sufficient, or are there other things one has to do to eliminate N
Re: ApacheCon US 2008
I will also be presenting on Mahout (machine learning) on Wednesday at 3:30 (I think). It will have some Hadoop flavor in it. -Grant On Oct 31, 2008, at 1:46 PM, Owen O'Malley wrote: Just a reminder that ApacheCon US is next week in New Orleans. There will be a lot of Hadoop developers and talks. (I'm CC'ing core-user because it has the widest coverage. Please join the low traffic [EMAIL PROTECTED] list for cross sub-project announcements.) * Hadoop Camp with lots of talks about Hadoop o Introduction to Hadoop by Owen O'Malley o A Tour of Apache Hadoop by Tom White o Programming Hadoop Map/Reduce by Arun Murthy o Hadoop at Yahoo! by Eric Baldeschwieler o Hadoop Futures Panel o Using Hadoop for an Intranet Seach Engine by Shivakumar Vaithyanthan o Cloud Computing Testbed by Thomas Sandholm o Improving Virtualization and Performance Tracing of Hadoop with Open Solaris by George Porter o An Insight into Hadoop Usage at Facebook by Dhruba Borthakur o Pig by Alan Gates o Zookeeper, Coordinating the Distributed Application by Ben Reed o Querying JSON Data on Hadoop using Jaql by Kevin Beyer o HBase by Michael Stack * Hadoop training on Practical Problem Solving in Hadoop * Cloudera is providing a test Hadoop cluster and a Hadoop hacking contest. There is also a new Hadoop tutorial available. -- Owen
Re: ApacheCon US 2008
Hi, Hope somebody will record at least fraction of these talks and put them on the web as soon as possible.Lukas On Fri, Oct 31, 2008 at 6:46 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > Just a reminder that ApacheCon US is next week in New Orleans. There will > be a lot of Hadoop developers and talks. (I'm CC'ing core-user because it > has the widest coverage. Please join the low traffic [EMAIL PROTECTED] list > for cross sub-project announcements.) > >* Hadoop Camp with lots of talks about Hadoop > o Introduction to Hadoop by Owen O'Malley > o A Tour of Apache Hadoop by Tom White > o Programming Hadoop Map/Reduce by Arun Murthy > o Hadoop at Yahoo! by Eric Baldeschwieler > o Hadoop Futures Panel > o Using Hadoop for an Intranet Seach Engine by Shivakumar > Vaithyanthan > o Cloud Computing Testbed by Thomas Sandholm > o Improving Virtualization and Performance Tracing of Hadoop with > Open Solaris > by George Porter > o An Insight into Hadoop Usage at Facebook by Dhruba Borthakur > o Pig by Alan Gates > o Zookeeper, Coordinating the Distributed Application by Ben Reed > o Querying JSON Data on Hadoop using Jaql by Kevin Beyer > o HBase by Michael Stack >* Hadoop training on Practical Problem Solving in Hadoop >* Cloudera is providing a test Hadoop cluster and a Hadoop hacking > contest. > > There is also a new Hadoop tutorial available. > > -- Owen -- http://blog.lukas-vlcek.com/
RE: ApacheCon US 2008
Owen, Just wanted to mention that there is a talk on Hive as well on Friday 9:30AM... Ashish -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 10:47 AM To: [EMAIL PROTECTED] Cc: core-user@hadoop.apache.org Subject: ApacheCon US 2008 Just a reminder that ApacheCon US is next week in New Orleans. There will be a lot of Hadoop developers and talks. (I'm CC'ing core-user because it has the widest coverage. Please join the low traffic [EMAIL PROTECTED] list for cross sub-project announcements.) * Hadoop Camp with lots of talks about Hadoop o Introduction to Hadoop by Owen O'Malley o A Tour of Apache Hadoop by Tom White o Programming Hadoop Map/Reduce by Arun Murthy o Hadoop at Yahoo! by Eric Baldeschwieler o Hadoop Futures Panel o Using Hadoop for an Intranet Seach Engine by Shivakumar Vaithyanthan o Cloud Computing Testbed by Thomas Sandholm o Improving Virtualization and Performance Tracing of Hadoop with Open Solaris by George Porter o An Insight into Hadoop Usage at Facebook by Dhruba Borthakur o Pig by Alan Gates o Zookeeper, Coordinating the Distributed Application by Ben Reed o Querying JSON Data on Hadoop using Jaql by Kevin Beyer o HBase by Michael Stack * Hadoop training on Practical Problem Solving in Hadoop * Cloudera is providing a test Hadoop cluster and a Hadoop hacking contest. There is also a new Hadoop tutorial available. -- Owen
ApacheCon US 2008
Just a reminder that ApacheCon US is next week in New Orleans. There will be a lot of Hadoop developers and talks. (I'm CC'ing core-user because it has the widest coverage. Please join the low traffic [EMAIL PROTECTED] list for cross sub-project announcements.) * Hadoop Camp with lots of talks about Hadoop o Introduction to Hadoop by Owen O'Malley o A Tour of Apache Hadoop by Tom White o Programming Hadoop Map/Reduce by Arun Murthy o Hadoop at Yahoo! by Eric Baldeschwieler o Hadoop Futures Panel o Using Hadoop for an Intranet Seach Engine by Shivakumar Vaithyanthan o Cloud Computing Testbed by Thomas Sandholm o Improving Virtualization and Performance Tracing of Hadoop with Open Solaris by George Porter o An Insight into Hadoop Usage at Facebook by Dhruba Borthakur o Pig by Alan Gates o Zookeeper, Coordinating the Distributed Application by Ben Reed o Querying JSON Data on Hadoop using Jaql by Kevin Beyer o HBase by Michael Stack * Hadoop training on Practical Problem Solving in Hadoop * Cloudera is providing a test Hadoop cluster and a Hadoop hacking contest. There is also a new Hadoop tutorial available. -- Owen
Re: SecondaryNameNode on separate machine
Otis Gospodnetic wrote: Konstantin & Co, please correct me if I'm wrong, but looking at hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN *Web UI*. In other words, this is where we people go look at the NN. The secondary NN must then be using only the Primary NN URL specified in fs.default.name. This URL looks like hdfs://name-node-hostname-here/. Something in Hadoop then knows the exact port for the Primary NN based on the URI schema (e.g. "hdfs://") in this URL. Is this correct? Yes. The default port for an HDFS URI is 8020 (NameNode.DEFAULT_PORT). The value of fs.default.name is used by HDFS. When starting the namenode or datanodes, this must be an HDFS URI. If this names an explicit port, then that will be used, otherwise the default, 8020 will be used. The default port for HTTP URIs is 80, but the namenode typically runs its web UI on 50070 (the default for dfs.http.address). Doug
Status FUSE-Support of HDFS
Hi, could anyone tell me what the current Status of FUSE support for HDFS is? Is this something that can be expected to be usable in a few weeks/months in a production environment? We have been really happy/successful with HDFS in our production system. However, some software we use in our application simply requires an OS-Level file system which currently requires us to do a lot of copying between HDFS and a regular file system for processes which require that software and FUSE support would really eliminate that one disadvantage we have with HDFS. We wouldn't even require the performance of that to be outstanding because just by eliminatimng the copy step, we would greatly increase the thruput of those processes. Thanks for sharing any thoughts on this. Regards, Robert
Re: [core-user] Help deflating output files
You can override this property by passing in -jobconf mapred.output.compress=false to the hadoop binary, e.g. hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar -input "/user/root/input" -mapper 'cat' -reducer 'wc -l' -output "/user/root/output" -jobconf mapred.job.name="Experiment" -jobconf mapred.output.compress=false -- Martin Jim R. Wilson wrote: > > Hi all, > > I'm using hadoop-streaming to execute Python jobs in an EC2 cluster. > The output directory in HDFS has part-0.deflate files - how can I > deflate them back into regular text? > > In my hadoop-site.xml, I unfortunately have: > > mapred.output.compress > true > > > mapred.output.compression.type > BLOCK > > > Of course, I could re-build my AMI's without this option, but is there > some way I can read my deflate files without going through that > hassle? I'm hoping there's a command-line program to read these files > since I'm none of my code is Java. > > Thanks in advance for any help. :) > > -- Jim R. Wilson (jimbojw) > > -- View this message in context: http://www.nabble.com/-core-user--Help-deflating-output-files-tp17658751p20268639.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: Browse HDFS file in URL
The Hadoop file API allows you to open a file based on URL Path file = new Path("hdfs://hadoop00:54313/user/hadoop/conflated.20081016/part-9"); JobConf job = new JobConf(new Configuration(), ReadFileHadoop.class); job.setJobName("test"); FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(file); -Original Message- From: Neal Lee (RDLV) [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 4:00 To: core-user@hadoop.apache.org Cc: Neal Lee (RDLV) Subject: Browse HDFS file in URL Hi All, I'm wondering that can I browse a HDFS file in URL (ex. http://host/test.jpeg) so that I can show this file on my webapp directly. Thanks, Neal This correspondence is from Cyberlink Corp. and is intended only for use by the recipient named herein, and may contain privileged, proprietary and/or confidential information, and is intended only to be seen and used by named addressee(s). You are notified that any discussion, dissemination, distribution or copying of this correspondence and any attachments, is strictly prohibited, unless otherwise authorized or consented to in writing by the sender. If you have received this correspondence in error, please notify the sender immediately, and please permanently delete the original and any copies of it and any attachment and destroy any related printouts without reading or copying them.
Re: hostname in logs
Alex Loddengaard wrote: Thanks, Steve. I'll look in to this patch. As a temporary solution I use a log4j variable to manually set a "hostname" private field in the Appender. This solution is rather annoying, but it'll work fro now. Thanks again. what about having the task tracker pass down a some jvm properties of interest, like hostname/processname. I've done things in the past (testing) that stored stuff by hostname, which works with 1 process per host. once you start running lots of processes, you want more detail.
Browse HDFS file in URL
Hi All, I'm wondering that can I browse a HDFS file in URL (ex. http://host/test.jpeg) so that I can show this file on my webapp directly. Thanks, Neal This correspondence is from Cyberlink Corp. and is intended only for use by the recipient named herein, and may contain privileged, proprietary and/or confidential information, and is intended only to be seen and used by named addressee(s). You are notified that any discussion, dissemination, distribution or copying of this correspondence and any attachments, is strictly prohibited, unless otherwise authorized or consented to in writing by the sender. If you have received this correspondence in error, please notify the sender immediately, and please permanently delete the original and any copies of it and any attachment and destroy any related printouts without reading or copying them.
Re: hostname in logs
Thanks, Steve. I'll look in to this patch. As a temporary solution I use a log4j variable to manually set a "hostname" private field in the Appender. This solution is rather annoying, but it'll work fro now. Thanks again. Alex On Fri, Oct 31, 2008 at 3:58 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Alex Loddengaard wrote: > >> I'd like my log messages to display the hostname of the node that they >> were >> outputted on. Sure, this information can be grabbed from the log >> filename, >> but I would like each log message to also have the hostname. I don't >> think >> log4j provides support to include the hostname in a log, so I've tried >> programmatically inserting the hostname with the following three >> approaches. >> These are all within a log4j Appender. >> -Using exec to run "hostname" from the command line. This returns null. >> -Using InetAddress.getLocalHost().getHostName(). This returns null. >> -Using InetAddress.getLocalHost().getHostAddress(). This returns null. >> > > You sure your real/virtual hosts networking is set up right? I've seen > problems in hadoop there > https://issues.apache.org/jira/browse/HADOOP-3426 > Have a look/apply that patch and see what happens > > Each of these approaches works in an isolated test, but they all return >> null >> when in Hadoop's context. I believe I'd be able to get the hostname with >> a >> Java call to a Hadoop configuration method if I were in a Mapper or >> Reducer, >> but because I'm in a log4j Appender, I don't have access to any of >> Hadoop's >> configuration APIs. How can I get the hostname? >> > > Log4J appenders should have access to the hostname info, But you are going > to risk time and trouble if you do that in every operation; every new > process runs a risk of a 30s delay even if you cache it from then on. It is > usually a lot faster/easier just to push out the IP address, as that > doesn't trigger a reverse DNS lookup or anything. >
Re: hostname in logs
Alex Loddengaard wrote: I'd like my log messages to display the hostname of the node that they were outputted on. Sure, this information can be grabbed from the log filename, but I would like each log message to also have the hostname. I don't think log4j provides support to include the hostname in a log, so I've tried programmatically inserting the hostname with the following three approaches. These are all within a log4j Appender. -Using exec to run "hostname" from the command line. This returns null. -Using InetAddress.getLocalHost().getHostName(). This returns null. -Using InetAddress.getLocalHost().getHostAddress(). This returns null. You sure your real/virtual hosts networking is set up right? I've seen problems in hadoop there https://issues.apache.org/jira/browse/HADOOP-3426 Have a look/apply that patch and see what happens Each of these approaches works in an isolated test, but they all return null when in Hadoop's context. I believe I'd be able to get the hostname with a Java call to a Hadoop configuration method if I were in a Mapper or Reducer, but because I'm in a log4j Appender, I don't have access to any of Hadoop's configuration APIs. How can I get the hostname? Log4J appenders should have access to the hostname info, But you are going to risk time and trouble if you do that in every operation; every new process runs a risk of a 30s delay even if you cache it from then on. It is usually a lot faster/easier just to push out the IP address, as that doesn't trigger a reverse DNS lookup or anything.
Re: TaskTrackers disengaging from JobTracker
To complete the picture: not only was our network swamped, I realized tonight that the NameNode/JobTracker was running on a 99% full disk (it hit 100% full about thirty minutes ago). That poor JobTracker was fighting against a lot of odds. As soon as we upgrade to a bigger disk and switch it back on, I'll apply the supplied patch to the cluster. Thank you for looking into this! - Aaron On Thu, Oct 30, 2008 at 3:42 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > Raghu Angadi wrote: > >> Devaraj fwded the stacks that Aaron sent. As he suspected there is a >> deadlock in RPC server. I will file a blocker for 0.18 and above. This >> deadlock is more likely on a busy network. >> >> > Aaron, > > Could you try the patch attached to > https://issues.apache.org/jira/browse/HADOOP-4552 ? > > Thanks, > Raghu. >