RE: Hadoop Write Performance
Raghu, I was using 0.17.2.1, but I installed 0.18.3 a couple of days ago. I also separated out my secondarynamenode and jobtracker to another machine. In addition my network operations people had misconfigured some switches which ended up being my bottleneck. After all of that my writer and Hadoop is working great. -Xavier -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Wednesday, February 18, 2009 11:49 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop Write Performance what is the hadoop version? You could check log on a datanode around that time. You could post any suspicious errors. For e.g. you can trace a particular block in client and datanode logs. Most likely it not a NameNode issue, but you can check NameNode log as well. Raghu. Xavier Stevens wrote: Does anyone have an expected or experienced write speed to HDFS outside of Map/Reduce? Any recommendations on properties to tweak in hadoop-site.xml? Currently I have a multi-threaded writer where each thread is writing to a different file. But after a while I get this: java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(D FS Client.java:2081) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient. ja va:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCl ie nt.java:1818) Which is perhaps indicating that the namenode is overwhelmed? Thanks, -Xavier
Hadoop Performance
I'm uploading a bunch of data to a new cluster but I'm getting terrible performance. I'm using Hadoop 0.18.3 and my write speed is never breaking 10 MB/s right now. Does anyone know what might be wrong? Or have suggestions or alternatives to dfs -put? Thanks, -Xavier
RE: Hadoop Performance
Disregard. I discovered that this was a networking performance issue unrelated to Hadoop. -Xavier -Original Message- From: Xavier Stevens Sent: Tuesday, February 17, 2009 3:59 PM To: core-user@hadoop.apache.org Subject: Hadoop Performance I'm uploading a bunch of data to a new cluster but I'm getting terrible performance. I'm using Hadoop 0.18.3 and my write speed is never breaking 10 MB/s right now. Does anyone know what might be wrong? Or have suggestions or alternatives to dfs -put? Thanks, -Xavier
Hadoop Write Performance
Does anyone have an expected or experienced write speed to HDFS outside of Map/Reduce? Any recommendations on properties to tweak in hadoop-site.xml? Currently I have a multi-threaded writer where each thread is writing to a different file. But after a while I get this: java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS Client.java:2081) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja va:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie nt.java:1818) Which is perhaps indicating that the namenode is overwhelmed? Thanks, -Xavier
Reading Protocol Buffers as BytesWritable value
I am currently trying to read in values where I have previously output a Text,BytesWritable pair into . The key is actually Hadoop's Text writable, and the value is a Protocol Buffer byte array output into a BytesWritable. Here is a snippet showing the output configuration. FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class); SequenceFileAsBinaryOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK); SequenceFileAsBinaryOutputFormat.setSequenceFileOutputKeyClass(job, Text.class); SequenceFileAsBinaryOutputFormat.setSequenceFileOutputValueClass(job, BytesWritable.class); FileOutputFormat.setOutputPath(job, outDataPath); In a another job I am trying to read this back in: job.setInputFormat(org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat.class); public static class Map extends MapReduceBase implements MapperText,BytesWritable,Text,LongWritable { ... } I get an error like this: java.io.IOException: hdfs://localhost:4000/user/myuser/step1-out/part-3.deflate not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1458) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1431) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1420) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1415) at org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat$SequenceFileAsBinaryRecordReader.init(SequenceFileAsBinaryInputFormat.java:67) at org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat.getRecordReader(SequenceFileAsBinaryInputFormat.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Am I doing something wrong here? Or is there just some inherent problem with what I am trying to do? -Xavier
RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory
I'm still seeing this problem on a cluster using Hadoop 0.18.2. I tried dropping the max number of map tasks per node from 8 to 7. I still get the error although it's less frequent. But I don't get the error at all when using Hadoop 0.17.2. Anyone have any suggestions? -Xavier -Original Message- From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon Sent: Thursday, October 09, 2008 2:07 AM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Thanks Alexander!! On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov [EMAIL PROTECTED] wrote: I received such errors when I overloaded data nodes. You may increase swap space or run less tasks. Alexander 2008/10/9 Edward J. Yoon [EMAIL PROTECTED] Hi, I received below message. Can anyone explain this? 08/10/09 11:53:33 INFO mapred.JobClient: Task Id : task_200810081842_0004_m_00_0, Status : FAILED java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo cator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil e.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja va:734) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:694) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124 ) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org -- Best Regards Alexander Aristov -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory
1) It doesn't look like I'm out of memory but it is coming really close. 2) overcommit_memory is set to 2, overcommit_ratio = 100 As for the JVM, I am using Java 1.6. **Note of Interest**: The virtual memory I see allocated in top for each task is more than what I am specifying in the hadoop job/site configs. Currently each physical box has 16 GB of memory. I see the datanode and tasktracker using: RESVIRT Datanode145m 1408m Tasktracker 206m 1439m When idle. So taking that into account I do 16000 MB - (1408+1439) MB which would leave me with 13200 MB. In my old settings I was using 8 map tasks so 13200 / 8 = 1650 MB. My mapred.child.java.opts is -Xmx1536m which should leave me a little head room. When running though I see some tasks reporting 1900m. -Xavier -Original Message- From: Brian Bockelman [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 2:42 PM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Hey Xavier, 1) Are you out of memory (dumb question, but doesn't hurt to ask...)? What does Ganglia tell you about the node? 2) Do you have /proc/sys/vm/overcommit_memory set to 2? Telling Linux not to overcommit memory on Java 1.5 JVMs can be very problematic. Java 1.5 asks for min heap size + 1 GB of reserved, non- swap memory on Linux systems by default. The 1GB of reserved, non- swap memory is used for the JIT to compile code; this bug wasn't fixed until later Java 1.5 updates. Brian On Nov 18, 2008, at 4:32 PM, Xavier Stevens wrote: I'm still seeing this problem on a cluster using Hadoop 0.18.2. I tried dropping the max number of map tasks per node from 8 to 7. I still get the error although it's less frequent. But I don't get the error at all when using Hadoop 0.17.2. Anyone have any suggestions? -Xavier -Original Message- From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon Sent: Thursday, October 09, 2008 2:07 AM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Thanks Alexander!! On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov [EMAIL PROTECTED] wrote: I received such errors when I overloaded data nodes. You may increase swap space or run less tasks. Alexander 2008/10/9 Edward J. Yoon [EMAIL PROTECTED] Hi, I received below message. Can anyone explain this? 08/10/09 11:53:33 INFO mapred.JobClient: Task Id : task_200810081842_0004_m_00_0, Status : FAILED java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:296) at org .apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo cator.java:124) at org .apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil e.java:107) at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.sortAndSpill(MapTask.ja va:734) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java: 694) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 2124 ) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org -- Best Regards Alexander Aristov -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Un-Blacklist Node
Is there a way to un-blacklist a node without restarting hadoop? Thanks, -Xavier
RE: Un-Blacklist Node
I don't think this is the per-job blacklist. The datanode is still running but the tasktracker isn't on the slave machine. Can I just start-mapred on that machine? -Xavier -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thursday, August 14, 2008 1:43 PM To: core-user@hadoop.apache.org Subject: Re: Un-Blacklist Node Xavier, On Aug 14, 2008, at 12:18 PM, Xavier Stevens wrote: Is there a way to un-blacklist a node without restarting hadoop? Which blacklist are you talking about? Per-job blacklist of TaskTrackers? Hadoop Daemons? Arun
RE: java.io.IOException: Cannot allocate memory
Actually I found the problem was our operations people had enabled overcommit on memory and restricted it to 50%...lol. Telling them to make it 100% fixed the problem. -Xavier -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Thursday, July 31, 2008 6:16 PM To: core-user@hadoop.apache.org Subject: Re: java.io.IOException: Cannot allocate memory Are you using HadoopStreaming? If so, then subprocess created by HadoopStreaming Job can take as much memory as it needs. In that case, the system will run out of memory and other processes (e.g. TaskTracker) may not be able to run properly or even be killed by the OS. /Taeho On Fri, Aug 1, 2008 at 2:24 AM, Xavier Stevens [EMAIL PROTECTED]wrote: We're currently running jobs on machines with around 16GB of memory with 8 map tasks per machine. We used to run with max heap set to 2048m. Since we started using version 0.17.1 we've been getting a lot of these errors: task_200807251330_0042_m_000146_0: Caused by: java.io.IOException: java.io.IOException: Cannot allocate memory task_200807251330_0042_m_000146_0: at java.lang.UNIXProcess.init(UNIXProcess.java:148) task_200807251330_0042_m_000146_0: at java.lang.ProcessImpl.start(ProcessImpl.java:65) task_200807251330_0042_m_000146_0: at java.lang.ProcessBuilder.start(ProcessBuilder.java:451) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.run(Shell.java:134) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPat hF orWrite(LocalDirAllocator.java:296) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAl lo cator.java:124) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputF il e.java:107) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask. ja va:734) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.j av a:272) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTa sk .java:707) We haven't changed our heapsizes at all. Has anyone else experienced this? Is there a way around it other than reducing heap sizes excessively low? I've tried all the way down to 1024m max heap and I still get this error. -Xavier
java.io.IOException: Cannot allocate memory
We're currently running jobs on machines with around 16GB of memory with 8 map tasks per machine. We used to run with max heap set to 2048m. Since we started using version 0.17.1 we've been getting a lot of these errors: task_200807251330_0042_m_000146_0: Caused by: java.io.IOException: java.io.IOException: Cannot allocate memory task_200807251330_0042_m_000146_0: at java.lang.UNIXProcess.init(UNIXProcess.java:148) task_200807251330_0042_m_000146_0: at java.lang.ProcessImpl.start(ProcessImpl.java:65) task_200807251330_0042_m_000146_0: at java.lang.ProcessBuilder.start(ProcessBuilder.java:451) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) task_200807251330_0042_m_000146_0: at org.apache.hadoop.util.Shell.run(Shell.java:134) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:296) task_200807251330_0042_m_000146_0: at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo cator.java:124) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil e.java:107) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja va:734) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.jav a:272) task_200807251330_0042_m_000146_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask .java:707) We haven't changed our heapsizes at all. Has anyone else experienced this? Is there a way around it other than reducing heap sizes excessively low? I've tried all the way down to 1024m max heap and I still get this error. -Xavier
RE: client connect as different username?
This is how I've done it before: 1.) Create a hadoop user/group. 2.) Make the local filesystem dfs directories writable by the hadoop group and set the sticky bit. 3.) Run hadoop as the hadoop user. 4.) Then add all of your users to the hadoop group. I also changed the dfs.permissions.supergroup property to hadoop in the $HADOOP_HOME/conf/hadoop-site.xml as well. This works pretty well for us. Hope it helps. Cheers, -Xavier -Original Message- From: Chris Collins [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 5:18 PM To: core-user@hadoop.apache.org Subject: Re: client connect as different username? The finer point to this is that in development you may be logged in as user x and have a shared hdfs instance that a number of people are using. In that mode its not practical to sudo as you have all your development tools setup for userx. hdfs is setup with a single user, what is the procedure to add users to that hdfs instance? It has to support it surely? Its really not obvious, looking in the hdfs docs that come with the distro nothing springs out. the hadoop command line tool doesnt have anything that vaguely looks like a way to create a user. Help is greatly appreciated. I am sure its somewhere so blindingly obvious. How are other people doing other that sudoing to one single user name? Thanks ChRiS On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote: The best way is to use sudo command to execute hadoop client. Does it work for you? Nicholas - Original Message From: Bob Remeika [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, June 11, 2008 12:56:14 PM Subject: client connect as different username? Apologies if this is an RTM response, but I looked and wasn't able to find anything concrete. Is it possible to connect to HDFS via the HDFS client under a different username than I am currently logged in as? Here is our situation, I am user bobr on the client machine. I need to add something to the HDFS cluster as the user companyuser. Is this possible with the current set of APIs or do I have to upload and chown? Thanks, Bob
RE: FileSystem.create
I don't even write successfully once. I could put a Thread.sleep after I call create to see if that works. This seems like a hack though. Further in the stack trace I start to get a not replicated yet exception after the retry logic kicks in. Why does the API return me a stream that I can't use? -Xavier -Original Message- From: Runping Qi Sent: Wednesday, May 14, 2008 9:02 PM To: core-user@hadoop.apache.org Subject: RE: FileSystem.create My experience is to call Thread.sleep(100) after calling dfs writes N (say 1000) times. -Original Message- From: Xavier Stevens Sent: Wednesday, May 14, 2008 10:47 AM To: core-user@hadoop.apache.org Subject: FileSystem.create I've having some problems creating a new file on HDFS. I am attempting to do this after my MapReduce job has finished and I am trying to combine all part-00* files into a single file programmatically. It's throwing a LeaseExpiredException saying the file I just created doesn't exist. Any idea why this is happening or what I can do to fix it? -Xavier Here is the code snippet === FileSystem fileSys = FileSystem.get(job); FSDataOutputStream fsdos = fileSys.create(new Path(outputIndexPath)); if (!fileSys.exists(new Path(outputIndexPath))) { System.err.println(File still does not exist: +outputIndexPath); } else { System.out.println(File exists: +outputIndexPath); } BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fsdos,UTF-8)); Output with stack trace === File exists: output/index.txt 08/05/14 03:20:13 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /user/xstevens/output/index.txt File does not exist. [Lease. Holder: 44 46 53 43 6c 69 65 6e 74 5f 2d 31 30 31 34 35 38 35 32 32 33, heldlocks: 0, pendingcreates: 1] at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1160) at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:312) at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901) at org.apache.hadoop.ipc.Client.call(Client.java:512) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo cationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation Handler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFS Client.java:2074) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF SClient.java:1967) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.ja va:1487) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie nt.java:1601)
FileSystem.create
I've having some problems creating a new file on HDFS. I am attempting to do this after my MapReduce job has finished and I am trying to combine all part-00* files into a single file programmatically. It's throwing a LeaseExpiredException saying the file I just created doesn't exist. Any idea why this is happening or what I can do to fix it? -Xavier Here is the code snippet === FileSystem fileSys = FileSystem.get(job); FSDataOutputStream fsdos = fileSys.create(new Path(outputIndexPath)); if (!fileSys.exists(new Path(outputIndexPath))) { System.err.println(File still does not exist: +outputIndexPath); } else { System.out.println(File exists: +outputIndexPath); } BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fsdos,UTF-8)); Output with stack trace === File exists: output/index.txt 08/05/14 03:20:13 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /user/xstevens/output/index.txt File does not exist. [Lease. Holder: 44 46 53 43 6c 69 65 6e 74 5f 2d 31 30 31 34 35 38 35 32 32 33, heldlocks: 0, pendingcreates: 1] at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1160) at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java: 1097) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:312) at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901) at org.apache.hadoop.ipc.Client.call(Client.java:512) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo cationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation Handler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFS Client.java:2074) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF SClient.java:1967) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.ja va:1487) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie nt.java:1601) Xavier stevens Sr. software engineer FOX INTERACTIVE MEDIA e: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] p: 310.633.9749
Problem retrieving entry from compressed MapFile
Currently I can retrieve entries if I use MapFileOutputFormat via conf.setOutputFormat with no compression specified. But I was trying to do this: public void configure(JobConf jobConf) { ... this.writer = new MapFile.Writer(jobConf, fileSys, dirName, Text.class, Text.class, SequenceFile.CompressionType.BLOCK); ... } public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { ... writer.append(newkey,newvalue); ... } To use SequenceFile block compression. Then later trying to retrieve the output values in a separate class: public static void main(String[] args) throws Exception { ... conf.setInputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat.class); ... MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, inDataPath, defaults); Partitioner part = (Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf); Text entryValue = null; entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new Text(mykey), new Text()); if (entryValue != null) { System.out.println(My Entry's Value: ); System.out.println(entryValue.toString()); } for (MapFile.Reader reader:readers) { if (reader != null) { reader.close(); } } } But when I use block compression I no longer get a result from MapFileOutputFormat.getEntry. What am I doing wrong? And/or is there a way for this to work using conf.setOutputFormat(MapFileOutputFormat.class) and conf.setMapOutputCompressionType(SequenceFile.CompressionType.BLOCK)?
RE: What's the best way to get to a single key?
I was thinking because it would be easier to search a single-index. Unless I don't have to worry and hadoop searches all my indexes at the same time. Is this the case? -Xavier -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 3:45 PM To: core-user@hadoop.apache.org Subject: Re: What's the best way to get to a single key? Xavier Stevens wrote: Thanks for everything so far. It has been really helpful. I have one more question. Is there a way to merge MapFile index/data files? No. To append text files you can use 'bin/hadoop fs -getmerge'. To merge sorted SequenceFiles (like MapFile/index files) you can use: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.had oop.fs.Path,%20boolean) But this doesn't generate a MapFile. Why is a single file preferable? Doug
RE: What's the best way to get to a single key?
So I read some more through the Javadocs. I had 11 reducers on my original job leaving me 11 MapFile directories. I am passing in their parent directory here as outDir. MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, outDir, defaults); Partitioner part = (Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf); Text entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new Text(mykey), null); System.out.println(My Entry's Value: ); System.out.println(entryValue.toString()); But I am getting an exception: Exception in thread main java.lang.ArithmeticException: / by zero at org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.java:35) at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:85) at mypackage.MyClass.main(ProfileReader.java:110) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) I am assuming I am doing something wrong, but I'm not sure what it is yet. Any ideas? -Xavier -Original Message- From: Xavier Stevens Sent: Mon 3/10/2008 3:49 PM To: core-user@hadoop.apache.org Subject: RE: What's the best way to get to a single key? I was thinking because it would be easier to search a single-index. Unless I don't have to worry and hadoop searches all my indexes at the same time. Is this the case? -Xavier -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 3:45 PM To: core-user@hadoop.apache.org Subject: Re: What's the best way to get to a single key? Xavier Stevens wrote: Thanks for everything so far. It has been really helpful. I have one more question. Is there a way to merge MapFile index/data files? No. To append text files you can use 'bin/hadoop fs -getmerge'. To merge sorted SequenceFiles (like MapFile/index files) you can use: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[], org.apache.had oop.fs.Path, boolean) But this doesn't generate a MapFile. Why is a single file preferable? Doug
RE: What's the best way to get to a single key?
Thanks for everything so far. It has been really helpful. I have one more question. Is there a way to merge MapFile index/data files? Assuming there is, what is the best way to do so? I was reading the Java docs on it and it looked like this is possible but it wasn't very explicit. Obviously I could specify to use a single reducer, but with my data size that would be really slow. Thanks, -Xavier -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 04, 2008 12:53 PM To: core-user@hadoop.apache.org Subject: Re: What's the best way to get to a single key? Xavier Stevens wrote: Is there a way to do this when your input data is using SequenceFile compression? Yes. A MapFile is simply a directory containing two SequenceFiles named data and index. MapFileOutputFormat uses the same compression parameters as SequenceFileOutputFormat. SequenceFileInputFormat recognizes MapFiles and reads the data file. So you should be able to just switch from specifying SequenceFileOutputFormat to MapFileOutputFormat in your jobs and everything should work the same except you'll have index files that permit random access. Doug
What's the best way to get to a single key?
I am curious how others might be solving this problem. I want to retrieve a record from HDFS based on its key. Are there any methods that can shortcut this type of search to avoid parsing all data until you find it? Obviously Hbase would do this as well, but I wanted to know if there is a way to do it using just Map/Reduce and HDFS. Thanks, -Xavier