Re: Seattle Hadoop/Scalability/NoSQL Meetup Tonight!
Thanks for coming, everyone! We had around 25 people. A *huge* success, for Seattle. And a big thanks to 10gen for sending Richard. Can't wait to see you all next month. On Wed, Feb 24, 2010 at 2:15 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: The Seattle Hadoop/Scalability/NoSQL (yeah, we vary the title) meetup is tonight! We're going to have a guest speaker from MongoDB :) As always, it's at the University of Washington, Allen Computer Science building, Room 303 at 6:45pm. You can find a map here: http://www.washington.edu/home/maps/southcentral.html?cse If you can, please RSVP here (not required, but very nice): http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/ -- http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science -- http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Reduce step never starts, can't read output from mappers? (Too many fetch-failures)
(re-posted from the mapreduce-user list in case anyone here might have an answer) Hello, I have set up a cluster with one NameNode/JobTracker and three DataNode/TaskTrackers, and having some issues with the reduce step being unable to start. Masters and slaves can ping and ssh each other. Attaching conf files (same on all machines). Is there anything else I should be looking at? Log output for JobTracker and one of the TaskTrackers that seems suspicious: JobTracker = exj...@exjobb-1:~$ hadoop jar /opt/hadoop/hadoop-0.20.1-examples.jar wordcount input/sessions-20100205145800.txt output-wordcount 10/02/24 11:15:24 INFO input.FileInputFormat: Total input paths to process : 1 10/02/24 11:15:25 INFO mapred.JobClient: Running job: job_201002240852_0003 10/02/24 11:15:26 INFO mapred.JobClient: map 0% reduce 0% 10/02/24 11:15:49 INFO mapred.JobClient: map 1% reduce 0% 10/02/24 11:15:58 INFO mapred.JobClient: map 2% reduce 0% 10/02/24 11:16:06 INFO mapred.JobClient: map 3% reduce 0% 10/02/24 11:16:15 INFO mapred.JobClient: map 4% reduce 0% 10/02/24 11:16:23 INFO mapred.JobClient: map 5% reduce 0% 10/02/24 11:16:32 INFO mapred.JobClient: map 6% reduce 0% 10/02/24 11:16:40 INFO mapred.JobClient: map 7% reduce 0% 10/02/24 11:16:51 INFO mapred.JobClient: map 8% reduce 0% 10/02/24 11:16:59 INFO mapred.JobClient: map 9% reduce 0% 10/02/24 11:17:07 INFO mapred.JobClient: map 10% reduce 0% 10/02/24 11:17:31 INFO mapred.JobClient: map 11% reduce 0% 10/02/24 11:17:39 INFO mapred.JobClient: map 12% reduce 0% 10/02/24 11:17:49 INFO mapred.JobClient: map 13% reduce 0% 10/02/24 11:17:57 INFO mapred.JobClient: map 14% reduce 0% 10/02/24 11:18:05 INFO mapred.JobClient: map 15% reduce 0% 10/02/24 11:18:15 INFO mapred.JobClient: map 16% reduce 0% 10/02/24 11:18:23 INFO mapred.JobClient: map 17% reduce 0% 10/02/24 11:18:32 INFO mapred.JobClient: map 18% reduce 0% 10/02/24 11:18:42 INFO mapred.JobClient: map 19% reduce 0% 10/02/24 11:18:51 INFO mapred.JobClient: map 20% reduce 0% 10/02/24 11:19:11 INFO mapred.JobClient: map 21% reduce 0% 10/02/24 11:19:22 INFO mapred.JobClient: map 22% reduce 0% 10/02/24 11:19:32 INFO mapred.JobClient: map 23% reduce 0% 10/02/24 11:19:40 INFO mapred.JobClient: map 24% reduce 0% 10/02/24 11:19:49 INFO mapred.JobClient: map 25% reduce 0% 10/02/24 11:19:57 INFO mapred.JobClient: map 26% reduce 0% 10/02/24 11:20:05 INFO mapred.JobClient: map 27% reduce 0% 10/02/24 11:20:15 INFO mapred.JobClient: map 28% reduce 0% 10/02/24 11:20:24 INFO mapred.JobClient: map 29% reduce 0% 10/02/24 11:20:34 INFO mapred.JobClient: map 30% reduce 0% 10/02/24 11:20:52 INFO mapred.JobClient: map 31% reduce 0% 10/02/24 11:21:02 INFO mapred.JobClient: map 32% reduce 0% 10/02/24 11:21:12 INFO mapred.JobClient: map 33% reduce 0% 10/02/24 11:21:21 INFO mapred.JobClient: map 34% reduce 0% 10/02/24 11:21:31 INFO mapred.JobClient: map 35% reduce 0% 10/02/24 11:21:40 INFO mapred.JobClient: map 36% reduce 0% 10/02/24 11:21:49 INFO mapred.JobClient: map 37% reduce 0% 10/02/24 11:21:58 INFO mapred.JobClient: map 38% reduce 0% 10/02/24 11:22:07 INFO mapred.JobClient: map 39% reduce 0% 10/02/24 11:22:17 INFO mapred.JobClient: map 40% reduce 0% 10/02/24 11:22:35 INFO mapred.JobClient: map 41% reduce 0% 10/02/24 11:22:44 INFO mapred.JobClient: map 42% reduce 0% 10/02/24 11:22:53 INFO mapred.JobClient: map 43% reduce 0% 10/02/24 11:23:05 INFO mapred.JobClient: map 44% reduce 0% 10/02/24 11:23:14 INFO mapred.JobClient: map 45% reduce 0% 10/02/24 11:23:22 INFO mapred.JobClient: map 46% reduce 0% 10/02/24 11:23:32 INFO mapred.JobClient: map 47% reduce 0% 10/02/24 11:23:40 INFO mapred.JobClient: map 48% reduce 0% 10/02/24 11:23:50 INFO mapred.JobClient: map 49% reduce 0% 10/02/24 11:23:59 INFO mapred.JobClient: map 50% reduce 0% 10/02/24 11:24:17 INFO mapred.JobClient: map 51% reduce 0% 10/02/24 11:24:27 INFO mapred.JobClient: map 52% reduce 0% 10/02/24 11:24:34 INFO mapred.JobClient: map 53% reduce 0% 10/02/24 11:24:45 INFO mapred.JobClient: map 54% reduce 0% 10/02/24 11:24:57 INFO mapred.JobClient: map 55% reduce 0% 10/02/24 11:25:04 INFO mapred.JobClient: map 56% reduce 0% 10/02/24 11:25:15 INFO mapred.JobClient: map 57% reduce 0% 10/02/24 11:25:22 INFO mapred.JobClient: map 58% reduce 0% 10/02/24 11:25:32 INFO mapred.JobClient: map 59% reduce 0% 10/02/24 11:25:42 INFO mapred.JobClient: map 60% reduce 0% 10/02/24 11:25:57 INFO mapred.JobClient: map 61% reduce 0% 10/02/24 11:26:07 INFO mapred.JobClient: map 62% reduce 0% 10/02/24 11:26:16 INFO mapred.JobClient: map 63% reduce 0% 10/02/24 11:26:24 INFO mapred.JobClient: map 64% reduce 0% 10/02/24 11:26:34 INFO mapred.JobClient: map 65% reduce 0% 10/02/24 11:26:45 INFO mapred.JobClient: map 66% reduce 0% 10/02/24 11:26:56 INFO mapred.JobClient: map 67% reduce 0% 10/02/24 11:27:05 INFO mapred.JobClient: map 68% reduce 0% 10/02/24 11:27:13 INFO mapred.JobClient: map 69% reduce 0% 10/02/24 11:27:17 INFO
Re: Hadoop key mismatch
On Wed, Feb 24, 2010 at 3:30 PM, Larry Homes larr.ho...@gmail.com wrote: Hello, I am trying to sort some values by using a simple map and reduce without any processing, but I think I messed up my data types somehow. Rather than try to paste code in an email, I have described the problem and pasted all the code (nicely formatted) here: http://www.coderanch.com/t/484435/Distributed-Java/java/Hadoop-key-mismatch Thanks I think the first problem you are having is that you changed the signature of the map method incorrectly. public void map(Text key, Text value, Context context) The type of key should be LongWritable. Key is an integer representing the offset of the line. Value is the entire line of text. Try : public void map(LongWritable key, Text value, Context context) Adjust accordingly and you should be ok. (at least until the next problem :)
Re: CDH2 or Apache Hadoop - Official Debian packages
Allen, For all intents and purposes, the Debian package sounds just like a re-packaging of the Apache distribution in .deb form. You're perfectly right. Most Debian packages are just a re-packaging of the upstream projects, but with additional management information and logic to ease the installation and make them work well on the plattform and together with other programs. It's the beautiful world of package management: apt-get install hadoop less /usr/share/doc/hadoop/README ... Have fun with hadoop - no version namespace, everything is called just hadoop, not hadoop-0.18 or hadoop-0.20 as in the cloudera package ... and thus making upgrades really hard and not suitable for anything real. Actually my hope is in the plan of hadoop to once establish a stable API (as planned) so that an upgrade will be backwards compatible. As long as that isn't the case, the Debian package is intended only for three audiencens: - People who are willing to deal with any upgrade hassles for the benefit of an official Debian package - People who'd like to try out and learn hadoop with an easily installable package - Me That said, I'm going to use the Debian package on a tiny production cluster of 5 machines. Thomas Koch, http://www.koch.ro
Re: Sun JVM 1.6.0u18
On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey sc...@richrelevance.comwrote: On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote: Hey all, Just a note that you should avoid upgrading your clusters to 1.6.0u18. We've seen a lot of segfaults or bus errors on the DN when running with this JVM - Stack found the ame thing on one of his clusters as well. Have you seen this for 32bit, 64 bit, or both? If 64 bit, was it with -XX:+UseCompressedOops? Just 64-bit, no compressed oops. But I haven't tested other variables. Any idea if there are Sun bugs open for the crashes? I opened one, yes. I think Stack opened a separate one. Haven't heard back. I have found some notes that suggest that -XX:-ReduceInitialCardMarks will work around some known crash problems with 6u18, but that may be unrelated. Yep, I think that is probably a likely workaround as well. For now I'm recommending downgrade to our clients, rather than introducing cryptic XX flags :) Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a minor patch over 6u16 where 6u18 includes a new version of Hotspot. Can anyone confirm that? I haven't heard anything bad about u17 either. But since we know 16 to be very good and nothing important is new in 17, I like to recommend 16 still. -Todd
Use intermediate compression for Map output or not?
Hi hadoop Gurus, here's a question about intermediate compression. As I understand, the point to compress Map output is to reduce network traffic that occur when feeding sequence files from Map to Reduce tasks which do not reside on the same boxes as Map tasks. So, depending on various factors such as how cluster is set up, data size, property of problem to solve and the quality of m/r program (e.g. a pig script), etc, this reduce in network traffic (due to compressed data) may or may not compensate the time for compression and decompression. In other words, intermediate compression may not reach its goal to reduce the overall time cost of a m/r job. As I know, a blog (http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) gives compression and decompression factor and speed and reports a positive result of using compression on raw data as input to m/r job, but no test or insight about intermediate compression. So, I am wondering if there is any case study or test results guiding when to use intermediate compression, pros and cons, settings, pitfalls and gains... Thanks, Michael
Re: CDH2 or Apache Hadoop - Official Debian packages
On Feb 25, 2010, at 10:20 AM, Allen Wittenauer wrote: Actually my hope is in the plan of hadoop to once establish a stable API (as planned) so that an upgrade will be backwards compatible. History shows you are in for a long wait. I hope not and I'm trying to make sure that isn't true. At this point, we have a lot of customers inside Yahoo who yell at our SVP when anyone breaks API compatibility with the previous release. My hope to get to the point where we do one major release a year and each major release is backwards compatible with the previous major release (as in you don't need to recompile your code). Bonus points if we can get a minor release out at the half year point. And of course bug fix releases as needed... -- Owen
Re: CDH2 or Apache Hadoop - Official Debian packages
On 2/25/10 8:39 AM, Thomas Koch tho...@koch.ro wrote: - no version namespace, everything is called just hadoop, not hadoop-0.18 or hadoop-0.20 as in the cloudera package ... and thus making upgrades really hard and not suitable for anything real. Actually my hope is in the plan of hadoop to once establish a stable API (as planned) so that an upgrade will be backwards compatible. History shows you are in for a long wait. It is also worth pointing out that API compat is only part of the issue. Without ABI compat, it is still a very rough road. [A point lost on way too many in the Hadoop community; too many devs, not enough ops.]
Re: Sun JVM 1.6.0u18
On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote: Hey all, Just a note that you should avoid upgrading your clusters to 1.6.0u18. We've seen a lot of segfaults or bus errors on the DN when running with this JVM - Stack found the ame thing on one of his clusters as well. Have you seen this for 32bit, 64 bit, or both? If 64 bit, was it with -XX:+UseCompressedOops? Any idea if there are Sun bugs open for the crashes? I have found some notes that suggest that -XX:-ReduceInitialCardMarks will work around some known crash problems with 6u18, but that may be unrelated. Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a minor patch over 6u16 where 6u18 includes a new version of Hotspot. Can anyone confirm that? We've found 1.6.0u16 to be very stable. -Todd
Hadoop freeze?
I ran into the following problem running a hadoop job written in pig.Pls help check what caused the issue. As I could tell, it seems to me the job/task tracker failed for some reason but name/data nodes still functioning. The job simply seems to make no progress at all (no output, no log). But couple of other hadoop jobs ran successfully before this one. hadoop fs -ls can still list files. But I did Hadoop job -list, it took too long and then failed with error message as follows. Exception in thread main java.io.IOException: Call to hostname/ip-address:50002 failed on local exception: Connection reset by peer at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:435) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:429) at org.apache.hadoop.mapred.JobClient.run(JobClient.java:1512) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1727)Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:271) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:493) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438) Web interface to job trac...@50030 simply came with no response at all. By checking netstat, sometimes it shows 50030 and sometimes not. connections and ports with data nodes were shown there. Then, if I ran another pig, it failed with the following error: Error before Pig is launchedERROR 6009: Failed to create job client:Call to hostname/ip-address:50002 failed on local exception: Connection reset by peer org.apache.pig.backend.executionengine.ExecException: ERROR 6009: Failed to create job client:Call to hostname/ip-address:50002 failed on local exception: Connection reset by peer at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:217) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:137) at org.apache.pig.impl.PigContext.connect(PigContext.java:199) at org.apache.pig.PigServer.init(PigServer.java:169) at org.apache.pig.PigServer.init(PigServer.java:158) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:54) at org.apache.pig.Main.main(Main.java:395)Caused by: java.io.IOException: Call to hostname/ip-address:50002 failed on local exception: Connection reset by peer at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:435) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:429) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:398) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212) ... 6 moreCaused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at
Re: On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir
Hello, I fixed this by running more than =2 slaves. I was testing with 1 when this error occurred. Regards Saptarshi On Tue, Feb 23, 2010 at 2:57 PM, Todd Lipcon t...@cloudera.com wrote: Hi Saptarshi, Can you please ssh into the JobTracker node and check that this directory is mounted, writable by the hadoop user, and not full? -Todd On Fri, Feb 19, 2010 at 2:13 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, Not sure if i should post this here or on Cloudera's message board, but here goes. When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the env variables are hadoop-ec2), and launch a job hadoop jar ... I get the following error 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir at org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975) at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.godhuli.f.RHMR.submitAndMonitorJob(RHMR.java:195) but the value of mapred.local.dir is /mnt/hadoop/mapred/local Any ideas?
Re: cluster involvement trigger
Hi, The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ). You say that you have too many small files. In general each of these small files ( 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ). On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files. Amogh On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote: Hi, We are using the streaming API.We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution. With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1? For example, we are trying to figure out if hadoop is more efficient at processing: a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the -archives argument, or b) a single input file containing all the raw data represented by the 100K or 1M small files. With (a), our input file is 64MB. With (b) our input file is very large. Thanks for any insight, -Michael