New York user group?
Please let me know if you would be interested in joining NY Hadoop user group if one existed. I know about 5-6 people in New York City running Hadoop. I am sure there are many more. Let me know. If there is some interest, I will try to put together first meeting. thanks -Alex ALEX DORMAN [EMAIL PROTECTED] contextweb.com
Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?
The Hadoop documentation says Sun's JDK must be used, this message is post to make sure that there is official statement about this.
RE: New York user group?
Yes. I am interested. Date: Fri, 18 Jul 2008 05:59:33 -0700 From: [EMAIL PROTECTED] Subject: New York user group? To: core-user@hadoop.apache.org Please let me know if you would be interested in joining NY Hadoop user group if one existed. I know about 5-6 people in New York City running Hadoop. I am sure there are many more. Let me know. If there is some interest, I will try to put together first meeting.thanks -Alex ALEX DORMAN [EMAIL PROTECTED] contextweb.com _ Keep your kids safer online with Windows Live Family Safety. http://www.windowslive.com/family_safety/overview.html?ocid=TXT_TAGLM_WL_family_safety_072008
Hadoop 0.17.1 namenode service can't start on windows XP.
Hi, I followed the instructions from http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/ to install Hadoop 0.17.1 on my Windows XP computer, whose computer name is AMBER, and the current user name is User. I installed CygWin on G:\. I have verified ssh and bin/hadoop version work fine. But when trying to start the dfs service I found the following problems: 1. Hadoop can't create the logs directory automatically if it does not exist in the install directory. 2. The datanode service can automatically create the G:\tmp\hadoop-SYSTEM\dfs\data directory, but the namenode service cant' automatically create G:\tmp\hadoop-User directory and it's sub directories, even I manually created the G:\tmp\hadoop-User\dfs\name\image directory the name service can't start neither, I found the following exceptions in the nameservice's log file: 2008-07-18 22:11:46,578 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = amber/116.76.140.27 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008 / 2008-07-18 22:11:47,234 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=47110 2008-07-18 22:11:47,250 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: localhost/127.0.0.1:47110 2008-07-18 22:11:47,265 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2008-07-18 22:11:47,281 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=User,None,root,Administrators,Users,ORA_DBA 2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.Storage: Storage directory G:\tmp\hadoop-User\dfs\name does not exist. 2008-07-18 22:11:48,359 INFO org.apache.hadoop.ipc.Server: Stopping server on 47110 2008-07-18 22:11:48,359 ERROR org.apache.hadoop.dfs.NameNode: org.apache.hadoop.dfs.InconsistentFSStateException: Directory G:\tmp\hadoop-User\dfs\name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:154) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at amber/116.76.140.27 / 2008-07-18 22:26:35,734 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = amber/116.76.140.27 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008 / 2008-07-18 22:26:36,046 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=47110 2008-07-18 22:26:36,062 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: localhost/127.0.0.1:47110 2008-07-18 22:26:36,062 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2008-07-18 22:26:36,093 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=User,None,root,Administrators,Users,ORA_DBA 2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2008-07-18 22:26:37,515 INFO org.apache.hadoop.dfs.Storage: Storage directory G:\tmp\hadoop-User\dfs\name does not exist. 2008-07-18 22:26:37,515 INFO org.apache.hadoop.ipc.Server: Stopping server on
Re: Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?
I'm not sure if this is useful info, but I used both the Sun and the IBM JDK under Linux to run version 0.16.iForget of Hadoop, without any problems. I did some brief performance testing, didn't see any significant difference, then we switched over to the Sun JDK exclusively as per the recommendation of the docs. -Colin On Fri, Jul 18, 2008 at 9:24 AM, Amber [EMAIL PROTECTED] wrote: The Hadoop documentation says Sun's JDK must be used, this message is post to make sure that there is official statement about this.
using too many mappers?
Is it possible that using too many mappers causes issues in Hadoop 0.17.1? I have an input data directory with 100 files in it. I am running a job that takes these files as input. When I set -jobconf mapred.map.tasks=200 in the job invocation, its seems like the mappers received empty inputs (that my binary does not cleanly handle). When I unset the mapred.map.tasks parameter, the jobs runs fine, many mappers do get used because the input files are manually split. Can anyone offer an explanation / have there been changes in the use of this parameter between 0.16.4 and 0.17.1? Ashish
Re: Reduce stalling
I'm having the same problem :-/ Maps are going fine while reduce phase stalls on 9-16%, and then resumes after a loong while (30-40 minutes). I'm using hadoop 0.16.0 (r618351) and wordcount hadoop-example... next week I'll try with a newer hadoop version (perhaps trunk) to see if I can reproduce this issue :-S On Sat, Jun 21, 2008 at 8:28 AM, Arnie Horta [EMAIL PROTECTED] wrote: Hello, I am having a problem...when I configure more than one node on my hadoop cluster, the reduce jobs all stall. Ther logs look EXACTLY like the logs described by Amit Kumar Singh. in his post last month. I have checked that the disk is fine (I formatted the HDFS and deleted all the data on the data nodes to make sure) and this is not a firewall issue or a problem with etc/hosts. It works fine with one node, but adding a second node starts the stalling. I have tried using both .16.2 and .16.4, to no avail. HELP!
Timeouts when running balancer
I'm trying to re balance my cluster as I've added to more nodes. When I run balancer with the default threshold I am seeing timeouts in the logs: 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Decided to move block -8432927406854991437 with a length of 128 MB bytes from 10.11.6.234:50010 to 10.11.6.235:50010 using proxy source 10.11.6.234:50010 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Starting Block mover for -8432927406854991437 from 10.11.6.234:50010 to 10.11.6.235:50010 2008-07-18 09:52:46,826 WARN org.apache.hadoop.dfs.Balancer: Timeout moving block -8432927406854991437 from 10.11.6.234:50010 to 10.11.6.235:50010 through 10.11.6.234:50010 I read in the balancer guide- http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2 That the default transfer rate is 1mb/sec I tried increasing this to 1gb/sec but I'm still seeing the timeouts. All of the nodes have gigE nics and are on the same switch. -- David O'Dell Director, Operations e: [EMAIL PROTECTED] t: (415) 738-5152 180 Townsend St., Third Floor San Francisco, CA 94107
help request: 0.16.0 java.io.IOException: Filesystem closed org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_....
I am seeing an odd mix of errors in a job we have running on a particular cluster of machines. Has anyone seen this before and what is actually the problem? We are running linux (Centos51, on 8 way xeons, with all disks under raid 5) and GigE switches between the machines. The namenode machine does not run a datanode or a tasktracker and is essentially idle. Thanks! 2008-07-18 09:25:43,626 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:158) at org.apache.hadoop.dfs.DFSClient.access$500(DFSClient.java:58) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1095) at java.io.FilterInputStream.close(FilterInputStream.java:155) at org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:97) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:277) at org.apache.hadoop.mapred.KeyValueLineRecordReader.close(KeyValueLineRecordReader.java:113) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:155) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:212) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071) and 2008-07-18 04:26:41,056 WARN org.apache.hadoop.mapred.TaskTracker: Error running child org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200807141819_0007_m_001981_1/spill0.out in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138) at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:464) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:713) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071) -- Jason Venner Attributor - Program the Web http://www.attributor.com/ Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested
What is the difference between streaming options: -file and -CacheFile ?
Seems that they mean the same thing, right? Another misleading options are : -NumReduceTasks and -jobconf mapred.reduce.tasks. Both are used to control (or give hit to) the number of reducers.
Re: can hadoop read files backwards
well here is the problem I'm trying to solve, I have a data set that looks like this: IDtype Timestamp A1X 1215647404 A2X 1215647405 A3X 1215647406 A1 Y 1215647409 I want to count how many A1 Y, show up within 5 seconds of an A1 X I was planning to have the data sorted by ID then timestamp, then read it backwards, (or have it sorted by reverse timestamp) go through it cashing all Y's for the same ID for 5 seconds to either find a matching X or not. the results don't need to be 100% accurate. so if hadoop gives the same file with the same lines in order then this will work. seems hadoop is really good at solving problems that depend on 1 line at a time? but not multi lines? hadoop has to get data in order, and be able to work on multi lines, otherwise how can it be setting records in data sorts. I'd appreciate other suggestions to go about doing this. Jim R. Wilson wrote: does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? You can't really depend on the order that the lines are given - it's best to think of them as random. The purpose of MapReduce/Hadoop is to distribute a problem among a number of cooperating nodes. The idea is that any given line can be interpreted separately, completely independent of any other line. So in wordcount, this makes sense. For example, say you and I are nodes. Each of us gets half the lines in a file and we can count the words we see and report on them - it doesn't matter what order we're given the lines, or which lines we're given, or even whether we get the same number of lines (if you're faster at it, or maybe you get shorter lines, you may get more lines to process in the interest of saving time). So if the project you're working on requires getting the lines in a particular order, then you probably need to rethink your approach. It may be that hadoop isn't right for your problem, or maybe that the problem just needs to be attacked in a different way. Without knowing more about what you're trying to achieve, I can't offer any specifics. Good luck! -- Jim On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: I have a program based on wordcount.java and I have files that are smaller than 64mb files (so i believe each file is one task ) do does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? Jim R. Wilson wrote: It sounds to me like you're talking about hadoop streaming (correct me if I'm wrong there). In that case, there's really no order to the lines being doled out as I understand it. Any given line could be handed to any given mapper task running on any given node. I may be wrong, of course, someone closer to the project could give you the right answer in that case. -- Jim R. Wilson (jimbojw) On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: is there a way to have hadoop hand over the lines of a file backwards to my mapper ? as in give the last line first.
Re: [PIG LATIN] how to get the size of a data bag
Charles, The right forum for Pig is [EMAIL PROTECTED], I'm redirecting you there... good luck! Arun On Jul 18, 2008, at 11:51 AM, charles du wrote: Hi: Just start learning hadoop and pig latin. How can I get the number of elements in a data bag? For example, a data bag like follow has four elements. B= {1, 2, 3, 5} I tried C = COUNT(B), it did not work. Thanks. -- tp
[Streaming]What is the difference between streaming options: -file and -CacheFile ?
Hi All, I am using Hadoop Streaming. I am confused by streaming options: -file and -CacheFile. Seems that they mean the same thing, right? Another misleading options are : -NumReduceTasks and -jobconf mapred.reduce.tasks. Both are used to control (or give hit to) the number of reducers. Thanks
Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ?
On Jul 18, 2008, at 4:53 PM, Steve Gao wrote: Hi All, I am using Hadoop Streaming. I am confused by streaming options: -file and -CacheFile. Seems that they mean the same thing, right? The difference is that -file will 'ship' your file (local file) to the cluster, while -cachefile assumes that it is already present on HDFS at the given path. Another misleading options are : -NumReduceTasks and -jobconf mapred.reduce.tasks. Both are used to control (or give hit to) the number of reducers. Yes, they are both equivalent. hth, Arun
Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ?
One more little question, why Hadoop streaming is designed in this way to use 2 different options to do the same thing (i.e. control the reduce number)? What's the point here? Thanks --- On Fri, 7/18/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ? To: core-user@hadoop.apache.org, Steve Gao [EMAIL PROTECTED] Date: Friday, July 18, 2008, 8:27 PM On Jul 18, 2008, at 4:53 PM, Steve Gao wrote: Hi All, I am using Hadoop Streaming. I am confused by streaming options: -file and -CacheFile. Seems that they mean the same thing, right? The difference is that -file will 'ship' your file (local file) to the cluster, while -cachefile assumes that it is already present on HDFS at the given path. Another misleading options are : -NumReduceTasks and -jobconf mapred.reduce.tasks. Both are used to control (or give hit to) the number of reducers. Yes, they are both equivalent. hth, Arun
Hadoop with Axis
Hello Again: I'm currently running Hadoop with various Client objects in the Map phase. A given Axis services provides the class of the Client to be used in this situation, which runs the call over the wire to the provided URL and translates the objects returned into Writable objects. When I use the code without Hadoop, it runs just fine--objects are returned from over the wire. When I run the code inside of Hadoop's structure, I am getting null objects within the return type (although the return type itself is not null) from the service. This is literally the same code. Do you think this is a time-thing, where the connection is taking too long so Hadoop kills it? It's only a few seconds, but I thought I should ask. Are there other things I should be looking into? Thanks, Kylie -- The Circle of the Dragon -- unlock the mystery that is the dragon. http://www.blackdrago.com/index.html Light, seeking light, doth the light of light beguile! -- William Shakespeare's Love's Labor's Lost