Re: Append to Files..
Sandeep Dhawan, Noida wrote: Hello, I am currently using hadoop-0.18.0. I am not able to append files in DFS. I came across a fix which was done on version 0.19.0 (http://issues.apache.org/jira/browse/HADOOP-1700). But I cannot migrate to 0.19.0 version because it runs on JDK 1.6 and I have to stick to JDK 1.5 Therefore, I would like to know, if there is patch available for this bug for 0.18.0. It is not considered a bug, but a new feature. Features go in the later versions, especially big changes to the filesystem. You can compile Hadoop 0.19 yourself and set the property javac.version=1.5 in the file build.properties to produce a JAR file that can be loaded/used in Java5. There is nothing -yet- in the JAR file that is Java6+ only, though everyone recommends using a recent Java6 JVM for the better memory management. Hadoop 1.9+ on Java 5 is not supported, and it is not tested on java5 before the release. Which means that although you can recompile Hadoop for Java5, you are on your own at that point. If you do need to append, upgrade. -steve
Re: killJob method in JobClient
Robert Goodman wrote: I have some code where I create my own Hadoop job and the use the JobClient to submit the Hadoop job. I noticed that the JobClient class has a killJob() method. I was planning to play around and try to kill a running Hadoop job. Does anybody know status the killJob method? I'm using Hadoop 0.17.2.1 My concern is that it might work in some phases (map, reduce, etc) of a Hadoop job, but not in other phases which is hard to find with simple testing. There may be hidden gotchas that are hard to detect. This concern is really driven by the fact that the jobtracker.jsp in Hadoop doesn't have an option to kill a job. This would be a useful option. If you get your API operations added as more functional tests to the hadoop codebase, then you can be reasonably confident that nobody is going to break your code. This is the best way of ensuring that your system remains supported -make it part of the test suite -steve
talk: My other computer is a datacentre
Here's a presentation on datacentres and MR my colleague and I gave to the local university this week, My other computer is a datacentre. MapReduce was demoed on a different dataset from normal (scanned bluetooth devices from a static location) and implemented in Erlang, because the students are learning functional languages and it is easier to show the concepts when you really do have a stateless runtime underneath. The overall theme though, is how you run things like this in real datacentres, which is why I have photos of some of the Yahoo! Hadoop team and one of their datacentres. http://www.slideshare.net/steve_l/my-other-computer-is-a-datacentre-presentation
Re: Map/Reduce from web servers
Allen, Jeffrey wrote: Hi, I have an existing enterprise system using web services. I'd like to have an event in the web service eventually result in a map/reduce being performed. It would be very desirable to be able to package up the map reduce classes into a jar that gets deployed inside the war file for the web service. (Primarily just to make redeployment easy.) In effect, I believe what I'm asking is if there is a way I can stream the map/reduce code (ie, the jar file) down to the JobTracker. Could someone please confirm whether I can do this or not, and if I can, help point me in the right direction? Thanks in advance for any insight... You could embed the JAR in the WAR as a single resource, stream it out from there to the filesystem and then upload it. As a feature creep, you could see if the the copy operations handled jar: as a filesystem URI which would then enable a streamed upload. -steve
Re: Which replica?
I'm looking at alternative policies for task and data placement. As a first step, I'd like to be able to observe what Hadoop is doing without modifying our cluster's software. We saw that the datanodes log every block that is read from them, but we didn't see any way to map from those block names to a (filename, chunk) pair. Doug Cutting wrote: A task may read from more than one block. For example, in line-oriented input, lines frequently cross block boundaries. And a block may be read from more than one host. For example, if a datanode dies midway through providing a block, the client will switch to using a different datanode. So the mapping is not simple. This information is also not, as you inferred, available to applications. Why do you need this? Do you have a compelling reason? Doug James Cipar wrote: Is there any way to determine which replica of each chunk is read by a map-reduce program? I've been looking through the hadoop code, and it seems like it tries to hide those kinds of details from the higher level API. Ideally, I'd like the host the task was running on, the file name and chunk number, and the host the chunk was read from.
Re: Hadoop and .tgz files
On Mon, 01 Dec 2008 12:16:28 EST, Ryan LeCompte wrote: I believe I spoke a little too soon. Looks like Hadoop supports .gz files, not .tgz. :-) On Mon, Dec 1, 2008 at 10:46 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm using Hadoop 0.19 and just discovered that it has no problems processing .tgz files that contain text files. I was under the impression that it wouldn't be able to break a .tgz file up into multiple maps, but instead just treat it as 1 map per .tgz file. Was this a recent change or enhancement? I'm noticing that it is breaking up the .tgz file into multiple maps. Thanks, Ryan Work is in progress to support splitting of .bz2 files. See http://issues.apache.org/jira/browse/HADOOP-4012 I don't believe splitting of .tgz files is possible, something compressed with gzip can only be uncompressed from the beginning. -John Heidemann
[NYC Hadoop meetup] 12/17 Cascading by Chris Wensel
The next New York Hadoop User Group meeting is scheduled for Wednesday, December 17th at ContextWeb, 6:30pm. Join us for a talk on Cascading, an API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster. Chris will specifically cover what the processing model looks like, many of its core features, and the kinds of problems Cascading was intended to solve or prevent. To RSVP: http://www.meetup.com/Hadoop-NYC/calendar/9240064/ -Alex