Re: Append to Files..

2008-12-02 Thread Steve Loughran

Sandeep Dhawan, Noida wrote:

Hello,

 


I am currently using hadoop-0.18.0. I am not able to append files in
DFS. I came across a fix which was done on version 0.19.0
(http://issues.apache.org/jira/browse/HADOOP-1700). But I cannot migrate
to 0.19.0 version because it runs on JDK 1.6 and I have
to stick to JDK 1.5 Therefore, I would like to know, if there is patch
available for this bug for 0.18.0. 



It is not considered a bug, but a new feature. Features go in the 
later versions, especially big changes to the filesystem.


You can compile Hadoop 0.19 yourself and set the property 
javac.version=1.5 in the file build.properties to produce a JAR file 
that can be loaded/used in Java5. There is nothing -yet- in the JAR file 
that is Java6+ only, though everyone recommends using a recent Java6 JVM 
for the better memory management. Hadoop 1.9+ on Java 5 is not 
supported, and it is not tested on java5 before the release. Which means 
that although you can recompile Hadoop for Java5, you are on your own at 
that point.


If you do need to append, upgrade.

-steve


Re: killJob method in JobClient

2008-12-02 Thread Steve Loughran

Robert Goodman wrote:

I have some code where I create my own Hadoop job and the use the JobClient
to submit the Hadoop job. I noticed that the JobClient class has a
killJob() method. I was planning to play around and try to kill a running
Hadoop job. Does anybody know status the killJob method? I'm using Hadoop
0.17.2.1 My concern is that it might work in some phases (map, reduce, etc)
of a Hadoop job, but not in other phases which is hard to find with simple
testing. There may be hidden gotchas that are hard to detect. This concern
is really driven by the fact that the jobtracker.jsp in Hadoop doesn't have
an option to kill a job. This would be a useful option.



If you get your API operations added as more functional tests to the 
hadoop codebase, then you can be reasonably confident that nobody is 
going to break your code. This is the best way of ensuring that your 
system remains supported -make it part of the test suite


-steve


talk: My other computer is a datacentre

2008-12-02 Thread Steve Loughran


Here's a presentation on datacentres and MR my colleague and I gave to 
the local university this week, My other computer is a datacentre. 
MapReduce was demoed on a different dataset from normal (scanned 
bluetooth devices from a static location) and implemented in Erlang, 
because the students are learning functional languages and it is easier 
to show the concepts when you really do have a stateless runtime 
underneath. The overall theme though, is how you run things like this in 
real datacentres, which is why I have photos of some of the Yahoo! 
Hadoop team and one of their datacentres.


http://www.slideshare.net/steve_l/my-other-computer-is-a-datacentre-presentation


Re: Map/Reduce from web servers

2008-12-02 Thread Steve Loughran

Allen, Jeffrey wrote:

Hi,

I have an existing enterprise system using web services.  I'd like to have an 
event in the web service eventually result in a map/reduce being performed.  It 
would be very desirable to be able to package up the map reduce classes into a 
jar that gets deployed inside the war file for the web service.   (Primarily 
just to make redeployment easy.)  In effect, I believe what I'm asking is if 
there is a way I can stream the map/reduce code (ie, the jar file) down to the 
JobTracker.

Could someone please confirm whether I can do this or not, and if I can, help 
point me in the right direction?

Thanks in advance for any insight...


You could embed the JAR in the WAR as a single resource, stream it out 
from there to the filesystem and then upload it. As a feature creep, you 
could see if the the copy operations handled jar: as a filesystem URI 
which would then enable a streamed upload.


-steve


Re: Which replica?

2008-12-02 Thread Jim Cipar
I'm looking at alternative policies for task and data placement.  As a 
first step, I'd like to be able to observe what Hadoop is doing without 
modifying our cluster's software.  We saw that the datanodes log every 
block that is read from them, but we didn't see any way to map from 
those block names to a (filename, chunk) pair.




Doug Cutting wrote:
A task may read from more than one block.  For example, in 
line-oriented input, lines frequently cross block boundaries.  And a 
block may be read from more than one host.  For example, if a datanode 
dies midway through providing a block, the client will switch to using 
a different datanode.  So the mapping is not simple.  This information 
is also not, as you inferred, available to applications.  Why do you 
need this?  Do you have a compelling reason?


Doug

James Cipar wrote:
Is there any way to determine which replica of each chunk is read by 
a map-reduce program?  I've been looking through the hadoop code, and 
it seems like it tries to hide those kinds of details from the higher 
level API.  Ideally, I'd like the host the task was running on, the 
file name and chunk number, and the host the chunk was read from.






Re: Hadoop and .tgz files

2008-12-02 Thread John Heidemann
On Mon, 01 Dec 2008 12:16:28 EST, Ryan LeCompte wrote: 
I believe I spoke a little too soon. Looks like Hadoop supports .gz
files, not .tgz. :-)


On Mon, Dec 1, 2008 at 10:46 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm using Hadoop 0.19 and just discovered that it has no problems
 processing .tgz files that contain text files. I was under the
 impression that it wouldn't be able to break a .tgz file up into
 multiple maps, but instead just treat it as 1 map per .tgz file. Was
 this a recent change or enhancement? I'm noticing that it is breaking
 up the .tgz file into multiple maps.

 Thanks,
 Ryan



Work is in progress to support splitting of .bz2 files.
See  http://issues.apache.org/jira/browse/HADOOP-4012

I don't believe splitting of .tgz files is possible, something
compressed with gzip can only be uncompressed from the beginning.

   -John Heidemann



[NYC Hadoop meetup] 12/17 Cascading by Chris Wensel

2008-12-02 Thread Alex Dorman
The next New York Hadoop User Group meeting is scheduled for Wednesday,
December 17th at ContextWeb, 6:30pm.

Join us for a talk on Cascading, an API for defining and executing
complex and fault tolerant data processing workflows on a Hadoop
cluster. Chris will specifically cover what the processing model looks
like, many of its core features, and the kinds of problems Cascading was
intended to solve or prevent.

To RSVP: http://www.meetup.com/Hadoop-NYC/calendar/9240064/

-Alex