Re: Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?

2008-07-18 Thread Colin Freas
I'm not sure if this is useful info, but I used both the Sun and the IBM JDK
under Linux to run version 0.16.iForget of Hadoop, without any problems.  I
did some brief performance testing, didn't see any significant difference,
then we switched over to the Sun JDK exclusively as per the recommendation
of the docs.

-Colin

On Fri, Jul 18, 2008 at 9:24 AM, Amber <[EMAIL PROTECTED]> wrote:

> The Hadoop documentation says "Sun's JDK must be used", this message is
> post to make sure that there is official statement about this.


Re: Input/Output Formaters and FileTypes

2008-06-20 Thread Colin Freas
We'd been using text input and output exclusively, but eventually realized
some efficiency improvements by using slightly more sophisticated classes
specific to our application.

Our main use of Hadoop is processing activity logs from a fleet of servers.
We get about 6GB of compressed data per day.  We were running reports based
on different dimensions in our logs.  At first, we were making a pass
through the data for each dimension.  The thing is, if we included the
dimension as part of the key, we could actually do the first MR job we need
in one pass.  But this slightly improved version of our reports still uses
the text input and output for keys, values, and output.

Where we use a custom class is when we process these intermediate results
into a final summary.  Our Summarizer class is the OutputValueClass for our
job, though the output forrmat is still text (which calls the toString
method of our Summarizer.)  Our final MR job operates on elements of
Summarizer, after deciding what to do based on the dimension label in the
key and based on certain charecteristics of the key and value from the
initial MR job.  This allows us to keep track of 4 independent tallies in
our summarizing MR job.

It was fairly easy to write the OutputValueClass, though our jobs are fairly
straightforward.  It's easy to see how it could be extended in really
interesting ways to do more though.


-Colin




On Fri, Jun 20, 2008 at 1:10 PM, Mathos Marcer <[EMAIL PROTECTED]>
wrote:

> Presumedly like most I've started off with mainly using "Text" based
> input and output formatters and using key and values as Text or
> IntWritable.  I've been looking more into the other formatters and
> writable classes and wondering what they would do for me.  To help
> spur some best practices and lessons learned conversations:  What are
> the benefits of the other formatters?  And benefits of MapFiles and
> SequenceFiles?  What are people out there using or have found gave
> them the greatest benefits?
>
> ==
> MM
>


Re: Stack Overflow When Running Job

2008-06-10 Thread Colin Freas
We keep running into this problem.  I've checked out the latest trunk,
applied the patch, and rebuilt the tar.gz.

Then I thought: would I need to run an upgrade on HDFS for this to work?
I'm not sure I'm up for that.

Any idea of the time until .17.1?

On Mon, Jun 9, 2008 at 4:22 PM, Runping Qi <[EMAIL PROTECTED]> wrote:

>
> This is a known problem for 0.17.0:
> https://issues.apache.org/jira/browse/HADOOP-3442
>
> It should be fixed in 0.17.1
>
> Runping
>
>
> > -Original Message-
> > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 09, 2008 12:56 PM
> > To: core-user@hadoop.apache.org
> > Subject: Re: Stack Overflow When Running Job
> >
> > We were getting this exact same problem in a really simple MR job, on
> > input
> > produced from a known-working MR job.
> >
> > It seemed to happen intermittently, and we couldn't figure out what
> was up.
> > In the end we solved the problem by increasing the number of maps (80
> to
> > 200, this is a 6 node, 12 code cluster).  Apparently, QuickSort can
> have
> > problems with big chunks of pre-sorted data.  Too much recursion, I
> > believe.
> >
> > This might not be what's going on with you, maybe you're on a cluster
> of
> > some other scale, but this worked for us (and in a setup with Hadoop
> 0.17.)
> >
> > Good luck!
> >
> > -Colin
> >
> > On Mon, Jun 2, 2008 at 3:18 PM, Devaraj Das <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi, do you have a testcase that we can run to reproduce this?
> Thanks!
> > >
> > > > -Original Message-
> > > > From: jkupferman [mailto:[EMAIL PROTECTED]
> > > > Sent: Monday, June 02, 2008 9:22 AM
> > > > To: core-user@hadoop.apache.org
> > > > Subject: Stack Overflow When Running Job
> > > >
> > > >
> > > > Hi everyone,
> > > > I have a job running that keeps failing with Stack Overflows
> > > > and I really dont see how that is happening.
> > > > The job runs for about 20-30 minutes before one task errors,
> > > > then a few more error and it fails.
> > > > I am running hadoop-17 and ive tried lowering these settings
> > > > to no avail:
> > > > io.sort.factor50
> > > > io.seqfile.sorter.recordlimit 50
> > > >
> > > > java.io.IOException: Spill failed
> > > >   at
> > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > > > MapTask.java:594)
> > > >   at
> > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > > > MapTask.java:576)
> > > >   at
> java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
> > > >   at Group.write(Group.java:68)
> > > >   at GroupPair.write(GroupPair.java:67)
> > > >   at
> > > > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> > > Serializer.serialize(WritableSerialization.java:90)
> > > >   at
> > > > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> > > Serializer.serialize(WritableSerialization.java:77)
> > > >   at
> > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa
> > > > sk.java:434)
> > > >   at MyMapper.map(MyMapper.java:27)
> > > >   at MyMapper.map(MyMapper.java:10)
> > > >   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> > > >   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
> > > >   at
> > > >
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> > > > Caused by: java.lang.StackOverflowError
> > > >   at java.io.DataInputStream.readInt(DataInputStream.java:370)
> > > >   at Group.readFields(Group.java:62)
> > > >   at GroupPair.readFields(GroupPair.java:60)
> > > >   at
> > > > org.apache.hadoop.io.WritableComparator.compare(WritableCompar
> > > > ator.java:91)
> > > >   at
> > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa
> > > > sk.java:494)
> > > >   at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
> > > >   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> > > >   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> > > > the above line repeated 200x
> > > >
> > > > I defined writeablecomparable called GroupPair which simply
> > > > holds to Group objects, each of which contains two integers.
> > > > I fail to see how QuickSort could recurse 200+ times since
> > > > that would require an insanely large amount of entries , far
> > > > more then the 500 million that had been output at that point.
> > > >
> > > > How is this even possible? And what can be done to fix this?
> > > > --
> > > > View this message in context:
> > > > http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935
> > > > 94p17593594.html
> > > > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > > >
> > > >
> > >
> > >
>


Simple question: call collect multiple times?

2008-06-09 Thread Colin Freas
Sorry if this is a dumb question, but in all my MR classes, I've only ever
called collect once, and now I find myself wanting to call collect multiple
times.  Looking at the API it seems like there shouldn't be a problem with
that, but I just wanted to make sure.  (...and to seed Google with the
answer for the next Hadooper that wonders.  ;)

-Colin


Re: Stack Overflow When Running Job

2008-06-09 Thread Colin Freas
We were getting this exact same problem in a really simple MR job, on input
produced from a known-working MR job.

It seemed to happen intermittently, and we couldn't figure out what was up.
In the end we solved the problem by increasing the number of maps (80 to
200, this is a 6 node, 12 code cluster).  Apparently, QuickSort can have
problems with big chunks of pre-sorted data.  Too much recursion, I believe.

This might not be what's going on with you, maybe you're on a cluster of
some other scale, but this worked for us (and in a setup with Hadoop 0.17.)

Good luck!

-Colin

On Mon, Jun 2, 2008 at 3:18 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:

> Hi, do you have a testcase that we can run to reproduce this? Thanks!
>
> > -Original Message-
> > From: jkupferman [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 02, 2008 9:22 AM
> > To: core-user@hadoop.apache.org
> > Subject: Stack Overflow When Running Job
> >
> >
> > Hi everyone,
> > I have a job running that keeps failing with Stack Overflows
> > and I really dont see how that is happening.
> > The job runs for about 20-30 minutes before one task errors,
> > then a few more error and it fails.
> > I am running hadoop-17 and ive tried lowering these settings
> > to no avail:
> > io.sort.factor50
> > io.seqfile.sorter.recordlimit 50
> >
> > java.io.IOException: Spill failed
> >   at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > MapTask.java:594)
> >   at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > MapTask.java:576)
> >   at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
> >   at Group.write(Group.java:68)
> >   at GroupPair.write(GroupPair.java:67)
> >   at
> > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> Serializer.serialize(WritableSerialization.java:90)
> >   at
> > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> Serializer.serialize(WritableSerialization.java:77)
> >   at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa
> > sk.java:434)
> >   at MyMapper.map(MyMapper.java:27)
> >   at MyMapper.map(MyMapper.java:10)
> >   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> >   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
> >   at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> > Caused by: java.lang.StackOverflowError
> >   at java.io.DataInputStream.readInt(DataInputStream.java:370)
> >   at Group.readFields(Group.java:62)
> >   at GroupPair.readFields(GroupPair.java:60)
> >   at
> > org.apache.hadoop.io.WritableComparator.compare(WritableCompar
> > ator.java:91)
> >   at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa
> > sk.java:494)
> >   at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
> >   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> >   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> > the above line repeated 200x
> >
> > I defined writeablecomparable called GroupPair which simply
> > holds to Group objects, each of which contains two integers.
> > I fail to see how QuickSort could recurse 200+ times since
> > that would require an insanely large amount of entries , far
> > more then the 500 million that had been output at that point.
> >
> > How is this even possible? And what can be done to fix this?
> > --
> > View this message in context:
> > http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935
> > 94p17593594.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>


Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Colin Freas
The MR jobs I'm performing are not CPU intensive, so I've always assumed
that they're more IO bound.  Maybe that's an exceptional situation, but I'm
not really sure.

A good motherboard with a local IO channel per disk, feeding individual
cores, with memory partitioned up between them...  and I've heard good
things about Intel's next tock vis-a-vis internal system throughput.

And yes, this would be a task for a paravirtualization system like Xen.
Again, it's just a thought, but with low end quad core proc's running about
$300, and the potential to cut the number of machines you need to physically
setup by 75%, I'm not sure I'd say it'd only be good for a proof of
concept.

Also, I just set up a dozen odd boxes that are two generations behind modern
boxes, and promptly blew a fuse.  The TDP on the Xeon 3.06Ghz chips I'm
using is 89W.  The TDP on an Intel Q6600 is 65W, and it represents 4 cores.

It's a simple experiment, but I don't have the resources on hand to run it.
I'm curious if anyone has seen the performance impact from the different
setups we're talking about.  I also think you could come close to faking it
with Hadoop config changes.

-Colin


On Fri, Jun 6, 2008 at 12:41 PM, Edward Capriolo <[EMAIL PROTECTED]>
wrote:

> I once asked a wise man in change of a rather large multi-datacenter
> service, "Have you every considered virtualization?" He replied, "All
> the CPU's here are pegged at 100%"
>
> They may be applications for this type of processing. I have thought
> about systems like this from time to time. This thinking goes in
> circles. Hadoop is designed for storing and processing on different
> hardware.  Virtualization lets you split a system into sub-systems.
>
> Virtualization is great for proof of concept.
> For example, I have deployed this: I installed VMware with two linux
> systems on my windows host, I followed a hadoop multi-system-tutorial
> running on two vmware nodes. I was able to get the word count
> application working, I also confirmed that blocks were indeed being
> stored on both virtual systems and that processing was being shared
> via MAP/REDUCE.
>
> The processing however was slow, of course this is the fault of
> VMware. VMware has a very high emulation overhead. Xen has less
> overhead. LinuxVserver and OpenVZ use software virtualization (they
> have very little (almost no) overhead). Regardless of how much
> overhead, overhead is overhead. Personally I find the Vmware falls
> short of its promises
>


Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Colin Freas
I've wondered about this using single or dual quad-core machines with one
spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
machines, possibly marking each physical box as a "rack".

There would be some initial and ongoing sysadmin costs.  But could this
increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
with many jobs by limiting the number of cores each job runs on, to say 8?
Has anyone tried such a setup?


On Fri, Jun 6, 2008 at 10:30 AM, Brad C <[EMAIL PROTECTED]> wrote:

> Hello Everyone,
>
> I've been brainstorming recently and its always been in the back of my
> mind, hadoop offers the functionality of clustering comodity systems
> together, but how would one go about virtualising them apart again?
>
> Kind Regards
>
> Brad :)
>


primary namenode not starting

2008-05-09 Thread Colin Freas
The primary namenode on my cluster seems to have stopped working.  The
secondary name node starts, but the primary fails with the error message
below.

I've scoured the cluster, particularly this node for changes, but I haven't
found any that I believe would cause this problem.

If anyone has an idea what I might look for, I'd appreciate any help.  Also,
is there any way to increase the verbosity of the logging?

-Colin

---

/
2008-05-09 11:31:46,484 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = dev04/10.0.2.12
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.16.1
STARTUP_MSG:   build =
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.16 -r 635123;
compiled by 'hadoopqa' on Sun Mar  9 05:44:19 UTC 2008
/
2008-05-09 11:31:46,656 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=54310
2008-05-09 11:31:46,665 INFO org.apache.hadoop.dfs.NameNode: Namenode up at:
dev04/10.0.2.12:54310
2008-05-09 11:31:46,671 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-05-09 11:31:46,676 INFO org.apache.hadoop.dfs.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2008-05-09 11:31:46,761 INFO org.apache.hadoop.fs.FSNamesystem:
fsOwner=hadoop,hadoop
2008-05-09 11:31:46,761 INFO org.apache.hadoop.fs.FSNamesystem:
supergroup=supergroup
2008-05-09 11:31:46,761 INFO org.apache.hadoop.fs.FSNamesystem:
isPermissionEnabled=true
2008-05-09 11:31:47,132 INFO org.apache.hadoop.ipc.Server: Stopping server
on 54310
2008-05-09 11:31:47,135 ERROR org.apache.hadoop.dfs.NameNode:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

2008-05-09 11:31:47,135 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at dev04/10.0.2.12
/


hdfs "injection" node?

2008-04-16 Thread Colin Freas
I have a machine that stores a lot of the data I need to put into my
cluster's HDFS.  It's on the same private network as the nodes, but it isn't
a node itself.

What is the easiest way to have it be able to directly inject the data files
into HDFS, without it acting as a datanode for replicas?

I tried an NFS mount, but something either within Hadoop, NFS, my hardware,
or somewhere else, and it would always hang when transferring more than a
few hundred files.

I'm hoping for a more direct solution, like setting up a dummy datanode
without a local storage space or something.  Just wondering if there's a
trick to that, or something.

-Colin


changing master node?

2008-04-14 Thread Colin Freas
i changed the master node on my cluster because the original crashed hard.

my nodes share an nfs mounted /conf.  i changed all the ip's appropriately,
starting and stopping seems to work fine.

when i do a bin/hadoop dfs -ls i get this message repeating itself over and
over:

08/04/14 06:01:10 INFO ipc.Client: Retrying connect to server: /
10.0.2.13:54310. Already tried 1 time(s).

is there something more i need to do to reconfigure the system?  do i need
to reformat hdfs, with all the accompanying headaches or is there a simpler
solution?

-colin


Re: Formatting the file system: Misleading hint in Wiki?

2008-04-10 Thread Colin Freas
This has been my experience as well.  This should be mentioned in the
Getting Started pages until resolved.

-colin



On Thu, Apr 10, 2008 at 10:54 AM, Michaela Buergle <
[EMAIL PROTECTED]> wrote:

> Hi all,
> on http://wiki.apache.org/hadoop/GettingStartedWithHadoop - it says:
> "Do not format a running Hadoop filesystem, this will cause all your
> data to be erased."
>
> It seems to me however that currently you better not format a Hadoop
> filesystem at all (after the first time, that is), running or not because:
>
> "Now if the cluster starts with the reformatted name-node, and not
> reformatted data-nodes the data-nodes will fail with
> java.io.IOException: Incompatible namespaceIDs ..."
> (http://issues.apache.org/jira/browse/HADOOP-1212 + personal experience)
>
> If I haven't missed an obvious solution to this problem, I suggest
> mentioning that issue in the Wiki.
>
> micha
>


Re: "incorrect data check

2008-04-09 Thread Colin Freas
I tried a somewhat naive version of this using streaming, and it failed
miserably.

I went with:

bin/hadoop jar ./contrib/streaming/hadoop-0.16.1-streaming.jar -input views
-output md5out -mapper org.apache.hadoop.maprred.lib.IdentityMapper -reducer
"md5sum -b -"

...but I think that's the wrong semantic.

The input directory is a bunch of gz files.  Are they passed to the reducer
(md5sum) as a whole, or are they decompressed and passed?  Are they passed
in on stdin?

Is there a way to ensure they're passed as a complete file?  Would I need to
write my own InputFormat handler, maybe extending FileInputFormat to ensure
the files aren't decompressed or split?



-colin


On Tue, Apr 8, 2008 at 6:15 PM, Norbert Burger <[EMAIL PROTECTED]>
wrote:

> Colin, how about writing a streaming mapper which simply runs md5sum on
> each
> file it gets as input?  Run this task along with the identity reducer, and
> you should be able to identify pretty quickly if there's HDFS corruption
> issue.
>
> Norbert
>
> On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas <[EMAIL PROTECTED]> wrote:
>
> > so, in an attempt to track down this problem, i've stripped out most of
> > the
> > files for input, trying to identify which ones are causing the problem.
> >
> > i've narrowed it down, but i can't pinpoint it.  i keep getting these
> > incorrect data check errors below, but the .gz files test fine with
> gzip.
> >
> > is there some way to run an md5 or something on the files in hdfs and
> > compare it to the checksum of the files on my local machine?
> >
> > i've looked around the lists and through the various options to send to
> > .../bin/hadoop, but nothing is jumping out at me.
> >
> > this is particularly frustrating because it's causing my jobs to fail,
> > rather than skipping the problematic input files.  i've also looked
> > through
> > the conf file and don't see anything similar about skipping bad files
> > without killing the job.
> >
> > -colin
> >
> >
> > On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas <[EMAIL PROTECTED]>
> wrote:
> >
> > > running a job on my 5 node cluster, i get these intermittent
> exceptions
> > in
> > > my logs:
> > >
> > > java.io.IOException: incorrect data check
> > >   at
> >
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> > Method)
> > >
> > >   at
> >
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
> > >   at
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
> > >   at
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> > >
> > >   at java.io.InputStream.read(InputStream.java:89)
> > >   at
> >
> org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
> > >   at
> >
> org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
> > >
> > >   at
> >
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
> > >   at
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
> > >   at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
> > >
> > >   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > >   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >   at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084
> > >
> > >
> > > they occur accross all the nodes, but i can't figure out which file is
> > > causing the problem.  i'm working on the assumption it's a specific
> file
> > > because it's precisely the same error that occurs on each node.  i've
> > > scoured the logs and can't find any reference to which file caused the
> > > hiccup.  but this is causing the job to fail.  other files are
> processed
> > > without a problem.  the files are 720 .gz files, ~100mb each.  other
> > files
> > > are processed on each node without a problem.  i'm in the middle
> testing
> > the
> > > .gz files, but i don't think the problem is necessarily in the source
> > data,
> > > as much as in when i copied it into hdfs.
> > >
> > > so my questions are these:
> > > is this a known issue?
> > > is there some way to determine which file or files are causing these
> > > exceptions?
> > > is there a way to run something like "gzip -t blah.gz" on the file in
> > > hdfs?  or maybe a checksum?
> > > is there a reason other than a corrupt datafile that would be causing
> > > this?
> > > in the original mapreduce paper, they talk about a mechanism to skip
> > > records that cause problems.  is there a way to have hadoop skip these
> > > problematic files and the associated records and continue with the
> job?
> > >
> > >
> > > thanks,
> > > colin
> > >
> >
>


Re: "incorrect data check

2008-04-08 Thread Colin Freas
so, in an attempt to track down this problem, i've stripped out most of the
files for input, trying to identify which ones are causing the problem.

i've narrowed it down, but i can't pinpoint it.  i keep getting these
incorrect data check errors below, but the .gz files test fine with gzip.

is there some way to run an md5 or something on the files in hdfs and
compare it to the checksum of the files on my local machine?

i've looked around the lists and through the various options to send to
.../bin/hadoop, but nothing is jumping out at me.

this is particularly frustrating because it's causing my jobs to fail,
rather than skipping the problematic input files.  i've also looked through
the conf file and don't see anything similar about skipping bad files
without killing the job.

-colin


On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas <[EMAIL PROTECTED]> wrote:

> running a job on my 5 node cluster, i get these intermittent exceptions in
> my logs:
>
> java.io.IOException: incorrect data check
>   at 
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native 
> Method)
>
>   at 
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
>
>   at java.io.InputStream.read(InputStream.java:89)
>   at 
> org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
>   at 
> org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
>
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
>
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084
>
>
> they occur accross all the nodes, but i can't figure out which file is
> causing the problem.  i'm working on the assumption it's a specific file
> because it's precisely the same error that occurs on each node.  i've
> scoured the logs and can't find any reference to which file caused the
> hiccup.  but this is causing the job to fail.  other files are processed
> without a problem.  the files are 720 .gz files, ~100mb each.  other files
> are processed on each node without a problem.  i'm in the middle testing the
> .gz files, but i don't think the problem is necessarily in the source data,
> as much as in when i copied it into hdfs.
>
> so my questions are these:
> is this a known issue?
> is there some way to determine which file or files are causing these
> exceptions?
> is there a way to run something like "gzip -t blah.gz" on the file in
> hdfs?  or maybe a checksum?
> is there a reason other than a corrupt datafile that would be causing
> this?
> in the original mapreduce paper, they talk about a mechanism to skip
> records that cause problems.  is there a way to have hadoop skip these
> problematic files and the associated records and continue with the job?
>
>
> thanks,
> colin
>


"incorrect data check

2008-04-08 Thread Colin Freas
running a job on my 5 node cluster, i get these intermittent exceptions in
my logs:

java.io.IOException: incorrect data check
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:89)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084


they occur accross all the nodes, but i can't figure out which file is
causing the problem.  i'm working on the assumption it's a specific file
because it's precisely the same error that occurs on each node.  i've
scoured the logs and can't find any reference to which file caused the
hiccup.  but this is causing the job to fail.  other files are processed
without a problem.  the files are 720 .gz files, ~100mb each.  other files
are processed on each node without a problem.  i'm in the middle testing the
.gz files, but i don't think the problem is necessarily in the source data,
as much as in when i copied it into hdfs.

so my questions are these:
is this a known issue?
is there some way to determine which file or files are causing these
exceptions?
is there a way to run something like "gzip -t blah.gz" on the file in hdfs?
or maybe a checksum?
is there a reason other than a corrupt datafile that would be causing this?
in the original mapreduce paper, they talk about a mechanism to skip records
that cause problems.  is there a way to have hadoop skip these problematic
files and the associated records and continue with the job?


thanks,
colin


Re: on number of input files and split size

2008-04-06 Thread Colin Freas
i just wanted to reiterate ted's point here.

my first run through with hadoop i used our log files as there are, which
are designed as small input files for a mysql database instance.  the files
were at most a few megabytes in size.  and we had tens something like 10,000
of them.  performance was atrocious.  it was really disheartening.

but then i strung them together into files of about 250mb performance was
fantastic.  then compressing those 250mb files increased performance again.
increased performance as in jobs that were were taking hours (on 5 machines)
were now taking 20 minutes.

so, you know, if you're wondering is it really worth the trouble to get the
input into larger chunks?  my experience, though limited, is that it
absolutely is.

-colin


On Fri, Apr 4, 2008 at 5:20 PM, Prasan Ary <[EMAIL PROTECTED]> wrote:

> So it seems best for my application if I can somehow consolidate smaller
> files into a couple of large files.
>
>  All of my files reside on S3, and I am using 'distcp' command to copy
> them to hdfs on EC2 before running a MR job. I was thinking it would be nice
> if I could modify distcp such that each EC2 image running 'distcp' on the
> EC2 cluster will concatenate input files into single file, so that at the
> end of the copy process , we will have as many files as there are machines
> in the cluster.
>
>  Any thoughts if how I should proceeed on this ? or if this is a good idea
> at all ?
>
>
>
> Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> The split will depend entirely on the input format that you use and the
> files that you have. In your case, you have lots of very small files so
> the
> limiting factor will almost certainly be the number of files. Thus, you
> will have 1000 splits (one per file).
>
> Your performance, btw, will likely be pretty poor with so many small
> files.
> Can you consolidate them? 100MB of data should probably be in no more than
> a few files if you want good performance. At that, most kinds of
> processing
> will be completely dominated by job startup time. If your jobs are I/O
> bound, they will be able to read 100MB of data in a just a few seconds at
> most. Startup time for a hadoop job is typically 10 seconds or more.
>
>
> On 4/4/08 12:58 PM, "Prasan Ary" wrote:
>
> > I have a question on how input files are split before they are given out
> to
> > Map functions.
> > Say I have an input directory containing 1000 files whose total size is
> 100
> > MB, and I have 10 machines in my cluster and I have configured 10
> > mapred.map.tasks in hadoop-site.xml.
> >
> > 1. With this configuration, do we have a way to know what size each
> split
> > will be of?
> > 2. Does split size depend on how many files there are in the input
> > directory? What if I have only 10 files in input directory, but the
> total size
> > of all these files is still 100 MB? Will it affect split size?
> >
> > Thanks.
> >
> >
> > -
> > You rock. That's why Blockbuster's offering you one month of Blockbuster
> Total
> > Access, No Cost.
>
>
>
>
> -
> You rock. That's why Blockbuster's offering you one month of Blockbuster
> Total Access, No Cost.
>


Performance impact of underlying file system?

2008-04-01 Thread Colin Freas
Is the performance of Hadoop impacted by the underlying file system on the
nodes at all?

All my nodes are ext3.  I'm wondering if using XFS, Reiser, or ZFS might
improve performance.

Does anyone have any offhand knowledge about this?

-Colin


Re: reduce task hanging or just slow?

2008-03-31 Thread Colin Freas
I believe that this is exactly what happened.

I'm not sure exactly what happened, but the networking stack on the master
node was all screwed up somehow.  All the machines serve double duty as
development boxes, and they're on two different networks.  The master node
could contact the cluster network but not the open net.  Once we got that
working, things seemed alright, even though before that all the cluster
machines could contact the master node on private gig-e network.

So, this is a pain in the ass.  Is there a way to get it to bind hostnames
to the ips in my slaves file?  Or just use the ips in slaves outright?  And
is there some way to know for sure this is what the problem is?  Is this
related to HADOOP-1374?  Could that bug be this hostname thing?

-Colin



On Mon, Mar 31, 2008 at 8:58 PM, Mafish Liu <[EMAIL PROTECTED]> wrote:

> Hi:
>I have met the similar problem with you.  Finally, I found that this
> problem was caused by the hostname resolution because hadoop use hostname
> to
> access other nodes.
>To fix this, try open your jobtracker log file( It often resides in
> $HADOOP_HOME/logs/hadoop--jobtracker-.log ) to see if there is a
> error:
> "FATAL org.apache.hadoop.mapred.JobTracker: java.net.UnknownHostException:
> Invalid hostname for server: local"
>If, it is, adding ip-hostname pairs to /etc/hosts files on all of you
> nodes may fix this problem.
>
> Good luck and best regards.
>
> Mafish
>
> --
> [EMAIL PROTECTED]
> Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
>


Re: Hadoop streaming performance problem

2008-03-31 Thread Colin Freas
Really?

I would expect the opposite: for compressed files to process slower.

You're saying that is not the case, and that compressed files actually
increase the speed of jobs?

-Colin

On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
wrote:

> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> provide the input files gzipped. Not great difference (e.g. 50% slower
> when not gzipped, plus it took more than twice as long to upload the
> data to HDFS-on-S3 in the first place), but still probably relevant.
>
> Andreas
>
> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > I'm running custom map programs written in C++. What the programs do is
> very
> > simple. For example, in program 2, for each input lineID node1
> node2
> > ... nodeN
> > the program outputs
> > node1 ID
> > node2 ID
> > ...
> > nodeN ID
> >
> > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >
> > I agree that it depends on the scripts. I tried replicating the
> computation
> > for each input line by 10 times and saw significantly better speedup.
> But it
> > is still pretty bad that Hadoop streaming has such big overhead for
> simple
> > programs.
> >
> > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > speed up on the cluster.
> >
> > Lin
> >
> > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> > wrote:
> >
> > > are you running a custom map script or a standard linux command like
> WC?
> > >  If
> > > custom, what does your script do?
> > >
> > > How much ram do you have?  what are you Java memory settings?
> > >
> > > I used the following setup
> > >
> > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> task
> > > max.
> > >
> > > I got the following results
> > >
> > > WC 30-40% speedup
> > > Sort 40% speedup
> > > Grep 5X slowdown (turns out this was due to what you described
> above...
> > > Grep
> > > is just very highly optimized for command line)
> > > Custom perl script which is essentially a For loop which matches each
> row
> > > of
> > > a dataset to a set of 100 categories) 60% speedup.
> > >
> > > So I do think that it depends on your script... and some other
> settings of
> > > yours.
> > >
> > > Theo
> > >
> > > On Mon, Mar 31, 2008 at 2:00 PM, lin <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > programs. So far the performance has been pretty disappointing.
> > > >
> > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > capacity
> > > > of each node is 2. The Hadoop version is 0.15.
> > > >
> > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> in
> > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > Hadoop
> > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > jobs.
> > > >
> > > > I understand that there is some overhead in starting up tasks,
> reading
> > > to
> > > > and writing from the distributed file system. But they do not seem
> to
> > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > program
> > > > 1 to output nothing and saw the same magnitude of overhead.
> > > >
> > > > The output of top shows that the majority of the CPU time is
> consumed by
> > > > Hadoop java processes (e.g.
> org.apache.hadoop.mapred.TaskTracker$Child).
> > > > So
> > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > mapred.child.java.opts.
> > > >
> > > > The profile results show that most of CPU time is spent in the
> following
> > > > methods
> > > >
> > > >   rank   self  accum   count trace method
> > > >
> > > >   1 23.76% 23.76%1246 300472
> > > java.lang.UNIXProcess.waitForProcessExit
> > > >
> > > >   2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes
> > > >
> > > >   3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes
> > > >
> > > >   4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes
> > > >
> > > > And their stack traces show that these methods are for interacting
> with
> > > > the
> > > > map program.
> > > >
> > > >
> > > > TRACE 300472:
> > > >
> > > >
> > > >  java.lang.UNIXProcess.waitForProcessExit(
> UNIXProcess.java:Unknownline)
> > > >
> > > >java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > >
> > > >java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > >
> > > > TRACE 300474:
> > > >
> > > >java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >java.io.BufferedInputStream.read1(BufferedInputStream.java
> :256)
> > > >
> > > >java.io.BufferedInputStream.read(BufferedInputStream.java
> :317)
> > > >
> > > >java.io.BufferedInputStream.fill(BufferedInputStream.java

reduce task hanging or just slow?

2008-03-31 Thread Colin Freas
I've set up a job to run on my small 4 (sometimes 5) node cluster on dual
processor server boxes with 2-8GB of memory.

My job processes 24 100-300MB files that are a days worth of logs, total
data is about 6GB.

I've modified the word count example to do what I need, and it works fine on
small test files.

I've set the number of map tasks at 200, the number of reduce tasks to 14.
Things seem to go along fine, the map % climbs nicely, along with the
reduce.  Once the map hits 100% though, the reduce % stops increasing.
Right now it's stuck around 58%.  I was hoping changing the number of reduce
tasks would help, but I'm not really sure it did.  I had tried this once
before with the default number of deduce jobs, and I got to 100% (Map) and
14% (Reduce) before I saw this hanging behavior.

I'm just trying to understand what's happening here, and if there's
something I can do to increase the performance, short of adding nodes.  Is
it likely I've set something up incorrectly somewhere?

Any help appreciated.

Thanks!

-Colin


nfs mount hadoop-site?

2008-03-27 Thread Colin Freas
are there any issues with having the hadoop-site.xml in .../conf placed on
an nfs mounted dir that all my nodes have access to?

-colin


Re: MapReduce with related data from disparate files

2008-03-25 Thread Colin Freas
Thanks Ted, Nathan.  Great advice.

So I've been looking at the InputFormat, RecordReader, and InputSplit
interfaces and associated classes and trying to get my head around it.

For the situation I'm in, where I have two types of file, the names are
distinct, and the names actually have time stamps built in.  I have files
like:

requests.20080323.1240
request_data.20080323.1240

There's a business rule that says any data about the requests must be in the
like-named request_data file.

So, if I implemented my own InputFormat and/or RecordReader, is there some
way to get access to the file name that's providing the input, so that I can
direct the input to the appropriate InputFormat/RecordReader/Mapper?  Is
this what I want to do?

I feel like the different input from the files should go to the same map,
and just manipulate the values associated with the keys common to both
files.  I'm not really sure how to do this.  Are there any more complex
examples of Hadoop setups anywhere?  I looked around, but I've found mostly
low-level tutorial stuff about getting the cluster up and running, but not
so much about subsequently bending it to my will.



-Colin


On Mon, Mar 24, 2008 at 5:18 PM, Nathan Wang <[EMAIL PROTECTED]> wrote:

>
> It's possible to do the whole thing in one round of map/reduce.
> The only requirement is to be able to differentiate between the 2
> different types of input files, possibly using different file name
> extensions.
>
> One of my coworkers wrote a smart InputFormat class that creates a
> different RecordReader for each file type, based on the input file's
> extension.
>
> In each RecordReader, you create a special typed value object for that
> input.  So, in your map method, you collect different value objects from
> different RecordReaders.  In you reduce method, for each key, you do
> necessary processing on the collection based on the value object types.
>
> The main point here is to keep track of the differences from the
> beginning to the end, and process them accordingly.
>
> Nathan
>
> -Original Message-
> From: Colin Freas [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 24, 2008 1:36 PM
> To: core-user@hadoop.apache.org
> Subject: MapReduce with related data from disparate files
>
> I have a cluster of 5 machines up and accepting jobs, and I'm trying to
> work
> out how to design my first MapReduce task for the data I have.
>
> So, I wonder if anyone has any experience with the sort of problem I'm
> trying to solve, and what the best ways to use Hadoop and MapReduce for
> it
> are.
>
> I have two sets of related comma delimited files.  One is a set of
> unique
> records with something like a primary key in one of the fields.  The
> other
> is a set of records keyed to the first with the primary key.  It's
> basically
> a request log, and ancillary data about the request.  Something like:
>
> file 1:
> asdf, 1, 2, 5, 3 ...
> qwer, 3, 6, 2, 7 ...
> zxcv, 2, 3, 6, 4 ...
>
> file 2:
> asdf, 10, 3
> asdf, 3, 2
> asdf, 1, 3
> zxcv, 3, 1
>
> I basically need to flatten this mapping, and then perform some analysis
> on
> the result.  I wrote a processing program that runs on a single machines
> to
> create a map like this:
>
> file 1-2:
> asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
> qwer, 3, 6, 2, 7, ... , , , , ,
> zxcv, 2, 3, 6, 4, ... , , 3, 1, ,
>
> ... where the "flattening" puts in blank values for missing ancillary
> data.
> I then sample this map taking some small number of entire records for
> output, and extrapolate some statistics from the results.
>
> So, what I'd really like to do is figure out exactly what questions I
> need
> to ask, and instead of sampling, do an enumeration.
>
> Is my best bet to create the conflated date file (that I labeled "file
> 1-2"
> above) in one task, then do analysis using another?  Or is it better to
> do
> the conflation and aggregation in one step, and then combine those?
>
> I'm not sure how clear this is, but I believe it gets the gist across.
>
> Any thoughts appreciated, any questions answered.
>
>
>
> -Colin
>


MapReduce with related data from disparate files

2008-03-24 Thread Colin Freas
I have a cluster of 5 machines up and accepting jobs, and I'm trying to work
out how to design my first MapReduce task for the data I have.

So, I wonder if anyone has any experience with the sort of problem I'm
trying to solve, and what the best ways to use Hadoop and MapReduce for it
are.

I have two sets of related comma delimited files.  One is a set of unique
records with something like a primary key in one of the fields.  The other
is a set of records keyed to the first with the primary key.  It's basically
a request log, and ancillary data about the request.  Something like:

file 1:
asdf, 1, 2, 5, 3 ...
qwer, 3, 6, 2, 7 ...
zxcv, 2, 3, 6, 4 ...

file 2:
asdf, 10, 3
asdf, 3, 2
asdf, 1, 3
zxcv, 3, 1

I basically need to flatten this mapping, and then perform some analysis on
the result.  I wrote a processing program that runs on a single machines to
create a map like this:

file 1-2:
asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
qwer, 3, 6, 2, 7, ... , , , , ,
zxcv, 2, 3, 6, 4, ... , , 3, 1, ,

... where the "flattening" puts in blank values for missing ancillary data.
I then sample this map taking some small number of entire records for
output, and extrapolate some statistics from the results.

So, what I'd really like to do is figure out exactly what questions I need
to ask, and instead of sampling, do an enumeration.

Is my best bet to create the conflated date file (that I labeled "file 1-2"
above) in one task, then do analysis using another?  Or is it better to do
the conflation and aggregation in one step, and then combine those?

I'm not sure how clear this is, but I believe it gets the gist across.

Any thoughts appreciated, any questions answered.



-Colin


Re: Master as DataNode

2008-03-21 Thread Colin Freas
yup, got it working with that technique.

pushed it out to 5 machines, things look good.  appreciate the help.

what is it that causes this?  i know i formatted the dfs more than once.  is
that what does it?  or just adding nodes, or...  ?

-colin


On Fri, Mar 21, 2008 at 2:30 PM, Jeff Eastman <[EMAIL PROTECTED]>
wrote:

> I encountered this while I was starting out too, while moving from a
> single
> node cluster to more nodes. I suggest clearing your hadoop-datastore
> directory, reformatting the HDFS and restarting again. You are very close
> :)
> Jeff
>
> > -Original Message-
> > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > Sent: Friday, March 21, 2008 11:18 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: Master as DataNode
> >
> > ah:
> >
> > 2008-03-21 14:06:05,526 ERROR org.apache.hadoop.dfs.DataNode:
> > java.io.IOException: Incompatible namespaceIDs in
> > /var/tmp/hadoop-datastore/hadoop/dfs/data: namenode namespaceID =
> > 2121666262; datanode namespaceID = 2058961420
> >
> >
> > looks like i'm hitting this "Incompatible namespaceID" bug:
> > http://issues.apache.org/jira/browse/HADOOP-1212
> >
> > is there a work around for this?
> >
> > -colin
> >
> >
> > On Fri, Mar 21, 2008 at 1:50 PM, Jeff Eastman <
> [EMAIL PROTECTED]>
> > wrote:
> >
> > > Check your logs. That should work out of the box with the
> configuration
> > > steps you described.
> > >
> > > Jeff
> > >
> > > > -Original Message-
> > > > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > > > Sent: Friday, March 21, 2008 10:40 AM
> > > > To: core-user@hadoop.apache.org
> > > > Subject: Master as DataNode
> > > >
> > > > setting up a simple hadoop cluster with two machines, i've gotten to
> > the
> > > > point where the two machines can see each other, things seem fine,
> but
> > > i'm
> > > > trying to set up the master as both a master and a slave, just for
> > > testing
> > > > purposes.
> > > >
> > > > so, i've put the master into the conf/masters file and the
> conf/slaves
> > > > file.
> > > >
> > > > things seem to work, but there's no DataNode process listed with jps
> > on
> > > > the
> > > > master.  i'm wondering if there's a switch i need to flip to tell
> > hadoop
> > > > to
> > > > use the master as a datanode even if it's in the slaves file?
> > > >
> > > > thanks again.
> > > >
> > > > -colin
> > >
> > >
> > >
>
>
>


Re: Master as DataNode

2008-03-21 Thread Colin Freas
ah:

2008-03-21 14:06:05,526 ERROR org.apache.hadoop.dfs.DataNode:
java.io.IOException: Incompatible namespaceIDs in
/var/tmp/hadoop-datastore/hadoop/dfs/data: namenode namespaceID =
2121666262; datanode namespaceID = 2058961420


looks like i'm hitting this "Incompatible namespaceID" bug:
http://issues.apache.org/jira/browse/HADOOP-1212

is there a work around for this?

-colin


On Fri, Mar 21, 2008 at 1:50 PM, Jeff Eastman <[EMAIL PROTECTED]>
wrote:

> Check your logs. That should work out of the box with the configuration
> steps you described.
>
> Jeff
>
> > -Original Message-
> > From: Colin Freas [mailto:[EMAIL PROTECTED]
> > Sent: Friday, March 21, 2008 10:40 AM
> > To: core-user@hadoop.apache.org
> > Subject: Master as DataNode
> >
> > setting up a simple hadoop cluster with two machines, i've gotten to the
> > point where the two machines can see each other, things seem fine, but
> i'm
> > trying to set up the master as both a master and a slave, just for
> testing
> > purposes.
> >
> > so, i've put the master into the conf/masters file and the conf/slaves
> > file.
> >
> > things seem to work, but there's no DataNode process listed with jps on
> > the
> > master.  i'm wondering if there's a switch i need to flip to tell hadoop
> > to
> > use the master as a datanode even if it's in the slaves file?
> >
> > thanks again.
> >
> > -colin
>
>
>


Master as DataNode

2008-03-21 Thread Colin Freas
setting up a simple hadoop cluster with two machines, i've gotten to the
point where the two machines can see each other, things seem fine, but i'm
trying to set up the master as both a master and a slave, just for testing
purposes.

so, i've put the master into the conf/masters file and the conf/slaves file.

things seem to work, but there's no DataNode process listed with jps on the
master.  i'm wondering if there's a switch i need to flip to tell hadoop to
use the master as a datanode even if it's in the slaves file?

thanks again.

-colin


Re: NFS mounted home, host RSA keys, localhost, strict sshds and bad mojo.

2008-03-21 Thread Colin Freas
ah, yes.  that worked.  thanks!

On Fri, Mar 21, 2008 at 12:48 PM, Natarajan, Senthil <[EMAIL PROTECTED]>
wrote:

> I guess the following file might have localhost entry, change to hostname
>
> /conf/masters
> /conf/slaves
>
>
> -Original Message-
> From: Colin Freas [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 21, 2008 12:25 PM
> To: core-user@hadoop.apache.org
> Subject: NFS mounted home, host RSA keys, localhost, strict sshds and bad
> mojo.
>
> i'm working to set up a cluster across several machines where users' home
> dirs are on an nfs mount.
>
> i setup key authentication for the hadoop user, install all the software
> on
> one node, get everything running, and move on to another node.
>
> once there, however, my sshd complains because the host key associated
> with
> "localhost" is a different machine, and it refuses the connection.
>
> i'm just testing here, so I can remove the localhost entry from
> .ssh/known_hosts, but, is this going to be an issue going forward if the
> home dirs are shared like this?
>
> can i get hadoop to use the ip or the hostname instead of localhost?  i
> scanned the config files, but it didn't jump out at me.
>
>
> -colin
>


NFS mounted home, host RSA keys, localhost, strict sshds and bad mojo.

2008-03-21 Thread Colin Freas
i'm working to set up a cluster across several machines where users' home
dirs are on an nfs mount.

i setup key authentication for the hadoop user, install all the software on
one node, get everything running, and move on to another node.

once there, however, my sshd complains because the host key associated with
"localhost" is a different machine, and it refuses the connection.

i'm just testing here, so I can remove the localhost entry from
.ssh/known_hosts, but, is this going to be an issue going forward if the
home dirs are shared like this?

can i get hadoop to use the ip or the hostname instead of localhost?  i
scanned the config files, but it didn't jump out at me.


-colin