Re: hadoop jobs take long time to setup

2009-06-28 Thread Stuart White
Although I've never done it, I believe you could manually copy your jar
files out to your cluster somewhere in hadoop's classpath, and that would
remove the need for you to copy them to your cluster at the start of each
job.

On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou wrote:

> Hi.
>
> Running without a jobtracker makes the job start almost instantly.
> I think it is due to something with the classloader. I use a huge amount of
> jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be
> loaded every time I guess.
>
> By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
> live forever then ?
>
> Cheers
>
> //Marcus
>
> On Sun, Jun 28, 2009 at 9:54 PM, tim robertson  >wrote:
>
> > How long does it take to start the code locally in a single thread?
> >
> > Can you reuse the JVM so it only starts once per node per job?
> > conf.setNumTasksToExecutePerJvm(-1)
> >
> > Cheers,
> > Tim
> >
> >
> >
> > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou >
> > wrote:
> > > Hi.
> > >
> > > Wonder how one should improve the startup times of a hadoop job. Some
> of
> > my
> > > jobs which have a lot of dependencies in terms of many jar files take a
> > long
> > > time to start in hadoop up to 2 minutes some times.
> > > The data input amounts in these cases are neglible so it seems that
> > Hadoop
> > > have a really high setup cost, which I can live with but this seems to
> > much.
> > >
> > > Let's say a job takes 10 minutes to complete then it is bad if it takes
> 2
> > > mins to set it up... 20-30 sec max would be a lot more reasonable.
> > >
> > > Hints ?
> > >
> > > //Marcus
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.he...@tailsweep.com
> > > http://www.tailsweep.com/
> > >
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
>


Re: Confused about partitioning and reducers

2009-06-27 Thread Stuart White
Please disregard this question.  I think I'm mistaken.

On Sat, Jun 27, 2009 at 10:25 AM, Stuart White wrote:

> If I call HashPartitioner.getPartition(), passing a key of 4 and a
> numPartitions of 5, it returns a partition of 4.  (Which is what I would
> expect.)
>
> However, if I have a mapred job, and in my mapper I emit a record with key
> 4, I'm configured to use the HashPartitioner, I have 5 Reducers configured,
> and I'm using the IdentityReducer, the record with key 4 gets handled by
> Reducer #0 (because it gets written out to part-0).
>
> I would have expected a record with key 4 to be handled by reducer #4 (and
> therefore written to part-4) because the HashPartitioner returns 4 for a
> key of 4 and a numPartitions of 5.
>
> Obviously I'm missing something here.  What is the logic for deciding which
> partition of records is handled by which reducer instance?
>
> It can't be random, otherwise mapside join wouldn't work.
>
> Thanks.
>


Confused about partitioning and reducers

2009-06-27 Thread Stuart White
If I call HashPartitioner.getPartition(), passing a key of 4 and a
numPartitions of 5, it returns a partition of 4.  (Which is what I would
expect.)

However, if I have a mapred job, and in my mapper I emit a record with key
4, I'm configured to use the HashPartitioner, I have 5 Reducers configured,
and I'm using the IdentityReducer, the record with key 4 gets handled by
Reducer #0 (because it gets written out to part-0).

I would have expected a record with key 4 to be handled by reducer #4 (and
therefore written to part-4) because the HashPartitioner returns 4 for a
key of 4 and a numPartitions of 5.

Obviously I'm missing something here.  What is the logic for deciding which
partition of records is handled by which reducer instance?

It can't be random, otherwise mapside join wouldn't work.

Thanks.


Does balancer ensure a file's replication is satisfied?

2009-06-23 Thread Stuart White
In my Hadoop cluster, I've had several drives fail lately (and they've
been replaced).  Each time a new empty drive is placed in the cluster,
I run the balancer.

I understand that the balancer will redistribute the load of file
blocks across the nodes.

My question is: will balancer also look at the desired replication of
a file, and if the actual replication of a file is less than the
desired (because the file had blocks stored on the lost drive), will
balancer re-replicate those lost blocks?

If not, is there another tool that will ensure the desired replication
factor of files is satisfied?

If this functionality doesn't exist, I'm concerned that I'm slowly,
silently losing my files as I replace drives, and I may not even
realize it.

Thoughts?


Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Stuart White
http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software


On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni  wrote:

> It works, many thanks.
>
> Last question: is this information documented somewhere in the package? I
> was not able to find it.
>
>
>  Roldano
>
>
>
> On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote:
> > Java 1.6.
> >
> > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni 
> wrote:
> >
> > > A very basic question: which Java version is required for
> hadoop-0.19.1?
> > >
> > > With jre1.5.0_06 I get the error:
> > >  java.lang.UnsupportedClassVersionError: Bad version number in .class
> file
> > >  at java.lang.ClassLoader.defineClass1(Native Method)
> > >  (..)
> > >
> > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
> > >
> > >
> > > Thanks in advance for your kind help
> > >
> > >  Roldano
> > >
>


Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Stuart White
Java 1.6.

On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni  wrote:

> A very basic question: which Java version is required for hadoop-0.19.1?
>
> With jre1.5.0_06 I get the error:
>  java.lang.UnsupportedClassVersionError: Bad version number in .class file
>  at java.lang.ClassLoader.defineClass1(Native Method)
>  (..)
>
> By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
>
>
> Thanks in advance for your kind help
>
>  Roldano
>


Renaming all nodes in Hadoop cluster

2009-06-02 Thread Stuart White
Is it possible to rename all nodes in a Hadoop cluster and not lose
the data stored on hdfs?  Of course I'll need to update the "master"
and "slaves" files, but I'm not familiar with how hdfs tracks where it
has written all the splits of the files.  Is it possible to retain the
data written to hdfs when renaming all nodes in the cluster, and if
so, what additional configuration changes, if any, are required?


Re: InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley  wrote:

>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just have the binary data in them.
>

Would you mind uploading it or sending it to the list?


InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
I need to process a dataset that contains text records of fixed length
in bytes.  For example, each record may be 100 bytes in length, with
the first field being the first 10 bytes, the second field being the
second 10 bytes, etc...  There are no newlines on the file.  Field
values have been either whitespace-padded or truncated to fit within
the specific locations in these fixed-width records.

Does Hadoop have an InputFormat to support processing of such files?
I looked but couldn't find one.

Of course, I could pre-process the file (outside of Hadoop) to put
newlines at the end of each record, but I'd prefer not to require such
a prep step.

Thanks.


Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Stuart White
I need to do a reduce-side join of two datasets.  It's a many-to-many
join; that is, each dataset can can multiple records with any given
key.

Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented to the reducers before records from the
second dataset.  I should "hold on" to the value from the one dataset
and remember it as I iterate across the values from the second
dataset.

This seems like it only works well for one-to-many joins (when one of
your datasets will only have a single record with any given key).
This scales well because you're only remembering one value.

In a many-to-many join, if you apply this same algorithm, you'll need
to remember all values from one dataset, which of course will be
problematic (and won't scale) when dealing with large datasets with
large numbers of records with the same keys.

Does an efficient algorithm exist for a many-to-many reduce-side join?


Re: Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
On Thu, May 14, 2009 at 10:25 AM, jason hadoop  wrote:
> If you put up a discussion question on www.prohadoopbook.com, I will fill in
> the example on how to do this.

Done.  Thanks!

http://www.prohadoopbook.com/forum/topics/preserving-partition-file


Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
I'm implementing a map-side join as described in chapter 8 of "Pro
Hadoop".  I have two files that have been partitioned using the
TotalOrderPartitioner on the same key into the same number of
partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
one Mapper will handle an entire partition.

I want the output to be written in the same partitioned, total sort
order.  If possible, I want to accomplish this by setting my
NumReducers to 0 and having the output of my Mappers written directly
to HDFS, thereby skipping the partition/sort step.

My question is this: Am I guaranteed that the Mapper that processes
part-0 will have its output written to the output file named
part-0, the Mapper that processes part-1 will have its output
written to part-1, etc... ?

If so, then I can preserve the partitioning/sort order of my input
files without re-partitioning and re-sorting.

Thanks.


Access counters from within Reducer#configure() ?

2009-05-11 Thread Stuart White
I'd like to be able to access my job's counters from within my
Reducer's configure() method (so I can know how many records were
output from my mappers).  Is this possible?  Thanks!


Re: Hadoop Summit 2009 - Open for registration

2009-05-06 Thread Stuart White
Any chance these presentations (as well as the Cloudera ones on the
following day) will be recorded and uploaded to YouTube?

On Tue, May 5, 2009 at 4:10 PM, Ajay Anand  wrote:
> This year's Hadoop Summit
> (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for
> June 10th at the Santa Clara Marriott, and is now open for registration.
>
>
>
> We have a packed agenda, with three tracks - for developers,
> administrators, and one focused on new and innovative applications using
> Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera,
> Facebook, HP, Microsoft, and the Yahoo! team, as well as leading
> universities including UC Berkeley, CMU, Cornell, U of Maryland, U of
> Nebraska and SUNY.
>
>
>
> From our experience last year with the rush for seats, I would encourage
> people to register early at http://hadoopsummit09.eventbrite.com/
>
>
>
> Looking forward to seeing you at the summit!
>
>
>
> Ajay
>
>


Re: Multiple outputs and getmerge?

2009-04-21 Thread Stuart White
On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi  wrote:
>
> I once used MultipleOutputFormat and created
>   (mapred.work.output.dir)/type1/part-_
>   (mapred.work.output.dir)/type2/part-_
>    ...
>
> And JobTracker took care of the renaming to
>   (mapred.output.dir)/type{1,2}/part-__
>
> Would that work for you?

Can you please explain this in more detail?  It looks like you're
using MultipleOutputFormat for *both* of your outputs?  So, you simply
don't use the OutputCollector passed as a parm to Mapper#map()?


Re: Multiple outputs and getmerge?

2009-04-21 Thread Stuart White
On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon  wrote:
> Would dfs -cat do what you need? e.g:
>
> ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* >
> /tmp/exceptions-merged

Yes, that would work.  Thanks for the suggestion.


Re: Put computation in Map or in Reduce

2009-04-20 Thread Stuart White
Unless you need the hashing/sorting provided by the reduce phase, I'd
recommend placing your logic in your mapper and, when setting up your
job, calling JobConf#setNumReduceTasks(0), so that the reduce phase
won't be executed.  In that case, any records emitted by your mapper
will be written to the output.

http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)


On Mon, Apr 20, 2009 at 10:25 PM, Mark Kerzner  wrote:
> Hi,
>
> in an MR step, I need to extract text from various files (using Tika). I
> have put text extraction into reduce(), because I am writing the extracted
> text to the output on HDFS. But now it occurs to me that I might as well
> have put it into map() and have default reduce() which will write every
> map() result out, is that true?
>
> Thank you,
> Mark
>


Multiple outputs and getmerge?

2009-04-20 Thread Stuart White
I've written a MR job with multiple outputs.  The "normal" output goes
to files named part-X and my secondary output records go to files
I've chosen to name "ExceptionDocuments" (and therefore are named
"ExceptionDocuments-m-X").

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the "normal" output and one
containing the ExceptionDocuments output).  But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue "hadoop dfs -getmerge", what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this?  It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!


Re: Can we somehow read from the HDFS without converting it to local?

2009-04-10 Thread Stuart White
Not sure if this is what you're looking for...

http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample


On Thu, Apr 9, 2009 at 10:56 PM, Sid123  wrote:
>
> I need to reuse the O/P of my DFS file without copying to local. Is there a
> way?
> --
> View this message in context: 
> http://www.nabble.com/Can-we-somehow-read-from-the-HDFS-without-converting-it-to-local--tp22982760p22982760.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Coordination between Mapper tasks

2009-03-28 Thread Stuart White
> You might want to look at a memcached solution some students and I worked
> out for exactly this problem.

Thanks, Jimmy!  This paper does exactly describe my problem.

I started working to implement the memcached solution you describe,
and I've run into a small problem.  I've described it on the
spymemcached forum:

http://groups.google.com/group/spymemcached/browse_thread/thread/7b4d82bca469ed20

Essentially, it seems the keys are being hashed inconsistently by
spymemcached across runs.  This, of course, will result in
inconsistent/invalid results.

Did you guys run into this?  Since I'm new to memcached, I'm hoping
that this is simply something I don't understand or am overlooking.


Re: Coordination between Mapper tasks

2009-03-21 Thread Stuart White
Thanks to everyone for your feedback.  I'm unfamiliar with many of the
technologies you've mentioned, so it may take me some time to digest
all your responses.  The first thing I'm going to look at is Ted's
suggestion of a pure map-reduce solution by pre-joining my data with
my lookup values.

On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley  wrote:
> On Thu, Mar 19, 2009 at 6:42 PM, Stuart White wrote:
>
>>
>> My process requires a large dictionary of terms (~ 2GB when loaded
>> into RAM).  The terms are looked-up very frequently, so I want the
>> terms memory-resident.
>>
>> So, the problem is, I want 3 processes (to utilize CPU), but each
>> process requires ~2GB, but my nodes don't have enough memory to each
>> have their own copy of the 2GB of data.  So, I need to somehow share
>> the 2GB between the processes.
>
>
> I would recommend using the multi-threaded map runner. Have 1 map/node and
> just use 3 worker threads that all consume the input. The only disadvantage
> is that it works best for cpu-heavy loads (or maps that are doing crawling,
> etc.), since you only have one record reader for all three of the map
> threads.
>
> In the longer term, it might make sense to enable parallel jvm reuse in
> addition to serial jvm reuse.
>
> -- Owen
>


Re: Coordination between Mapper tasks

2009-03-19 Thread Stuart White
The nodes in my cluster have 4 cores & 4 GB RAM.  So, I've set
mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for
"breathing room").

My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.

What I have currently implemented is a standalone RMI service that,
during startup, loads the 2GB dictionaries.  My mappers are simply RMI
clients that call this RMI service.

This works just fine.  The only problem is that my standalone RMI
service is totally "outside" Hadoop.  I have to ssh onto each of the
nodes, start/stop/reconfigure the services manually, etc...

So, I was thinking that, at job startup, the processes on each node
would (using ZooKeeper) elect a leader responsible for hosting the 2GB
dictionaries.  This process would load the dictionaries and share them
via RMI.  The other processes would recognize that another process on
the box is the leader, and they would act as RMI clients to that
process.

To make this work, I'm calling conf.setNumTasksToExecutePerJvm(-1) so
that Hadoop does not create new JVMs for each task.

Also note that the processes are "grouped" by node; that is, the
ZooKeeper path that I'll use for coordination will include the
hostname, so that only processes on the same node will compete for
leadership.

Anyway, in short, I was looking for a way to elect a leader process
per node responsible for hosting/sharing a large amount of
memory-resident data via RMI.

Hopefully that made sense...


Coordination between Mapper tasks

2009-03-18 Thread Stuart White
I'd like to implement some coordination between Mapper tasks running
on the same node.  I was thinking of using ZooKeeper to provide this
coordination.

I think I remember hearing that MapReduce and/or HDFS use ZooKeeper
under-the-covers.

So, I'm wondering... in my Mappers, if I want distributed
coordination, can I "piggy-back" onto the ZooKeeper instance being
used by the underlying MapRed/HDFS?  The benefit being that I don't
need to create/configure/run my own ZooKeeper instance.


Re: Release batched-up output records at end-of-job?

2009-03-17 Thread Stuart White
Yeah, I thought of that, but I was concerned that, even if it did
work, if it wasn't guaranteed behavior, that it might stop working in
a future release.  I'll go ahead and give that a try.

Can anybody provide details on this new API?

Thanks for the response!

On Tue, Mar 17, 2009 at 7:29 AM, Jingkei Ly  wrote:
> You should be able to keep a reference to the OutputCollector provided
> to the #map() method, and then use it in the #close() method.
>
> I believe that there's a new API that will actually provide the output
> collector to the close() method via a context object, but in the mean
> time I think the above should work.
>
> -----Original Message-
> From: Stuart White [mailto:stuart.whi...@gmail.com]
> Sent: 17 March 2009 12:13
> To: core-user@hadoop.apache.org
> Subject: Release batched-up output records at end-of-job?
>
> I have a mapred job that simply performs data transformations in its
> Mapper.  I don't need sorting or reduction, so I don't use a Reducer.
>
> Without getting too detailed, the nature of my processing is such that
> it is much more efficient if I can process blocks of records
> at-a-time.  So, what I'd like to do is, in my Mapper, in the map()
> function, simply add the incoming record to a list, and once that list
> reaches a certain size, process the batched-up records, and then call
> output.collect() multiple times to release the output records, each
> corresponding to one of the input records.
>
> At the end of the job, my Mappers will have partially full blocks of
> records.  I'd like to go ahead and process these blocks at end-of-job,
> regardless of their sizes, and release the corresponding output
> records.
>
> How can I accomplish this?  In my Mapper#map(), I have no way of
> knowing whether a record is the final record.  The only end-of-job
> hook that I'm aware of is for my Mapper to override
> MapReduceBase#close(), but when in that method, there is no
> OutputCollector available.
>
> Is it possible to batch-up records, and at end-of-job, process and
> release any final partial blocks?
>
> Thanks!
>
>
>
> This message should be regarded as confidential. If you have received this 
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy by 
> an authorised signatory.  The contents of this email may relate to dealings 
> with other companies within the Detica Group plc group of companies.
>
> Detica Limited is registered in England under No: 1337451.
>
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
>
>
>


Release batched-up output records at end-of-job?

2009-03-17 Thread Stuart White
I have a mapred job that simply performs data transformations in its
Mapper.  I don't need sorting or reduction, so I don't use a Reducer.

Without getting too detailed, the nature of my processing is such that
it is much more efficient if I can process blocks of records
at-a-time.  So, what I'd like to do is, in my Mapper, in the map()
function, simply add the incoming record to a list, and once that list
reaches a certain size, process the batched-up records, and then call
output.collect() multiple times to release the output records, each
corresponding to one of the input records.

At the end of the job, my Mappers will have partially full blocks of
records.  I'd like to go ahead and process these blocks at end-of-job,
regardless of their sizes, and release the corresponding output
records.

How can I accomplish this?  In my Mapper#map(), I have no way of
knowing whether a record is the final record.  The only end-of-job
hook that I'm aware of is for my Mapper to override
MapReduceBase#close(), but when in that method, there is no
OutputCollector available.

Is it possible to batch-up records, and at end-of-job, process and
release any final partial blocks?

Thanks!


Controlling maximum # of tasks per node on per-job basis?

2009-03-13 Thread Stuart White
My cluster nodes have 2 dual-core processors, so, in general, I want
to configure my nodes with a maximum of 3 task processes executed per
node at a time.

But, for some jobs, my tasks load large amounts of memory, and I
cannot fit 3 such tasks on a single node.  For these jobs, I'd like to
enforce running a maximum of 1 task process per node at a time.

I've tried to enforce this by setting
mapred.tasktracker.map.tasks.maximum at runtime, but I see it has no
effect, because this is a configuration for the TaskTracker, which is
of course already running before my job starts.

Is there no way to configure a maximum # of map tasks per node on a
per-job basis?

Thanks!


Not a host:port pair when running balancer

2009-03-11 Thread Stuart White
I've been running hadoop-0.19.0 for several weeks successfully.

Today, for the first time, I tried to run the balancer, and I'm receiving:

java.lang.RuntimeException: Not a host:port pair: hvcwydev0601

In my hadoop-site.xml, I have this:


  fs.default.name
  hdfs://hvcwydev0601/


What do I need to change to get the balancer to work?  It seems I need
to add a port to fs.default.name.  If so, what port?  Can I just pick
any port?  If I specify a port, do I need to specify any other parms
accordingly?

I searched the forum, and found a few posts on this topic, but it
seems that the configuration parms have changed over time, so I'm not
sure what the current correct configuration is.

Also, if fs.default.name is supposed to have a port, I'll point out
that the docs don't say so:
http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html

The example given for fs.default.name is "hdfs://hostname/".

Thanks!


OT: How to search mailing list archives?

2009-03-08 Thread Stuart White
This is slightly off-topic, and I realize this question is not
specific to Hadoop, but what is the best way to search the mailing
list archives?  Here's where I'm looking:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/

I don't see any way to search the archives.  Am I missing something?
Is there another archive site I should be looking at?

Thanks!


Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen  wrote:
> have you try the MultipleOutputFormat and it is subclass?

Nope (didn't know it existed).  I'll take a look at it.

Both of these suggestions sound great.  Thanks for the tips!


Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
I have a large amount of data, from which I'd like to extract multiple
different types of data, writing each type of data to different sets
of output files.  What's the best way to accomplish this?  (I should
mention, I'm only using a mapper.  I have no need for sorting or
reduction.)

Of course, if I only wanted 1 output file, I can just .collect() the
output from my mapper and let mapreduce write the output for me.  But,
to get multiple output files, the only way I can see is to manually
write the files myself from within my mapper.  If that's the correct
way, then how can I get a unique filename for each mapper instance?
Obviously hadoop has solved this problem, because it writes out its
partition files (part-0, etc...) with unique numbers.  Is there a
way for my mappers to get this unique number being used so they can
use it to ensure a unique filename?

Thanks!


MapReduce jobs with expensive initialization

2009-02-28 Thread Stuart White
I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my "main" class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

Thanks!


Re: General questions about Map-Reduce

2009-01-12 Thread Stuart White
> On Sun, Jan 11, 2009 at 9:05 PM, tienduc_dinh wrote:
>
> Is there any article which describes it ?
>

I'd also recommend Google's MapReduce whitepaper:

http://labs.google.com/papers/mapreduce.html


Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

2008-12-30 Thread Stuart White
I agree.  Using a List seems to make more sense.

FYI... I opened a jira for this:
https://issues.apache.org/jira/browse/HADOOP-4864

On Tue, Dec 30, 2008 at 3:53 PM, Jason Venner  wrote:

> The path separator is a major issue with a number of items in the
> configuration data set that are multiple items packed together via the path
> separator.
> the class path
> the distributed cache
> the input path set
>
> all suffer from the path.separator issue for 2 reasons:
> 1 being the difference across jvms as indicated in the previous email item
> (I had missed this!)
> 2 separator characters that happen to be embedded in the individual
> elements are not escaped before the item is added to the existing set.
>
> For all of the pain we have with these packed items, it may be simpler to
> serialize a List for multi element items rather than packing them
> with the path.separator system property item.
>
>
>
> Aaron Kimball wrote:
>
>> Hi Stuart,
>>
>> Good sleuthing out that problem :) The correct way to submit patches is to
>> file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP).
>> Create
>> an account, create a new issue describing the bug, and then attach the
>> patch
>> file. There'll be a discussion there and others can review your patch and
>> include it in the codebase.
>>
>> Cheers,
>> - Aaron
>>
>> On Fri, Dec 12, 2008 at 12:14 PM, Stuart White > >wrote:
>>
>>
>>
>>> Ok, I'll answer my own question.
>>>
>>> This is caused by the fact that hadoop uses
>>> system.getProperty("path.separator") as the delimiter in the list of
>>> jar files passed via -libjars.
>>>
>>> If your job spans platforms, system.getProperty("path.separator")
>>> returns a different delimiter on the different platforms.
>>>
>>> My solution is to use a comma as the delimiter, rather than the
>>> path.separator.
>>>
>>> I realize comma is, perhaps, a poor choice for a delimiter because it
>>> is valid in filenames on both Windows and Linux, but the -libjars uses
>>> it as the delimiter when listing the additional required jars.  So, I
>>> figured if it's already being used as a delimiter, then it's
>>> reasonable to use it internally as well.
>>>
>>> I've attached a patch (against 0.19.0) that applies this change.
>>>
>>> Now, with this change, I can submit hadoop jobs (requiring multiple
>>> supporting jars) from my Windows laptop (via cygwin) to my 10-node
>>> Linux hadoop cluster.
>>>
>>> Any chance this change could be applied to the hadoop codebase?
>>>
>>>
>>>
>>
>>
>>
>


Simple data transformations in Hadoop?

2008-12-13 Thread Stuart White
(I'm quite new to hadoop and map/reduce, so some of these questions
might not make complete sense.)

I want to perform simple data transforms on large datasets, and it
seems Hadoop is an appropriate tool.  As a simple example, let's say I
want to read every line of a text file, uppercase it, and write it
out.

First question: would Hadoop be an appropriate tool for something like this?

What is the best way to model this type of work in Hadoop?

I'm thinking my mappers will accept a Long key that represents the
byte offset into the input file, and a Text value that represents the
line in the file.

I *could* simply uppercase the text lines and write them to an output
file directly in the mapper (and not use any reducers).  So, there's a
question: is it considered bad practice to write output files directly
from mappers?

Assuming it's advisable in this example to write a file directly in
the mapper - how should the mapper create a unique output partition
file name?  Is there a way for a mapper to know its index in the total
# of mappers?

Assuming it's inadvisable to write a file directly in the mapper - I
can output the records to the reducers using the same key and using
the uppercased data as the value.  Then, in my reducer, should I write
a file?  Or should I collect() the records in the reducers and let
hadoop write the output?

If I let hadoop write the output, is there a way to prevent hadoop
from writing the key to the output file?  I may want to perform
several transformations, one-after-another, on a set of data, and I
don't want to place a superfluous key at the front of every record for
each pass of the data.

I appreciate any feedback anyone has to offer.


Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

2008-12-12 Thread Stuart White
Ok, I'll answer my own question.

This is caused by the fact that hadoop uses
system.getProperty("path.separator") as the delimiter in the list of
jar files passed via -libjars.

If your job spans platforms, system.getProperty("path.separator")
returns a different delimiter on the different platforms.

My solution is to use a comma as the delimiter, rather than the path.separator.

I realize comma is, perhaps, a poor choice for a delimiter because it
is valid in filenames on both Windows and Linux, but the -libjars uses
it as the delimiter when listing the additional required jars.  So, I
figured if it's already being used as a delimiter, then it's
reasonable to use it internally as well.

I've attached a patch (against 0.19.0) that applies this change.

Now, with this change, I can submit hadoop jobs (requiring multiple
supporting jars) from my Windows laptop (via cygwin) to my 10-node
Linux hadoop cluster.

Any chance this change could be applied to the hadoop codebase?
diff -ur src/core/org/apache/hadoop/filecache/DistributedCache.java 
src_working/core/org/apache/hadoop/filecache/DistributedCache.java
--- src/core/org/apache/hadoop/filecache/DistributedCache.java  2008-11-13 
21:09:36.0 -0600
+++ src_working/core/org/apache/hadoop/filecache/DistributedCache.java  
2008-12-12 14:07:48.865460800 -0600
@@ -710,7 +710,7 @@
 throws IOException {
 String classpath = conf.get("mapred.job.classpath.archives");
 conf.set("mapred.job.classpath.archives", classpath == null ? archive
- .toString() : classpath + System.getProperty("path.separator")
+ .toString() : classpath + ","
  + archive.toString());
 FileSystem fs = FileSystem.get(conf);
 URI uri = fs.makeQualified(archive).toUri();
@@ -727,8 +727,7 @@
 String classpath = conf.get("mapred.job.classpath.archives");
 if (classpath == null)
   return null;
-ArrayList list = Collections.list(new StringTokenizer(classpath, System
-  
.getProperty("path.separator")));
+ArrayList list = Collections.list(new StringTokenizer(classpath, ","));
 Path[] paths = new Path[list.size()];
 for (int i = 0; i < list.size(); i++) {
   paths[i] = new Path((String) list.get(i));


-libjars with multiple jars broken when client and cluster reside on different OSs?

2008-12-11 Thread Stuart White
I've written a simple map/reduce job that demonstrates a problem I'm
having.  Please see attached example.

Environment:
  hadoop 0.19.0
  cluster resides across linux nodes
  client resides on cygwin

To recreate the problem I'm seeing, do the following:

- Setup a hadoop cluster on linux

- Perform the remaining steps on cygwin, with a hadoop installation
configured to point to the linux cluster.  (set fs.default.name and
mapred.job.tracker)

- Extract the tarball.  Change into created directory.
  tar xvfz Example.tar.gz
  cd Example

- Edit build.properties, set your hadoop.home appropriately, then
build the example.
  ant

- Load the file Example.in into your dfs
  hadoop dfs -copyFromLocal Example.in Example.in

- Execute the provided shell script, passing it testID 1.
  ./Example.sh 1
  This test does not use -libjars, and it completes successfully.

- Next, execute testID 2.
  ./Example.sh 2
  This test uses -libjars with 1 jarfile (Foo.jar), and it completes
successfully.

- Next, execute testID 3.
  ./Example.sh 3
  This test uses -libjars with 1 jarfile (Bar.jar), and it completes
successfully.

- Next, execute testID 4.
  ./Example.sh 4
  This test uses -libjars with 2 jarfiles (Foo.jar and Bar.jar), and
it fails with a ClassNotFoundException.

This behavior only occurs when calling from cygwin to linux or vice
versa.   If both the cluster and the client reside on either linux or
cygwin, the problem does not occur.

I'm continuing to dig to see what I can figure out, but since I'm very
new to hadoop (started using it this week), I thought I'd go ahead and
throw this out there to see if anyone can help.

Thanks!


Example.tar.gz
Description: GNU Zip compressed data