heterogeneous cluster

2008-07-14 Thread Sandhya E
Hi

We currently have a cluster of 3 nodes, all of same
configuration[4cpu, 8GB RAM, 2TB]. Namenode website shows that DFS
used % is 30%[Nearly 700GB used per node]. The CPU utilization is
continuously remaining high for all the nodes in cluster. So to fit to
the needs of growing number of jobs that we are submitting to hadoop
cluster, planning to add more boxes [of smaller capacity, 2cpu, 8GB
RAM, 150GB]. So in long term, we want to grow the cluster with many
smaller capacity boxes and replace the larger capacity boxes.  To
start off we may be having a some of large boxes and some of smaller
ones in the cluster. Will this setup work fine.

Thanks & Regards
Sandhya


Re: Hadoop and lucene integration

2008-07-14 Thread bhupendar

Thanks for the reply 

Nice to get more information about hadoop and i have also started looking at
nutch 
The only thing i am not able to get yet is weather we can integrate hadoop
with lucene or not and any tutorial on it that will help me do this ?

Thanks a lot again for your response and help 
Regards
Bhupendra



-- 
View this message in context: 
http://www.nabble.com/Re%3A-Hadoop-and-lucene-integration-tp18441305p18458172.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: multiple Output Collectors ?

2008-07-14 Thread Alejandro Abdelnur
check MultipleOutputFormat and MultipleOutputs (this has been
committed to the trunk last week)


On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Is it possible to have more than one output collector for one map?
>
> My input are records of html pages. I am mapping each url to its
> html-content and want to have two output collectors. One that maps
> each  -->  and another one that map
>  to something else (difficult to explain).
>
> Please help. Thanks
>
> -k
>


different dfs block size

2008-07-14 Thread Rong-en Fan
Hi,

I'm wondering what would be the memory consumption of
dfs.block.size for a fixed set of data in NameNode? I know
it is determined by # of blocks and # of replications, but
how many memory does one block will use in NameNode?
In addition, what would be the pros/cons of bigger/smaller
block size?

Thanks,
Rong-En Fan


Re: Writable readFields and write functions

2008-07-14 Thread Chris Douglas


-- Presently, my RecordReader converts XML strings from a file to  
MyWritable

object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter should write the objects out


Not quite. Your RecordReader may produce MyWritable records, but  
readFields may not be involved. For your MyWritable records to get to  
your reduce, they should implement the Writable interface so the  
framework may regard them as streams of bytes. Your OutputFormat-  
which may use your MyWriter- may take the MyWritable objects you emit  
from your reduce and make them conform to whatever format your spec  
requires.


* Your InputFormat takes XML and provides MyWritable objects to your  
mapper
* The framework calls MyWritable::write(byte_stream) and  
MyWritable::readFields(byte_stream) to push records you emit from your  
mapper across the network, between abstractions, etc.
* Your OuputFormat takes MyWritable objects you emit from your reducer  
and stores them according to the format you specify


With many exceptions, most RecordReaders calling readFields are  
reading from structured, generic formats (like SequenceFile). -C



The RecordReader is record-oriented, but both the readFields and write
functions are byte-oriented... in order for Hadoop to be happy, I  
need to

coordinate my record-oriented to byte-oriented.

Is this correct? I just want to make sure before I tinker more with  
the

code, to have the design properly down.

Thanks!
Kylie


On Mon, Jul 14, 2008 at 3:43 PM, Chris Douglas <[EMAIL PROTECTED]>
wrote:

It's easiest to consider write as a function that converts your  
record to
bytes and readFields as a function restoring your record from  
bytes. So it

should be the case that:

MyWritable i = new MyWritable();
i.initWithData(some_data);
i.write(byte_stream);
...
MyWritable j = new MyWritable();
j.initWithData(some_other_data); // (1)
j.readFields(byte_stream);
assert i.equals(j);

Note that the assert should be true whether or not (1) is present,  
i.e. a
call to readFields should be deterministic and without hysteresis  
(it should
make no difference whether the Writable is newly created or if it  
formally
held some other state). readFields must also consume the entire  
record, so
for example, if write outputs three integers, readFields must  
consume three
integers. Variable-sized Writables are common, but any optional/ 
variably

sized fields must be encoded to satisfy the preceding.

So if your MyBigWritable record held two ints (integerA, integerB)  
and a

MyWritable (my_writable), its write method might look like:

out.writeInt(integerA);
out.writeInt(integerB);
my_writable.write(out);

and readFields would restore:

integerA = in.readInt(in);
integerB = in.readInt(in);
my_writable.readFields(in);

There are many examples in the source of simple, compound, and
variably-sized Writables.

Your RecordReader is responsible for providing a key and value to  
your map.
Most generic formats rely on Writables or another mode of  
serialization to

write and restore objects to/from structured byte sequences, but less
generic InputFormats will create Writables from byte streams.
TextInputFormat, for example, will create Text objects from CR- 
delimited
files, though Text objects are not, themselves, encoded in the  
file. In
constrast, a SequenceFile storing the same data will encode the  
Text object

(using its write method) and will restore that object as encoded.

The critical difference is that the framework needs to convert your  
record
to a byte stream at various points- hence the Writable interface-  
while you
may be more particular about the format from which you consume and  
the
format to which you need your output to conform. Note that you can  
elect to

use a different serialization framework if you prefer.

If your data structure will be used as a key (implementing
WritableComparable), it's strongly recommended that you implement a
RawComparator, which can compare the serialized bytes directly  
without

deserializing both arguments. -C


On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:

Hi There!

I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this  
one. I

was
wondering, however, about the following two functions:

public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)

I have my own RecordReader object to read in the complex type  
Service, and

I
also have my own Writer object to write my complex type ResultSet  
for

output. In LongWritable, the code is very simple:

value = in.readLong()
and
out.writeLong(value);

Since I am dealing with more complex objects, the ObjectWritable  
won't

help
me. I'm a little confused with the interaction here between my
RecordReader,
and Writer objects--because there does not seem to be any  
directly. Can

someone help me out here?

Thanks,
Kylie




Re: Is this supported that Combiner emits keys other than its input key set?

2008-07-14 Thread Chris Douglas
Yes; a combiner that emits a key that should go to a different  
partition is incorrect. If this were legal, then the combiner output  
would also need to be buffered, sorted, spilled, etc., effectively  
requiring another map phase. The combiner's purpose is to decrease the  
volume of data that needs to be shuffled or spilled (wordcount is the  
perfect example). It should not be thought of as a stage of  
computation. -C


On Jul 14, 2008, at 4:46 PM, Keliang Zhao wrote:


Hi there,

I read the code a bit, though I am not sure if I get it right. It
appears to me that when memory buffer of mapper is full, it spills and
gets sorted by partition id and by keys. Then, if there is a combiner
defined, it will work on each partition. However, it seems that the
outputs of a combiner are put in the same input partition, which means
that the keys emit by a combiner have to be in the same partition as
the inputs to it. Is this the case?

Best,
-Kevin




Setting inputs in configure()

2008-07-14 Thread schnitzi

I have some mapping jobs that are chained together, and would like to set the
inputs for them in an overridden configure(JobConf) method.  When I try to
do it this way, though, I get an error like this:

aggregatorJob failed: java.io.IOException: No input paths specified in input
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:347)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)

If I add dummy input paths initially, though, then let them get overridden
in my configure() method, it works fine.  Am I doing something wrong?  Is
this bad practice?  It would seem to me that, internally, it should be
calling the (potentially overridden) configure() method before validating
the inputs.


Thanks
Mark
-- 
View this message in context: 
http://www.nabble.com/Setting-inputs-in-configure%28%29-tp18456290p18456290.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hello Chris:
Thanks for the prompt reply!

So, to conclude from your note:
-- Presently, my RecordReader converts XML strings from a file to MyWritable
object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter should write the objects out

The RecordReader is record-oriented, but both the readFields and write
functions are byte-oriented... in order for Hadoop to be happy, I need to
coordinate my record-oriented to byte-oriented.

Is this correct? I just want to make sure before I tinker more with the
code, to have the design properly down.

Thanks!
Kylie


On Mon, Jul 14, 2008 at 3:43 PM, Chris Douglas <[EMAIL PROTECTED]>
wrote:

> It's easiest to consider write as a function that converts your record to
> bytes and readFields as a function restoring your record from bytes. So it
> should be the case that:
>
> MyWritable i = new MyWritable();
> i.initWithData(some_data);
> i.write(byte_stream);
> ...
> MyWritable j = new MyWritable();
> j.initWithData(some_other_data); // (1)
> j.readFields(byte_stream);
> assert i.equals(j);
>
> Note that the assert should be true whether or not (1) is present, i.e. a
> call to readFields should be deterministic and without hysteresis (it should
> make no difference whether the Writable is newly created or if it formally
> held some other state). readFields must also consume the entire record, so
> for example, if write outputs three integers, readFields must consume three
> integers. Variable-sized Writables are common, but any optional/variably
> sized fields must be encoded to satisfy the preceding.
>
> So if your MyBigWritable record held two ints (integerA, integerB) and a
> MyWritable (my_writable), its write method might look like:
>
> out.writeInt(integerA);
> out.writeInt(integerB);
> my_writable.write(out);
>
> and readFields would restore:
>
> integerA = in.readInt(in);
> integerB = in.readInt(in);
> my_writable.readFields(in);
>
> There are many examples in the source of simple, compound, and
> variably-sized Writables.
>
> Your RecordReader is responsible for providing a key and value to your map.
> Most generic formats rely on Writables or another mode of serialization to
> write and restore objects to/from structured byte sequences, but less
> generic InputFormats will create Writables from byte streams.
> TextInputFormat, for example, will create Text objects from CR-delimited
> files, though Text objects are not, themselves, encoded in the file. In
> constrast, a SequenceFile storing the same data will encode the Text object
> (using its write method) and will restore that object as encoded.
>
> The critical difference is that the framework needs to convert your record
> to a byte stream at various points- hence the Writable interface- while you
> may be more particular about the format from which you consume and the
> format to which you need your output to conform. Note that you can elect to
> use a different serialization framework if you prefer.
>
> If your data structure will be used as a key (implementing
> WritableComparable), it's strongly recommended that you implement a
> RawComparator, which can compare the serialized bytes directly without
> deserializing both arguments. -C
>
>
> On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:
>
>  Hi There!
>> I'm currently working on code for my own Writable object (called
>> ServiceWritable) and I've been working off LongWritable for this one. I
>> was
>> wondering, however, about the following two functions:
>>
>> public void readFields(java.io.DataInput in)
>> and
>> public void write(java.io.DataOutput out)
>>
>> I have my own RecordReader object to read in the complex type Service, and
>> I
>> also have my own Writer object to write my complex type ResultSet for
>> output. In LongWritable, the code is very simple:
>>
>> value = in.readLong()
>> and
>> out.writeLong(value);
>>
>> Since I am dealing with more complex objects, the ObjectWritable won't
>> help
>> me. I'm a little confused with the interaction here between my
>> RecordReader,
>> and Writer objects--because there does not seem to be any directly. Can
>> someone help me out here?
>>
>> Thanks,
>> Kylie
>>
>
>


-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Is this supported that Combiner emits keys other than its input key set?

2008-07-14 Thread Keliang Zhao
Hi there,

I read the code a bit, though I am not sure if I get it right. It
appears to me that when memory buffer of mapper is full, it spills and
gets sorted by partition id and by keys. Then, if there is a combiner
defined, it will work on each partition. However, it seems that the
outputs of a combiner are put in the same input partition, which means
that the keys emit by a combiner have to be in the same partition as
the inputs to it. Is this the case?

Best,
-Kevin


Re: Writable readFields and write functions

2008-07-14 Thread Chris Douglas
It's easiest to consider write as a function that converts your record  
to bytes and readFields as a function restoring your record from  
bytes. So it should be the case that:


MyWritable i = new MyWritable();
i.initWithData(some_data);
i.write(byte_stream);
...
MyWritable j = new MyWritable();
j.initWithData(some_other_data); // (1)
j.readFields(byte_stream);
assert i.equals(j);

Note that the assert should be true whether or not (1) is present,  
i.e. a call to readFields should be deterministic and without  
hysteresis (it should make no difference whether the Writable is newly  
created or if it formally held some other state). readFields must also  
consume the entire record, so for example, if write outputs three  
integers, readFields must consume three integers. Variable-sized  
Writables are common, but any optional/variably sized fields must be  
encoded to satisfy the preceding.


So if your MyBigWritable record held two ints (integerA, integerB) and  
a MyWritable (my_writable), its write method might look like:


out.writeInt(integerA);
out.writeInt(integerB);
my_writable.write(out);

and readFields would restore:

integerA = in.readInt(in);
integerB = in.readInt(in);
my_writable.readFields(in);

There are many examples in the source of simple, compound, and  
variably-sized Writables.


Your RecordReader is responsible for providing a key and value to your  
map. Most generic formats rely on Writables or another mode of  
serialization to write and restore objects to/from structured byte  
sequences, but less generic InputFormats will create Writables from  
byte streams. TextInputFormat, for example, will create Text objects  
from CR-delimited files, though Text objects are not, themselves,  
encoded in the file. In constrast, a SequenceFile storing the same  
data will encode the Text object (using its write method) and will  
restore that object as encoded.


The critical difference is that the framework needs to convert your  
record to a byte stream at various points- hence the Writable  
interface- while you may be more particular about the format from  
which you consume and the format to which you need your output to  
conform. Note that you can elect to use a different serialization  
framework if you prefer.


If your data structure will be used as a key (implementing  
WritableComparable), it's strongly recommended that you implement a  
RawComparator, which can compare the serialized bytes directly without  
deserializing both arguments. -C


On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:


Hi There!
I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this  
one. I was

wondering, however, about the following two functions:

public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)

I have my own RecordReader object to read in the complex type  
Service, and I

also have my own Writer object to write my complex type ResultSet for
output. In LongWritable, the code is very simple:

value = in.readLong()
and
out.writeLong(value);

Since I am dealing with more complex objects, the ObjectWritable  
won't help
me. I'm a little confused with the interaction here between my  
RecordReader,
and Writer objects--because there does not seem to be any directly.  
Can

someone help me out here?

Thanks,
Kylie




Re: When does reducer read mapper's intermediate result?

2008-07-14 Thread Chris Douglas
Not quite; the intermediate output is written to the local disk on the  
node executing MapTask and fetched over HTTP by the ReduceTask. The  
ReduceTask need only wait for the MapTask to complete successfully  
before fetching its output, but it cannot start before all MapTasks  
have finished. The intermediate output is sorted, so the ReduceTask  
only needs to merge the output produced by the map and group by key  
(using the grouping comparator). -C


On Jul 14, 2008, at 3:59 PM, Mori Bellamy wrote:

i'm pretty sure that the reducer waits for all of the map tasks'  
output to be written to HDFS (or else i nee no use for the Combiner  
class).  i'm not sure about your second question though. my gut  
tells me "no"



On Jul 14, 2008, at 3:50 PM, Kevin wrote:


Hi, there,

I am interested in the implementation details of hadoop mapred. In
particular, does the reducer wait till a map task ends and then fetch
the output (key-value pairs)? If so, is the very file produced by a
mapper for the reducer sorted before reducer gets it? (which means
that the reducer only needs to do merge sort when it gets all the
intermediate files from different mappers).

Best,
-Kevin






Re: When does reducer read mapper's intermediate result?

2008-07-14 Thread Mori Bellamy
i'm pretty sure that the reducer waits for all of the map tasks'  
output to be written to HDFS (or else i nee no use for the Combiner  
class).  i'm not sure about your second question though. my gut tells  
me "no"



On Jul 14, 2008, at 3:50 PM, Kevin wrote:


Hi, there,

I am interested in the implementation details of hadoop mapred. In
particular, does the reducer wait till a map task ends and then fetch
the output (key-value pairs)? If so, is the very file produced by a
mapper for the reducer sorted before reducer gets it? (which means
that the reducer only needs to do merge sort when it gets all the
intermediate files from different mappers).

Best,
-Kevin




Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Mori Bellamy
Weird. I use eclipse, but that's never happened to me. When  you set  
up your JobConfs, for example:

JobConf conf2 = new JobConf(getConf(),MyClass.class)
is your "MyClass" in the same package as your driver program? also, do  
you run from eclipse or from the command line (i've never tried to  
launch a hadoop task from eclipse). if you run from the command line:


hadoop jar MyMRTaskWrapper.jar myEntryClass option1 option2...

and all of the requisite resources are in MyMRTaskWrapper.jar, i don't  
see what the problem would be. if this is the way you run a hadoop  
task, are you sure that all of the resources are getting compiled into  
the same jar? when you export a jar from eclipse, it won't pack up  
external resources by default. (look into addons like FatJAR for that).



On Jul 14, 2008, at 2:25 PM, Sean Arietta wrote:



Well that's what I need to do also... but Hadoop complains to me  
when I
attempt to do that. Are you using Eclipse by any chance to develop?  
The
error I'm getting seems to be stemming from the fact that Hadoop  
thinks I am
uploading a new jar for EVERY execution of JobClient.runJob() so it  
fails
indicating the job jar file doesn't exist. Did you have to turn  
something
on/off to get it to ignore that or are you using a different IDE?  
Thanks!


Cheers,
Sean


Mori Bellamy wrote:


hey sean,

i later learned that the method i originally posted (configuring
different JobConfs and then running them, blocking style, with
JobClient.runJob(conf)) was sufficient for my needs. the reason it  
was
failing before was somehow my fault and the bugs somehow got fixed  
x_X.


Lukas gave me a helpful reply pointing me to TestJobControl.java (in
the hadoop source directory). it seems like this would be helpful if
your job dependencies are complex. but for me, i just need to do one
job after another (and every job only depends on the one right before
it), so the code i originally posted works fine.
On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:



Could you please provide some small code snippets elaborating on how
you
implemented that? I have a similar need as the author of this thread
and I
would appreciate any help. Thanks!

Cheers,
Sean


Joman Chu-2 wrote:


Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to
work
well. I've run sequences involving hundreds of MapReduce jobs in a
for
loop and it hasn't died on me yet.

On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:

Hey all, I'm trying to chain multiple mapreduce jobs together to
accomplish a complex task. I believe that the way to do it is as
follows:

JobConf conf = new JobConf(getConf(), MyClass.class); //configure
job
set mappers, reducers, etc
SequenceFileOutputFormat.setOutputPath(conf,myPath1);
JobClient.runJob(conf);

//new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
SequenceFileInputFormat.setInputPath(conf,myPath1); //more
configuration... JobClient.runJob(conf2)

Is this the canonical way to chain jobs? I'm having some trouble
with
this
method -- for especially long jobs, the latter MR tasks sometimes
do not
start up.





--
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net





--
View this message in context:
http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.







--
View this message in context: 
http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





When does reducer read mapper's intermediate result?

2008-07-14 Thread Kevin
Hi, there,

I am interested in the implementation details of hadoop mapred. In
particular, does the reducer wait till a map task ends and then fetch
the output (key-value pairs)? If so, is the very file produced by a
mapper for the reducer sorted before reducer gets it? (which means
that the reducer only needs to do merge sort when it gets all the
intermediate files from different mappers).

Best,
-Kevin


Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hi There!
I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this one. I was
wondering, however, about the following two functions:

public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)

I have my own RecordReader object to read in the complex type Service, and I
also have my own Writer object to write my complex type ResultSet for
output. In LongWritable, the code is very simple:

value = in.readLong()
and
out.writeLong(value);

Since I am dealing with more complex objects, the ObjectWritable won't help
me. I'm a little confused with the interaction here between my RecordReader,
and Writer objects--because there does not seem to be any directly. Can
someone help me out here?

Thanks,
Kylie


Re: How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
Thank you, Chris. This solves my questions.
-Kevin


On Mon, Jul 14, 2008 at 11:17 AM, Chris Douglas <[EMAIL PROTECTED]> wrote:
> "Yielding equal partitions" means that each input source will offer n
> partitions and for any given partition 0 <= i < n, the records in that
> partition are 1) sorted on the same key 2) unique to that partition, i.e. if
> a key k is in partition i for a given source, k appears in no other
> partitions from that source and if any other source contains k, all
> occurrences appear in partition i from that source. All the framework really
> effects is the cartesian product of all matching keys, so yes, that implies
> equi-joins.
>
> It's a fairly strict requirement. Satisfying it is less onerous if one is
> joining the output of several m/r jobs, each of which uses the same
> keys/partitioner, the same number of reduces, and each output file
> (part-x) of each job is not splittable. In this case, n is equal to the
> number of output files from each job (the number of reduces), (1) is
> satisfied if the reduce emits records in the same order (i.e. no new keys,
> no records out of order), and (2) is guaranteed by the partitioner and (1).
>
> An InputFormat capable of parsing metadata about each source to generate
> partitions from the set of input sources is ideal, but I can point to no
> existing implementation. -C
>
> On Jul 14, 2008, at 9:20 AM, Kevin wrote:
>
>> Hi,
>>
>> I find limited information about this package which looks like could
>> do "equi?" join. "Given a set of sorted datasets keyed with the same
>> class and yielding equal partitions, it is possible to effect a join
>> of those datasets prior to the map. " What does "yielding equal
>> partitions" mean?
>>
>> Thank you.
>>
>> -Kevin
>
>


Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Joman Chu
Hi, I don't have the code sitting in front of me at the moment, but
I'll do some of it from memory and I'll post a real snippet tomorrow
night. Hopefully, this can get you started

public class MyMainClass {
public static void main(String[] args) {
ToolRunner.run(new Configuration(), new 
ClassThatImplementsTool(), args);
//make sure you see the API for other trickiness you can do.
}
}

public class ClassThatImplementsTool implements Tool {
public int run(String[] args) {
//this method gets called by ToolRunner.run
//do all sorts of configuration here
//ie, set your Map, Combine, Reduce class
//look at the Configuration class API
}
}

The main think to know is that the ToolRunner.run() will call your
class's run() method.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Mon, Jul 14, 2008 at 4:38 PM, Sean Arietta <[EMAIL PROTECTED]> wrote:
>
> Could you please provide some small code snippets elaborating on how you
> implemented that? I have a similar need as the author of this thread and I
> would appreciate any help. Thanks!
>
> Cheers,
> Sean
>
>
> Joman Chu-2 wrote:
>>
>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>> well. I've run sequences involving hundreds of MapReduce jobs in a for
>> loop and it hasn't died on me yet.
>>
>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>>> accomplish a complex task. I believe that the way to do it is as follows:
>>>
>>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job
>>> set mappers, reducers, etc
>>> SequenceFileOutputFormat.setOutputPath(conf,myPath1);
>>> JobClient.runJob(conf);
>>>
>>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
>>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>>> configuration... JobClient.runJob(conf2)
>>>
>>> Is this the canonical way to chain jobs? I'm having some trouble with
>>> this
>>> method -- for especially long jobs, the latter MR tasks sometimes do not
>>> start up.
>>>
>>>
>>
>>
>> --
>> Joman Chu
>> AIM: ARcanUSNUMquam
>> IRC: irc.liquid-silver.net
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>


Re: multiple Output Collectors ?

2008-07-14 Thread Joman Chu
One cheap hack that comes to mind is to extend the GenericWritable and
ArrayWritable classes and write a second and third MapReduce job that
will both parse over your first job's output, and each will select for
the Key-Value pair it wants.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Mon, Jul 14, 2008 at 2:19 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Is it possible to have more than one output collector for one map?
>
> My input are records of html pages. I am mapping each url to its
> html-content and want to have two output collectors. One that maps
> each  -->  and another one that map
>  to something else (difficult to explain).
>
> Please help. Thanks
>
> -k
>
>


FileSplit hosts

2008-07-14 Thread Nathan Marz
What's the behavior of giving FileSplit "null" for the hosts field in  
the constructor? Will the framework figure out which hosts the data  
from on its own?


Thanks,
Nathan Marz



Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Sean Arietta

Well that's what I need to do also... but Hadoop complains to me when I
attempt to do that. Are you using Eclipse by any chance to develop? The
error I'm getting seems to be stemming from the fact that Hadoop thinks I am
uploading a new jar for EVERY execution of JobClient.runJob() so it fails
indicating the job jar file doesn't exist. Did you have to turn something
on/off to get it to ignore that or are you using a different IDE? Thanks!

Cheers,
Sean


Mori Bellamy wrote:
> 
> hey sean,
> 
> i later learned that the method i originally posted (configuring  
> different JobConfs and then running them, blocking style, with  
> JobClient.runJob(conf)) was sufficient for my needs. the reason it was  
> failing before was somehow my fault and the bugs somehow got fixed x_X.
> 
> Lukas gave me a helpful reply pointing me to TestJobControl.java (in  
> the hadoop source directory). it seems like this would be helpful if  
> your job dependencies are complex. but for me, i just need to do one  
> job after another (and every job only depends on the one right before  
> it), so the code i originally posted works fine.
> On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:
> 
>>
>> Could you please provide some small code snippets elaborating on how  
>> you
>> implemented that? I have a similar need as the author of this thread  
>> and I
>> would appreciate any help. Thanks!
>>
>> Cheers,
>> Sean
>>
>>
>> Joman Chu-2 wrote:
>>>
>>> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to  
>>> work
>>> well. I've run sequences involving hundreds of MapReduce jobs in a  
>>> for
>>> loop and it hasn't died on me yet.
>>>
>>> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
 Hey all, I'm trying to chain multiple mapreduce jobs together to
 accomplish a complex task. I believe that the way to do it is as  
 follows:

 JobConf conf = new JobConf(getConf(), MyClass.class); //configure  
 job
 set mappers, reducers, etc
 SequenceFileOutputFormat.setOutputPath(conf,myPath1);
 JobClient.runJob(conf);

 //new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
 SequenceFileInputFormat.setInputPath(conf,myPath1); //more
 configuration... JobClient.runJob(conf2)

 Is this the canonical way to chain jobs? I'm having some trouble  
 with
 this
 method -- for especially long jobs, the latter MR tasks sometimes  
 do not
 start up.


>>>
>>>
>>> -- 
>>> Joman Chu
>>> AIM: ARcanUSNUMquam
>>> IRC: irc.liquid-silver.net
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18453200.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Ideal number of mappers and reducers; any physical limits?

2008-07-14 Thread Lukas Vlcek
Hi,

I have a couple of *basic* questions about Hadoop internals.

1) If I understood correctly the ideal number of Reducers is equal to number
of distinct keys (or custom Partitioners) emitted from from all Mappers at
given Map-Reduce iteration. Is that correct?

2) In configuration there can be set maximum number of Reducers. How does
Hadoop handle the situation when there are more intermediate keys emitted
from Mappers then this number? AFAIK the intermediate results are stored in
SequenceFiles. Does it mean that this intermetidate persistent storeage is
somehow scanned to all records of the same key (or custom Partinioner value)
and such chunk of data is send to one Reduced and if no Reducer is left them
the process waits unitl some of them is done and can be assigned a new chunk
of data?

3) Is there any recommendation about how to set up a job if number of
intermediate keys is not know beforehand?

4) Is there any physical limit of number of Reducers given by internal
Hadoop architecture?

... and finally ...

5) Does anybody know how and what exactly do folks in Yahoo! use Hadoop for?
If the biggest reported Hadoop cluster has something like 2000 machines then
the total number of Mappers/Reducers can be like 2000*200 (assuming there
are for example 200 Reducers running on each machine), which is a big number
but still probably not big enough to handle processing of really large
graphs data structures IMHO. As far as I understood Google is not directly
using Map-Reduce form of PageRank calculation for whole internet graph
processing (see http://www.youtube.com/watch?v=BT-piFBP4fE). So, if Yahoo!
needs scaling algorithm for really large tasks, what do they use if not
Hadoop?

Regards,
Lukas

-- 
http://blog.lukas-vlcek.com/


Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Mori Bellamy

hey sean,

i later learned that the method i originally posted (configuring  
different JobConfs and then running them, blocking style, with  
JobClient.runJob(conf)) was sufficient for my needs. the reason it was  
failing before was somehow my fault and the bugs somehow got fixed x_X.


Lukas gave me a helpful reply pointing me to TestJobControl.java (in  
the hadoop source directory). it seems like this would be helpful if  
your job dependencies are complex. but for me, i just need to do one  
job after another (and every job only depends on the one right before  
it), so the code i originally posted works fine.

On Jul 14, 2008, at 1:38 PM, Sean Arietta wrote:



Could you please provide some small code snippets elaborating on how  
you
implemented that? I have a similar need as the author of this thread  
and I

would appreciate any help. Thanks!

Cheers,
Sean


Joman Chu-2 wrote:


Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to  
work
well. I've run sequences involving hundreds of MapReduce jobs in a  
for

loop and it hasn't died on me yet.

On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:

Hey all, I'm trying to chain multiple mapreduce jobs together to
accomplish a complex task. I believe that the way to do it is as  
follows:


JobConf conf = new JobConf(getConf(), MyClass.class); //configure  
job

set mappers, reducers, etc
SequenceFileOutputFormat.setOutputPath(conf,myPath1);
JobClient.runJob(conf);

//new job JobConf conf2 = new JobConf(getConf(),MyClass.class)
SequenceFileInputFormat.setInputPath(conf,myPath1); //more
configuration... JobClient.runJob(conf2)

Is this the canonical way to chain jobs? I'm having some trouble  
with

this
method -- for especially long jobs, the latter MR tasks sometimes  
do not

start up.





--
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net





--
View this message in context: 
http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Sean Arietta

Could you please provide some small code snippets elaborating on how you
implemented that? I have a similar need as the author of this thread and I
would appreciate any help. Thanks!

Cheers,
Sean


Joman Chu-2 wrote:
> 
> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
> well. I've run sequences involving hundreds of MapReduce jobs in a for
> loop and it hasn't died on me yet.
> 
> On Wed, July 9, 2008 4:28 pm, Mori Bellamy said:
>> Hey all, I'm trying to chain multiple mapreduce jobs together to
>> accomplish a complex task. I believe that the way to do it is as follows:
>> 
>> JobConf conf = new JobConf(getConf(), MyClass.class); //configure job
>> set mappers, reducers, etc 
>> SequenceFileOutputFormat.setOutputPath(conf,myPath1); 
>> JobClient.runJob(conf);
>> 
>> //new job JobConf conf2 = new JobConf(getConf(),MyClass.class) 
>> SequenceFileInputFormat.setInputPath(conf,myPath1); //more
>> configuration... JobClient.runJob(conf2)
>> 
>> Is this the canonical way to chain jobs? I'm having some trouble with
>> this
>> method -- for especially long jobs, the latter MR tasks sometimes do not
>> start up.
>> 
>> 
> 
> 
> -- 
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-chain-multiple-hadoop-jobs--tp18370089p18452309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Hadoop and lucene integration

2008-07-14 Thread Naama Kraus
I think you may find a lot of information about Hadoop in general in
Hadoop's Wiki http://hadoop.apache.org/core/

Re. Hadoop and search, you might also want to take a look at Nutch
http://lucene.apache.org/nutch/

In general, Hadoop allows one to store huge amount of data on a cluster of
commodity nodes and process it in an efficient way (parallel, computation
near data) using the map reduce framework. Hadoop is an infrastructure, each
application should use it in a way fitting its needs.

Hope this helps,
Naama

On Mon, Jul 14, 2008 at 2:58 PM, bhupendar <[EMAIL PROTECTED]> wrote:

>
> Thanks for the response .
>
> The problem i am facing here is i dont have any clue about the hadoop . So
> first i am trying to analyse weather i can integrate hadoop with the
> existing application developed using Lucene or not ? i need some clue or
> tutorial which talk about hadoop integration with lucene . I have started
> exploring the two links that you have given
> Thanks a lot.
> Looking forward to some more useful information about hadoop
>
> Regards,
> Bhupendra
> --
> View this message in context:
> http://www.nabble.com/Re%3A-Hadoop-and-lucene-integration-tp18441305p18442379.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


multiple Output Collectors ?

2008-07-14 Thread Khanh Nguyen
Hello,

Is it possible to have more than one output collector for one map?

My input are records of html pages. I am mapping each url to its
html-content and want to have two output collectors. One that maps
each  -->  and another one that map
 to something else (difficult to explain).

Please help. Thanks

-k


Re: How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Chris Douglas
"Yielding equal partitions" means that each input source will offer n  
partitions and for any given partition 0 <= i < n, the records in that  
partition are 1) sorted on the same key 2) unique to that partition,  
i.e. if a key k is in partition i for a given source, k appears in no  
other partitions from that source and if any other source contains k,  
all occurrences appear in partition i from that source. All the  
framework really effects is the cartesian product of all matching  
keys, so yes, that implies equi-joins.


It's a fairly strict requirement. Satisfying it is less onerous if one  
is joining the output of several m/r jobs, each of which uses the same  
keys/partitioner, the same number of reduces, and each output file  
(part-x) of each job is not splittable. In this case, n is equal  
to the number of output files from each job (the number of reduces),  
(1) is satisfied if the reduce emits records in the same order (i.e.  
no new keys, no records out of order), and (2) is guaranteed by the  
partitioner and (1).


An InputFormat capable of parsing metadata about each source to  
generate partitions from the set of input sources is ideal, but I can  
point to no existing implementation. -C


On Jul 14, 2008, at 9:20 AM, Kevin wrote:


Hi,

I find limited information about this package which looks like could
do "equi?" join. "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map. " What does "yielding equal
partitions" mean?

Thank you.

-Kevin




Pulling input from http?

2008-07-14 Thread Khanh Nguyen
Hello,

I am struggling to get Hadoop to pull input from a http source but so
far no luck. Is it even possible because in this case, the input is
not placed in Hadoop's file system? An example code would be ideal.

Thanks.

-k


How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
Hi,

I find limited information about this package which looks like could
do "equi?" join. "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map. " What does "yielding equal
partitions" mean?

Thank you.

-Kevin


Hadoop User Group UK

2008-07-14 Thread Johan Oskarsson

Update on the Hadoop user group in the UK:

It will be hosted at Skills Matter in Clerkenwell, London on August 19. 
We'll have presentations from both developers and users of Apache Hadoop.


The event is free and anyone is welcome, but we only have room for 60 
people so make sure you're on the attending list @ 
http://upcoming.yahoo.com/event/506444 if you're coming.
We're sponsored by Yahoo! Developer Network (lunch+beer), Skills matter 
(beer) and Last fm (room hire), thanks guys!


If you're interested in speaking please let us know at 
[EMAIL PROTECTED], we can still squeeze in some interesting 
presentations or lightning talks.


Preliminary times:
10.00 -> 10.45: Doug Cutting (Project founder, Yahoo!) - Hadoop overview
10.45 -> 11.30: Tom White (Lexemetech) - Hadoop on Amazon S3/EC2
11.30 -> 12.15: Steve Loughran and Julio Guijarro (HP) - Smartfrog and 
Hadoop
12.15 -> 13.15: Free lunch! (Sandwich, fruit, drink and crisps. Meat and 
veggie options available)
13.15 -> 14.00: Martin Dittus and Johan Oskarsson (Last.fm) - Hadoop 
usage at Last fm

14.00 -> 15.00: Lightning talks (5-10 minutes each)
15.00 -> 16.00: Panel discussion
16.00 -> 17.00: Free beer!
17.00 -> xx.xx: Wandering to a nearby pub

Lightning talks include:
Miles Osborne (University of Edinburgh) - Using Nutch and Hadoop for 
Natural Language Processing

Tim Sell (Last fm intern) - PostgreSQL to HBase replication

For those of you who cannot attend we'll try to put presentations up on 
the wiki and perhaps even record the event in some fashion.


/Johan


Re: Why is the task run in a child JVM?

2008-07-14 Thread Torsten Curdt

On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote:


One benefit is that if your map or reduce behaves badly it can't  
take down

the task tracker.



As the tracker jvm could also be monitored (and restarted) from  
outside, the internal execution might still be worth looking into. At  
least to have the option. We had a patch ...but it's terrible out of  
date.


On the other hand starting up the second jvm is supposed to be much  
faster on 1.6 no idea if that's true though.


cheers
--
Torsten


Re: Why is the task run in a child JVM?

2008-07-14 Thread Shengkai Zhu
Well, I got it.


On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote:
>
> One benefit is that if your map or reduce behaves badly it can't take down
> the task tracker.
>
> In our case we have some poorly behaved external native libraries we use,
> and we have to forcibly ensure that the child vms are killed when the child
> main finishes (often by kill -9), so the fact the child (task) is a separate
> jvm process is very helpful.
>
> The downside is the jvm start time. Has anyone experimented with the jar
> freezing for more than the standard boot class path jars to speed up
> startup?
>
>
> Shengkai Zhu wrote:
>
>> What's the benefits from such design compared to multi-thread?
>>
>>
>>
> --
> Jason Venner
> Attributor - Program the Web 
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if
> interested
>



-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: Is it possible to input two different files under same mapper

2008-07-14 Thread Jason Venner

This sounds like a good task for the Data Join code.
If you can set up so that all of your data is stored in MapFiles, with 
the same type of key and the same partitioning setup and count, it will 
go very well.


Mori Bellamy wrote:

Hey Amer,
It sounds to me like you're going to have to write your own input 
format (or atleast modify an existing one). Take a look here:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html 



I'm not sure how you'd go about doing this, but i hope this helps you.

(Also, have you considered preprocessing your input so that any 
arbitrary mapper can know whether or not its looking at a line from 
the "large file"?)

On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:


HI,
My requirement is to compare the contents of one very large file (GB 
to TB size) with a bunch of smaller files (100s of MB to GB  sizes). 
Is there a way I can give the mapper the 1st file independently of 
the remaining bunch?

Amer



--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: Why is the task run in a child JVM?

2008-07-14 Thread Jason Venner
One benefit is that if your map or reduce behaves badly it can't take 
down the task tracker.


In our case we have some poorly behaved external native libraries we 
use, and we have to forcibly ensure that the child vms are killed when 
the child main finishes (often by kill -9), so the fact the child (task) 
is a separate jvm process is very helpful.


The downside is the jvm start time. Has anyone experimented with the jar 
freezing for more than the standard boot class path jars to speed up 
startup?



Shengkai Zhu wrote:

What's the benefits from such design compared to multi-thread?

  

--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Why is the task run in a child JVM?

2008-07-14 Thread Shengkai Zhu
What's the benefits from such design compared to multi-thread?

-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: Hadoop and lucene integration

2008-07-14 Thread bhupendar

Thanks for the response .

The problem i am facing here is i dont have any clue about the hadoop . So
first i am trying to analyse weather i can integrate hadoop with the
existing application developed using Lucene or not ? i need some clue or
tutorial which talk about hadoop integration with lucene . I have started
exploring the two links that you have given
Thanks a lot. 
Looking forward to some more useful information about hadoop 

Regards,
Bhupendra
-- 
View this message in context: 
http://www.nabble.com/Re%3A-Hadoop-and-lucene-integration-tp18441305p18442379.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Parameterized InputFormats

2008-07-14 Thread Alejandro Abdelnur
If your InputFormat implements Configurable you'll get access to the
JobConf via the setConf(Configuration) method when Hadoop creates an
instance of your class.

On Mon, Jun 30, 2008 at 11:20 PM, Nathan Marz <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Are there any plans to change the JobConf API so that it takes an instance
> of an InputFormat rather than the InputFormat class? I am finding the
> inability to properly parameterize my InputFormats to be very restricting.
> What's the reasoning behind having the class as a parameter rather than an
> instance?
>
> -Nathan Marz
>


Re: Hadoop and lucene integration

2008-07-14 Thread Naama Kraus
Hi bhupendra,

You may find these links helpful:

https://issues.apache.org/jira/browse/HADOOP-2951
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00596.html

Naama

On Mon, Jul 14, 2008 at 8:37 AM, bhupendar <[EMAIL PROTECTED]> wrote:

>
> Hi all
>
> I have created a search engine using lucene to search on the file system
> and
> it is working fine right now .
> I heard somewhere that using hadoop we can increase the performance of the
> search engine . I just want to know
> 1) how can be hadoop plug in my search engine and what are the things it
> will improve if it can be plugged in
> 2) Any tutorial on hadoop + lucene which can be help me out in plugging
> hadoop in my search engine
> 3) One more than i want to know is how can we distribute the lucene index
> using hadoop
>
> Thanks and Regards
> bhupendra
> --
> View this message in context:
> http://www.nabble.com/Hadoop-and-lucene-integration-tp18437758p18437758.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Outputting to different paths from the same input file

2008-07-14 Thread Alejandro Abdelnur
You can use MultipleOutputFormat or MultipleOutputs (it has been
committed to SVN a few days ago) for this.

Then you can use a filter on your input dir for the next jobs so only
files matching a given name/pattern are used.

A

On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner <[EMAIL PROTECTED]> wrote:
> We open side effect files in our map and reduce jobs to 'tee' off additional
> data streams.
> We open them in the /configure/ method and close them in the /close/ method
> The /configure/ method provides access to the /JobConf.
>
> /We create our files relative to value of conf.get("mapred.output.dir"), in
> the map/reduce object instances.
>
> The files end up in the conf.getOutputPath() directory, and we move them out
> based on knowing the shape of the file names, after the job finishes.
>
>
> Then after the job is finished move all of the files to another location
> using a file name based filter to select the files to move (from the job
>
> schnitzi wrote:
>>
>> Okay, I've found some similar discussions in the archive, but I'm still
>> not
>> clear on this.  I'm new to Hadoop, so 'scuse my ignorance...
>>
>> I'm writing a Hadoop tool to read in an event log, and I want to produce
>> two
>> separate outputs as a result -- one for statistics, and one for budgeting.
>> Because the event log I'm reading in can be massive, I would like to only
>> process it once.  But the outputs will each be read by further M/R
>> processes, and will be significantly different from each other.
>>
>> I've looked at MultipleOutputFormat, but it seems to just want to
>> partition
>> data that looks basically the same into this file or that.
>>
>> What's the proper way to do this?  Ideally, whatever solution I implement
>> should be atomic, in that if any one of the writes fails, neither output
>> will be produced.
>>
>>
>> AdTHANKSvance,
>> Mark
>>
>
> --
> Jason Venner
> Attributor - Program the Web 
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if
> interested
>