Re: Multiple k,v pairs from a single map - possible?

2009-04-02 Thread 皮皮
thank you very much . this is what i am looking for.

2009/3/27 Brian MacKay 

>
> Amandeep,
>
> Add this to your driver.
>
> MultipleOutputs.addNamedOutput(conf, "PHONE",TextOutputFormat.class,
> Text.class, Text.class);
>
> MultipleOutputs.addNamedOutput(conf, "NAME,
>TextOutputFormat.class, Text.class, Text.class);
>
>
>
> And in your reducer
>
>  private MultipleOutputs mos;
>
> public void reduce(Text key, Iterator values,
>OutputCollector output, Reporter reporter) {
>
>
>  // namedOutPut = either PHONE or NAME
>
>while (values.hasNext()) {
>String value = values.next().toString();
>mos.getCollector(namedOutPut, reporter).collect(
>new Text(value), new Text(othervals));
>}
>}
>
>@Override
>public void configure(JobConf conf) {
>super.configure(conf);
>mos = new MultipleOutputs(conf);
>}
>
>public void close() throws IOException {
>mos.close();
>}
>
>
>
> By the way, have you had a change to post your Oracle fix to
> DBInputFormat ?
> If so, what is the Jira tag #?
>
> Brian
>
> -Original Message-
> From: Amandeep Khurana [mailto:ama...@gmail.com]
> Sent: Friday, March 27, 2009 5:46 AM
> To: core-user@hadoop.apache.org
> Subject: Multiple k,v pairs from a single map - possible?
>
> Is it possible to output multiple key value pairs from a single map
> function
> run?
>
> For example, the mapper outputing  and 
> simultaneously...
>
> Can I write multiple output.collect(...) commands?
>
> Amandeep
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
>
>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this message in error, please contact the sender and delete the material
> from any computer.
>
>
>


Re: Using HDFS to serve www requests

2009-04-02 Thread Snehal Nagmote

can you please explain exactly adding NIO bridge means what and how it can be
done , what could 
be advantages in this case ?  




Steve Loughran wrote:
> 
> Edward Capriolo wrote:
>> It is a little more natural to connect to HDFS from apache tomcat.
>> This will allow you to skip the FUSE mounts and just use the HDFS-API.
>> 
>> I have modified this code to run inside tomcat.
>> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
>> 
>> I will not testify to how well this setup will perform under internet
>> traffic, but it does work.
>> 
> 
> If someone adds an NIO bridge to hadoop filesystems then it would be 
> easier; leaving you only with the performance issues.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-HDFS-to-serve-www-requests-tp22725659p22862098.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: skip setting output path for a sequential MR job..

2009-04-02 Thread some speed
Removing the file programatically is doing the trick for me. thank you all
for your answers and help :-)

On Tue, Mar 31, 2009 at 12:25 AM, some speed  wrote:

> Hello everyone,
>
> Is it necessary to redirect the ouput of reduce to a file? When I am trying
> to run the same M-R job more than once, it throws an error that the output
> file already exists. I dont want to use command line args so I hard coded
> the file name into the program.
>
> So, Is there a way , I could delete a file on HDFS programatically?
> or can i skip setting a output file path n just have my output print to
> console?
> or can I just append to an existing file?
>
>
> Any help is appreciated. Thanks.
>
> -Sharath
>


Re: HDFS data block clarification

2009-04-02 Thread Owen O'Malley
The last block of an HDFS block only occupies the required space. So a  
4k file only consumes 4k on disk.


-- Owen

On Apr 2, 2009, at 18:44, javateck javateck  wrote:


 Can someone tell whether a file will occupy one or more blocks? for
example, the default block size is 64MB, and if I save a 4k file to  
HDFS,
will the 4K file occupy the whole 64MB block alone? so in this case,  
do I do
need to configure the block size to 10k if most of my files are less  
than

10K?

thanks,


Re: HDFS data block clarification

2009-04-02 Thread jason hadoop
HDFS only allocates as much physical disk space is required for a block, up
to the block size for the file (+ some header data).
So if you write a 4k file, the single block for that file will be around 4k.

If you write a 65M file, there will be two blocks, one of roughly 64M, and
one of roughly 1M.

You can verify this yourself by, on a datanode, running *find
${dfs.data.dir} -iname blk'*' -type f -ls*

Note: the above command will only work as expected if a single directory is
defined for dfs block storage, and ${dfs.data.dir}, is replaced with the
effective value of the configuration parameter dfs.data.dir, from your
hadoop configuration.
dfs.data.dir is commonly defined as ${hadoop.tmp.dir}/dfs/data.

The following rather insane bash shell command will print out the value of
dfs.data.dir on the local machine.
It must be run from the hadoop installation directory, and makes 2 temporary
names in /tmp/f.PID.input and /tmp/f.PID.output
This little ditty relies on the fact that the configuration parameters are
pushed into the process environment for streaming jobs.

Streaming Rocks!

B=/tmp/f.$$;
date > ${B}.input;
rmdir ${B}.output;
bin/hadoop jar contrib/streaming/hadoop-*-streaming.jar -D
fs.default.name=file:///
-jt local -input ${B}.input -output ${B}.output -numReduceTasks 0 -mapper
env;
grep dfs.data.dir ${B}.output/part-0;
rm ${B}.input;
rm -rf ${B}.output

On Thu, Apr 2, 2009 at 6:44 PM, javateck javateck wrote:

>  Can someone tell whether a file will occupy one or more blocks? for
> example, the default block size is 64MB, and if I save a 4k file to HDFS,
> will the 4K file occupy the whole 64MB block alone? so in this case, do I
> do
> need to configure the block size to 10k if most of my files are less than
> 10K?
>
> thanks,
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


HDFS data block clarification

2009-04-02 Thread javateck javateck
  Can someone tell whether a file will occupy one or more blocks? for
example, the default block size is 64MB, and if I save a 4k file to HDFS,
will the 4K file occupy the whole 64MB block alone? so in this case, do I do
need to configure the block size to 10k if most of my files are less than
10K?

thanks,


Re: Checking if a streaming job failed

2009-04-02 Thread Miles Osborne
here is how i do it (in perl).  hadoop streaming is actually called by
a shell script, which in this case expects compressed input and
produces compressed output.  but you get the idea:

(the mailer had messed-up the formatting somewhat)
>
sub runStreamingCompInCompOut {
my $mapper = shift @_;
my $reducer = shift @_;
my $inDir = shift @_;
my $outDir = shift @_;
my $numMappers = shift @_;
my $numReducers = shift @_;
my $jobName = $runName . ":" . shift @_;
my $cmd = "sh runStreamingCompInCompOut.sh $mapper $reducer $inDir
$outDir $jobName $numMappers \$numReducers &> /tmp/.trace";
print STDERR "Running: $cmd\n";
system $cmd;
open IN, "/tmp/.trace" or die "can't open streaming trace";
while(!eof(IN)){
my $line = ;
(my $date,my $time,my $status) = split(/\s+/,$line);
if ($status eq "ERROR") {
print STDERR "command: $cmd failed\n";
exit(-1);
}
}
}


2009/4/3 Mayuran Yogarajah :
> Hello, does anyone know how I can check if a streaming job (in Perl) has
> failed or succeeded? The only way I can see at the moment is to check
> the web interface for that jobID and parse out the '*Status:*' value.
>
> Is it not possible to do this using 'hadoop job -status' ? I see there is a
> count
> for failed map/reduce tasks, but map/reduce tasks failing is normal (or so
> I thought).  I am under the impression that if a task fails it will simply
> be
> reassigned to a different node.  Is this not the case?  If this is normal
> then I
> can't reliably use this count to check if the job as a whole failed or
> succeeded.
>
> Any feedback is greatly appreciated.
>
> thanks,
> M
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: RPM spec file for 0.19.1

2009-04-02 Thread Christophe Bisciglia
Hey Ian, we are totally fine with this - the only reason we didn't
contribute the SPEC file is that it is the output of our internal
build system, and we don't have the bandwidth to properly maintain
multiple RPMs.

That said, we chatted about this a bit today, and were wondering if
the community would like us to host RPMs for all releases in our
"devel" repository. We can't stand behind these from a reliability
angle the same way we can with our "blessed" RPMs, but it's a
manageable amount of additional work to have our build system spit
those out as well.

If you'd like us to do this, please add a "me too" to this page:
http://www.getsatisfaction.com/cloudera/topics/should_we_release_host_rpms_for_all_releases

We could even skip the branding on the "devel" releases :-)

Cheers,
Christophe

On Thu, Apr 2, 2009 at 12:46 PM, Ian Soboroff  wrote:
>
> I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
> with a spec file for building a 0.19.1 RPM.
>
> I like the idea of Cloudera's RPM file very much.  In particular, it has
> nifty /etc/init.d scripts and RPM is nice for managing updates.
> However, it's for an older, patched version of Hadoop.
>
> This spec file is actually just Cloudera's, with suitable edits.  The
> spec file does not contain an explicit license... if Cloudera have
> strong feelings about it, let me know and I'll pull the JIRA attachment.
>
> The JIRA includes instructions on how to roll the RPMs yourself.  I
> would have attached the SRPM but they're too big for JIRA.  I can offer
> noarch RPMs build with this spec file if someone wants to host them.
>
> Ian
>
>


Checking if a streaming job failed

2009-04-02 Thread Mayuran Yogarajah

Hello, does anyone know how I can check if a streaming job (in Perl) has
failed or succeeded? The only way I can see at the moment is to check
the web interface for that jobID and parse out the '*Status:*' value.

Is it not possible to do this using 'hadoop job -status' ? I see there 
is a count

for failed map/reduce tasks, but map/reduce tasks failing is normal (or so
I thought).  I am under the impression that if a task fails it will 
simply be
reassigned to a different node.  Is this not the case?  If this is 
normal then I
can't reliably use this count to check if the job as a whole failed or 
succeeded.


Any feedback is greatly appreciated.

thanks,
M


Re: Hardware - please sanity check?

2009-04-02 Thread Philip Zeyliger
>
>
> I've been assuming that RAID is generally a good idea (disks fail quite
> often, and it's cheaper to hotswap a drive than to rebuild an entire box).
>

Hadoop data nodes are often configured without RAID (i.e., "JBOD" = Just a
Bunch of Disks)--HDFS already provides for the data redundancy.  Also, if
you stripe across disks, you're liable to be as slow as the slowest of your
disks, so data nodes are typically configured to point to multiple disks.

-- Philip


Question about upgrading

2009-04-02 Thread Usman Waheed

Hello,

I have a 5 node cluster with one master node. I am upgrading from 16.4 to  
18.3 but am a little confused if i am doing it the right way. I read up on  
the documentatin and how to use the -upgrade switch but want to make sure  
i havent missed any step.


First i took down the cluster by issuing stop-all.sh on the master node.
I installed the new hadoop by untaring the tar ball and then copied the  
config files from the old setup 16.4/conf/* into 18.3/conf/*. Changed some  
symlinks to point to the new version. Performed this step on all the  
master and slave nodes.


Then i went on the master and started the master node only with the  
-upgrade switch using the command in the new hadoop version directory.  
Waited for everything to go smoothly. No error were reported i didnt  
change any setting in the config files and just copied from the old  
version conf directory.


Then i started the other data nodes, it should work right. Did i miss  
anything, i am the sysadmin for this setup so want to make sure i do this  
right and not have to reformat the file system and cant afford to loose  
the data for every upgrade. Want to keep the file system as is and upgrade  
from 16.4 to 18.3.


If i missed any important detail or steps please advise.

Thanks,
Usman


Re: Hardware - please sanity check?

2009-04-02 Thread Patrick Angeles
I had a similar curiosity, but more regarding disk speed.
Can I assume linear improvement between 7200rpm -> 10k rpm -> 15k rpm? How
much of a bottleneck is disk access?

Another question is regarding hardware redundancy. What is the relative
value of the following:
- RAID / hot-swappable drives
- dual NICs
- redundant backplane
- redundant power supply
- UPS

I've been assuming that RAID is generally a good idea (disks fail quite
often, and it's cheaper to hotswap a drive than to rebuild an entire box).
Dual NICs are also good, as both can be used at the same time. Everything
else is not necessary in a Hadoop cluster.

On Thu, Apr 2, 2009 at 11:33 AM, tim robertson wrote:

> Thanks Miles,
>
> Thus far most of my work has been on EC2 large instances and *mostly*
> my code is not memory intensive (I sometimes do joins against polygons
> and hold Geospatial indexes in memory, but am aware of keeping things
> within the -Xmx for this).
> I am mostly  looking to move routine data processing and
> transformation (lots of distinct, count and group by operations) off a
> chunky mysql DB (200million rows and growing) which gets all locked
> up.
>
> We have gigabit switches.
>
> Cheers
>
> Tim
>
>
>
> On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne  wrote:
> > make sure you also have a fast switch, since you will be transmitting
> > data across your network and this will come to bite you otherwise
> >
> > (roughly, you need one core per hadoop-related job, each mapper, task
> > tracker etc;  the per-core memory may be too small if you are doing
> > anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
> > and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
> > and it routinely falls over when we run jobs on it)
> >
> > Miles
> >
> > 2009/4/2 tim robertson :
> >> Hi all,
> >>
> >> I am not a hardware guy but about to set up a 10 node cluster for some
> >> processing of (mostly) tab files, generating various indexes and
> >> researching HBase, Mahout, pig, hive etc.
> >>
> >> Could someone please sanity check that these specs look sensible?
> >> [I know 4 drives would be better but price is a factor (second hand
> >> not an option, hosting is not either as there is very good bandwidth
> >> provided)]
> >>
> >> Something along the lines of:
> >>
> >> Dell R200 (8GB is max memory)
> >> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
> >> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
> >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
> >>
> >>
> >> Dell R300 (can be expanded to 24GB RAM)
> >> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
> >> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
> >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
> >>
> >>
> >> If there is a major flaw please can you let me know.
> >>
> >> Thanks,
> >>
> >> Tim
> >> (not a hardware guy ;o)
> >>
> >
> >
> >
> > --
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
> >
>


Lost TaskTracker Errors

2009-04-02 Thread Bhupesh Bansal
Hey Folks, 

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster. 

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn¹t have any more info  And task tracker logs are
clean. 

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs ­ls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075,
ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 48 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689
remote=/172.16.216.62:50010]
at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java
:185)
at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.
java:159)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.
java:198)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G
RAM) with these properties
1. mapred.child.java.opts = Xmx600M
2. mapred.tasktracker.map.tasks.maximum = 8
3. mapred.tasktracker.reduce.tasks.maximum = 4
4. dfs.datanode.handler.count = 10
5. dfs.datanode.du.reserved = 10240
6. dfs.datanode.max.xcievers = 512

The map jobs writes a Ton of data for each record, does increasing
³dfs.datanode.handler.count² will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh




Re: Running MapReduce without setJar

2009-04-02 Thread Farhan Husain
I did all of them i.e. I used setMapClass, setReduceClass and new
JobConf(MapReduceWork.class) but still it cannot run the job without a jar
file. I understand the reason that it looks for those classes inside a jar
but I think there should be some better way to find those classes without
using a jar. But I am not sure whether it is possible at all.

On Thu, Apr 2, 2009 at 2:56 PM, Rasit OZDAS  wrote:

> You can point to them by using
> conf.setMapClass(..) and conf.setReduceClass(..)  - or something
> similar, I don't have the source nearby.
>
> But something weird has happened to my code. It runs locally when I
> start it as java process (tries to find input path locally). I'm now
> using trunk, maybe something has changed with new version. With
> version 0.19 it was fine.
> Can somebody point out a clue?
>
> Rasit
>



-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS
You can point to them by using
conf.setMapClass(..) and conf.setReduceClass(..)  - or something
similar, I don't have the source nearby.

But something weird has happened to my code. It runs locally when I
start it as java process (tries to find input path locally). I'm now
using trunk, maybe something has changed with new version. With
version 0.19 it was fine.
Can somebody point out a clue?

Rasit


Re: Multiple k,v pairs from a single map - possible?

2009-04-02 Thread Amandeep Khurana
Here's the JIRA for the Oracle fix.
https://issues.apache.org/jira/browse/HADOOP-5616

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Mar 27, 2009 at 5:18 AM, Brian MacKay
wrote:

>
> Amandeep,
>
> Add this to your driver.
>
> MultipleOutputs.addNamedOutput(conf, "PHONE",TextOutputFormat.class,
> Text.class, Text.class);
>
> MultipleOutputs.addNamedOutput(conf, "NAME,
>TextOutputFormat.class, Text.class, Text.class);
>
>
>
> And in your reducer
>
>  private MultipleOutputs mos;
>
> public void reduce(Text key, Iterator values,
>OutputCollector output, Reporter reporter) {
>
>
>  // namedOutPut = either PHONE or NAME
>
>while (values.hasNext()) {
>String value = values.next().toString();
>mos.getCollector(namedOutPut, reporter).collect(
>new Text(value), new Text(othervals));
>}
>}
>
>@Override
>public void configure(JobConf conf) {
>super.configure(conf);
>mos = new MultipleOutputs(conf);
>}
>
>public void close() throws IOException {
>mos.close();
>}
>
>
>
> By the way, have you had a change to post your Oracle fix to
> DBInputFormat ?
> If so, what is the Jira tag #?
>
> Brian
>
> -Original Message-
> From: Amandeep Khurana [mailto:ama...@gmail.com]
> Sent: Friday, March 27, 2009 5:46 AM
> To: core-user@hadoop.apache.org
> Subject: Multiple k,v pairs from a single map - possible?
>
> Is it possible to output multiple key value pairs from a single map
> function
> run?
>
> For example, the mapper outputing  and 
> simultaneously...
>
> Can I write multiple output.collect(...) commands?
>
> Amandeep
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
>
>
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this message in error, please contact the sender and delete the material
> from any computer.
>
>
>


RPM spec file for 0.19.1

2009-04-02 Thread Ian Soboroff

I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
with a spec file for building a 0.19.1 RPM.

I like the idea of Cloudera's RPM file very much.  In particular, it has
nifty /etc/init.d scripts and RPM is nice for managing updates.
However, it's for an older, patched version of Hadoop.

This spec file is actually just Cloudera's, with suitable edits.  The
spec file does not contain an explicit license... if Cloudera have
strong feelings about it, let me know and I'll pull the JIRA attachment.

The JIRA includes instructions on how to roll the RPMs yourself.  I
would have attached the SRPM but they're too big for JIRA.  I can offer
noarch RPMs build with this spec file if someone wants to host them.

Ian



Re: Amazon Elastic MapReduce

2009-04-02 Thread Peter Skomoroch
Kevin,

The API accepts any arguments you can pass in the standard jobconf for
Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON
job description that will run on the service.

-Pete

On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson  wrote:

> So if I understand correctly, this is an automated system to bring up a
> hadoop cluster on EC2, import some data from S3, run a job flow, write the
> data back to S3, and bring down the cluster?
>
> This seems like a pretty good deal. At the pricing they are offering,
> unless
> I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
> cheaper to use this new service.
>
> Does this use an existing Hadoop job control API, or do I need to write my
> flows to conform to Amazon's API?
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread He Chen
It seems like the InMemoryFileSystem class has been deprecated in Hadoop
0.19.1. Why?

I want to reuse the result of reduce as the next time map's input. Cascading
does not work, because the data of each step is dependent. I set each
timestep mapreduce job as synchronization. If the InMemoryFileSystem is
deprecated. How can I reduce the I/O for each timestep's mapreduce job.

2009/4/2 Farhan Husain 

> Is there a way to implement some OutputCollector that can do what Andy
> wants
> to do?
>
> On Thu, Apr 2, 2009 at 10:21 AM, Rasit OZDAS  wrote:
>
> > Andy, I didn't try this feature. But I know that Yahoo had a
> > performance record with this file format.
> > I came across a file system included in hadoop code (probably that
> > one) when searching the source code.
> > Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem
> > But if you have a lot of big files, this approach won't be suitable I
> > think.
> >
> > Maybe someone can give further info.
> >
> > 2009/4/2 andy2005cst :
> > >
> > > thanks for your reply. Let me explain more clearly, since Map Reduce is
> > just
> > > one step of my program, I need to use the output of reduce for furture
> > > computation, so i do not need to want to wirte the output into disk,
> but
> > > wanna to get the collection or list of the output in RAM. if it
> directly
> > > wirtes into disk, I have to read it back into RAM again.
> > > you have mentioned a special file format, will you please show me what
> is
> > > it? and give some example if possible.
> > >
> > > thank you so much.
> > >
> > >
> > > Rasit OZDAS wrote:
> > >>
> > >> Hi, hadoop is normally designed to write to disk. There are a special
> > file
> > >> format, which writes output to RAM instead of disk.
> > >> But I don't have an idea if it's what you're looking for.
> > >> If what you said exists, there should be a mechanism which sends
> output
> > as
> > >> objects rather than file content across computers, as far as I know
> > there
> > >> is
> > >> no such feature yet.
> > >>
> > >> Good luck.
> > >>
> > >> 2009/4/2 andy2005cst 
> > >>
> > >>>
> > >>> I need to use the output of the reduce, but I don't know how to do.
> > >>> use the wordcount program as an example if i want to collect the
> > >>> wordcount
> > >>> into a hashtable for further use, how can i do?
> > >>> the example just show how to let the result onto disk.
> > >>> myemail is : andy2005...@gmail.com
> > >>> looking forward your help. thanks a lot.
> > >>> --
> > >>> View this message in context:
> > >>>
> >
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
> > >>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> M. Raşit ÖZDAŞ
> > >>
> > >>
> > >
> > > --
> > > View this message in context:
> >
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html
> > > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >
> > >
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>
>
>
> --
> Mohammad Farhan Husain
> Research Assistant
> Department of Computer Science
> Erik Jonsson School of Engineering and Computer Science
> University of Texas at Dallas
>



-- 
Chen He
RCF CSE Dept.
University of Nebraska-Lincoln
US


Re: Amazon Elastic MapReduce

2009-04-02 Thread Kevin Peterson
So if I understand correctly, this is an automated system to bring up a
hadoop cluster on EC2, import some data from S3, run a job flow, write the
data back to S3, and bring down the cluster?

This seems like a pretty good deal. At the pricing they are offering, unless
I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
cheaper to use this new service.

Does this use an existing Hadoop job control API, or do I need to write my
flows to conform to Amazon's API?


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Farhan Husain
Is there a way to implement some OutputCollector that can do what Andy wants
to do?

On Thu, Apr 2, 2009 at 10:21 AM, Rasit OZDAS  wrote:

> Andy, I didn't try this feature. But I know that Yahoo had a
> performance record with this file format.
> I came across a file system included in hadoop code (probably that
> one) when searching the source code.
> Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem
> But if you have a lot of big files, this approach won't be suitable I
> think.
>
> Maybe someone can give further info.
>
> 2009/4/2 andy2005cst :
> >
> > thanks for your reply. Let me explain more clearly, since Map Reduce is
> just
> > one step of my program, I need to use the output of reduce for furture
> > computation, so i do not need to want to wirte the output into disk, but
> > wanna to get the collection or list of the output in RAM. if it directly
> > wirtes into disk, I have to read it back into RAM again.
> > you have mentioned a special file format, will you please show me what is
> > it? and give some example if possible.
> >
> > thank you so much.
> >
> >
> > Rasit OZDAS wrote:
> >>
> >> Hi, hadoop is normally designed to write to disk. There are a special
> file
> >> format, which writes output to RAM instead of disk.
> >> But I don't have an idea if it's what you're looking for.
> >> If what you said exists, there should be a mechanism which sends output
> as
> >> objects rather than file content across computers, as far as I know
> there
> >> is
> >> no such feature yet.
> >>
> >> Good luck.
> >>
> >> 2009/4/2 andy2005cst 
> >>
> >>>
> >>> I need to use the output of the reduce, but I don't know how to do.
> >>> use the wordcount program as an example if i want to collect the
> >>> wordcount
> >>> into a hashtable for further use, how can i do?
> >>> the example just show how to let the result onto disk.
> >>> myemail is : andy2005...@gmail.com
> >>> looking forward your help. thanks a lot.
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
> >>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >>
> >> --
> >> M. Raşit ÖZDAŞ
> >>
> >>
> >
> > --
> > View this message in context:
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


Re: A bizarre problem in reduce method

2009-04-02 Thread Farhan Husain
Thanks Rasit for your suggestion. Actually, I should have let the group know
earlier that I solved the problem and it had nothing to do with the reduce
method. I used my reducer class as the combiner too which is not appropriate
in this case. I just got rid of the combiner and everything works fine now.
I think the Map/Reduce tutorial in hadoop's website should talk more about
the combiner. In the word count example the reducer can work as a combiner
but not in all other problems. This should be highlighted a little bit more
in the tutorial.

On Thu, Apr 2, 2009 at 8:50 AM, Rasit OZDAS  wrote:

> Hi, Husain,
>
> 1. You can use a boolean control in your code.
>   boolean hasAlreadyOned = false;
>int iCount = 0;
>   String sValue;
>   while (values.hasNext()) {
>   sValue = values.next().toString();
>   iCount++;
>if (sValue.equals("1"))
> hasAlreadyOned = true;
>
>   if (!hasAlreadyOned)
> sValues += "\t" + sValue;
>   }
>   ...
>
> 2. You're actually controlling for 3 elements, not 2. You should use  if
> (iCount == 1)
>
> 2009/4/1 Farhan Husain 
>
> > Hello All,
> >
> > I am facing some problems with a reduce method I have written which I
> > cannot
> > understand. Here is the method:
> >
> >@Override
> >public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter)
> >throws IOException {
> >String sValues = "";
> >int iCount = 0;
> >String sValue;
> >while (values.hasNext()) {
> >sValue = values.next().toString();
> >iCount++;
> >sValues += "\t" + sValue;
> >
> >}
> >sValues += "\t" + iCount;
> >//if (iCount == 2)
> >output.collect(key, new Text(sValues));
> >}
> >
> > The output of the code is like the following:
> >
> > D0U0:GraduateStudent0lehigh:GraduateStudent11
>  1
> > D0U0:GraduateStudent1lehigh:GraduateStudent11
>  1
> > D0U0:GraduateStudent10lehigh:GraduateStudent11
>  1
> > D0U0:GraduateStudent100lehigh:GraduateStudent11
> >  1
> > D0U0:GraduateStudent101lehigh:GraduateStudent1
> > D0U0:GraduateCourse0121
> > D0U0:GraduateStudent102lehigh:GraduateStudent11
> >  1
> > D0U0:GraduateStudent103lehigh:GraduateStudent11
> >  1
> > D0U0:GraduateStudent104lehigh:GraduateStudent11
> >  1
> > D0U0:GraduateStudent105lehigh:GraduateStudent11
> >  1
> >
> > The problem is there cannot be so many 1's in the output value. The
> output
> > which I expect should be like this:
> >
> > D0U0:GraduateStudent0lehigh:GraduateStudent1
> > D0U0:GraduateStudent1lehigh:GraduateStudent1
> > D0U0:GraduateStudent10lehigh:GraduateStudent1
> > D0U0:GraduateStudent100lehigh:GraduateStudent1
> > D0U0:GraduateStudent101lehigh:GraduateStudent
> > D0U0:GraduateCourse02
> > D0U0:GraduateStudent102lehigh:GraduateStudent1
> > D0U0:GraduateStudent103lehigh:GraduateStudent1
> > D0U0:GraduateStudent104lehigh:GraduateStudent1
> > D0U0:GraduateStudent105lehigh:GraduateStudent1
> >
> > If I do not append the iCount variable to sValues string, I get the
> > following output:
> >
> > D0U0:GraduateStudent0lehigh:GraduateStudent
> > D0U0:GraduateStudent1lehigh:GraduateStudent
> > D0U0:GraduateStudent10lehigh:GraduateStudent
> > D0U0:GraduateStudent100lehigh:GraduateStudent
> > D0U0:GraduateStudent101lehigh:GraduateStudent
> > D0U0:GraduateCourse0
> > D0U0:GraduateStudent102lehigh:GraduateStudent
> > D0U0:GraduateStudent103lehigh:GraduateStudent
> > D0U0:GraduateStudent104lehigh:GraduateStudent
> > D0U0:GraduateStudent105lehigh:GraduateStudent
> >
> > This confirms that there is no 1's after each of those values (which I
> > already know from the intput data). I do not know why the output is
> > distorted like that when I append the iCount to sValues (like the given
> > code). Can anyone help in this regard?
> >
> > Now comes the second problem which is equally perplexing. Actually, the
> > reduce method which I want to run is like the following:
> >
> >@Override
> >public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter)
> >throws IOException {
> >String sValues = "";
> >int iCount = 0;
> >String sValue;
> >while (values.hasNext()) {
> >sValue = values.next().toString();
> >iCount++;
> >sValues += "\t" + sValue;
> >
> >}
> >s

Re: Running MapReduce without setJar

2009-04-02 Thread Farhan Husain
Does this class need to have the mapper and reducer classes too?

On Wed, Apr 1, 2009 at 1:52 PM, javateck javateck wrote:

> you can run from java program:
>
>JobConf conf = new JobConf(MapReduceWork.class);
>
>// setting your params
>
>JobClient.runJob(conf);
>
>
> On Wed, Apr 1, 2009 at 11:42 AM, Farhan Husain  wrote:
>
> > Can I get rid of the whole jar thing? Is there any way to run map reduce
> > programs without using a jar? I do not want to use "hadoop jar ..."
> either.
> >
> > On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck  > >wrote:
> >
> > > I think you need to set a property (mapred.jar) inside hadoop-site.xml,
> > > then
> > > you don't need to hardcode in your java code, and it will be fine.
> > > But I don't know if there is any way that we can set multiple jars,
> since
> > a
> > > lot of times our own mapreduce class needs to reference other jars.
> > >
> > > On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain 
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > Can anyone tell me if there is any way running a map-reduce job from
> a
> > > java
> > > > program without specifying the jar file by JobConf.setJar() method?
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Mohammad Farhan Husain
> > > > Research Assistant
> > > > Department of Computer Science
> > > > Erik Jonsson School of Engineering and Computer Science
> > > > University of Texas at Dallas
> > > >
> > >
> >
> >
> >
> > --
> > Mohammad Farhan Husain
> > Research Assistant
> > Department of Computer Science
> > Erik Jonsson School of Engineering and Computer Science
> > University of Texas at Dallas
> >
>



-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Bryan Duxbury
I don't really see what the downside of reading it from disk is. A  
list of word counts should be pretty small on disk so it shouldn't  
take long to read it into a HashMap. Doing anything else is going to  
cause you to go a long way out of your way to end up with the same  
result.


-Bryan

On Apr 2, 2009, at 2:41 AM, andy2005cst wrote:



I need to use the output of the reduce, but I don't know how to do.
use the wordcount program as an example if i want to collect the  
wordcount

into a hashtable for further use, how can i do?
the example just show how to let the result onto disk.
myemail is : andy2005...@gmail.com
looking forward your help. thanks a lot.
--
View this message in context: http://www.nabble.com/HELP%3A-I-wanna- 
store-the-output-value-into-a-list-not-write-to-the-disk- 
tp22844277p22844277.html

Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
That seems interesting, we have 3 replications as default.
Is there a way to define, lets say, 1 replication for only job-specific files?

2009/4/2 Owen O'Malley :
>
> On Apr 2, 2009, at 2:41 AM, andy2005cst wrote:
>
>>
>> I need to use the output of the reduce, but I don't know how to do.
>> use the wordcount program as an example if i want to collect the wordcount
>> into a hashtable for further use, how can i do?
>
> You can use an output format and then an input format that uses a database,
> but in practice, the cost of writing to hdfs and reading it back is not a
> problem, especially if you set the replication of the output files to 1.
> (You'll need to re-run the job if you lose a node, but it will be fast.)
>
> -- Owen
>



-- 
M. Raşit ÖZDAŞ


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Owen O'Malley


On Apr 2, 2009, at 2:41 AM, andy2005cst wrote:



I need to use the output of the reduce, but I don't know how to do.
use the wordcount program as an example if i want to collect the  
wordcount

into a hashtable for further use, how can i do?


You can use an output format and then an input format that uses a  
database, but in practice, the cost of writing to hdfs and reading it  
back is not a problem, especially if you set the replication of the  
output files to 1. (You'll need to re-run the job if you lose a node,  
but it will be fast.)


-- Owen


Re: HadoopConfig problem -Datanode not able to connect to the server

2009-04-02 Thread Rasit OZDAS
I have no idea, but there are many "use hostname instead of IP"
issues. Try once hostname instead of IP.

2009/3/26 mingyang :
> check you iptable is off
>
> 2009/3/26 snehal nagmote 
>
>> hello,
>> We configured hadoop successfully, but after some days  its configuration
>> file from datanode( hadoop-site.xml) went off , and datanode was not coming
>> up ,so we again did the same configuration, its showing one datanode and
>> its
>> name as localhost rather than expected as either name of respected datanode
>> m/c or ip address of   actual datanode in ui interfece of hadoop.
>>
>> But capacity as 80.0gb ,(we have  one namenode (40 gb) and datanode(40
>> gb))means capacity is updated ,we can browse the filesystem , it is showing
>> whatever directories we are creating in namenode .
>>
>> but when we try to access the same through the datanode  machine
>> means doing ssh and executing series of commands its not able to connect to
>> the server.
>> saying retrying connect to the server
>>
>> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
>> 172.16.6.102:21011. Already tried 0 time(s).
>>
>> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
>> 172.16.6.102:21011. Already tried 1 time(s)
>>
>>
>> moreover we added one datanode into it and formatted namenode ,but that
>> datanode is not getting added. we are not understanding whats the problem.
>>
>> Can configuration files in case of datanode automatcally lost  after some
>> days??
>>
>> I have again one doubt , according to my understanding namenode doesnt
>> store
>> any data , it stores metadata of all the data , so when i execute mkdir in
>> namenode machine  and copying some files into it, it means that data is
>> getting stored in datanode provided to it, please correct me if i am wrong
>> ,
>> i am very new to hadoop.
>> So if i am able to view the data through inteface means its properly
>> storing
>> data into respected datanode, So
>> why its showing localhost as datanode name rather than respected datanode
>> name.
>>
>> can you please help.
>>
>>
>> Regards,
>> Snehal Nagmote
>> IIIT hyderabad
>>
>
>
>
> --
> 致
> 礼!
>
>
> 王明阳
>



-- 
M. Raşit ÖZDAŞ


Re: Hardware - please sanity check?

2009-04-02 Thread tim robertson
Thanks Miles,

Thus far most of my work has been on EC2 large instances and *mostly*
my code is not memory intensive (I sometimes do joins against polygons
and hold Geospatial indexes in memory, but am aware of keeping things
within the -Xmx for this).
I am mostly  looking to move routine data processing and
transformation (lots of distinct, count and group by operations) off a
chunky mysql DB (200million rows and growing) which gets all locked
up.

We have gigabit switches.

Cheers

Tim



On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne  wrote:
> make sure you also have a fast switch, since you will be transmitting
> data across your network and this will come to bite you otherwise
>
> (roughly, you need one core per hadoop-related job, each mapper, task
> tracker etc;  the per-core memory may be too small if you are doing
> anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
> and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
> and it routinely falls over when we run jobs on it)
>
> Miles
>
> 2009/4/2 tim robertson :
>> Hi all,
>>
>> I am not a hardware guy but about to set up a 10 node cluster for some
>> processing of (mostly) tab files, generating various indexes and
>> researching HBase, Mahout, pig, hive etc.
>>
>> Could someone please sanity check that these specs look sensible?
>> [I know 4 drives would be better but price is a factor (second hand
>> not an option, hosting is not either as there is very good bandwidth
>> provided)]
>>
>> Something along the lines of:
>>
>> Dell R200 (8GB is max memory)
>> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
>> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
>> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>>
>>
>> Dell R300 (can be expanded to 24GB RAM)
>> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
>> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
>> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>>
>>
>> If there is a major flaw please can you let me know.
>>
>> Thanks,
>>
>> Tim
>> (not a hardware guy ;o)
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
Andy, I didn't try this feature. But I know that Yahoo had a
performance record with this file format.
I came across a file system included in hadoop code (probably that
one) when searching the source code.
Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem
But if you have a lot of big files, this approach won't be suitable I think.

Maybe someone can give further info.

2009/4/2 andy2005cst :
>
> thanks for your reply. Let me explain more clearly, since Map Reduce is just
> one step of my program, I need to use the output of reduce for furture
> computation, so i do not need to want to wirte the output into disk, but
> wanna to get the collection or list of the output in RAM. if it directly
> wirtes into disk, I have to read it back into RAM again.
> you have mentioned a special file format, will you please show me what is
> it? and give some example if possible.
>
> thank you so much.
>
>
> Rasit OZDAS wrote:
>>
>> Hi, hadoop is normally designed to write to disk. There are a special file
>> format, which writes output to RAM instead of disk.
>> But I don't have an idea if it's what you're looking for.
>> If what you said exists, there should be a mechanism which sends output as
>> objects rather than file content across computers, as far as I know there
>> is
>> no such feature yet.
>>
>> Good luck.
>>
>> 2009/4/2 andy2005cst 
>>
>>>
>>> I need to use the output of the reduce, but I don't know how to do.
>>> use the wordcount program as an example if i want to collect the
>>> wordcount
>>> into a hashtable for further use, how can i do?
>>> the example just show how to let the result onto disk.
>>> myemail is : andy2005...@gmail.com
>>> looking forward your help. thanks a lot.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> M. Raşit ÖZDAŞ
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
M. Raşit ÖZDAŞ


Re: Identify the input file for a failed mapper/reducer

2009-04-02 Thread Rasit OZDAS
Two quotes for this problem:

"Streaming map tasks should have a "map_input_file" environment
variable like the following:
map_input_file=hdfs://HOST/path/to/file"

"the value for map.input.file gives you the exact information you need."

(didn't try)
Rasit

2009/3/26 Jason Fennell :
> Is there a way to identify the input file a mapper was running on when
> it failed?  When a large job fails because of bad input lines I have
> to resort to rerunning the entire job to isolate a single bad line
> (since the log doesn't contain information on the file that that
> mapper was running on).
>
> Basically, I would like to be able to do one of the following:
> 1. Find the file that a mapper was running on when it failed
> 2. Find the block that a mapper was running on when it failed (and be
> able to find file names from block ids)
>
> I haven't been able to find any documentation on facilities to
> accomplish either (1) or (2), so I'm hoping someone on this list will
> have a suggestion.
>
> I am using the Hadoop streaming API on hadoop 0.18.2.
>
> -Jason
>



-- 
M. Raşit ÖZDAŞ


Re: Amazon Elastic MapReduce

2009-04-02 Thread Chris K Wensel

You should check out the new pricing.

On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote:

seems like I should pay for additional money, so why not configure a  
hadoop
cluster in EC2 by myself. This already have been automatic using  
script.






On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne   
wrote:



... and only in the US

Miles

2009/4/2 zhang jianfeng :

Does it support pig ?


On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel   
wrote:




FYI

Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/

And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html

cheers,
ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/








--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Join Variation

2009-04-02 Thread jason hadoop
Probably be available in a week or so, as draft one isn't quite finished :)

On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski  wrote:

> .. and is not yet available as an alpha book chapter. Any chance uploading
> it?
>
> On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop 
> wrote:
> > Just for fun, chapter 9 in my book is a work through of solving this
> class
> > of problem.
> >
> >
> > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop  >wrote:
> >
> >> For the classic map/reduce job, you have 3 requirements.
> >>
> >> 1) a comparator that provide the keys in ip address order, such that all
> >> keys in one of your ranges, would be contiguous, when sorted with the
> >> comparator
> >> 2) a partitioner that ensures that all keys that should be together end
> up
> >> in the same partition
> >> 3) and output value grouping comparator that considered all keys in a
> >> specified range equal.
> >>
> >> The comparator only sorts by the first part of the key, the search file
> has
> >> a 2 part key begin/end the input data has just a 1 part key.
> >>
> >> A partitioner that new ahead of time the group sets in your search set,
> in
> >> the way that the tera sort example works would be ideal:
> >> ie: it builds an index of ranges from your seen set so that the ranges
> get
> >> rougly evenly split between your reduces.
> >> This requires a pass over the search file to write out a summary file,
> >> which is then loaded by the partitioner.
> >>
> >> The output value grouping comparator, will get the keys in order of the
> >> first token, and will define the start of a group by the presence of a 2
> >> part key, and consider the group ended when either another 2 part key
> >> appears, or when the key value is larger than the second part of the
> >> starting key. - This does require that the grouping comparator maintain
> >> state.
> >>
> >> At this point, your reduce will be called with the first key in the key
> >> equivalence group of (3), with the values of all of the keys
> >>
> >> In your map, any address that is not in a range of interest is not
> passed
> >> to output.collect.
> >>
> >> For the map side join code, you have to define a comparator on the key
> type
> >> that defines your definition of equivalence and ordering, and call
> >> WritableComparator.define( Key.class, comparator.class ), to force the
> join
> >> code to use your comparator.
> >>
> >> For tables with duplicates, per the key comparator, in map side join,
> your
> >> map fuction will receive a row for every permutation of the duplicate
> keys:
> >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4;
> your
> >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
> >>
> >>
> >>
> >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara  >wrote:
> >>
> >>> Thanks for all who replies.
> >>>
> >>> Stefan -
> >>> I'm unable to see how converting IP ranges to network masks would help
> >>> because different ranges can have the same network mask and with that I
> >>> still have to do a comparison of two fields: the searched IP with
> >>> from-IP&mask.
> >>>
> >>> Pig - I'm familier with pig and use it many times, but I can't think of
> a
> >>> way to write a pig script that will do this type of "join". I'll ask
> the
> >>> pig
> >>> users group.
> >>>
> >>> The search file is indeed large in terms of the amount records.
> However, I
> >>> don't see this as an issue yet, because I'm still puzzeled with how to
> >>> write
> >>> the job in plain MR. The join code is looking for an exact match in the
> >>> keys
> >>> and that is not what I need. Would a custom comperator which will look
> for
> >>> a
> >>> match in between the ranges, be the right choice to do this ?
> >>>
> >>> Thanks,
> >>> Tamir
> >>>
> >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop  >>> >wrote:
> >>>
> >>> > If the search file data set is large, the issue becomes ensuring that
> >>> only
> >>> > the required portion of search file is actually read, and that those
> >>> reads
> >>> > are ordered, in search file's key order.
> >>> >
> >>> > If the data set is small, most any of the common patterns will work.
> >>> >
> >>> > I haven't looked at pig for a while, does pig now use indexes in map
> >>> files,
> >>> > and take into account that a data set is sorted?
> >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> >>> will
> >>> > do a decent job of this, but the entire search file set will be read.
> >>> > To stop reading the entire search file, a record reader or join type,
> >>> would
> >>> > need to be put together to:
> >>> > a) skip to the first key of interest, using the index if available
> >>> > b) finish when the last possible key of interest has been delivered.
> >>> >
> >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee 
> >>> wrote:
> >>> >
> >>> > > In addition to other suggestions, you could also take a look at
> >>> > > building a Cascading job with a custom Joiner class.
> >>> > >
> >>> > > - John
> >>> > >
> >>> > > On Tue, M

Re: Hardware - please sanity check?

2009-04-02 Thread Miles Osborne
make sure you also have a fast switch, since you will be transmitting
data across your network and this will come to bite you otherwise

(roughly, you need one core per hadoop-related job, each mapper, task
tracker etc;  the per-core memory may be too small if you are doing
anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
and it routinely falls over when we run jobs on it)

Miles

2009/4/2 tim robertson :
> Hi all,
>
> I am not a hardware guy but about to set up a 10 node cluster for some
> processing of (mostly) tab files, generating various indexes and
> researching HBase, Mahout, pig, hive etc.
>
> Could someone please sanity check that these specs look sensible?
> [I know 4 drives would be better but price is a factor (second hand
> not an option, hosting is not either as there is very good bandwidth
> provided)]
>
> Something along the lines of:
>
> Dell R200 (8GB is max memory)
> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>
>
> Dell R300 (can be expanded to 24GB RAM)
> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive
>
>
> If there is a major flaw please can you let me know.
>
> Thanks,
>
> Tim
> (not a hardware guy ;o)
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: hdfs-doubt

2009-04-02 Thread Rasit OZDAS
It seems that either NameNode or DataNode is not started.
You can take a look at log files, and paste related lines here.

2009/3/29 deepya :
>
> Thanks,
>
> I have another doubt.I just want to run the examples and see how it works.I
> am trying to copy the file from local file system to hdfs using the command
>
>  bin/hadoop fs -put conf input
>
> It is giving the following error.
> 09/03/29 05:50:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.NoRouteToHostException: No route to host
> 09/03/29 05:50:54 INFO hdfs.DFSClient: Abandoning block
> blk_-5733385806393158149_1053
>
> I have only one datanode in my cluster and my replication factor is also
> 1(as configured in the conf file in hadoop-site.xml).Can you please provide
> the solution for this.
>
>
> Thanks in advance
>
> SreeDeepya
>
>
> sree deepya wrote:
>>
>> Hi sir/madam,
>>
>> I am SreeDeepya,doing Mtech in IIIT.I am working on a project named cost
>> effective and scalable storage server.Our main goal of the project is to
>> be
>> able to store images in a server and the data can be upto petabytes.For
>> that
>> we are using HDFS.I am new to hadoop and am just learning about it.
>>     Can you please clarify some of the doubts I have.
>>
>>
>>
>> At present we configured one datanode and one namenode.Jobtracker is
>> running
>> on namenode and tasktracker on datanode.Now namenode also acts as
>> client.Like we are writing programs in the namenode to store or retrieve
>> images.My doubts are
>>
>> 1.Can we put the client and namenode in two separate systems?
>>
>> 2.Can we access the images from the datanode of hadoop cluster from a
>> machine in which hdfs is not there?
>>
>> 3.At present we may not have data upto petabytes but will be in
>> gigabytes.Is
>> hadoop still efficient in storing mega and giga bytes of data
>>
>>
>> Thanking you,
>>
>> Yours sincerely,
>> SreeDeepya
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/hdfs-doubt-tp22764502p22765332.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
M. Raşit ÖZDAŞ


Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
I doubt If I understood you correctly, but if so, there is a previous
thread to better understand what hadoop is intended to be, and what
disadvantages it has:
http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html

2009/4/2 Rasit OZDAS 
>
> If performance is important to you, Look at the quote from a previous thread:
>
> "HDFS is a file system for distributed storage typically for distributed
> computing scenerio over hadoop. For office purpose you will require a SAN
> (Storage Area Network) - an architecture to attach remote computer storage
> devices to servers in such a way that, to the operating system, the devices
> appear as locally attached. Or you can even go for AmazonS3, if the data is
> really authentic. For opensource solution related to SAN, you can go with
> any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
> zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
> Server + XSan."
>
> --nitesh
>
> Besides, I wouldn't use HDFS for this purpose.
>
> Rasit



--
M. Raşit ÖZDAŞ


Hardware - please sanity check?

2009-04-02 Thread tim robertson
Hi all,

I am not a hardware guy but about to set up a 10 node cluster for some
processing of (mostly) tab files, generating various indexes and
researching HBase, Mahout, pig, hive etc.

Could someone please sanity check that these specs look sensible?
[I know 4 drives would be better but price is a factor (second hand
not an option, hosting is not either as there is very good bandwidth
provided)]

Something along the lines of:

Dell R200 (8GB is max memory)
Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


Dell R300 (can be expanded to 24GB RAM)
Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


If there is a major flaw please can you let me know.

Thanks,

Tim
(not a hardware guy ;o)


Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
If performance is important to you, Look at the quote from a previous
thread:

"HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer storage
devices to servers in such a way that, to the operating system, the devices
appear as locally attached. Or you can even go for AmazonS3, if the data is
really authentic. For opensource solution related to SAN, you can go with
any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
Server + XSan."

--nitesh

Besides, I wouldn't use HDFS for this purpose.

Rasit


Re: A bizarre problem in reduce method

2009-04-02 Thread Rasit OZDAS
Hi, Husain,

1. You can use a boolean control in your code.
   boolean hasAlreadyOned = false;
   int iCount = 0;
   String sValue;
   while (values.hasNext()) {
   sValue = values.next().toString();
   iCount++;
   if (sValue.equals("1"))
 hasAlreadyOned = true;

   if (!hasAlreadyOned)
 sValues += "\t" + sValue;
   }
   ...

2. You're actually controlling for 3 elements, not 2. You should use  if
(iCount == 1)

2009/4/1 Farhan Husain 

> Hello All,
>
> I am facing some problems with a reduce method I have written which I
> cannot
> understand. Here is the method:
>
>@Override
>public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter)
>throws IOException {
>String sValues = "";
>int iCount = 0;
>String sValue;
>while (values.hasNext()) {
>sValue = values.next().toString();
>iCount++;
>sValues += "\t" + sValue;
>
>}
>sValues += "\t" + iCount;
>//if (iCount == 2)
>output.collect(key, new Text(sValues));
>}
>
> The output of the code is like the following:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent111
> D0U0:GraduateStudent1lehigh:GraduateStudent111
> D0U0:GraduateStudent10lehigh:GraduateStudent111
> D0U0:GraduateStudent100lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent101lehigh:GraduateStudent1
> D0U0:GraduateCourse0121
> D0U0:GraduateStudent102lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent103lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent104lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent105lehigh:GraduateStudent11
>  1
>
> The problem is there cannot be so many 1's in the output value. The output
> which I expect should be like this:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent1
> D0U0:GraduateStudent1lehigh:GraduateStudent1
> D0U0:GraduateStudent10lehigh:GraduateStudent1
> D0U0:GraduateStudent100lehigh:GraduateStudent1
> D0U0:GraduateStudent101lehigh:GraduateStudent
> D0U0:GraduateCourse02
> D0U0:GraduateStudent102lehigh:GraduateStudent1
> D0U0:GraduateStudent103lehigh:GraduateStudent1
> D0U0:GraduateStudent104lehigh:GraduateStudent1
> D0U0:GraduateStudent105lehigh:GraduateStudent1
>
> If I do not append the iCount variable to sValues string, I get the
> following output:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent
> D0U0:GraduateStudent1lehigh:GraduateStudent
> D0U0:GraduateStudent10lehigh:GraduateStudent
> D0U0:GraduateStudent100lehigh:GraduateStudent
> D0U0:GraduateStudent101lehigh:GraduateStudent
> D0U0:GraduateCourse0
> D0U0:GraduateStudent102lehigh:GraduateStudent
> D0U0:GraduateStudent103lehigh:GraduateStudent
> D0U0:GraduateStudent104lehigh:GraduateStudent
> D0U0:GraduateStudent105lehigh:GraduateStudent
>
> This confirms that there is no 1's after each of those values (which I
> already know from the intput data). I do not know why the output is
> distorted like that when I append the iCount to sValues (like the given
> code). Can anyone help in this regard?
>
> Now comes the second problem which is equally perplexing. Actually, the
> reduce method which I want to run is like the following:
>
>@Override
>public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter)
>throws IOException {
>String sValues = "";
>int iCount = 0;
>String sValue;
>while (values.hasNext()) {
>sValue = values.next().toString();
>iCount++;
>sValues += "\t" + sValue;
>
>}
>sValues += "\t" + iCount;
>if (iCount == 2)
>output.collect(key, new Text(sValues));
>}
>
> I want to output only if "values" contained only two elements. By looking
> at
> the output above you can see that there is at least one such key values
> pair
> where values have exactly two elements. But when I run the code I get an
> empty output file. Can anyone solve this?
>
> I have tried many versions of the code (e.g. using StringBuffer instead of
> String, using flags instead of integer count) but nothing works. Are these
> problems due to bugs in Hadoop? Please let me know any kind of solution you
> can think of.
>
> Thanks,
>
> --
> Mohammad Farhan Husain
> Research Assistant
> Department of Computer Science
> Erik Jonsson School of Engineering and Computer Science
> University of T

Re: what change to be done in OutputCollector to print custom writable object

2009-04-02 Thread Rasit OZDAS
There is also a good alternative,
We use ObjectInputFormat and ObjectRecordReader.
With it you can easily do File <-> Object translations.
I can send a code sample to your mail if you want.


Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Guilherme Germoglio
you should append id_dsa.pub to ~/.ssh/authorized_keys on the other
computers from the cluster. if your home directory is shared by all of them
(e.g., you're mounting /home/$user using NFS), "cat ~/.ssh/id_dsa.pub >>
~/.ssh/authorized_keys" might work. however, if it isn't shared, you might
use 'ssh-copy-id' to all your nodes (or append id_dsa.pub manually).

2009/4/2 Puri, Aseem 

> Hi Rasit,
>
> Now I got a different problem when I start my Hadoop server the slave
> datanode do not accept password. It gives message permission denied.
>
>I have also use the commands on all m/c
>
> $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
>
> But my problem is not solved. Any suggestion?
>
>
> -Original Message-
> From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
> Sent: Thursday, April 02, 2009 6:49 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Cannot resolve Datonode address in slave file
>
> Hi, Sim,
>
> I've two suggessions, if you haven't done yet:
>
> 1. Check if your other hosts can ssh to master.
> 2. Take a look at logs of other hosts.
>
> 2009/4/2 Puri, Aseem 
>
> >
> > Hi
> >
> >I have a small Hadoop cluster with 3 machines. One is my
> > NameNode/JobTracker + DataNode/TaskTracker and other 2 are
> > DataNode/TaskTracker. So I have made all 3 as slave.
> >
> >
> >
> > In slave file I have put names of all there machines as:
> >
> >
> >
> > master
> >
> > slave
> >
> > slave1
> >
> >
> >
> > When I start Hadoop cluster it always start DataNode/TaskTracker on last
> > slave in the list and do not start DataNode/TaskTracker on other two
> > machines. Also I got the message as:
> >
> >
> >
> > slave1:
> >
> > : no address associated with name
> >
> > : no address associated with name
> >
> > slave1: starting datanode, logging to
> > /home/HadoopAdmin/hadoop/bin/../logs/hadoo
> >
> > p-HadoopAdmin-datanode-ie11dtxpficbfise.out
> >
> >
> >
> > If I change the order in slave file like this:
> >
> >
> >
> > slave
> >
> > slave1
> >
> > master
> >
> >
> >
> > then DataNode/TaskTracker on master m/c starts and not on other two.
> >
> >
> >
> > Please tell how I should solve this problem.
> >
> >
> >
> > Sim
> >
> >
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread andy2005cst

thanks for your reply. Let me explain more clearly, since Map Reduce is just
one step of my program, I need to use the output of reduce for furture
computation, so i do not need to want to wirte the output into disk, but
wanna to get the collection or list of the output in RAM. if it directly
wirtes into disk, I have to read it back into RAM again.
you have mentioned a special file format, will you please show me what is
it? and give some example if possible.

thank you so much.


Rasit OZDAS wrote:
> 
> Hi, hadoop is normally designed to write to disk. There are a special file
> format, which writes output to RAM instead of disk.
> But I don't have an idea if it's what you're looking for.
> If what you said exists, there should be a mechanism which sends output as
> objects rather than file content across computers, as far as I know there
> is
> no such feature yet.
> 
> Good luck.
> 
> 2009/4/2 andy2005cst 
> 
>>
>> I need to use the output of the reduce, but I don't know how to do.
>> use the wordcount program as an example if i want to collect the
>> wordcount
>> into a hashtable for further use, how can i do?
>> the example just show how to let the result onto disk.
>> myemail is : andy2005...@gmail.com
>> looking forward your help. thanks a lot.
>> --
>> View this message in context:
>> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> M. Raşit ÖZDAŞ
> 
> 

-- 
View this message in context: 
http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS
Yes, as an additional info,
you can use this code just to start the job, not wait until it's finished:

JobClient client = new JobClient(conf);
client.runJob(conf);

2009/4/1 javateck javateck 

> you can run from java program:
>
>JobConf conf = new JobConf(MapReduceWork.class);
>
>// setting your params
>
>JobClient.runJob(conf);
>
>


Re: Reducer side output

2009-04-02 Thread Rasit OZDAS
I think it's about that you have no right to access to the path you define.
Did you try it with a path under your user directory?

You can change permissions from console.

2009/4/1 Nagaraj K 

> Hi,
>
> I am trying to do a side-effect output along with the usual output from the
> reducer.
> But for the side-effect output attempt, I get the following error.
>
> org.apache.hadoop.fs.permission.AccessControlException:
> org.apache.hadoop.fs.permission.AccessControlException: Permission denied:
> user=nagarajk, access=WRITE, inode="":hdfs:hdfs:rwxr-xr-x
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
>at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:2311)
>at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477)
>at
> org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383)
>at
> org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310)
>at
> org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319)
>at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)
>
> My reducer code;
> =
> conf.set("group_stat", "some_path"); // Set during the configuration of
> jobconf object
>
> public static class ReducerClass extends MapReduceBase implements
> Reducer {
>FSDataOutputStream part=null;
>JobConf conf;
>
>public void reduce(Text key, Iterator values,
>   OutputCollector output,
>   Reporter reporter) throws IOException {
>double i_sum = 0.0;
>while (values.hasNext()) {
>i_sum += ((Double) values.next()).valueOf();
>}
>String [] fields = key.toString().split(SEP);
>if(fields.length==1)
>{
>   if(part==null)
>   {
>   FileSystem fs = FileSystem.get(conf);
>String jobpart =
> conf.get("mapred.task.partition");
>part = fs.create(new
> Path(conf.get("group_stat"),"/part-000"+jobpart)) ; // Failing here
>   }
>   part.writeBytes(fields[0] +"\t" + i_sum +"\n");
>
>}
>else
>output.collect(key, new DoubleWritable(i_sum));
>}
> }
>
> Can you guys let me know what I am doing wrong here!.
>
> Thanks
> Nagaraj K
>



-- 
M. Raşit ÖZDAŞ


RE: Cannot resolve Datonode address in slave file

2009-04-02 Thread Puri, Aseem
Hi Rasit,
 
Now I got a different problem when I start my Hadoop server the slave datanode 
do not accept password. It gives message permission denied. 

I have also use the commands on all m/c 

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

But my problem is not solved. Any suggestion?


-Original Message-
From: Rasit OZDAS [mailto:rasitoz...@gmail.com] 
Sent: Thursday, April 02, 2009 6:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Cannot resolve Datonode address in slave file

Hi, Sim,

I've two suggessions, if you haven't done yet:

1. Check if your other hosts can ssh to master.
2. Take a look at logs of other hosts.

2009/4/2 Puri, Aseem 

>
> Hi
>
>I have a small Hadoop cluster with 3 machines. One is my
> NameNode/JobTracker + DataNode/TaskTracker and other 2 are
> DataNode/TaskTracker. So I have made all 3 as slave.
>
>
>
> In slave file I have put names of all there machines as:
>
>
>
> master
>
> slave
>
> slave1
>
>
>
> When I start Hadoop cluster it always start DataNode/TaskTracker on last
> slave in the list and do not start DataNode/TaskTracker on other two
> machines. Also I got the message as:
>
>
>
> slave1:
>
> : no address associated with name
>
> : no address associated with name
>
> slave1: starting datanode, logging to
> /home/HadoopAdmin/hadoop/bin/../logs/hadoo
>
> p-HadoopAdmin-datanode-ie11dtxpficbfise.out
>
>
>
> If I change the order in slave file like this:
>
>
>
> slave
>
> slave1
>
> master
>
>
>
> then DataNode/TaskTracker on master m/c starts and not on other two.
>
>
>
> Please tell how I should solve this problem.
>
>
>
> Sim
>
>


-- 
M. Raşit ÖZDAŞ


Re: Strange Reduce Bahavior

2009-04-02 Thread Rasit OZDAS
Yes, we've constructed a local version of a hadoop process,
We needed 500 input files in hadoop to reach the speed of local process,
total time was 82 seconds in a cluster of 6 machines.
And I think it's a good performance among other distributed processing
systems.

2009/4/2 jason hadoop 

> 3) The framework is designed for working on large clusters of machines
> where
> there needs to be a little delay between operations to avoid massive
> network
> loading spikes, and the initial setup of the map task execution environment
> on a machine, and the initial setup of the reduce task execution
> environment
> take a bit of time.
> In production jobs, these delays and setup times are lost in the overall
> task run time.
> In the small test job case the delays and setup times will be the bulk of
> the time spent executing the test.
>
>
>


Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Rasit OZDAS
Hi, Sim,

I've two suggessions, if you haven't done yet:

1. Check if your other hosts can ssh to master.
2. Take a look at logs of other hosts.

2009/4/2 Puri, Aseem 

>
> Hi
>
>I have a small Hadoop cluster with 3 machines. One is my
> NameNode/JobTracker + DataNode/TaskTracker and other 2 are
> DataNode/TaskTracker. So I have made all 3 as slave.
>
>
>
> In slave file I have put names of all there machines as:
>
>
>
> master
>
> slave
>
> slave1
>
>
>
> When I start Hadoop cluster it always start DataNode/TaskTracker on last
> slave in the list and do not start DataNode/TaskTracker on other two
> machines. Also I got the message as:
>
>
>
> slave1:
>
> : no address associated with name
>
> : no address associated with name
>
> slave1: starting datanode, logging to
> /home/HadoopAdmin/hadoop/bin/../logs/hadoo
>
> p-HadoopAdmin-datanode-ie11dtxpficbfise.out
>
>
>
> If I change the order in slave file like this:
>
>
>
> slave
>
> slave1
>
> master
>
>
>
> then DataNode/TaskTracker on master m/c starts and not on other two.
>
>
>
> Please tell how I should solve this problem.
>
>
>
> Sim
>
>


-- 
M. Raşit ÖZDAŞ


Re: reducer in M-R

2009-04-02 Thread Rasit OZDAS
Since every file name is different, you have a unique key for each map
output.
That means, every iterator has only one element. So you won't need to search
for a given name.
But it's possible that I misunderstood you.

2009/4/2 Vishal Ghawate 

> Hi ,
>
> I just wanted to know that values parameter passed to the reducer is always
> iterator ,
>
> Which is then used to iterate through for particular key
>
> Now I want to use file name as key and file content as its value
>
> So how can I set the parameters in the reducer
>
>
>
> Can anybody please help me on this.
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
M. Raşit ÖZDAŞ


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
Hi, hadoop is normally designed to write to disk. There are a special file
format, which writes output to RAM instead of disk.
But I don't have an idea if it's what you're looking for.
If what you said exists, there should be a mechanism which sends output as
objects rather than file content across computers, as far as I know there is
no such feature yet.

Good luck.

2009/4/2 andy2005cst 

>
> I need to use the output of the reduce, but I don't know how to do.
> use the wordcount program as an example if i want to collect the wordcount
> into a hashtable for further use, how can i do?
> the example just show how to let the result onto disk.
> myemail is : andy2005...@gmail.com
> looking forward your help. thanks a lot.
> --
> View this message in context:
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: mapreduce problem

2009-04-02 Thread Rasit OZDAS
MultipleOutputFormat would be what you want. It supplies multiple files as
output.
I can paste some code here if you want..

2009/4/2 Vishal Ghawate 

> Hi,
>
> I am new to map-reduce programming model ,
>
>  I am writing a MR that will process the log file and results are written
> to
> different files on hdfs  based on some values in the log file
>
>The program is working fine even if I haven't done any
> processing in reducer ,I am not getting how to use reducer for solving my
> problem efficiently
>
> Can anybody please help me on this.
>
>
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
M. Raşit ÖZDAŞ


Re: Amazon Elastic MapReduce

2009-04-02 Thread Brian Bockelman


On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:

seems like I should pay for additional money, so why not configure a  
hadoop
cluster in EC2 by myself. This already have been automatic using  
script.





Not everyone has a support team or an operations team or enough time  
to learn how to do it themselves.  You're basically paying for the  
fact that the only thing you need to know to use Hadoop is:

1) Be able to write the Java classes.
2) Press the "go" button on a webpage somewhere.

You could use Hadoop with little-to-zero systems knowledge (and  
without institutional support), which would always make some  
researchers happy.


Brian





On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne   
wrote:



... and only in the US

Miles

2009/4/2 zhang jianfeng :

Does it support pig ?


On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel   
wrote:




FYI

Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/

And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html

cheers,
ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/








--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





reducer in M-R

2009-04-02 Thread Vishal Ghawate
Hi ,

I just wanted to know that values parameter passed to the reducer is always
iterator ,

Which is then used to iterate through for particular key 

Now I want to use file name as key and file content as its value 

So how can I set the parameters in the reducer

 

Can anybody please help me on this. 


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread andy2005cst

I need to use the output of the reduce, but I don't know how to do.
use the wordcount program as an example if i want to collect the wordcount
into a hashtable for further use, how can i do?
the example just show how to let the result onto disk. 
myemail is : andy2005...@gmail.com
looking forward your help. thanks a lot.
-- 
View this message in context: 
http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: hadoop job controller

2009-04-02 Thread Stefan Podkowinski
You can get the job progress and completion status through an instance
of org.apache.hadoop.mapred.JobClient . If you really want to use perl
I guess you still need to write a small java application that talks to
perl and JobClient on the other side.
Theres also some support for Thrift in the hadoop contrib package, but
I'm not sure if it exposes any job client related methods.

On Thu, Apr 2, 2009 at 12:46 AM, Elia Mazzawi
 wrote:
>
> I'm writing a perl program to submit jobs to the cluster,
> then wait for the jobs to finish, and check that they have completed
> successfully.
>
> I have some questions,
>
> this shows what is running
> ./hadoop job  -list
>
> and this shows the completion
> ./hadoop job -status  job_200903061521_0045
>
>
> but i want something that just says pass / fail
> cause with these, i have to check that its done then check that its 100%
> completed.
>
> which must exist since the webapp jobtracker.jsp knows what is what.
>
> also a controller like that must have been written many times already,  are
> there any around?
>
> Regards,
> Elia
>


mapreduce problem

2009-04-02 Thread Vishal Ghawate
Hi,

I am new to map-reduce programming model ,

 I am writing a MR that will process the log file and results are written to
different files on hdfs  based on some values in the log file

The program is working fine even if I haven't done any
processing in reducer ,I am not getting how to use reducer for solving my
problem efficiently

Can anybody please help me on this.

 


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Join Variation

2009-04-02 Thread Stefan Podkowinski
.. and is not yet available as an alpha book chapter. Any chance uploading it?

On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop  wrote:
> Just for fun, chapter 9 in my book is a work through of solving this class
> of problem.
>
>
> On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop wrote:
>
>> For the classic map/reduce job, you have 3 requirements.
>>
>> 1) a comparator that provide the keys in ip address order, such that all
>> keys in one of your ranges, would be contiguous, when sorted with the
>> comparator
>> 2) a partitioner that ensures that all keys that should be together end up
>> in the same partition
>> 3) and output value grouping comparator that considered all keys in a
>> specified range equal.
>>
>> The comparator only sorts by the first part of the key, the search file has
>> a 2 part key begin/end the input data has just a 1 part key.
>>
>> A partitioner that new ahead of time the group sets in your search set, in
>> the way that the tera sort example works would be ideal:
>> ie: it builds an index of ranges from your seen set so that the ranges get
>> rougly evenly split between your reduces.
>> This requires a pass over the search file to write out a summary file,
>> which is then loaded by the partitioner.
>>
>> The output value grouping comparator, will get the keys in order of the
>> first token, and will define the start of a group by the presence of a 2
>> part key, and consider the group ended when either another 2 part key
>> appears, or when the key value is larger than the second part of the
>> starting key. - This does require that the grouping comparator maintain
>> state.
>>
>> At this point, your reduce will be called with the first key in the key
>> equivalence group of (3), with the values of all of the keys
>>
>> In your map, any address that is not in a range of interest is not passed
>> to output.collect.
>>
>> For the map side join code, you have to define a comparator on the key type
>> that defines your definition of equivalence and ordering, and call
>> WritableComparator.define( Key.class, comparator.class ), to force the join
>> code to use your comparator.
>>
>> For tables with duplicates, per the key comparator, in map side join, your
>> map fuction will receive a row for every permutation of the duplicate keys:
>> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
>> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
>>
>>
>>
>> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara wrote:
>>
>>> Thanks for all who replies.
>>>
>>> Stefan -
>>> I'm unable to see how converting IP ranges to network masks would help
>>> because different ranges can have the same network mask and with that I
>>> still have to do a comparison of two fields: the searched IP with
>>> from-IP&mask.
>>>
>>> Pig - I'm familier with pig and use it many times, but I can't think of a
>>> way to write a pig script that will do this type of "join". I'll ask the
>>> pig
>>> users group.
>>>
>>> The search file is indeed large in terms of the amount records. However, I
>>> don't see this as an issue yet, because I'm still puzzeled with how to
>>> write
>>> the job in plain MR. The join code is looking for an exact match in the
>>> keys
>>> and that is not what I need. Would a custom comperator which will look for
>>> a
>>> match in between the ranges, be the right choice to do this ?
>>>
>>> Thanks,
>>> Tamir
>>>
>>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop >> >wrote:
>>>
>>> > If the search file data set is large, the issue becomes ensuring that
>>> only
>>> > the required portion of search file is actually read, and that those
>>> reads
>>> > are ordered, in search file's key order.
>>> >
>>> > If the data set is small, most any of the common patterns will work.
>>> >
>>> > I haven't looked at pig for a while, does pig now use indexes in map
>>> files,
>>> > and take into account that a data set is sorted?
>>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
>>> will
>>> > do a decent job of this, but the entire search file set will be read.
>>> > To stop reading the entire search file, a record reader or join type,
>>> would
>>> > need to be put together to:
>>> > a) skip to the first key of interest, using the index if available
>>> > b) finish when the last possible key of interest has been delivered.
>>> >
>>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee 
>>> wrote:
>>> >
>>> > > In addition to other suggestions, you could also take a look at
>>> > > building a Cascading job with a custom Joiner class.
>>> > >
>>> > > - John
>>> > >
>>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara 
>>> > > wrote:
>>> > > > Hi,
>>> > > >
>>> > > > We need to implement a Join with a between operator instead of an
>>> > equal.
>>> > > > What we are trying to do is search a file for a key where the key
>>> falls
>>> > > > between two fields in the search file like this:
>>> > > >
>>> > > > main file (ip, a, b):
>>> > > > (80, zz, yy)
>>> > > > (125, vv, bb)

Announcing Amazon Elastic MapReduce

2009-04-02 Thread Sirota, Peter
Dear Hadoop community,

We are excited today to introduce the public beta of Amazon Elastic MapReduce, 
a web service that enables developers to easily and cost-effectively process 
vast amounts of data. It utilizes a hosted Hadoop (0.18.3) running on the 
web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and 
Amazon Simple Storage Service (Amazon S3).

Using Amazon Elastic MapReduce, you can instantly provision as much or as 
little capacity as you like to perform data-intensive tasks for applications 
such as web indexing, data mining, log file analysis, machine learning, 
financial analysis, scientific simulation, and bioinformatics research.  Amazon 
Elastic MapReduce lets you focus on crunching or analyzing your data without 
having to worry about time-consuming set-up, management or tuning of Hadoop 
clusters or the compute capacity upon which they sit.

Working with the service is easy: Develop your processing application using our 
samples or by building your own, upload your data to Amazon S3, use the AWS 
Management Console or APIs to specify the number and type of instances you 
want, and click "Create Job Flow." We do the rest, running Hadoop over the 
number of specified instances, providing progress monitoring, and delivering 
the output to Amazon S3.

We will be posting several patches to Hadoop today and are hoping to become a 
part of this exciting community moving forward.

We hope this new service will prove a powerful tool for your data processing 
needs and becomes a great development platform to build sophisticated data 
processing applications. You can sign up and start using the service today at 
http://aws.amazon.com/elasticmapreduce.

Our forums are available to ask any questions or suggest features: 
http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52

Sincerely,

The Amazon Web Services Team



Re: Amazon Elastic MapReduce

2009-04-02 Thread zhang jianfeng
seems like I should pay for additional money, so why not configure a hadoop
cluster in EC2 by myself. This already have been automatic using script.





On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne  wrote:

> ... and only in the US
>
> Miles
>
> 2009/4/2 zhang jianfeng :
> > Does it support pig ?
> >
> >
> > On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel  wrote:
> >
> >>
> >> FYI
> >>
> >> Amazons new Hadoop offering:
> >> http://aws.amazon.com/elasticmapreduce/
> >>
> >> And Cascading 1.0 supports it:
> >> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
> >>
> >> cheers,
> >> ckw
> >>
> >> --
> >> Chris K Wensel
> >> ch...@wensel.net
> >> http://www.cascading.org/
> >> http://www.scaleunlimited.com/
> >>
> >>
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>


Re: Amazon Elastic MapReduce

2009-04-02 Thread Miles Osborne
... and only in the US

Miles

2009/4/2 zhang jianfeng :
> Does it support pig ?
>
>
> On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel  wrote:
>
>>
>> FYI
>>
>> Amazons new Hadoop offering:
>> http://aws.amazon.com/elasticmapreduce/
>>
>> And Cascading 1.0 supports it:
>> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>>
>> cheers,
>> ckw
>>
>> --
>> Chris K Wensel
>> ch...@wensel.net
>> http://www.cascading.org/
>> http://www.scaleunlimited.com/
>>
>>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Amazon Elastic MapReduce

2009-04-02 Thread zhang jianfeng
Does it support pig ?


On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel  wrote:

>
> FYI
>
> Amazons new Hadoop offering:
> http://aws.amazon.com/elasticmapreduce/
>
> And Cascading 1.0 supports it:
> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>
> cheers,
> ckw
>
> --
> Chris K Wensel
> ch...@wensel.net
> http://www.cascading.org/
> http://www.scaleunlimited.com/
>
>


Amazon Elastic MapReduce

2009-04-02 Thread Chris K Wensel


FYI

Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/

And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html

cheers,
ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Cannot resolve Datonode address in slave file

2009-04-02 Thread Puri, Aseem

Hi

I have a small Hadoop cluster with 3 machines. One is my
NameNode/JobTracker + DataNode/TaskTracker and other 2 are
DataNode/TaskTracker. So I have made all 3 as slave. 

 

In slave file I have put names of all there machines as:

 

master

slave

slave1

 

When I start Hadoop cluster it always start DataNode/TaskTracker on last
slave in the list and do not start DataNode/TaskTracker on other two
machines. Also I got the message as:

 

slave1:

: no address associated with name

: no address associated with name

slave1: starting datanode, logging to
/home/HadoopAdmin/hadoop/bin/../logs/hadoo

p-HadoopAdmin-datanode-ie11dtxpficbfise.out

 

If I change the order in slave file like this:

 

slave

slave1

master

 

then DataNode/TaskTracker on master m/c starts and not on other two.

 

Please tell how I should solve this problem.

 

Sim



Re: Strange Reduce Bahavior

2009-04-02 Thread jason hadoop
1) when running in pseudo-distributed mode, only 2 values for the reduce
count are accepted, 0 and 1. All other positive values are mapped to 1.

2) The single reduce task spawned has several steps, and each of these steps
account for about 1/3 of it's overall progress.

The 1st third, is collecting all of the map outputs, each of which in your
example has 0 records.
The 2nd third, is to produce a single sorted set from all of the map
outputs.
The 3rd third, is to reduce the sorted set.

So, you get progress reports,

3) The framework is designed for working on large clusters of machines where
there needs to be a little delay between operations to avoid massive network
loading spikes, and the initial setup of the map task execution environment
on a machine, and the initial setup of the reduce task execution environment
take a bit of time.
In production jobs, these delays and setup times are lost in the overall
task run time.
In the small test job case the delays and setup times will be the bulk of
the time spent executing the test.

On Wed, Apr 1, 2009 at 10:31 PM, Sriram Krishnan  wrote:

> Hi all,
>
> I am new to this list, and relatively new to Hadoop itself. So if this
> question has been answered before, please point me to the right thread.
>
> We are investigating the use of Hadoop for processing of geo-spatial data.
> In its most basic form, out data is laid out in files, where every row has
> the format -
> {index, x, y, z, }
>
> I am writing some basic Hadoop programs for selecting data based on x and y
> values, and everything appears to work correctly. I have Hadoop 0.19.1
> running in pseudo-distributed on a Linux box. However, as a academic
> exercise, I began writing some code that simply reads every single line of
> my input file, and does nothing else - I hoped to gain an understanding on
> how long it would take for Hadoop/HDFS to read the entire data set. My Map
> and Reduce functions are as follows:
>
>public void map(LongWritable key, Text value,
>OutputCollector output,
>Reporter reporter) throws IOException {
>
>// do nothing
>return;
>}
>
>public void reduce(Text key, Iterator values,
>   OutputCollector output,
>   Reporter reporter) throws IOException {
>// do nothing
>return;
>}
>
> My understanding is that the above map function will produce no
> intermediate key/value pairs - and hence, the reduce function should take no
> time at all. However, when I run this code, Hadoop seems to spend an
> inordinate amount of time in the reduce phase. Here is the Hadoop output -
>
> 09/04/01 20:11:12 INFO mapred.JobClient: Running job: job_200904011958_0005
> 09/04/01 20:11:13 INFO mapred.JobClient:  map 0% reduce 0%
> 09/04/01 20:11:21 INFO mapred.JobClient:  map 3% reduce 0%
> 09/04/01 20:11:25 INFO mapred.JobClient:  map 7% reduce 0%
> 
> 09/04/01 20:13:17 INFO mapred.JobClient:  map 96% reduce 0%
> 09/04/01 20:13:20 INFO mapred.JobClient:  map 100% reduce 0%
> 09/04/01 20:13:30 INFO mapred.JobClient:  map 100% reduce 4%
> 09/04/01 20:13:35 INFO mapred.JobClient:  map 100% reduce 7%
> ...
> 09/04/01 20:14:05 INFO mapred.JobClient:  map 100% reduce 25%
> 09/04/01 20:14:10 INFO mapred.JobClient:  map 100% reduce 29%
> 09/04/01 20:14:15 INFO mapred.JobClient: Job complete:
> job_200904011958_0005
> 09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15
> 09/04/01 20:14:15 INFO mapred.JobClient:   File Systems
> 09/04/01 20:14:15 INFO mapred.JobClient: HDFS bytes read=1787707732
> 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes read=10
> 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes written=932
> 09/04/01 20:14:15 INFO mapred.JobClient:   Job Counters
> 09/04/01 20:14:15 INFO mapred.JobClient: Launched reduce tasks=1
> 09/04/01 20:14:15 INFO mapred.JobClient: Launched map tasks=27
> 09/04/01 20:14:15 INFO mapred.JobClient: Data-local map tasks=27
> 09/04/01 20:14:15 INFO mapred.JobClient:   Map-Reduce Framework
> 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input groups=1
> 09/04/01 20:14:15 INFO mapred.JobClient: Combine output records=0
> 09/04/01 20:14:15 INFO mapred.JobClient: Map input records=44967808
> 09/04/01 20:14:15 INFO mapred.JobClient: Reduce output records=0
> 09/04/01 20:14:15 INFO mapred.JobClient: Map output bytes=2
> 09/04/01 20:14:15 INFO mapred.JobClient: Map input bytes=1787601210
> 09/04/01 20:14:15 INFO mapred.JobClient: Combine input records=0
> 09/04/01 20:14:15 INFO mapred.JobClient: Map output records=1
> 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input records=0
>
> As you can see, the reduce phase takes a little more than a minute - which
> is about a third of the execution time. However, the number of reduce tasks
> spawned is 1, and reduce input records is 0. Why does it spend so long on
> the reduce ph